In an earlier post I discussed how to avoid overfitting when using Support Vector Machines. This was achieved using cross validation. In cross validation, prediction accuracy is maximized by varying the cost parameter. Importantly, prediction accuracy is calculated on a different subset of the data from that used for training.

In this blog post I take that concept a step further, by automating the manual search for the optimal cost.

The data set I'll be using describes different types of glass based upon physical attributes and chemical composition.  You can read more about the data here, but for the purposes of my analysis all you need to know is that the outcome variable is categorical (7 types of glass) and the 4 predictor variables are numeric.

Creating the base support vector machine model

I start, as in my earlier analysis, by splitting the data into a larger 70% training sample and a smaller 30% testing sample. Then I train a support vector machine on the training sample with the following code:

svm = SupportVectorMachine(Type ~ RefractiveIndex + Ca + Ba + Fe,
                           subset = training,
                           cost = 1)

This produces output as shown below. There are 2 reasons why we can largely disregard the 64.67% accuracy:

  1. We used the training data (and not the independent testing data) to calculate accuracy.
  2. We have used a default value for the cost of 1 and not attempted to optimize.

Amending the R code

I am going to amend the code above in order to loop over a range of values of cost. For each value, I will calculate the accuracy on the test sample. The updated code is as follows:

costs = c(0.1, 1, 10, 100, 1000, 10000)
i = 1
accuracies = rep(0, length(costs))

for (cost in costs)
    svm = SupportVectorMachine(Type ~ RefractiveIndex + Ca + Ba + Fe,
                               subset = training,
                               cost = cost)
    accuracies[i] = attr(ConfusionMatrix(svm, subset = (testing == 1)), "accuracy")
    i = i + 1
plot(costs, accuracies, type = "l", log = "x")

The first 5 lines set things up. I load libraries required to run the Support Vector Machine and calculate the accuracy. Next I choose a range of costs, initialize a loop counter i and an empty vector accuracies, where I store the results.

Then I add a loop around the code that created the base model to iterate over costsThe next line calculates and stores the accuracy on the testing sample. Finally I plot the results which tells me that the greatest accuracy appears around 100. This allows us to go back and update costs to a more granular range around this value.

Re-running the code again using the new costs (10, 20, 50, 75, 100, 150, 200, 300, 500, 1000) I get the final chart shown below. This indicates that a cost of 50 gives best performance.

The analysis in this post used R in Displayr. The flipMultivariates package (available on GitHub), which uses the e1071 package, performed the calculations. You can try automatically fitting the Support Vector Machine Cost Parameter yourself using the data in this example.