SVM Parameter Tuning

Published in

Towards Data Science

11 min readSep 6, 2019

The support-vector machine is one of the most popular classification algorithms. The SVM approach to classifying data is elegant, intuitive and includes some very cool mathematics. In this tutorial we’ll take an in-depth look at the different SVM parameters to get an understanding of how we can tune our models.

Before we can develop our understanding of what the parameters do, we have to understand how the algorithm itself works.

How SVMs Work

Support-vector machines work by finding data points of different classes and drawing boundaries between them. The selected data points are called the support-vectors and the boundaries are called hyperplanes.

The algorithm considers each pair of data points until it finds the closest pair that are in different classes and draws a straight line (or plane) midway between them.

If the input data is linearly separable, then solving for the hyperplane is simple. But it’s often the case that classification regions overlap and that no single straight plane can act as a boundary.

One way to get around this is to project your data into higher dimensions by creating additional features. Instead of the two-dimensional space you get from having features a and b, you could combine them (for example ab, a², b²) and try to find patterns (or a dividing hyperplane) in those dimensions.

But there’s a problem with this approach. While having the extra dimensions makes it easier to find a hyperplane, it also gives your algorithm more features to learn.

SVMs allow us to sidestep this additional learning by using the Kernel Trick.

The Kernel Trick

You may have heard of the Kernel Trick. It’s one of those things that is included in every discussion of SVMs, but it’s often not explained as well as it could be.

A kernel is just a function which takes two data points as inputs and returns a similarity score. This similarity can be interpreted as a metric of closeness. The nearer the data points are, the higher the similarity.

The cool thing about kernel functions is that they can give us similarity scores from higher dimensions without us having to transform our data.

We get to find the closest data points in much higher dimensions without them actually being there. That means we can get all the upside from the additional features without engineering and learning them.

The kernel trick, then, is using a kernel function instead of doing a high-cost transformation.

(If you’re mathematically minded you might be wondering what functions can be used as a kernel. The answer is that the implicit function, the one that gives us the similarity score in the higher dimensions, exists whenever an inner-product space can be given a measure that is positive-definite.)

SVM Kernels

Now that we understand how SVMs work and what the kernel trick is, we can move over to a Jupyter notebook and see how our choice of kernel impacts our models.

If you’re using Scikit-Learn, you’ll see from the documentation that you can choose from several different kernels when you create your support-vector classifier object (SVC). These include:

Let’s load the Iris dataset.

from sklearn import datasets iris = datasets.load_iris()

For visual convenience, let’s keep just the first two features of the dataset (Sepal length and Sepal width). We can plot them and use colours to show their class (species).

sepal_length = iris.data[:,0] 
sepal_width = iris.data[:,1] 
plt.scatter(sepal_length, sepal_width, c=iris.target)

Now we have a visual understanding of our data, we can move on to seeing how the SVM kernels affect the class boundaries.

The Linear Kernel

from sklearn.svm import SVC 
clf = SVC(kernel='linear') 
clf.fit(np.c_[sepal_length, sepal_width], iris.target) create_grid_plot(clf, sepal_length, sepal_width)

Linear kernels compute similarity in the input space. They don’t implicitly define a transformation to higher dimensions. Because of this, each of the hyperplanes in the figure above are straight lines.

You may have seen plots of SVMs where it looks as though the decision boundaries are curved (like those coming up). It’s important to note that the decision boundaries are only curved in the input space. In the implicit, higher-dimensional feature space they are straight lines or planes. (In fact, plots like this aren’t created with equations for curvy lines, they are created by classifying a grid of tiny points and colouring them according to the SVM output, as you can see from the create_grid_plot function.)

The RBF Kernel

clf = SVC(kernel='rbf') 
clf.fit(np.c_[sepal_length, sepal_width], iris.target) create_grid_plot(clf, sepal_length, sepal_width)

The Radial Basis Function kernel doesn’t just help us avoid computing a few extra features. The RBF feature space has an infinite number of dimensions. This means that we can utilise the kernel to build very complex decision boundaries. The more dimensions, the better chance we’ll find a hyperplane that neatly separates our data.

The radial basis part of the name comes from the fact that this function decreases in value as it moves away from the center. (The center in this case being the support-vector.) This explains why the decision boundaries are bell-curve shaped when we visualise them.

The Polynomial Kernel

clf = SVC(kernel='poly') 
clf.fit(np.c_[sepal_length, sepal_width], iris.target) create_grid_plot(clf, sepal_length, sepal_width)

We’ve actually already met the polynomial kernel. Linear kernels are a special case of polynomial kernels where the degree = 1.

The polynomial kernel allows us to learn patterns in our data as if we had access to the interaction features, which are the features that come from combining pre-existing features (a², b², ab, etc.)

If you’re wondering how to choose between polynomial and RBF: RBF is your best bet unless you’re working in NLP where quadratic (degree=2) polynomials tend to work well.

Sigmoid Kernel and Pairwise Metrics

So far each kernel has worked fairly well without additional parameter tuning or feature alterations, but that’s all about to change. Let’s take a look at the sigmoid kernel.

clf = SVC(kernel='sigmoid') 
clf.fit(np.c_[sepal_length, sepal_width], iris.target) create_grid_plot(clf, sepal_length, sepal_width)

Using the sigmoid kernel has made our model predict that same one class for every input row, leading to an accuracy score of around 17%. So what happened?

We’ve talked about how kernel functions are just similarity measures. If you’re having a problem with an SVM model, it can be useful to run your data through the kernel to see if anything unexpected happens. Handily, each of the kernels available for the support-vector classifier are included in Scikit-Learn’s pairwise metrics module. This means we can pass our data in to just the kernel and see what we get back. Let’s try it for the sigmoid kernel.

from sklearn.metrics.pairwise import sigmoid_kernel sigmoid_kernel(np.c_[sepal_length, sepal_width]) 
>>> 
array([[1., 1., 1., ..., 1., 1., 1.], 
       [1., 1., 1., ..., 1., 1., 1.], 
       [1., 1., 1., ..., 1., 1., 1.],
       ..., 
       [1., 1., 1., ..., 1., 1., 1.], 
       [1., 1., 1., ..., 1., 1., 1.], 
       [1., 1., 1., ..., 1., 1., 1.]])

That output looks suspicious to me. The sigmoid kernel is saying that every point is equally similar to every other point — no wonder the SVM is having difficulty drawing class boundaries!

If you’ve studied deep learning (or logistic regression), you’ll recognise the sigmoid function. It serves as an activation — on or off, zero or one. You may also know that inputs that aren’t between zero and one can cause chaos. Let’s try to scale our iris features and see what impact that has.

from sklearn.preprocessing import normalize 
sepal_length_norm = normalize(sepal_length.reshape(1, -1))[0] sepal_width_norm = normalize(sepal_width.reshape(1, -1))[0] clf.fit(np.c_[sepal_length_norm, sepal_width_norm], iris.target) create_grid_plot(clf, sepal_length_norm, sepal_width_norm)

Success. Our accuracy score is back to being over 75%.

C: The Penalty Parameter

What does the C parameter do in SVM classification? It tells the algorithm how much you care about misclassified points.

SVMs, in general, seek to find the maximum-margin hyperplane. That is, the line that has as much room on both sides as possible.

A high value for C tells the algorithm that you care more about classifying all of the training points correctly than leaving wiggle room for future data.

Think about it like this — if you increase the C parameter, you’re betting that the training data contains the most extreme possible observations. You’re betting that the future observations will be further away from the boundaries than the points you trained the model on.

Let’s see how changing the C parameter can influence our models.

clf1 = SVC(kernel='linear', C=1000000) 
clf1.fit(np.c_[sepal_length, sepal_width], iris.target) create_grid_plot(clf1, sepal_length, sepal_width) 
clf2 = SVC(kernel='linear', C=.0000001) 
clf2.fit(np.c_[sepal_length, sepal_width], iris.target) create_grid_plot(clf2, sepal_length, sepal_width)

You can see that the larger C value has created finer boundaries between the classification areas. Let’s see how it influences the RBF kernel.

clf1 = SVC(kernel='rbf', C=1000000) 
clf1.fit(np.c_[sepal_length, sepal_width], iris.target) create_grid_plot(clf1, sepal_length, sepal_width) 
clf2 = SVC(kernel='rbf', C=.0000001) 
clf2.fit(np.c_[sepal_length, sepal_width], iris.target) create_grid_plot(clf2, sepal_length, sepal_width)

When combined with an RBF (or Gaussian) kernel, large values for the C parameter can drastically overfit the data. This caused the isolated regions of classification in the first figure above. If we run GridSearchCV on the C parameter we find that the ideal value for C is 10:

The overfitting and isolated regions have disappeared, leaving us with an 83% accuracy score which is a 1% improvement over our original RBF kernel model.

Before we move on to a different parameter, here’s how large and small values of the C parameter effect the sigmoid kernel:

You can see that we’ve increased the accuracy score of our (initially) very poor sigmoid model to around the same accuracy as the other kernels by simply increasing the C parameter. And that’s why we tune parameters!

Gamma in SVM

To mix things up, let’s change the features we’re working with. Instead of sepal length and sepal width, let’s use sepal length and petal length and take a look at our data:

(We can see that the different classes seem to be more separable than when using the previous features but that the green and grey species still overlap.)

The gamma parameter makes most intuitive sense when we think about the RBF (or Gaussian) kernel. As I mentioned above, the Gaussian class boundaries dissipate as they get further from the support vectors. The gamma parameter determines how quickly this dissipation happens; larger values decrease the effect of any individual support vector.

The above plot makes increasing gamma seem like a great idea. After all, if we could just reduce the effect of each of those clusters enough we could get nicely delineated boundaries:

clf = SVC(kernel='rbf', C=10, gamma=100) 
clf.fit(np.c_[sepal_length, petal_length], iris.target) create_grid_plot(clf, sepal_length, petal_length)

99.3% accuracy isn’t bad, is it?

It’s important to remember, however, that we haven’t split the data into training and test sets yet, and we’ve made no assumptions about future data. If we choose a high gamma value like this for our final model, we need to be sure that all future instances will fall into the tiny regions we’ve delineated.

As a sanity check, let’s do 500 rounds of random sampling and assess the stability of our model:

scores = [] 
for i in range(0,500): 
    X_train, X_test, y_train, y_test = train_test_split(np.c_[sepal_length, petal_length], iris.target) 
clf = SVC(kernel='rbf', C=10, gamma=100) 
clf.fit(X_train, y_train) scores.append(accuracy_score(clf.predict(X_test), y_test)) plt.hist(scores)

Looking at this histogram we can see that, not only does the 99.3% accuracy we achieved when using all of the data never occur on smaller samples, most of the scores fall into a range similar to models without gamma tuning and many of them fare significantly worse.

Refining Gamma

Let’s try a more reasonable value for gamma. Here’s what we get when we set the parameter equal to 5:

The influence of the support vectors reaches further with the severely reduced gamma and in return we have scores that do much better, on average, than those of the untuned models. Moreover, we actually received multiple test scores in the 97.5–100% accuracy bin.

Let’s severely reduce the gamma parameter again, this time to 0.001, and see what happens to our decision boundaries.

The model’s score has dropped significantly and regions of the feature space are more clearly classified. On the other hand, the 500 iterations show that the model performance is improved, on average, against the untuned models (which hovered around 80%).

Scikit-Learn’s current default setting for gamma is ‘auto’. This takes the value of 1 / n_features, in our case this would be 1/2 or 0.5. In later versions of sklearn the default will be ‘scale’ which takes the value 1 / (n_features * X.var()), which in our case would be around 0.168. Let’s try the ‘scale’ approach:

1 / (2 * np.c_[sepal_length, petal_length].var()) 
>>> 0.16804089263919647 clf = SVC(kernel='rbf', C=10, gamma=(1 / (2 * np.c_[sepal_length, petal_length].var()))) 
clf.fit(np.c_[sepal_length, petal_length], iris.target) create_grid_plot(clf, sepal_length, petal_length)

This gives us a 96% accuracy and doesn’t restrict the classification areas. The histogram shows that the subsamples also give good accuracy. The ‘scale’ setting, then, seems to be an excellent choice for the Iris dataset.

Summary

As with any parameter tuning, you shouldn’t look just for those changes which bring you the highest accuracy but rather those which make the most sense to the problem you’re working on.

You’ve got to assess, using your own experience, whether:

Future observations are likely to be clustered together
Classes should have large margins between them
The apparent curves in the decision boundaries are due to random chance or measurement error

I hope that you’ve found this exploration of SVM parameters helpful. We’ve covered the different kernel types, pairwise functions, the penalty parameter, and the purpose of gamma. All of which should help you refine and tune your future support-vector machine models.

If you have any questions, please leave a comment.

Thanks for reading!