
The goal of this article is to visually see the effect of each hyperparameter and kernel function of a SVM to understand their effects on the models.
First a brief introduction to SVMs is presented followed by the dataset. For the purpose of this project, the dataset is a simple one with just two features to easily visualize it in two dimensions. The dataset will be divided into a training and validation dataset. And, for each kernel function, a visualization is presented showing the effects of different hyperparameter.
You can find the source code in this Notebook.
Table of Contents
- Introduction to SVMs
- Dataset
- Popular kernel functions
- Conclusions
- Acknowledgments
Introduction to SVMs
Support Vector Machines are supervised Machine Learning models used for classification (or regression) tasks. In the case of binary classification, there is a dataset made of 𝑛 observations, each observation made of a vector 𝑥𝑖 of 𝑑 dimensions and a target variable 𝑦𝑖 which can be either −1 or 1 depending on whether the observation belongs to one class or the other.
Using this data, a SVM learns the parameters of a hyperplane, 𝑤⋅𝑥−𝑏=0 that separate the space in two parts: one for the observations of one class and the other part for the other class. Furthermore, among all possible hyperparameters that separate both classes, a SVM learns the one that separates them the most, that is, leaving as much distance/margin as possible between each class and the hyperplane
To learn the parameters of the hyperplane, a SVM tries to maximize the margin between the observations, with the additional constraint that observations of different classes have to be on different sides of the hyperplane.
This article will not get into much detail about the math behind how SVMs get these parameters. However, to provide some intuition that may be useful, it is interesting to note that training data, 𝑥, will only appear as part of scalar products when computing the parameters of the hyperplane.
Thanks to this observation, SVMs can get the advantage of what is called a kernel function. This is a function that returns the same thing as the scalar product between two vectors but without needing to calculate the coordinates of those vectors. And this is very useful because we can simulate applying operations that increase the dimensionality of 𝑥 without the need of going to higher dimensions which is computationally cheaper (this is called the kernel trick).
Thanks to this, SVMs can work for data that is not linearly separable. Instead of trying to fit some complicated function to the data, a SVM goes to higher dimensions where the data could be linearly separable and finds the hyperplane to separate the data there (and back in the original dimension, it won’t look like a linear separation).
And now is the time to say that, because of this, SVMs do not calculate the parameters of the hyperplane, instead they remember the 𝑥’s that they need to calculate the hyperplane and, when a new input data comes, SVMs performs the scalar products between these 𝑥’s, called Support Vectors, and the input data. Again, this is to keep using scalar products and take advantage of the kernel functions.
Another interesting thing to note is that it is common to not find a perfect separation between two classes. And this is bad for SVMs, because during training (calculating the hyperplane parameters) they try to maximize the margin between classes with the constraint that observations of one class must be on one side of the hyperplane and observations of the other class, on the other side. To help with this, SVMs have "soft margins", which is a parameter that relaxes this by regulating how many observations are allowed to pass the margin.
Dataset
The dataset is a modification of the iris dataset to have just two dimensions so it can be easily visualized. In particular, it consists of sepal length and width for setosa and versicolor iris flowers.
The next figure shows how this dataset was divided in training and validation sets. Furthermore, the data has been preprocessed to have a mean of 0 and unit variance. This is a good practise because SVMs use scalar products, so if a variable has a larger range of values than the rest, it will dominate the result of the scalar product.

Popular kernel functions
This section introduces the four most common kernel functions and a Visualization with the effect that each hyperparameter has on the result of the SVM.
Linear kernel function

This is the simplest kernel function because it is equivalent to not use any kernel function. In other words, it directly computes the scalar product between the inputs. It does not add any extra hyperparameters to the SVM and it is perfect to see the effect of the hyperparameter 𝐶 that regulates the margin.
Next plots shows the result of training the SVM with a linear kernel on the training dataset

The background color represents the decision of the SVM. The training dataset is represented as points in the plane and their class is represented also with color. Highlighted points represent the Support Vectors, that is, the data points that define the hyperplane. Dash-lines represent the margin of the SVM. And above each plot you can find the R2 score of that SVM on the validation dataset and the value of the hyperparameter used.
As seen in the plots, the effect of incrementing the hyperparameter 𝐶 is to make the margin tighter and, thus, less Support Vectors are needed to define the hyperplane.
RBF Kernel Function (Radial Basis Function)

This function adds an extra hyperparameter to tune, 𝛾. But, unlike in the case of the linear kernel function, this function does map the data to a higher dimension. And, as it can be seen in the plots below, now the SVM can represent a non-linear separation.

From top to bottom we can see the effect of incrementing the hyperparameter 𝐶, which affects the margin of the SVM as seen previously.
On the other hand, from left to right, we can see the effect of incrementing 𝛾: lower values result in separations that look more linear and, as 𝛾 increases, it results in more complex separations.
With a sufficiently high 𝛾, every observation of the training set are Support Vectors. In other words, every training point is used to define the hyperplane, which indicates a clear overfitting. Moreover, the more support vectors used, the more computationally expensive is the SVM and the more time it needs to make a prediction.
Finally, it might be interesting to note that, looking at the separations in the original dimension of the data, it looks like a gaussian, just like the kernel function formula looks like a gaussian.
Sigmoid Kernel Function

In this function there are two extra hyperparameters: 𝑟 and another one also called 𝛾 that, as it can be seen below, it also affects the complexity of the separation.

Now the 𝐶 hyperparameter has been fixed because as shown above it affects the margin. From top to bottom can be seen the effect of varying 𝛾, which helps making separations more complex as it increases, just like with the RBF kernel function.
The 𝑟 hyperparameter doesn’t have a very clear visual effect with small changes. However, looking at the function definition, tanh(γxi⋅xj + r), the effect of this constant 𝑟 is to shift the hyperbolic tangent:

Higher positive and negative values for the 𝑟 constant dominates the result, it makes it harder for the scalar product to have an impact on the result of the function.
If we interpret the scalar product as a measure of similarity between two vectors, kernel functions also represent similarities. So we could interpret this effect of 𝑟 as biasing the result, which can be useful if we know that some class is prefered over the other?
Polynomial Kernel Function

This last kernel function has three hyperparameters: a factor 𝛾 that affects the scalar product, a constant 𝑟 and the degree of the polynomial, 𝑑.

From top to bottom can be seen that the effect of incrementing the degree of the polynomial and 𝛾 is to make more complex separations.
From left to right can be seen that the effect of varying 𝑟 looks similar to what happens with the 𝑟 in the sigmoid function, at least for lower degrees of the polynomial.
Conclusions
This article started with a brief introduction to SVMs, followed by the dataset used. Then, it showed the visual effect, in the dimension of the original data, of varying the hyperparameters of the most common kernel functions. These effects could be classified in two categories following the visual intuition: either they affect the margin of the SVM and/or they affect the non-linearity of the separation.
It is commonly advised to use the RBF kernel as the "go-to" kernel function as it is one with just two hyperparameters to tune and it can model non-linear separations. But, as many things, this might not be the perfect kernel function for any task, depending on the data there could be a kernel function that models better the separation between the classes. Hopefully this article gives the readers an intuition to help choose the kernel function to use and/or better understand SVMs 🙂
Acknowledgments
Special thanks to Angel Igareta for reviewing this article before publishing it 😊
Thank you for reading this article! 😄 Any feedback? It would be much appreciated 🙂 Feel free to share your thoughts here on the comments, or on the Notebook or even on LinkedIn!
Have a nice day! 🙂