The world’s leading publication for data science, AI, and ML professionals.

In-Depth Support Vector Machines (SVMs) for Linear & Non-linear Classification & Regression

A deeper understanding of how SVMs work behind the scenes

Photo by vackground.com on Unsplash
Photo by vackground.com on Unsplash

A support vector machine is a versatile machine-learning algorithm mainly used for linear and non-linear classification and can also be used for linear and non-linear regression.

It falls under the supervised learning category which needs both the feature matrix and the label vector for training the model.

The main objective of support vector machines is to find a hyperplane that separates data points into two or more groups (classes) in the best possible manner.


Definition of a hyperplane in different dimensions

In Machine Learning, a hyperplane is a linear decision boundary whose dimension is one less than the dimension of data. If the data is plotted on an N-dimensional space, the hyperplane has N-1 dimensions.

Examples:

  • In a one-dimensional (1D) space (line), the hyperplane is a point (0D).
Hyperplane is a point in the 1D space (Image by author)
Hyperplane is a point in the 1D space (Image by author)
  • In a two-dimensional (2D) space, the hyperplane is a line (1D).
Hyperplane is a line in the 2D space (Image by author)
Hyperplane is a line in the 2D space (Image by author)
  • In a three-dimensional (3D) space, the hyperplane is a plane (2D).
Hyperplane is a plane in the 3D space (Image by author)
Hyperplane is a plane in the 3D space (Image by author)

Beyond 3D space (i.e. 4D, 5D, and so on), it is impossible to visualize the hyperplane, but it still exists and has N-1 dimensions!


The maximum separability or large margin concept

It is possible to draw different hyperplanes for a group of linearly separable data. Each line can separate data into distinct groups.

Two possible hyperplanes that can separate data (Image by author)
Two possible hyperplanes that can separate data (Image by author)

However, an SVM tries to find the hyperplane that has the margins (dotted lines) with the largest possible width (here, d1).

Hyperplanes with margins showed (Image by author)
Hyperplanes with margins showed (Image by author)

Therefore, the best hyperplane is the one that has the widest margins.

Each margin should touch the closest data point(s) of each class. The hyperplane lies in the middle of the two margins. The data points that lie on the two margins (dotted lines) are called support vectors (hence the name, support vector machines).

Once the best hyperplane is found, the algorithm uses it to predict the classes of future data.


The math behind finding the best possible hyperplane (optional)

By finding the coordinates of the hyperplane in terms of weight-bias parameters and features (denoted by x1, x2,…), we can easily find the position of the hyperplane. To get the best possible hyperplane, we should consider the widest margin when calculating coordinates.

For ease of calculation, we only consider a linear SVM in the 2-dimensional space. However, the calculations are still valid in higher dimensional space and also for non-linear forms of SVMs.

Now, consider the following diagram.

SVM math explained (Image by author)
SVM math explained (Image by author)

The function, g(x) in the 2-dimensional space can be defined by:

g(x) = w1.x1 + w2.x2 + b

In the n-dimensional space, this can be represented with:

 g(x) = w1.x1 + w2.x2 + ... + wn.xn + b

Mathematically,

  • g(x) = 0 gives the equation for the hyperplane.
  • g(x) = 1 gives the equation for the right margin line
  • g(x) = -1 gives the equation for the left margin line

In addition to that, when

  • g(x) > 1, points are classified as "Red"
  • g(x) < -1, points are classified as "Blue"
  • -1 < g(x) < 1, points lie on the margin (that didn’t happen in this case)

The perpendicular distance (denoted by d) from point A (a support vector) to the hyperplane is given by:

d = 1 / L2-norm of the weight vector

The weight vector, W (Capital W) can be represented as:

W = (w1, w2)

This can be extended to the n-dimensional space as:

W = (w1, w2, ..., wn)

The L2-norm of W is the length or magnitude of W, denoted by || W ||. Mathematically, it is:

(Image by author)
(Image by author)

This looks like the Pythagorean theorem!

Since, d = 1 / || W ||, the total margin, 2d is:

Total margin = 2 / || W ||

To maximize the total margin, we want to minimize || W ||.

Training a linear SVM means that finding the optimal values of w1, w2 and b using the training data. Those optimal values will give the equation for the best possible hyperplane that has the widest margin.


Linear classification with SVMs

Support vector machines can be used for both linear and non-linear classification.

In linear classification, we say classes are linearly separable if we can separate them with a hyperplane (linear decision boundary). The dimension of data doesn’t matter here! The dataset that includes linearly separable classes is called linearly separable data.


Non-linear classification with SVMs

Conversely, we say classes are linearly inseparable if we cannot separate them with a hyperplane (linear decision boundary). The dataset that includes linearly inseparable classes is called linearly inseparable data.

We need to follow special methods to perform non-linear classification with SVMs.

There are three main methods:

  • Softening margins: Intentionally allowing a few points to lie on the margin by increasing the width of the margins.
  • Using kernel trick: Adding extra dimension(s) to the data to transform linearly inseparable data into a linearly separable form.
  • Using non-linear kernel functions: Using special functions to transform non-linear data into linearly separable form by creating flexible (non-linear) decision boundaries.

Softening margins

By softening margins, we can still use a linear decision boundary (linear hyperplane) to separate data into different groups, even if the data seems to be non-linear! We can do that by increasing the width of the margin. This is called softening the margin. The opposite is hardening the margin which means decreasing the width of the margin. You may have heard terms like soft and hard margin classification which also have a similar meaning to this.

When softening margins, some data points will lie on the margin or even be misclassified. But, the advantage is that we can still use a linear hyperplane to classify data points instead of a more complex non-linear decision boundary. This will also avoid overfitting and finally simplify the model.

Soft margin vs hard margin classification (Image by author)
Soft margin vs hard margin classification (Image by author)

Technically, we can soften the margins by decreasing the value of the C hyperparameter (see the section on C hyperparameter below).

Using kernel trick in SVMs

The kernel trick in machine learning is the process of transforming linearly inseparable data into a linearly separable form by adding extra dimension(s) to the original data, even without computing the coordinates in the higher dimension. After transformation, data become linearly separable with a linear hyperplane in the higher dimension.

Kernel trick - Transforming linearly inseparable data into linearly separable form (Image by author)
Kernel trick – Transforming linearly inseparable data into linearly separable form (Image by author)

In SVMs, this type of transformation can be achieved using a linear kernel function.

Using kernel functions in SVMs

The SVMs that we have discussed so far work with linear decision boundaries (linear hyperplanes).

However, there are some cases in which all points cannot be separated by a linear decision boundary. To handle such cases, we need to use non-linear decision boundaries to separate data points into different classes.

The non-linear kernel functions such as RBF and Polynomial perform the same kernel trick, but they also allow flexible (non-linear) decision boundaries.

There are two types of non-linear kernels used in SVMs.

**RBF kernel

          • -**

RBF is a short form for Radial Basis Function. It is one of the non-linear kernel functions used in SVMs. RBF kernel assigns a value for each data point based on a fixed point. It measures the Euclidean distance between each data point and the fixed point. If the distance is large, it assigns a larger value, and vice versa.

Gamma is the kernel coefficient for the RBF kernel. It controls the level of overfitting. The higher the value, the more it will overfit the data. In other words, the decision boundaries will become more flexible with higher values of Gamma.

In SVMs, the value of Gamma is specified as a hyperparameter. It takes a non-negative float.

**Polynomial kernel

                • -**

The other type of non-linear kernel function is the polynomial kernel. It has two parameters: Gamma and Degree.

Gamma is the same as the one we discussed under the RBF kernel.

The other parameter defines the degree of the polynomial function. Mathematically, a polynomial kernel of degree 1 is similar to the linear kernel. So, we avoid the integer 1 as the value of the degree parameter, if we want to perform non-linear classification.

Higher values such as 3, 4 define higher-degree polynomial kernels which can create more flexible decision boundaries. The flexibility increases when using higher values of the degree parameter.

In SVMs, the value of Degree is specified as a hyperparameter. It takes a non-negative integer.


C hyperparameter

In SVMs, there is a hyperparameter called C which is a parameter of the regularization term applied to the loss function when training an SVM. It implements L2 regularization for SVM classification and regression.

C takes a positive float. The default value is 1.0.

When using C with a linear kernel, it controls the width of the margin. When the value of C is high, the margin will become narrow, but all points will lie outside the margin. A lower value of C allows a wider margin as possible, but some points will lie on the margin or even be misclassified.

When using C with a non-linear kernel such as RBF, it controls the smoothness of the decision boundary. The lower the value of C, the more the decision boundary will become smooth.


Linear and non-linear regression with SVMs

As mentioned earlier, SVMs also support linear and non-linear regression. In classification, the objective is to find the hyperplane that has the highest possible width for the margin and avoid training instances to lie on the margin as possible.

We can achieve regression tasks with SVMs by reversing the objective. In SVM regression, the objective is to narrow down the margin and fit as many data points as possible on the margin.

The width of the margin is controlled by two hyperparameters:

  • C: A lower C value allows a wider margin and a higher value allows a narrower margin.
  • epsilon: A lower epsilon value allows a narrower margin and a higher value allows a wider margin.

In linear SVM regression, the margins will be linear. In non-linear SVM regression, the margins are flexible as they can fit complex non-linear patterns.


Using Scikit-learn for training SVMs

Scikit-learn provides multiple ways to implement SVMs with Python. Here are examples.

For linear classification

  • The most common way to perform SVM linear classification is to use the SVC class with a linear kernel. SVC stands for Support Vector Classification.
from sklearn.svm import SVC
linear_clf = SVC(C, kernel='linear')
  • Another way to perform SVM linear classification is to use the SVC class with a polynomial kernel of degree 1.
from sklearn.svm import SVC
linear_clf = SVC(C, kernel='poly', degree=1)
  • The LinearSVC class performs SVM linear classification almost similar to SVC with parameter kernel = ‘linear’.
from sklearn.svm import LinearSVC
linear_clf = LinearSVC(C)

Out of three options, I recommend using the first one as it has a built-in kernel trick for linear transformations!

For non-linear classification

Non-linear SVM classification can be performed by using a non-linear kernel function with the SVC class. There are two options.

  • Using the RBF kernel
from sklearn.svm import SVC
non_linear_clf = SVC(C, kernel='rbf')
  • Using the polynomial kernel with a degree of 2 or higher
from sklearn.svm import SVC
non_linear_clf = SVC(C, kernel='poly', degree=3)

For linear regression

Linear SVM regression can be performed in two ways.

  • Using the SVR class with a linear kernel. SVR stands for Support Vector Regression.
from sklearn.svm import SVR
linear_reg = SVR(C, kernel='linear', epsilon)
  • The LinearSVR class performs SVM linear regression almost similar to SVR with parameter kernel = ‘linear’.
from sklearn.svm import LinearSVR
linear_reg = LinearSVR(C, epsilon)

For non-linear regression

Non-linear SVM regression can be performed by using a non-linear kernel function with the SVR class. There are two options.

  • Using the RBF kernel
from sklearn.svm import SVR
non_linear_reg = SVR(C, kernel='rbf', epsilon)
  • Using the polynomial kernel with a degree of 2 or higher
from sklearn.svm import SVR
non_linear_reg = SVR(C, kernel='poly', degree=2, epsilon)

Effect of feature scaling in SVMs

SVMs are sensitive to relative scales of features. We need to scale features if they are not measured on a similar scale. After scaling, the margins (also the hyperplane) will look much better.


This is the end of today’s article.

Please let me know if you’ve any questions or feedback.

How about an AI course?

Join my private list of emails

Never miss a great story from me again. By subscribing to my email list, you will directly receive my stories as soon as I publish them.

Thank you so much for your continuous support! See you in the next article. Happy learning to everyone!

Designed and written by: Rukshan Pramoditha

2024–05–19


Related Articles