Learn the Preliminary Details Behind Support Vector Machines

With Examples in R

Robby Sneiderman
Towards Data Science

--

Vlad Shapochnikov via Unsplash

Introduction:

Support Vector Machines are a popular tool used in several branches of Machine Learning. In particular, they are extremely useful for binary classification. Support Vector Machines have their basis in the concept of separating hyperplanes, so it is useful to first be introduced to this concept.

In this article, I introduce the method of classification via separating hyperplanes. We start off simple and describe how even linear regression can be used to make simple binary classifications. We then move on to using separating hyperplanes and ‘optimal’ separating hyperplanes. Finally, with this background, we are in the position to introduce support vectors and the powerful Support Vector Machine. We also include examples using R, and discuss how one can fit kernel based algorithms in R using the kernlab (Kernel Based Machine Learning) package.

Much of the mathematical details, and the toy example datasets in this article come from The Elements of Statistical Learning II edition (Hastie). The data, and more information on the subject can be found on the textbook's website. The data can be downloaded from the data tab of the above link, where it is called “ESL.mixture”.

Let’s start with a simple. You can load in the following toy dataset into R from this link. The dataset consists of observations each with a pair of coordinates (x1,x2). Each observation also has an outcome, y, which we denote in the plot by colour (red (y=1) or blue (y=0) denoting its class).

Figure 1: A Toy example of data from an unknown distribution (x1,x2) where the class is indicated by colour (blue y=0, or red y=1). Image Citation: Created in R by author.

Figure 1 is a plot of the toy example dataset. It appears that red and blue points are somewhat separated, but clearly there would be no way to perfectly separate the classes with just a simple line.

Suppose we were given a new observation with coordinates (x1,x2). How should we classify this new point? Clearly, the location of the coordinates will help inform our choice of assigning our new observation to red or blue.

One simple method would be to fit a linear regression line to the data and classify the new point based off if the point lied above 0.5, or below it.

If we want to classify based on the 0.5 threshold(the midpoint between the two classes), we could fit the linear regression line and solve in terms of x2; as seen in the below code and plot.

Figure 2: The same toy example of data from an unknown distribution (x1,x2) where the class is indicated by colour (blue y=0, or red y=1). We fit a linear regression to the response and defined new classifications based on if the new point was above or below the line. Image Citation: Created in R by author.

We could then classify a point based on whether it was located above (red) or below (blue) the regression line.

But is this the optimal way to make classifications? Well, we don’t know the true underlying distribution, so it would be impossible to conclude definitively. But in reality, it would be extremely unlikely that this simple classification would be ideal.

We come back to this toy dataset later on, but for now, we move to the more general problem.

Separating Hyperplanes:

As we noted in the toy example dataset, no line could perfectly separate our two classes. But consider a more simple case, where such a line could actually be fit.

Figure 3: A different simple example dataset. Shown in orange is the linear regression line and in blue are two separating hyperplanes found using the perceptron algorithm. Image Citation: This is Figure 4.14 in the Elements of Statistical Learning II.

Let's go back to Rosenblatt 1958 and the Perceptron. The perceptron algorithm can be used to find separating hyperplanes (provided one exists). The method for which they find the plane involves stochastic gradient descent.

Figure 3 demonstrates three lines fit to a simple two class dataset. The orange line is the least squares solution fit to the data, and it doesn’t even separate the two classes. The two blue lines both perfectly separate the classes, but are not identical. This is due to the fact that the perceptron algorithm finds one solution and then stops (due to random starts it can obtain different solutions).

Using the perceptron algorithm is better then just using simple linear regression, and can actually separate the classes (provided it is possible). However, it still encounters problems. For one, it doesn’t tell us anything about the multiple different lines it can find (are some ‘better’)? And also, if no perfect separating hyperplane exists, the algorithm does not converge.

Optimal Separating Hyperplanes:

We saw that the perceptron can find a separating hyperplane (provided one exists). However, it doesn’t tell us anything special about the one it happens to converge on. The next big improvement is from Vapnik (1996), which forms the basis for the support vector machine. Vapnik described not only how to find the unique separating hyperplane, but how to find the optimal separating hyperplane or theperceptron of optimal stability’. This has been shown to outperform simple hyperplane separation and often results in lower test error.

Figure 4: The same simple dataset as in Figure 3, fit with the optimal separating hyperplane. The three blue squares are points that lie on the boundary of the margin, and are known as support vectors. Image Citation: Figure 4.16 in The Elements of Statistical Learning II.

Figure 4 illustrates the optimal separating hyperplane classifier fit to the simple dataset. Three points lie directly on the outside of the margin, and are known as the support points. The red line is the solution from logistic regression, which is very similar, although not identical.

Support Vectors:

Figure 4 was a simple way to visualise the basic support vector machine. We now describe the details surrounding the support vector classifier, and what can be done when there is no way to perfectly separate the data.

In simple terms, SVMs find an optimal hyperplane to separate distinct classes. In the most basic case, this would be finding the line that best separates two classes. By ‘best separates’, we mean that the margin is maximised. The margin is the distance from the closest point in each of the two classes. Classifying new observations then depends on their location with respect to the decision boundary.

Consider a set of N training examples:

Figure 5: A set of N training examples.

Where there are only two classes for which y belongs to, -1 or 1. That is;

Figure 6: The two classes that each training set can belong to.

The support vector works by creating a decision boundary (hyperplane) such that the ‘margin’ is maximised:

Figure 7: The decision boundary that we wish to form.

Points that lie directly on the outside lines (and thus are key in determining the margin), are known as the Support Vectors.

Figure 8: This image shows a simple support vector classifier. We classify new points depending on their value of xtB+B. Image Citation: The Elements of Statistical Learning II (Figure 12.1).

New classes are thus predicted based on if they lie above or below the boundary (the solid middle blue line); i.e simply off their sign.

Figure 9: How we classify new observations (based off their sign). Image by Author.

When the data is linearly separable, we can perfectly separate the training data with a maximum margin hyperplane.

In cases in which we can’t separate perfectly, we have to allow for some violations. We then fit the model by using some cost function that allows for some misclassifications.

Details:

For those interested in more background details please read along, but feel free to skip this section if you are satisfied with knowing that the algorithm is solved using quadratic programming. The method in which this optimal hyperplane is found is not simple; It is a quadratic programming problem that involves Lagrange Multipliers. Thankfully, multiple packages in R and Python can do this automatically.

Essentially, we want to maximise the Margin M subject to the constraints that we lie on one side of the hyperplane or the other. Which can be simply written as a maximisation problem:

Figure 10: How the margin is maximised subject to constraints. Image by Author.

This is equivalent to the minimisation in terms of the parameters;

Figure 11: The equivalent form that allows for a simpler calculation of the optimal hyperplane. Image by Author.

In fact, it is the hyperplane which produces the maximum margin.

This is all solved numerically.

Back to the Toy Data:

Let’s return to the mixture example from the toy dataset we discussed above. In R, the package Kernlab allows us to easily fit support vector machines. We have to specify a few parameters.

Type: We specify C-svc (the default setting) for SVM classification.

Kernel: We want to first fit a linear kernel (linear decision boundary) so we use the vanilladot linear kernel

C: The Cost Parameter (Determines how points around margin are treated). We try first with C=10000. Larger values of C focus the margin on points close to it, while smaller C focuses more on points in the outliers, far from the margin.

Figure 12: A plot of our linear SVM using C=10000. Image created in R by author.

The above plot visualises our linear support vector machine which aims to separate the two classes. Since it is not perfectly separable, we have a parameter C that acts as a regularisation parameter

Figure 13: Summary of our linear SVM. Image created in R by author.

The model obtains a training error of 0.27. Considering that the toy dataset was visually non-separable, that isn’t too bad. There are a total of 125 support vectors. These are points that line on the boundary or within the margin.

We try the same thing but with C=0.01;

Figure 14: Another linear SVM fit but with C=0.01. Image created in R by author.

This produces a similar plot, but the training error is slightly reduced to 0.265. We have more support vectors now, 172. This is due to the fact that we have a wider margin for this particular C value.

Figure 15: Results of our second linear SVM. Image created in R by author.

Above we fit some linear Support Vector Machines. But our data didn’t look linearly separable. Can we improve on our 0.265 training error? The answer is yes. SVMs are popular for this reason, as they can be greatly improved and extended via what is known as the ‘kernel trick’ (which are just a form of basis expansions).

Basis Functions and the Kernel Trick:

Clearly our toy dataset was not linearly separable. One of the main applications of support vector machines is that we can easily extend them via the ‘kernel trick’. This involves transforming the initial observations, using some sort of dot-product like transformation and THEN finding the optimal separable hyperplane. Above, we simply worked with the observations as they came in the form (x1,x2).

But, suppose for each observation we instead worked with

Figure 16: The Gaussian or Radial Basis Function transformation. Image by Author.

This is known as the Radial Basis Kernel and allows us to fit more complicated models.

Look what happens when we apply the simple radial basis function to our original data:

Figure 17: The results of applying the radial basis function to each set of observations. Image created in R by Author.

Which can then be fit linearly and back transformed to the original dimension. We specify our kernel now as ‘rbfdot’ (instead of linear as before).

Figure 18: Plot of SVM fit using the radial basis function (Gaussian Kernel) with a C=1. Image created in R by Author.
Figure 19: Summary of our toy data fit with a Gaussian Kernel. Image created in R by author.

Which has an even lower training error of 0.155. The number of support vectors is reduced to 115 as less points lie inside the margin. The Cost parameter C can be tuned using cross-validation if you want to search a grid of possible values, but C=1 is the default.

Summary and Conclusion:

Support Vector Machines are an important and useful tool that was originally used for binary classification. However, they can be applied to many machine learning algorithms. The history and method by which they were developed is also interesting. The mathematical details behind the SVM are solid and have a firm foundation in linear algebra. They are worthwhile algorithm for those interested in Data Science and Machine learning to study in more detail.

SVMs extend the idea of separating hyperplanes and find the optimal separating hyperplane. In low dimensions, this has an intuitive way to understand, in higher dimension, we can’t visualise it, but the math is a simple extension.

SVMs can be extended to a non-linear decision boundary using ‘kernel tricks’ (or basis expansions) to fit more complicated boundaries. Common other kernels include the Gaussian (Radial Basis Function). These methods work by transforming the data first so that we can fit what appears to be a linear boundary in the transformed input space.

Despite the theoretical background and more difficult interpretation in words, SVMs are widely used in industry. In particular, they find usage in medical related data, neurological imaging and more.

In this article, we considered the initial use of SVMs, which is to binary classification. However, SVMs are a topic of ongoing research and several extensions have been developed. You can use SVMs to perform regression and even use them to perform clustering via unsupervised learning.

Thanks for Reading!

Did you enjoy this article or learn something new? If so, please check out my other Medium articles and consider sharing the article or leaving a clap. If you would like to connect, add me on LinkedIn. Also please leave any comments/corrections/questions below, and check out the sources.

Github Code:

Sources:

[1] Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning II.

[2] F Rosenblatt (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.

[3] V Vapnik (1996). Support vector regression machines

[4] Noble, W. S. (2006). What is a support vector machine?. Nature biotechnology, 24(12), 1565–1567.

[5] Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M., & Haussler, D. (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10), 906–914.

[6] Geron, Aurelien (2020). Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow.

[7] A Karatzoglou, A Smola (2004). Kernlab- An S4 Package for Kernel Methods in R.

--

--