So why the heck are they called Support Vector Machines?

Published in

Towards Data Science

6 min readMar 23, 2020

My first pass with Supervised Machine Learning was a quick one: I did some research on curriculums and handpicked the topics I wanted to learn. Then, I read a few articles on them until I understood the general concept (or I thought I did), experimented with them using the out-of-the-box Scikit-Learn version of the algorithm, and moved on. Recently, I began gaining a formal, in-depth knowledge of these supervised learning algorithms (the “why they work”, if you will). Naturally, SVMs were part of the fundamentals and it was then that I had an unsettling revelation: I had no idea what the words “Support Vector Machines” meant! I had completely overlooked the fact that this fancy name probably had a deeper meaning. In this piece, I summarize my findings and attempt to explain the mathematical reasoning behind this ‘complex’ name.

The Shortcomings of the Perceptron Algorithm

The Perceptron algorithm is a supervised learning algorithm for binary classification. Formally, it allows us to learn a binary classifier called a threshold function that allows us to map some x in n-dimensional space to an output value f(x), which is a binary value:

The binary classifier expressed as a piecewise function.

Note that the conditional expression for f(x) to be 1 contains a weighted summation in the form of a dot product w ⋅ x along with a bias added to it. A closer look tells us that we can rewrite the condition such that the bias represents the threshold value for our binary classifier:

The binary classifier conditional rewritten to emphasize the bias as the threshold value mentioned above.

Note that the perceptron is a linear classifier, which means that if the training set D is not linearly separable, then no ‘approximate’ solution will be reached using the learning algorithm. However, if D is linearly separable then the algorithm is guaranteed to find a linear separator.

However, every dataset that is linearly separable admits an infinite number of possible linear decision boundaries that attains a misclassification rate of 0. The perceptron is assured to find one of these, but there is no guarantee as to how “good” this decision boundary is.

Every dataset that is linearly separable admits an infinite number of possible linear decision boundaries that attains a misclassification rate of 0. The perceptron is assured to find one of these, but there is no guarantee as to how “good” this decision boundary is.

A binary classification problem in 2D space with an infinite number of decision boundaries that separate the classes. The Perceptron algorithm could produce any one of these decision boundaries.

Maximum Margin Separators: The Case for SVMs

This motivates us to come up with another algorithm that maximizes the margin between the two classes. The margin is defined as the minimum distance between the data instances and the decision boundary.

Let the decision boundary be defined by the equation w ⋅ x + b = 0. In this case, w is the n-dimensional normal vector to the hyperplane. We normalize w, i.e., divide it by its length just in case its not a unit vector.

The distance between the boundary and the training point closest to it from the positive class followed by the distance between the boundary and the training point closest to it from the negative class.

Let T be the set of n training instances that we are given. We need to find a vector w that maximizes the margin, which is given by:

The maximum margin expressed mathematically. Notice that the equation inside the square brackets represents the distance between the point and the margin (the ‘y’ exists to differentiate between the positive and negative class). We define the margin in relation to the point closest to it from each class, which is why we minimize over all training examples.

Formulating this as a Constrained Optimization Problem

Consider the set of vectors w such that the following constraint holds:

These vectors produce a linear separator that perfectly separates the data with a non-zero margin. It helps to introduce some equality into the above constraint and scale w to arrive at the following solution set S:

If D=1, then

On the other hand, if D > 1, then

Notice that if the closest point is more than a distance of 1 / ||w|| away from the separator, we can always adjust w to ensure that the distance is decreased. Thus, the quantity 1 / ||w|| is the margin, and we need to minimize ||w|| in order to maximize the margin. We have the following optimization problem:

subject to the following constraints:

Now that we have a mathematical framework to solve the problem, there is a caveat that we need to address: we have been operating under the assumption that the data we have is linearly separable. But what if it’s not? In that case, S would be empty and we’d be left with an infeasible solution. We can work around this by relaxing our constraints to the following:

Of course, we keep the relaxation constant ϵ > 0 but small enough so as to not overshoot and obtain a non-optimal solution. We rewrite our minimization problem to be:

The parameter C represents the tradeoff between minimizing the error and maximizing the margin. Solving this problem is done via Quadratic Programming.

Enter Support Vectors

The support vectors are the training instances that satisfy the constraint:

The constraint that needs to be satisfied for a training instance to become a support vector.

The solution to our problem, i.e., the optimal (maximum-margin) hyperplane remains unchanged if we remove all training instances but the support vectors. That is why they are given the name ‘support vectors’. These training instances can be thought of as ‘supporting’ or ‘holding up’ the optimal hyperplane.

That is why they are given the name ‘support vectors’. These training instances can be thought of as ‘supporting’ or ‘holding up’ the optimal hyperplane.

The Objective of Support Vector Machines

So, what do SVMs do with the data we provide them? The objective is as follows:

Assuming that each data point is n-dimensional, a Support Vector Machine attempts to find the (n-1)-dimensional hyperplane of maximum margin.

Intuitively, this proves to be very effective in boosting the performance of the classifier/regressor as the larger margin, the lower the generalization error of the classifier/regressor.

References

“Large Margin Separators”: http://cs.brown.edu/people/pfelzens/engn2520/CS1420_Lecture_10.pdf

“Support Vector Machines”: https://en.wikipedia.org/wiki/Support-vector_machine