The world’s leading publication for data science, AI, and ML professionals.

Understand Support Vector Machines

part 1 of a 5-sequence, brief articles that provide a comprehensive introduction; no prior knowledge assumed

SVM attempts to find the boundary with the maximum margin (Image by Author)
SVM attempts to find the boundary with the maximum margin (Image by Author)

Overview

This is part 1 of a 5-sequence, short articles that provide a comprehensive introduction to Support Vector Machines (SVM). The objective of the series is to help you thoroughly understand SVM and be able to confidently use it in your own projects. The sequence assumes no prior knowledge of Machine Learning (ML) and familiarity with high-school mathematics is sufficient to follow along and understand.


The Big Picture

If you are an absolute beginner in Data Science and do not yet understand the difference between Supervised Learning or Unsupervised Learning, then I suggest you read my earlier article that assumes absolutely no prior knowledge and uses a story to help you make sense of these terms.

Learning from Data: A Bird’s Eye View

Within the ecosystem of data science, SVM is a very sought-after algorithm that can help you solve two types of Machine Learning problems within Supervised Learning: Classification and Regression. In classification, we want to develop an algorithm that can help us predict a "class" (a specific prediction out of a discrete set of possibilities). Some examples of classification are Spam Email vs Not Spam, identifying any of the 0–9 digits from an image, having a disease vs not having a disease (hopefully, you get the idea). In regression, what we predict is a "continuous" output (e.g. the price of a house, or the amount of force predicted from brain signals).

Conceptually, both regression and classification are similar. In both cases, we have a set of data points or attributes (commonly called features) that we then use to predict an output. If the output is discrete, we call it classification. If the output is continuous, we call it regression.

SVM can help us solve both classification and regression problems. However, the mathematics behind SVM can look scary at first, but it has a very clear geometric intuition (and hopefully the math behind SVM wouldn’t look as scary once you are through this article series).


Is SVM Generative or Discriminative?

Broadly, there are two approaches to classification. We can either learn the statistical properties of the data of each class and then use that knowledge to identify which class a specific data point belongs to. This is called a generative approach because we are attempting to learn the underlying probabilistic model for each class (if we have a probability model of each class, we can then use that model to "generate" new data points, hence the name, generative models).

Alternatively, we can attempt to directly learn the decision boundary that can help us determine which class a specific data point belongs to. This is the "discriminant" analysis approach towards solving a classification problem. This is equivalent to dividing the "space" of data points into different regions where each region belongs to a specific class.

Hold on, what do I mean by "space"? …

Space is just a conceptual construct that helps us represent data points. For example, think of a data point consisting of two elements: heart rate, and breathing rate. We can represent this data point on a 2-dimensional space, where one dimension is the heart rate, and the other dimension is the breathing rate.

A 2-dimensional space where each data point is represented by a pair of values (Image by Author)
A 2-dimensional space where each data point is represented by a pair of values (Image by Author)

It doesn’t have to be heart rate, and breathing rate. It can be anything really, depending on the problem at hand. For example, think of a house price prediction problem using floor area and the number of bedrooms. In such a case, the two dimensions would be floor area, and the number of bedrooms and each data point in the 2-dimensional space will correspond to a unique data point. Also, we are not restricted to a 2-dimensional space and the same concepts can be extended to more than dimensions.

The key point to note is: for a problem with n features, each data point can conceptually be represented in an n-dimensional space.

A discriminant model, such as SVM, tries to find a boundary to help us partition this n-dimensional space (where each region in the space will belong to a specific class). Mathematically, this decision boundary is an n-1 subspace and it is called a hyperplane (for example, if we have a 2-dimensional space, then the hyperplane for decision boundary would be a 1-dimensional space, a line; if we have a 3-dimensional space, then the hyperplane for a decision boundary would be a 2-dimensional plane).


How Do We Determine the Decision Boundary?

The problem boils down to finding a decision boundary in an n-dimensional space. Conceptually, we achieve this by attempting to identify different possible decision boundaries and then picking the one that is the most suitable. Mathematically, this is equivalent to finding the minimum of a "cost" function that has a high value when predictions are poor and low value when predictions are good. (My previous article on logistic regression has introduced the concept of cost function in more detail). All the magic, thus, happens in how we define the "cost" function.

And this is where SVM differs from another well-known classification technique, logistic regression. In logistic regression, the mathematics is formulated such that we take all the data points to find the decision boundary that can give us the minimum cost. In SVM, however, we attempt to find the decision boundary that leads to the maximum distance (known as the margin) between the decision boundary (the hyperplane) and the closest (to the hyperplane) data points of the two classes. (these closest data points "support" the hyperplane, like a pillar supports a building).


Overview of the Mathematics Behind SVM

If the mathematics of SVM is off-putting, here is a bird’s eye view of the mathematics behind SVM in words (we will derive the relevant equations in a subsequent article in this series). As mentioned earlier, the real magic that differentiates SVM from other algorithms such as logistic regression happens in the "cost" function. For logistic regression, we define a cost function that is "convex" (this means that it has a U shape and a single, minimum point; such problems are easy to solve). For SVM, we have two requirements:

(i): we want to maximize the margin (an optimization)

(ii): we want to ensure that the points lie on the correct side of the decision boundary (a constraint)

Thus, the SVM problem is a "constrained" optimization problem where we want to maximize the margin (i), but also ensure that the constraints (ii) are satisfied. The method of Langrage multipliers provides us a way of solving a constrained optimization problem (by converting it into an unconstrained problem, which we can then solve with calculus; we will derive the relevant equations in a future article in the series).

Support Vector Machines can be formulated as a constrained optimization problem, where the goal is to maximize margin while ensuring that points lie on the correct side of the hyperplane (Image by Author)
Support Vector Machines can be formulated as a constrained optimization problem, where the goal is to maximize margin while ensuring that points lie on the correct side of the hyperplane (Image by Author)

In summary, SVM is a maximum margin classifier. This article provided a gentle introduction and used a simple, linearly separable case to illustrate the underlying concepts. Subsequent articles in the series will show you how the mathematics behind SVM can be derived from geometric intuitions, and cover situations where the data points are not linearly separable. Stay tuned!


If you enjoyed reading this, then you may consider reading the next article in the sequence where the SVM equations are derived, using geometric intuitions.

Understand Support Vector Machines

Read every story from Ahmar Shah, PhD (Oxford) (and thousands of other writers on Medium)


Related Articles