Support Vector Machines explained with Python examples

How to use SVMs in classification problems.

Carolina Bento
Towards Data Science

--

Support vector machines (SVM) is a supervised machine learning technique. And, even though it’s mostly used in classification, it can also be applied to regression problems.

SVMs define a decision boundary along with a maximal margin that separates almost all the points into two classes. While also leaving some room for misclassifications.

Support vector machines are an improvement over maximal margin algorithms. Its biggest advantage is that it can define both a linear or a non-linear decision boundary by using kernel functions. This makes it more suitable for real-world problems, where data are not always completely separable with a straight line.

Separating hyperplanes

The main goal of an SVM is to define an hyperplane that separates the points in two different classes. The hyperplane is also called separating hyperplane or decision boundary.

So let’s start with hyperplanes. The easiest away to visualize an hyperplane is if think about a 2-dimensional dataset.

Hyperplane that completely separates the points in two different classes.

There’s going to be an infinite number of hyperplanes that separate points into two classes. But, since we’re working in a 2-dimensional space, any hyperplane we define will always have (2–1) = 1 dimensions. So, we can represent the hyperplane with a simple regression line.

With the decision boundary defined, we can now classify points based on where they fall in relation to it.

Classification based on where the vector falls, respective to the decision boundary.

If you are working with more than two dimensions, as in, your feature vector X has more than two features, you’re classifying vectors, instead of points.

So, to generalize, all vectors that fall below the decision boundary belong to class -1, and if they fall above it they belong to class 1.

Using margins to increase prediction confidence

We used training data to define the decision boundary. But what about the quality of predictions for the testing set?

If a vector is far from the decision boundary we can be confident about its class, even if the model has some error. But, what happens when we classify a vector and it is very close to decision boundary? How can we be sure about which class to assign?

To tackle this issue, support vector machines also draws a margin around the decision boundary. The goal of this margin is to separate the vectors from the decision boundary, as much as possible. The intuition behind it is that a margin gives us more confidence in our predictions. Because the vectors are at least the length of the margin away from the decision boundary, there’s less ambiguity during classification.

The position of the margin is defined using the vectors that are closest to the decision boundary. That’s why the vectors that lie on top of the margin are the support vectors.

And with the margin working as a buffer, we can classify a vector based on where they fall relative to the margin. Where M is the width of the margin.

Classification based on where vectors fall relative to the margin.

Some room for misclassification

The addition of a margin improves the quality of the predictions in the testing set, but it assumes the classes are completely separable.

But, in most real world problems, data is messy and it’s usually not completely separable.

That’s why SVM shares an important characteristic with the algorithm that came before it, support vector classifiers. It allows the algorithm to make mistakes, and assign the wrong class to some vectors.

Decision boundary and margin for support vector classifiers, along with the corresponding support vectors.

So, instead of trying to completely separate the vectors in into two classes, SMV make a trade-off. It allows for some vectors to fall inside the margin and on the wrong side of the decision boundary.

Support vector machines allow some misclassification during the learning process. So they can do a better job at classifying most vectors in the testing set.

Besides the margin, our model now includes slack variables, which will tell us two things:

  • if a test observation misclassified,
  • where the observation is relative to the decision boundary and the margin.

Slack variables can have three possible values:

And number of misclassified vectors is bound by a parameter C.

Classification based on where vectors fall relative to the margin, including slack variables.

We can see the model captures much more nuance. But it’s still built on top of maximum margin classifiers. For instance, if you set parameter C to zero, meaning it allows zero slack variables, it falls back to a maximum margin classifier. So you have a linear decision boundary, a margin that is as large as possible and no vectors allowed inside it.

The higher the number of slack variables, the higher the number of misclassified vectors allowed. This impacts the width of the margin, because picking different support vectors. And it also controls the Bias-Variance tradeoff of the model.

How the number of slack variables control the bias-variance tradeoff.

Having some room for misclassification makes SMVs more flexibility, but it only applies to a limited set of problems.

In most real world problems, it’s hard to separate data into two classes with a linear decision boundary. Even with some room for error.

Support vector machines

SVMs share the characteristics of the margin classifiers that came before it. What is unique about them is how they can define both linear and non-linear decision boundaries.

To support non-linear decision boundaries, SMVs use functions to transform the original feature space into a new space can represent those non-linear relationships.

For instance, say you augment the original feature space with the square its features. In this case, you applied a quadratic function to the original feature set to create the square of those features. Now you have your original feature and their quadratic version, in this augmented space. And so, implicitly, there’s a function that maps these two feature spaces.

Augmenting the feature space with the quadratic version of original features.

If you try to draw the decision boundary in the original feature space it has a quadratic shape. But if you train your model in the augmented feature space, you’ll find a linear decision boundary that separates the two classes. Because it is a transformation, the quadratic boundary in original feature space corresponds to a linear one in the augmented feature space.

The functions that define these transformations are called kernels. They work as similarity functions between observations in the training and testing sets.

Decision boundary and margin for SVM, along with the corresponding support vectors, using a linear kernel (right) and a polynomial kernel (left).

Whenever you have a model that is represented with inner products, you can plug in a kernel function. For instance, a linear kernel is the same as applying linear transformations to feature space. And, in this case, it’s the same as a support vector classifier, because the decision boundary is linear.

With polynomial kernels, you’re projecting the original feature space into a polynomial feature space. So the decision boundary that separates the classes is defined with a higher order polynomial.

The use of kernels is what distinguishes support vector classifiers from support vector machines. And they open up the possibility to tackle more complex problems. But augmenting the feature space could mean extra computational needs. Because, with a big enough feature space, it can be expensive to fit a model, both in terms of both time and resources.

Despite the the augmented feature space, kernels bring a significant advantage. SVMs don’t actually compute the transformation of each observation into the augmented space. They use a trick and instead compute the inner product of observations in the augmented space which, computationally, is much cheaper. This is called the kernel trick.

In the end, SVMs make two important assumptions:

  • Data is linearly separable. Even if the linear boundary is in an augmented feature space.
  • The model is represented using inner products, so that kernels can be used.

Let’s look at a few examples

To see support vector machines in action, I’ve generated a random dataset and split it into two different classes. Here's the code snippet that generates and plots the data.

import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def generate_random_dataset(size):
""" Generate a random dataset and that follows a quadratic distribution
"""
x = []
y = []
target = []
for i in range(size):
# class zero
x.append(np.round(random.uniform(0, 2.5), 1))
y.append(np.round(random.uniform(0, 20), 1))
target.append(0)
# class one
x.append(np.round(random.uniform(1, 5), 2))
y.append(np.round(random.uniform(20, 25), 2))
target.append(1)
x.append(np.round(random.uniform(3, 5), 2))
y.append(np.round(random.uniform(5, 25), 2))
target.append(1)
df_x = pd.DataFrame(data=x)
df_y = pd.DataFrame(data=y)
df_target = pd.DataFrame(data=target)
data_frame = pd.concat([df_x, df_y], ignore_index=True, axis=1)
data_frame = pd.concat([data_frame, df_target], ignore_index=True, axis=1)
data_frame.columns = ['x', 'y', 'target']
return data_frame

# Generate dataset
size = 100
dataset = generate_random_dataset(size)
features = dataset[['x', 'y']]
label = dataset['target']
# Hold out 20% of the dataset for training
test_size = int(np.round(size * 0.2, 0))
# Split dataset into training and testing sets
x_train = features[:-test_size].values
y_train = label[:-test_size].values
x_test = features[-test_size:].values
y_test = label[-test_size:].values
# Plotting the training set
fig, ax = plt.subplots(figsize=(12, 7))
# removing to and right border
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
# adding major gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
ax.scatter(features[:-test_size]['x'], features[:-test_size]['y'], color="#8C7298")
plt.show()

Before any classification, the training set looks like this.

Random training set.

There’s a little space between the two groups of data points. But closer to the center, it’s not clear which data point belongs to which class.

A quadratic curve might be a good candidate to separate these classes. So let’s fit an SVM with a second-degree polynomial kernel.

from sklearn import svm
model = svm.SVC(kernel='poly', degree=2)
model.fit(x_train, y_train)

To see the result of fitting this model, we can plot the decision boundary and the margin along with the dataset.

Dataset after classification, with decision boundary (full line), margin (dashed lines) and support vectors marked with a circle.

Here’s the code to plot the decision boundary and margins.

fig, ax = plt.subplots(figsize=(12, 7))# Removing to and right border
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
# Create grid to evaluate model
xx = np.linspace(-1, max(features['x']) + 1, len(x_train))
yy = np.linspace(0, max(features['y']) + 1, len(y_train))
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
train_size = len(features[:-test_size]['x'])# Assigning different colors to the classes
colors = y_train
colors = np.where(colors == 1, '#8C7298', '#4786D1')
# Plot the dataset
ax.scatter(features[:-test_size]['x'], features[:-test_size]['y'], c=colors)
# Get the separating hyperplane
Z = model.decision_function(xy).reshape(XX.shape)
# Draw the decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
# Highlight support vectors with a circle around them
ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=100, linewidth=1, facecolors='none', edgecolors='k')
plt.show()

If we calculate the accuracy of this model against the testing set we get a good result, granted the dataset is very small and generated at random.

Accuracy of the SVM model with 2nd degree polynomial kernel.
from sklearn.metrics import accuracy_score
predictions_poly = model.predict(x_test)
accuracy_poly = accuracy_score(y_test, predictions_poly)
print("2nd degree polynomial Kernel\nAccuracy (normalized): " + str(accuracy_poly))

The accuracy is good, but let's see if a more simplistic approach could have solved our problem. To fit an SVM with a linear kernel we just need to update the kernel parameter.

model = svm.SVC(kernel='linear')
model.fit(x_train, y_train)

And plot the decision boundary the same way we did back there.

Dataset after classification, with decision boundary (full line), margin (dashed lines) and support vectors marked with a circle.

Now it looks like there are fewer points inside the margin, and fewer misclassified points. Calculating the accuracy of this model, it has slightly better accuracy than the one with a polynomial kernel.

Accuracy of SVM model with linear kernel.

So it turns out that for this problem a simpler model, an SVM with a linear kernel, was the best solution.

Hope you enjoyed these examples, and that you got a better understanding of SVMs and what kinds of problems they can be applied to.

Thanks for reading!

--

--