
Support Vector Machines are very versatile Machine Learning algorithms. The main reason for their popularity is for their ability to perform both linear and non-linear classification and regression using what is known as the kernel trick; if you don’t know what that is, don’t worry. By the end of this article, you will be able to :
- Understand what an SVM is and how it works
- Distinguish between a hard margin SVM vs a Soft margin SVM
- Code an SVM from Scratch in Python
So, without further ado, let’s dive in!
What is an SVM and why do I need it?

With so many other algorithms out there(Linear Regression, Logistic Regression, Neural Networks, etc..) You may be wondering why you need to have another one in your toolkit! Perhaps these questions can be answered with the aid of a diagram:

Here we see three potential decision boundaries for classifying the Data: H1, H2, and H3. First off, H1 does not separate the classes at all, so it is not a good hyperplane. H2 does separate the classes, however notice how the margin(or street) between the points is so small, and this classifier is very unlikely to perform well on unseen instances.
The third hyperplane, H3, represents the decision boundary of the SVM classifier; this line not only separates the two classes but also keeps the widest distance between the most extreme points of the two classes.
You can think of the SVM as fitting the widest possible margin between the two classes. This is known as large margin classification.
Large Margin Classification

Like I said, a large margin SVM Classifier essentially tries to fit the widest possible street(shown by the dashed parallel lines) between two classes. It is important to note that adding more instances that are "off the street"(not on the dashed line) will not affect the decision boundary.
The decision boundary is fully determined(or supported) by the most extreme instances of the class, or, in other words, the instances located on the edge of the street. These are called _support vectors(they are circled in black in the diagram)._
Limits of Hard Margin Classification

So essentially, a hard margin SVM basically tries to fit a decision boundary that maximises the distance between the support vectors of the two classes. However, there are a few issues with this model:
- It is very sensitive to outliers
- It only works on data that is linearly separable
These two concepts can be clearly highlighted in this visualisation:
Problem 1: It is very sensitive to outliers

Note how the red point is an extreme outlier, and hence the SVM algorithm uses it as a support vector. Because the Hard Margin classifier finds the maximum distance between the support vectors, it uses the red outlier and the blue support vectors to set a decision boundary.
This results in a very poor decision boundary that is likely overfitting and will fail to predict new classes well.
Problem 2: it only works on data that is linearly separable

In this example, we can clearly observe that there is no possible linear classifier that will separate the classes. Additionally, there is a major outlier. So, the question is, how can an SVM separate non-linearly separable data?
Dealing with Outliers and Non-Linear Data
Approach one: Soft Margin SVM

One approach is to find a good balance between keeping the streets as wide as possible(maximising the margin) and limiting the _margin violations(these are instances that end up in the middle of the street or even on the wrong side of the street)._ This is called a soft margin SVM.
Essentially you are controlling the trade-off between two objectives:
- Maximise the distance between the decision boundary and support vectors
- Maximise the number of points that are correctly classified by the decision boundary
This trade-off is usually controlled by a hyperparameter that can be denoted by λ, or, more commonly(in scikit-learn) the C parameter. This essentially controls the misclassification cost. Concretely,
- A small values of C leads to a wider street but more margin violations(higher bias, lower variance)
- A large values of C leads to a narrower street but with less margin violations(low bias, high variance).
Although this approach can work, we have to figure out the optimal C parameter using cross-validation techniques. This can take a considerable amount of time. Additionally, one may want to create an optimal model and not have any "slack" variables that cross margin violations. So what is our solution now?
Approach two: Kernels

While Linear SVM’s work well in most cases, it is extremely rare to have a dataset that is linearly separable. One approach to combat this is to add more features, such as polynomial features(theses essentially transform your features by raising the values to an N degree polynomial(think X²,X³, etc..)).
For instance, let’s say we have the following data:

Clearly, this dataset isn’t linearly separable.
However, when we apply a polynomial transformation by raising the root to a power of 20:

We get a dataset that is linearly separable.
However, this is not feasible for large datasets; the computational complexity and the time it will take for the polynomial transformation to happen would be simply too long and computationally expensive.
Additionally, using high order polynomial degrees creates a huge number of features, making the model too slow.
That’s where the beauty of SVM’s come to play. More specifically, the beauty of the kernel trick
The Kernel Trick

Essentially, kernels are different functions that calculates the relationships between non-linearly separable data points and maps them into higher dimensions. It then fits a standard Support Vector Classifier. It effectively maps features from being in a relatively low dimension to a relatively high dimension.
However, kernel functions only calculate the high dimensional relationships between the data points as if they were in a higher dimension; they do not actually do the transformation, meaning that the kernel function does not add any features, but we get the same results as if we id.
This trick(calculating the high dimensional relationships between the data points without actually creating or transforming them) is known as the Kernel Trick.
The kernel trick reduces the computational complexity of the SVM by avoiding the math that transforms features from low dimensions to high dimensions!
Let’s look at two common Kernel functions:
1 . The Polynomial Kernel
- Gaussian Radial Based Function
Kernel Function 1: The Polynomial Kernel

Essentially, this uses a Polynomial Kernel to calculate the high dimensional relationships between the data points and map the data into a higher dimension without adding any features.
The formula for the polynomial kernel is the following(I do apologise for hitting you with the maths without warning!):

- the d here is a hyperparameter and refers to the degree of the polynomial that the function should use.
An example
To give a concrete example, let’s suppose we have data that looks like this:

Clearly, this data is not linearly separable. However, if we were to use an SVM with a polynomial kernel, we would get the following high dimension mapping:

Again, the important idea to take from this is that the kernel function only calculates the high dimensional relationship between the points as if they were in high dimensions, but does not create or transform new features.
The following code implements a polynomial kernel with the SVC class in scikit-learn:
from sklearn.svm import SVC
svc = SVC(kernel="poly", degree=3, coef0=1, C=5))
svc.fit(X_train,y_train)
Obviously if your model is overfitting, you may need to reduce the degree of the polynomial. You may have noticed a few parameters here. Let me explain them briefly:
- kernel: defines the kernel to be used(we will explore some other options later)
- degree: this defines the degree of the polynomial kernel
- C: the classification error that basically controls the trade-off between having the largest possible margin and maximise the number of points correctly classified by the decision boundary.
Kernel Function 2: The Gaussian RBF Kernel

Another SVM kernel that is extremely popular is the Gaussian Radial Based Function(Gaussian RBF). Essentially, this is a similarity function that computes the distance between instance and a landmark. The formula for the kernel function is given below:

To clarify a few notations(NOTE: anything that is unclear will be explained shortly):
- x: refers to an instance
- x’: refers to a landmark
- σ: this refers to the gamma
The function itself follows a bell shaped curve(Hence why it’s Gaussian) ranging from 0 (very far away from the landmark) to 1(at the landmark).
I’m sure this is still fuzzy in your mind, so let’s clear up the confusion the only way I know; with an example!
An example
Observe the diagram of the 1D data below:

A landmark is essentially a point in the dataset that we will use to get the similarity between. Here we have 2 landmarks, X2 and X3.
We are now ready to compute new features. For example, Let’s look at the instance X, which is equal to -1. It is located at a distance of 1 from the first landmark and a distance 2 from the second landmark. Therefore, its new mapped features would be:
x2 = exp (–0.3 × 12) ≈ 0.74
and
x3 = exp (–0.3 × 22) ≈ 0.30
Which is plotted in the following diagram:

And now we simply use a normal SVC to classify the points!
You may be thinking, cool! But:
- How do you select landmarks?
- You still didn’t explain what the heck gamma means!
Selecting Landmarks
So True. Well, to address the first question, usually you create a landmark at the location of each and every instance in the dataset. This creates many dimensions and thus increases the chances that the transformed training set will be linearly separable.
However, once again, just like polynomial transformations, this is computationally expensive and requires a lot of features to be added, Just imagine if you had a training set with m instances and n features gets transformed into a training set with m instances and m features (assuming you drop the original features).
In other words, if you had 20 instances with 20 features, calculating these transformations will results in 400 features!
Luckily for us, the kernel trick does its magic; it makes it possible to compute these higher dimensional relationships without actually transforming or creating new features, and still getting the same result as if you had!
Gamma
Now, gamma is a special hyperparameter that is a specific to rbf kernels. Referring back to our plot above of the rbf function, gamma controls the width of each bell-shaped function.
Concretely, a large value of gamma will shorten the width of the bell-shaped curve and reduce the range of influence of each instance and resulting in a more irregular decision boundary that wiggles between individual data points. Conversely, a small value of gamma increases the range of influence of each data point, and results in a much smoother decision boundary.
The scikit-learn implementation of an SVC with an rbf kernel would be the following:
SVC(kernel="rbf", gamma=5, C=0.001)
NOTE: In this article, I will be only coding a soft and hard margin SVM. but in the future, I will be writing articles on how to implement the kernel trick in SVM, so be sure to stay tuned for that in the future
The Math of the Hard Margin and Soft Margin SVM

Yes I’m sorry but you do unfortunately need to understand the math in order to code it. If you really hate math, feel free to skip this section, but I highly advise at least trying to understand what is going on to give you a better sense of the problem at hand.
Before I actually get to the math, let me give you a step by step guide as to how the SVM works:
- Fit a hyperplane to the data and try to classify the points
- Using an optimisation algorithm, adjust the parameters of the model so that it keeps the largest margin possible between the support vectors.
- Repeat for n iterations, or until the cost function is minimised.
Now, let me explain some of those key terms before we dive into the SVM Math.
Cost Functions

A cost function is essentially a formula that measures the loss, or the "cost" of your model. If you have ever done any Kaggle competitions, you may have come across some of them. A few common ones include:
- Mean Squared Error
- Root Mean Squared Error
- Mean Absolute Error
- Log Loss
The cost function that we will be using is called the hinge loss. The __ formula for the function is the following:

Graphically, the hinge loss looks like this:

In this plot, the blue represents the loss for correct classification, and the green represents the loss for misclassification.
Note that the hinge loss penalises predictions where data points are inside the margin, even if we classified them correctly.
Essentially, we will be using this to measure our algorithm’s performance and make sure we reach our goal(maximising the margin)
Optimisation Algorithms
Optimisation is usually defined as the process of improving something so that it operates at its full potential. This is also applicable in Machine Learning. In the world of ML, optimisation is essentially trying to find the best combination of parameters for a certain dataset. This is essentially the "learning" bit of Machine Learning.
While may optimisation algorithms exist, I will discuss two of the most common ones: Gradient Descent and The Normal Equation.
Gradient Descent
Gradient Descent is an optimisation algorithm that aims to find the minimum of a function. It achieves this goal by iteratively taking steps in the negative direction of the slope. In our example, gradient descent would continuously update the weights by moving in the slope of the tangential line to the function. Well fantastic, sound great. English please? 🙂
A concrete example of Gradient Descent

To better illustrate Gradient Descent, Let’s go through a simple example. Imagine a human is at the top of a mountain, and he/she want to get to the bottom. What they might do is look around and see in what direction they should take a step in in order to get down quicker. Then, they might take a step in that direction and now they are closer to their goal. However, they have to be careful when coming down as they might get stuck at a certain point, so we have to make sure to choose our step sizes accordingly.
Similarly, The objective of gradient descent is to minimise a function. In our case, it is to minimise the cost of our model. It does this by finding the tangential line to the function and moving in that direction. The size of the "step" of the algorithm is defined by what is known as a learning rate. This essentially controls how far we move down. With this parameter, we have to be careful of two cases:
- The learning rate is too large, the algorithm might not converge(reach a minimum) and bounce around the minimum, but never be at it
- The learning rate is too small, the algorithm will take too long to get to the minimum, also might get "stuck" at a sub-optimal point.
We also have a parameter that controls the number of times the algorithm iterates over the dataset.
Visually, the algorithm would do something like this:

Because this algorithm is so vital to know in Machine Learning, let’s recap what it does:
- Randomly initialises the weights. This is called(you guessed it) random initialisation)
- The model then makes predictions using these random weights.
- The model’s predictions are evaluated through a cost function.
- The model then runs gradient descent, by finding the tangential line of the function, and then taking a step in that slope of the tangent
- The process is repeated for N iterations, or if a criteria is met.
This process is shown mathematically like so:

Important things to note here:
α: this is the sign for the learning rate(remember: the size of the step)
m: the number of training examples
h(θx): our predictions
θn: the nth coefficient of our algorithm
Advantages and Disadvantages of Gradient Descent
Advantages:
- Gradient descent is very likely to reduce the cost function to a global minimum(very close to or = 0)
- One of the most effective optimisation algorithms
Disadvantages:
- Can be slow on large datasets, as it uses the whole dataset to compute the gradients of the tangential line of the function
- Is liable to getting stuck at a sub-optimal point(or local minima)
- The user has to manually choose the learning rate and the number of iterations, which can be time consuming
Ok, I know you really want to get onto to the actual coding, but this last part is the most important part of the math for SVM, so hang in there!
The math of SVM
For predictions:

We essentially do the following:
- Predict 1 if the sign of the dot product of the weights multiplied by the features minus the bias term is greater than or equal to 1
- Predict 0 if the sign of the dot product of the weights multiplied by the features minus the bias term is less than or equal to -1
The condition:

This is the condition of the hinge loss for positive classification. This is represented by the blue line on the plot of the hinge loss we saw earlier. Basically, this checks whether the given instance was correctly or incorrectly classified.
For correct classification:

This is our formula for correct classification. To clarify:
w: the weight of the algorithm
α: the __ learning rate for gradient descent that we talked about earlier
λ: the regularisation parameter(this is equivalent to the C parameter)
For incorrect classification:

This is the formula for incorrect classification. Note how we also adjust the bias term.
Ok, so finally, ladies and gentlemen, the main event: SVM from Scratch!
Coding an SVM from Scratch

Let’s finally begin! First, let’s do some basic imports:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
Yes, no sklearn models! This is going to be coded in plain numpy!
Next, let’s make a basic dataset and split our data into train and test:
X, y = datasets.make_classification(n_samples=500, n_features=5, random_state=42, class_sep=0.7)
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42, test_size=0.2)
I’m going to make a synthetic dataset for this tutorial, and set the _classsep to 0.7, meaning that if should be relatively simple to classify the data.
Now, to adhere to software engineering principles, I will make an SVM class and build it to make the code look cleaner and make it easier to share:
class SVM:
def __init__(self, learning_rate=0.001, C=0.001, n_iters=100):
self.lr = learning_rate
self.C = C
self.n_iters = n_iters
self.w = None
self.b = None
So here, we essentially initialise our learning rate, regularisation parameter, the number of iterations, and we set the weights and biased equal to zero.
Next, we define our fit method:
def fit(self, X, y):
n_samples, n_features = X.shape
# Convert 0 labels to be -1 to set threshold with hinge loss
y_ = np.where(y <= 0, -1, 1)
self.w = np.random.rand(n_features)
self.b = 0
Once we are given training features and a target vector, we can randomly initialise our weights to be a vector of the number of features. Note how we convert the 0 values in our dataset to equal -1 so that we can use the hinge loss.
Now, we move on to the core of the method:
for _ in range(self.n_iters):
for idx, x_i in enumerate(X):
condition = y_[idx] * (np.dot(x_i, self.w) - self.b) >= 1
So, we are essentially doing the following:
- Loop through n_iters times(by default n_iters=100)
- for a selected index and value of X, we set our condition, checking if our selected target value multiplied by the dot product of our selected instance and weights minus the bias is greater than or equal to one. This essentially checks whether we classified correctly or not according to the hinge loss.
Next, we translate our formula for correct classification into code:
if condition:
self.w -= self.lr * (2 * self.C * self.w)
and our formula for incorrect classification into code:
else:
self.w -= self.lr * (2 * self.C * self.w - np.dot(x_i, y_[idx]))
self.b -= self.lr * y_[idx]
Finally, we define our predict function:
def predict(self, X):
preds = np.sign(np.dot(X, self.w) - self.b)
return np.where(preds == -1,0,1)
We make sure to convert our labels that equal -1 back to zero.
Now, we simply call our function and get the accuracy of our model on the test set:
clf = SVM()
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
(preds == y_test).mean()
OUT:
0.82
I have added a visualise_svm() function to help visualise the SVM which can be accessed from the Github repo I have added at the end of this article. Nevertheless, running the function outputs the following:

Now, if you have not guessed, we have just implemented a soft margin SVM. We set the C value to be 0.001, and we can clearly see that the decision boundary allowed some points to be in the margin and on the wrong side, but resulted in a better hyperplane.
Now, when we change the C value to 0.9(very little regularisation), we get the following plot:

And our accuracy has decreased from 0.82 to 0.7
Lockdown homework
I want you to try out some tasks:
- Try adjusting the C to be a really small and a really big value? How does the decision boundary change?
- Adjust the learning rate for the gradient descent algorithm. Do you get a better result.
I really appreciate everyone who motivates me to write good articles. I thank my loyal followers and everybody who has decided to read my work. I assure you, it does not go unnoticed. I hope to always produce interesting and fascinating work for readers.
I hope you learned something new, and possibly refreshed some old knowledge. Be sure to stay tuned for more, and I wish you all the best!
PS: here is the link to the Github Code
