The world’s leading publication for data science, AI, and ML professionals.

Ace your Machine Learning Interview - Part 4

Dive into Support Vector Machines using Python

This is my fourth article dedicated to the series I started about Machine Learning fundamentals to know in order to approach your interview. I leave you the links here to the previous articles.

  1. Ace your Machine Learning Interview – Part 1: Dive into Linear, Lasso and Ridge Regression and their assumptions
  2. Ace your Machine Learning Interview – Part 2: Dive into Logistic Regression for classification problems using Python
  3. Ace your Machine Learning Interview – Part 3: Dive into Naive Bayes Classifier using Python

Introduction

The algorithm underlying Support Vector Machines (SVMs) is intuitively very simple, although the math can be complex for those without a strong mathematical background.

As with logistic regression, we want to separate two classes of points using a straight line, or a hyperplane in n dimensions. But unlike logistic regression, we try to find the hyperplane that maximizes the margin.

The margin is defined as the distance between the separating hyperplane (decision boundary) and the training examples that are closest to this hyperplane, which are the so-called support vectors. Let’s look at an example.

By maximizing the margin we will find a line that as in the example above will have a large generalization capacity. It will therefore not only be concerned with fitting (or overfitting) the training data but will also be able to make good predictions on new data as well.

Goal: find the hyperplane that maximizes the margins. Large margins allow us to generalize better than small margins.

How does it work?

Let’s assume that our positive points are valued +1 and negative points -1. We can thus define the functions describing the edges and the decision boundary. In addition, we define by w the direction perpendicular to the hyperplane. Let us summarize in the following figure.

Now suppose you are a positive point x that lies precisely on the decision boundary and needs to be classified. How many k steps in the directions of w must you take for it to be classified as positive Let’s calculate it.

This should give you an overview of how you got to the point of wanting to maximize the margin. But then to maximize the margin we just need to minimize ||w|| since it is in the denominator. Obviously, we want to minimize ||w|| always keeping in mind the constraint of wanting to classify the points correctly.

Finally, the Hard Margin SVM optimization problem can be defined as the following.

Soft Margin SVM

As you have probably already guessed if there is Hard Margin SVM there will also exist Soft Margin SVM.

In this version, we allow the models to make some errors, that is, let some points lie within the margin lines. In this way, by allowing some errors to be present we can maintain a decision boundary that generalizes well. At the theoretical level, we are introducing what is called a slack variable first introduced by Vladimir Vapnik in 1995. In the context of SVMs, we refer to this variable as C, and it is a hyperparameter of the network to be tuned. In this case our minimization problem changes as follows.

Let’s code!

Implementing an SVM using sklearn is quite simple, let’s see how to do it. We also use the well-known Iris dataset here, which consists of the following features.

The dataset is provided by sklearn with an open license, you can find it here.

We will only use two features of the Iris Dataset for visualization purposes. So let’s load and standardized our data.

Now we can train the SVM and plot the decision boundaries.

Final Thoughts

Naive Bayes is one of the main algorithms to know when approaching Machine Learning. It has been used heavily, especially in problems with text data, such as Spam email recognition. As we have seen it still has its advantages and disadvantages, but certainly when you are asked about basic Machine Learning expect a question about it!

SVMs can do much more than what I reported in this article. In particular, SVM kernels make it possible to classify data that would not be separable from a hyperplane. Using the kernel trick we can switch to another dimensional space with more dimensions and classify in this new space the points of the dataset. We will take a good look at how to do this in the next article in this series! 😁

The End

Marcello Politi

Linkedin, Twitter, CV


Related Articles