The world’s leading publication for data science, AI, and ML professionals.

Support Vector Machines (SVM)

Separated and yet supportive

Image by Viktor Bondar on 123rf
Image by Viktor Bondar on 123rf

History

This article is about an introduction to SVMs, understanding the mathematical intuition, Regularization, implementing the concept in code, and then knowing the fields of its applications. Right then tighten your seat belt and let’s enter this wonderland of beautiful concept Support Vector Machines (SVM). Vladimir Vapnik one of the pioneers in the Machine Learning world had discovered the SVMs in the early 1960s. But he was unable to prove the magic of this Algorithm computationally as he was lacking the needed resources back then in the 1960s. It was in the early 1990s that the Bell Laboratory invited him to come forward to the US related to the research work. Back then in 1993, there was quite a fast track research going on in the character digit recognition area. Vapnik bet that his work of SVMs could do much better than the neural nets in digit recognition. So one of my colleagues tried to apply the SVM and it worked wonders to his surprise. So finally the work which was discovered in the 1960s got recognition in the 1990s when it was actually implemented to give better computational results.

Introduction

Support Vector Machines is a supervised machine learning algorithm used in Classification and Regression modeling. The purpose of SVMs is to identify a sample in space and segregate it based on the classes. For example, if I give you images of fruits as a list then the algorithm classifies the input images that are recognized as Apple or Strawberry. Remember in the above example the classification was based on the assumption that the classes are easily separable. What if the images were distributed in such a way that its difficult to split the classes by drawing a straight line? Before we analyze in that direction we need to familiarize a few basics.

Let’s take a step further by understanding a few technical terms :

Hyperplane: A plane separating the positive and negative classes spread over distribution in a given space.

Margin: A separating line drawn parallel on either side of the hyperplane which separates the positive and negative classes.

Support Vectors: The nearest positive and negative points to the Margins lying on either side of the Hyperplanes.

Image by Author
Image by Author

The Tricky Part !!

I am sure a question might have popped up in your head that why can’t we find a separating hyperplane and margin in any other direction. Yes, it’s possible but we have to make sure that we choose the hyperplane in such a way that the distance of the points beyond the margins is maximum. And why does it have to be maximum ?? Just to avoid misclassification of points in the region. The wider is the separating margins more accurate classification will be obtained. The below fig 1 shows that there is enough separation between the margins to avoid misclassification of new points and correctly allocate them to respective classes. Fig 2 shows that the hyperplane in such a direction may result in insufficient gaps between margins and end up resulting in misclassifying points.

Image by Author
Image by Author
Image by Author
Image by Author

Maximize Margin – Mathematical Intuition

Image by Author
Image by Author

Consider a vector ŵ in a direction orthogonal to Hyperplane as shown in fig. Assume that we have an unknown point u which is lying in the hyperplane . . As we have a vector perpendicular to the hyperplane we can consider the inner product of ŵ and c. Any unknown point whose presence is yet not known i.e. whether it is a positive or negative point for our mathematical assumption we consider its dot product at least greater than a constant c.

Ŵ∙u >= c

Assuming the constant — c = b

Ŵ∙u – c >= 0

Ŵ∙u + b >= 0 – (1)

Equation 1 is our decision rule which we will be utilizing for our purpose in the later stage.

Let us assume that the points on the right of the plane are positive and represented as +1.

The points on the left of the plane are negative and represented as -1.

So using the Decision rule from equation 1 we have below two equations for positive and negative points.

Ŵ∙(x+) + b >= 1 – (2)

Ŵ∙(x-) + b <= -1 – (3)

Now for mathematical convenience we introduce a characteristic variable y which satisfies the above equations as

Y = 1 when Ŵ∙(x+) + b >= +1 – (4)

Y = -1 when Ŵ∙(x-) + b <= -1 – (5)

From the above two equations, we can conclude that by multiplying Y with Eqn (2) and (3) ** the necessary condition for any point to be correctly classified is** given below.

*Y (Ŵ∙x + b) >= 1 – (6)**

Rearranging the above equation and considering that the point is still unknown we can equate it to zero to get it as below

*Y (Ŵ∙x + b) – 1 = 0 – (7)**

Now we want to maximize the distance between two margins and hence we draw the orthogonal vector ŵ to the margin of positive points and negative points. Our support vectors on the margin can be considered as X+ and X-. Our goal is to calculate the maximum distance between the two ** support vectors on the margin. However, we draw a norm of w such that distance (X+) – (**X-) can be calculated by portraying the distance on norm w. Hence multiplying the unit vector of w with the distance doesn’t actually change the value.

*Margin Width = (X+)— (X-) = [(X+) – (X-) ] ( Ŵ / || Ŵ || )**

Image by Author
Image by Author

Now from Eqn (7), we can calculate the value of X+ and X-.

By putting Y = 1 for X+ we get the value of X+ as below

*1 (Ŵ∙(X+) + b) – 1 = 0**

(X+) = (1 – b) / Ŵ – (8)

Y = 1 for X- we get the value of X- as below

*(-1) (Ŵ∙(X-) + b) – 1 = 0**

(X-) = (-1 – b) / Ŵ – (9)

Substituting the eqn (8) and (9) in Margin Width calculation :

Margin Width = (X+) – (X-)

*= [(X+) – (X- )] ( Ŵ / || Ŵ || )**

= (1- b + 1 + b)( 1 / Ŵ ) ( Ŵ / || Ŵ || )

*= 2 ( 1 / || Ŵ || )**

*Our Goal is to maximize the margin width = Max (2 ( 1 / || Ŵ || ) )**

*With a condition that Y (Ŵ∙x + b) – 1 = 0**

There you go Finally we reached the margin width formula .

Does it end over ?? No, Not really !!!

We are here to optimize the formula and make it generalize. To optimize the formula, we got to make sure that it achieves convexity !! If you are unaware of Convexity for optimization then that’s ok. For now, understand it as if without reaching mathematical convenience we cant achieve optimization.

So we reframe our equation of margin width as

*Min (½) (|| Ŵ || ^2)**

*With a condition that Y (Ŵ∙x + b) – 1 = 0**

Optimizing using Lagrange multipliers

Now we know that when we have an objective function and constraint then we can utilize the Lagrange Multipliers method to find the dependency of Lagrange on the variables. So using the Objective function and Constraints we have we can construct the Lagrange function.

L = *(½) || Ŵ ||2 – ∑ αi [ yi ( ŵ∙x + b) – 1 ] – – – -(1)**

Now as per Lagrange function

∂L / ∂ŵ = 0

∂L / ∂b = 0

Applying the same in our main function (1) we get

∂L / ∂ŵ = 0

ŵ – ∑ αi yi xi = 0

ŵ = ∑ αi yi xi – – -(2)

∂L / ∂b = 0

*∑ αi yi = 0 – – – -(3)**

Applying (2) and (3) in (1)

*L = ½ ( ∑ αi yi xi ) (∑ αj yj xj ) – (∑ αi yi xi ) (∑ αj yj xj ) – b∑αiyi + ∑αi**

*L = ∑αi – ∑∑ αi αj yj yj xi xj**

So we finally conclude that the optimization factor of margin width depends on any two samples or points x.

Linearly separable or Not?

Going back to our first figure it was seen that both the positive and negative classes were separated so neatly and cleanly by drawing a straight line. Suh type of separation is known as Linearly separable variables. Is it possible all the time to separate both the classes by drawing a straight line? What if we have a huge number of misclassified points while attempting to draw a straight line? Consider the below figure where we can see that the points are not easily separable. In such a case we use kernels. The role of kernels is to convert the points in any given dimensional space to higher-dimensional space. For e.g in the below figure, we have the points in 2-D and by applying kernels we see that they get converted to 3-D. To our surprise, we see that the points are linearly separable now by drawing a line.

Image by Arash Saeidpour on ResearchGate
Image by Arash Saeidpour on ResearchGate

Regularization

Imagine a case where the margin width is getting maximized but at the same time couple of misclassification of points is occurring. What will we do in such a case? Will we try to rearrange the Hyperplane in such a way that the misclassification can be avoided? If you think carefully answer has to be no, as we are here to optimize and generalize the situation. We can’t afford to compromise the margin width by changing the hyperplane every time although we may classify a few points wrongly. In such a case we will apply the concept of regularization. Here we will find the striking balance between the alignment of the hyperplane and no errors or misclassification allowed while trying to maximize the margin. We include two variables error which c which gives us the total points wrongly classified and Ɛ which gives us the value of each error that’s wrongly classified. This helps us with the regularisation effect which tells the Svm not to adjust or include the outliers and also helps us in calculating the value of c to find a striking balance between making the model immune to certain misclassification and also make sure that the margin width is also maximized at the same time.

*Regularisation = Min (½) (|| Ŵ || ^2) + ∑ Ɛi**

Code Implementation

Iris classification using SVM is one of the best examples to understand the concept of the SVM classifier. Iris is a type of flower which has around 250–300 species. In our example, we have a dataset consisting of the flower details such as Sepal width, Sepal length, Petal width, Petal length. Our goal is to classify the flower into correct species based on the details available in the dataset using the SVM classifier.

Image by Author
Image by Author

We split the data into training and test data using the methods available in the sklearn library. Also, we import the SVC classifier model from the sklearn library and train a model on the train data.

After training the model on the set we then apply the predict method to test dataset. Using the various accuracy metrics we then check then validate the model output using the confusion matrix and classification report.

Image by Author of Confusion Matrix
Image by Author of Confusion Matrix
Image by Author
Image by Author

Applications of SVM

· Character Recognition

· Bioinformatics

· Image Classification

· Face Detection

Thank you for taking the time to read out the article !!


Related Articles