Support Vector Machines & Imbalanced Data

How does SVM work in the case of an imbalanced dataset?

Deepthi A R

Published in

Towards Data Science

5 min readDec 18, 2019

A brief about SVMs

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other.

In SVM, the hyperplane is chosen in such a way that it's equidistant from both the classes and also at the shortest distance.

Confused? Take a look at the explanation below.

Figure 1

Let’s suppose, red dots are class 1 and green dots are class 2.

Now, if I’m asked to draw a threshold which separates these two classes, I can draw it anyway, there are literally n ways of doing it.

What happens when I draw a margin as shown in Figure 2? Anything on the left will be classified as a red point and anything on the right will be classified as a green point.

Let’s analyse.

This new point can be classified as a red dot, that is, class 1.

This new point can be classified as a green dot, that is, class 2.

Okay! that was pretty straight forward. But let's see the below situation.

This new point will be classified as a green point as it's on the right side. According to the threshold, yes that’s a correct classification. But is it actually right? NO!

The point is much closer to the red points and much far away from green points. So, this is a wrong classification.

Then how do we decide the right threshold?

Figure 6

Let’s focus on the observations that are on the edge of both the classes. Now let's draw a threshold such a way that it’s equidistant to both the points.

Now, any point on the left of this threshold will be closer to the red points than green points, and hence will be classified as a red point. And any point on right will be a green point.

The edge observations are called the support vectors.

The shortest distance between the observations and the threshold is called the margin. When the threshold is halfway between the two observations, the margin is as large as it can be.

Now let’s dive in……

SVM & Imbalanced data

First, let's create the imbalanced datasets, each of these will have positive and negative classes.

Dataset 1 — 100 positive points and 2 negative points

Dataset 2 — 100 positive points and 20 negative points

Dataset 3 — 100 positive points and 40 negative points

Dataset 4 — 100 positive points and 80 negative points

import numpy as np
import matplotlib.pyplot as plt
ratios = [(100,2), (100, 20), (100, 40), (100, 80)]
plt.figure(figsize = (20,6))
for j,i in enumerate(ratios):
 plt.subplot(1, 4, j+1) 
 X_p=np.random.normal(0,0.05,size=(i[0],2))
 X_n=np.random.normal(0.13,0.02,size=(i[1],2))
 y_p=np.array([1]*i[0]).reshape(-1,1)
 y_n=np.array([0]*i[1]).reshape(-1,1)
 X=np.vstack((X_p,X_n))
 y=np.vstack((y_p,y_n))
 plt.title(“Dataset “ + str(j+1)+ “: “ +str(i))
 plt.scatter(X_p[:,0],X_p[:,1])
 plt.scatter(X_n[:,0],X_n[:,1],color=’red’)
plt.show()

This code is to create the 4 different datasets. These datasets look as follows:

So now that you have seen how our datasets look like, let’s go ahead further.

Now we have to draw a hyperplane separating the points. We will consider 3 different values for our regularization term and observe how the hyperplane changes with the changing regularization term on our imbalanced dataset.

We just have to add a couple of lines to complete our code. So, the updated code looks like:

def draw_line(coef,intercept, mi, ma):
 
  points=np.array([[((-coef[1]*mi — intercept)/coef[0]), mi],
  [((-coef[1]*ma — intercept)/coef[0]), ma]])
 
  plt.plot(points[:,0], points[:,1])

The above code is to draw the line that separates the point.

c = [0.001,1,100]
plt.figure(figsize = (20,30))
ratios = [(100,2), (100, 20), (100, 40), (100, 80)]
num=1
for j,i in enumerate(ratios):
 for k in range(0, 3):
   model=LinearSVC(C=c[k])
   plt.subplot(4, 3, num) 
   num=num+1
   X_p=np.random.normal(0,0.05,size=(i[0],2))
   X_n=np.random.normal(0.13,0.02,size=(i[1],2))
   y_p=np.array([1]*i[0]).reshape(-1,1)
   y_n=np.array([0]*i[1]).reshape(-1,1)
   X=np.vstack((X_p,X_n))
   y=np.vstack((y_p,y_n))
   model.fit(X,y)
   plt.scatter(X_p[:,0],X_p[:,1])
   plt.scatter(X_n[:,0],X_n[:,1],color=’red’)
   plt.title(‘C = ‘+ str(c[k])+str(i))
   draw_line(coef=model.coef_[0],intercept=model.intercept_,ma=       max(X[:,1]), mi= min(X[:,1]))
 
plt.show()

Output:

Observations

When c = 0.001

As c is very small, the model is unable to classify the data, we can observe that the hyperplane position is very much away from the data points. Data being balanced or imbalanced doesn't make any difference here as the c value is very small.

2. When c = 1

The hyperplane is far from the data points, we can say that this model cannot classify the imbalanced dataset. In the last case, where the dataset is almost balanced, we can see that the model can classify with bit errors, but this seems to work well only if the data is balanced.

3. When c = 100

The model is not able to classify the highly imbalanced dataset as even with a high value of c. So we can conclude that this model does not work well, or it's not recommended to use this when we have a highly imbalanced dataset. As the data becomes the model does a pretty good job in classification.

References: