Easily Implement Multiclass SVM From Scratch in Python

Comes with a Free Deep Overview on SVMs

Published in

Towards Data Science

14 min readNov 4, 2023

In this story, we shall implement the support vector machine learning algorithm in its general soft-margin and kernelized form. We will start by providing a brief overview of SVM and its training and inference equations, then subsequently translate these into code to develop the SVM model. Afterwards, we extend our implementation to handle multiclass scenarios and conclude by testing our model using Sci-kit Learn.

Thus, by the end of this story:

You will gain a clear perspective of various important SVM concepts.
You will be able to implement, with genuine comprehension, the SVM model from scratch for the binary and multiclass cases.

Beautiful Van Gogh painting for Two Stars and a Line Between them like Starry Night— Generated by author using DALLE 2

Table of Content

· Brief Overview
∘ Hard Margin SVM
∘ Soft Margin SVM
∘ Kernel Soft Margin SVM
· Implementation
∘ Basic Imports
∘ Defining Kernels and SVM Hyperparameters
∘ Define the Predict Method
∘ Define the Predict Method
∘ Test the Implementation
∘ Generalizing Fit to Multiclass
∘ Generalizing Predict to Multiclass
∘ Testing the Implementation

Brief Overview

Hard Margin SVM

The goal in SVM is to fit the hyperplane that would attain the maximum margin (distance from the closest points from in the two classes). It can be shown, and is intuitive that such hyperplane (A) has better generalization properties and is more robust to noise than a hyperplane that doesn’t maximize the margin (B).

Figure by Ennepetaler86 on Wikimedia. CC BY-SA 3.0 Unported.

In order to achieve this, SVM finds the hyperplane’s W and b by solving the following optimization problem:

It attempts to find W,b that maximizes the distance to the closest point and classifies everything correctly (as in the constraint where y takes ±1). This can be shown to be equivalent to the following optimization problem:

For which one can write the equivalent dual optimization problem

The solution to this yields a Lagrange multiplier for each point in the dataset which we assume to have size m: (α₁, α₂, …, α_N). The objective function is clearly quadratic in α and the constraints are linear which means that this can be easily solved with quadratic programming. Once the solution found, it follows from the derivation of the dual that:

Notice that only points with α>0 define the hyperplane (contribute to the sum). Those are called support vectors.

And thereby, the prediction equation, that when given a new example x, returns its prediction y=±1 is:

Involves plugging then doing some algebraic simplification

This basic form of SVM is called hard margin SVM because the optimization problem it solves (as defined above) enforces that all points in training must be classified correctly. In practical scenarios, there may be some noise that prevents or limits the existence of a hyperplane that perfectly separates the data in which case the optimization problem would return no or a poor solution.

Soft Margin SVM

To generalize hard margin SVM, soft margin SVM adapts the optimization problem by introducing a C constant (user given hyperparamer) that controls how “hard” it should be. In particular, it modifies the primal optimization problem to become the following:

modifications in blue

which allows each point to make some violation ϵₙ (e.g., be in the wrong side of the hyperplane) but still aims to reduce them overall by weighting their sum in the objective function by C. It becomes equivalent to hard margin as C approaches infinity (generally before it does). Meanwhile, a smaller C would allow more violations (in return for a wider margin; i.e., smaller wᵗw).

Quite surprisingly, it can be shown that the equivalent dual problem only changes by constraining α for each point to be ≤C.

Since violations are allowed, support vectors (points with α>0) are no longer all on the margin’s edge. It can be shown that any support vector that has committed a violation will have α=C and that non-support vectors (α=0) cannot commit violations. We call support vectors that potentially committed violations (α=C) “non-margin support vectors” and other pure ones (that have not committed violations; lie on the edge) “margin support vectors” (0<α<C).

It can be shown that the inference equation doesn’t change:

However, now (xₛ,yₛ) must now be a support vector that has not committed violations because the equation assumes it’s on the margin’s edge (previously, any support vector could be used).

Kernel Soft Margin SVM

Soft Margin SVM extends the Hard Margin SVM to handle noise, but frequently, data is not separable by a hyperplane due to factors beyond noise, such as being naturally nonlinear. Soft Margin SVM can be used in cases like these, but then the optimal solution will probably involve a hyperplane that permits much more errors than is tolerable in reality.

Figure by Machine Learner on Wikimedia. CC BY-SA 4.0 International

Kernel Soft Margin SVM generalizes Soft Margin SVM to deal with situations where the data is naturally nonlinear. For instance, in the example shown on the left there is no linear hyperplane that Soft Margin SVM can find, regardless to the setting of C, that would decently separate the data.

It is possible, however, to map each point x in the dataset to a higher dimension via some transform function z=Φ(x) to make the data more linear (or perfectly linear) in the new higher dimensional space. This is equivalent to replacing x with z in the dual to get:

In reality, especially when Φ transforms into a very high-dimensional space, computing z can take a very long time. This is solved by the kernel trick which replaces the zᵗz with an equivalent computation of a mathematical function (called kernel function) and which is much faster (e.g., an algebraic simplification of zᵗz). For instance, here are some popular kernel functions (each of which corresponds to some transformation Φ to a higher dimensional space):

Degree of polynomial (Q) and RBF γ are hyperparameters (decided by the user)

With this, the dual optimization problem becomes:

and intuitively, the inference equation becomes (after algebraic manipulation):

A full derivation of all the equations above assuming that you have the mathematical background can be found here.

Implementation

For the implementation we will use

Basic Imports

We start by importing some basic libraries:

import numpy as np                  # for basic operations over arrays
from scipy.spatial import distance  # to compute the Gaussian kernel
import cvxopt                       # to solve the dual opt. problem
import copy                         # to copy numpy arrays

Defining Kernels and SVM Hyperparameters

We start by defining the three kernels using their respective functions

class SVM:
    linear = lambda x, xࠤ , c=0: x @ xࠤ.T
    polynomial = lambda x, xࠤ , Q=5: (1 + x @ xࠤ.T)**Q
    rbf = lambda x, xࠤ, γ=10: np.exp(-γ*distance.cdist(x, xࠤ,'sqeuclidean'))
    kernel_funs = {'linear': linear, 'polynomial': polynomial, 'rbf': rbf}

For consistency with other kernels, the linear kernel takes an extra useless hyperparameter. As obvious, kernel_funs takes a string for the kernel and returns the corresponding kernel function.

Now let’s carry on by defining the constructor:

class SVM:
    linear = lambda x, xࠤ , c=0: x @ xࠤ.T
    polynomial = lambda x, xࠤ , Q=5: (1 + x @ xࠤ.T)**Q
    rbf = lambda x, xࠤ, γ=10: np.exp(-γ*distance.cdist(x, xࠤ,'sqeuclidean'))
    kernel_funs = {'linear': linear, 'polynomial': polynomial, 'rbf': rbf}
    
    def __init__(self, kernel='rbf', C=1, k=2):
        # set the hyperparameters
        self.kernel_str = kernel
        self.kernel = SVM.kernel_funs[kernel]
        self.C = C                  # regularization parameter
        self.k = k                  # kernel parameter
        
        # training data and support vectors (set later)
        self.X, y = None, None
        self.αs = None
        
        # for multi-class classification (set later)
        self.multiclass = False
        self.clfs = []

The SVM has three main hyperparameters, the kernel (we store the string given and the corresponding kernel function), the regularization parameter C and the kernel hyperparameter (to be passed to the kernel function); it represents Q for the polynomial kernel and γ for the RBF kernel.

Define the Fit Method

To extend this class with fit and predict functions in separate cells, we will define the following function and use it as a decorator later:

SVMClass = lambda func: setattr(SVM, func.__name__, func) or func

Recall that fitting the SVM corresponds to finding the support vector α for each point by solving the dual optimization problem:

Let α be a variable column vector (α₁ α₂ … α_N)ᵗ and let y be a constant column vector for the labels (y₁ y₂ … y_N)ᵗ and let K be a constant matrix where K[n,m] computes the kernel at (xₙ, xₘ). Recall the following index-based equivalences for the dot product, outer product and quadratic form respectively:

to be able to write the dual optimization problem in matrix form as follow:

Knowing that this is a quadratic program as we hinted earlier, we can read over Quadratic Programming in CVXOPT’s documentation:

From CVXOPT Documentation. GNU General Public License

The square brackets tell you that you can call this with (P,q) only or (P,q,G,h) or (P, q, G, h, A, b) and so on (anything not given will be set by a default value such as 1).

To know the values for (P, q, G, h, A, b) in our case, we make the following comparison:

Let’s the comparison easier by rewriting the first as follows:

Notice that we changed the max to min by multiplying the function by -1

It’s now obvious that (note that 0≤α is equivalent to -α≤0):

By this, we can write the following fit function:

@SVMClass
def fit(self, X, y, eval_train=False):
    # if more than two unique labels, call the multiclass version
    if len(np.unique(y)) > 2:
        self.multiclass = True
        return self.multi_fit(X, y, eval_train)
    
    # if labels given in {0,1} change it to {-1,1}
    if set(np.unique(y)) == {0, 1}: y[y == 0] = -1

    # ensure y is a Nx1 column vector (needed by CVXOPT)
    self.y = y.reshape(-1, 1).astype(np.double) # Has to be a column vector
    self.X = X
    N = X.shape[0]  # Number of points
    
    # compute the kernel over all possible pairs of (x, x') in the data
    # by Numpy's vectorization this yields the matrix K
    self.K = self.kernel(X, X, self.k)
    
    ### Set up optimization parameters
    # For 1/2 x^T P x + q^T x
    P = cvxopt.matrix(self.y @ self.y.T * self.K)
    q = cvxopt.matrix(-np.ones((N, 1)))
    
    # For Ax = b
    A = cvxopt.matrix(self.y.T)
    b = cvxopt.matrix(np.zeros(1))

    # For Gx <= h
    G = cvxopt.matrix(np.vstack((-np.identity(N),
                                 np.identity(N))))
    h = cvxopt.matrix(np.vstack((np.zeros((N,1)),
                                 np.ones((N,1)) * self.C)))

    # Solve    
    cvxopt.solvers.options['show_progress'] = False
    sol = cvxopt.solvers.qp(P, q, G, h, A, b)
    self.αs = np.array(sol["x"])            # our solution
        
    # a Boolean array that flags points which are support vectors
    self.is_sv = ((self.αs-1e-3 > 0)&(self.αs <= self.C)).squeeze()
    # an index of some margin support vector
    self.margin_sv = np.argmax((0 < self.αs-1e-3)&(self.αs < self.C-1e-3))
    
    if eval_train:  
      print(f"Finished training with accuracy{self.evaluate(X, y)}")

We ensure that this is a binary problem and that the binary labels are set as assumed by SVM (±1) and that y is a column vector with dimensions (N,1). Then we solve the optimization problem to find (α₁ α₂ … α_N)ᵗ.

We use (α₁ α₂ … α_N)ᵗ to get an array of flags that is 1 at any index corresponding to a support vector so that we can later apply the prediction equation by only summing over support vectors and an index for a margin support vector for (xₛ,yₛ). Notice that in the checks we do assume that non-support vectors may not have α=0 exactly, if it’s α≤1e-3 then this is approximately zero (we know CVXOPT results may not be ultimately precise). Likewise, we assume that non-margin support vectors may not have α=C exactly.

Define the Predict Method

Recall the prediction equation is:

@SVMClass
def predict(self, X_t):
    if self.multiclass: return self.multi_predict(X_t)
    # compute (xₛ, yₛ)
    xₛ, yₛ = self.X[self.margin_sv, np.newaxis], self.y[self.margin_sv]
    # find support vectors
    αs, y, X= self.αs[self.is_sv], self.y[self.is_sv], self.X[self.is_sv]
    # compute the second term
    b = yₛ - np.sum(αs * y * self.kernel(X, xₛ, self.k), axis=0)
    # compute the score
    score = np.sum(αs * y * self.kernel(X, X_t, self.k), axis=0) + b
    return np.sign(score).astype(int), score

That’s it. We can also implement an evaluate method to compute the accuracy (used in fit above).

@SVMClass
def evaluate(self, X,y):  
    outputs, _ = self.predict(X)
    accuracy = np.sum(outputs == y) / len(y)
    return round(accuracy, 2)

Test the Implementation

from sklearn.datasets import make_classification
import numpy as np

# Load the dataset
np.random.seed(1)
X, y = make_classification(n_samples=2500, n_features=5, 
                           n_redundant=0, n_informative=5, 
                           n_classes=2,  class_sep=0.3)

# Test Implemented SVM
svm = SVM(kernel='rbf', k=1)
svm.fit(X, y, eval_train=True)

y_pred, _ = svm.predict(X)
print(f"Accuracy: {np.sum(y==y_pred)/y.shape[0]}")  #0.9108

# Test with Scikit
from sklearn.svm import SVC
clf = SVC(kernel='rbf', C=1, gamma=1)
clf.fit(X, y)
y_pred = clf.predict(X)
print(f"Accuracy: {sum(y==y_pred)/y.shape[0]}")    #0.9108

You can change the dataset and hyperparameter to further ensure that they are the same. Ideally, do so by comparing decision functions rather than accuracy.

Generalizing Fit to Multiclass

@SVMClass
def multi_fit(self, X, y, eval_train=False):
    self.k = len(np.unique(y))      # number of classes
    # for each pair of classes
    for i in range(self.k):
        # get the data for the pair
        Xs, Ys = X, copy.copy(y)
        # change the labels to -1 and 1
        Ys[Ys!=i], Ys[Ys==i] = -1, +1
        # fit the classifier
        clf = SVM(kernel=self.kernel_str, C=self.C, k=self.k)
        clf.fit(Xs, Ys)
        # save the classifier
        self.clfs.append(clf)
    if eval_train:  
        print(f"Finished training with accuracy {self.evaluate(X, y)}")

To generalize the model to multiclass, over k classes. We train a binary SVM classifier for each class present where we loop on each class and relabel points belonging to it into +1 and points from all other classes into -1.

The result from training is k classifiers when given k classes where the ith classifier was trained on the data with the ith class being labeled as +1 and all others being labeled as -1.

Generalizing Predict to Multiclass

Then to perform prediction on a new example, we choose the class for which the corresponding classifier is most confident (has the highest score)

@SVMClass
def multi_predict(self, X):
    # get the predictions from all classifiers
    N = X.shape[0]
    preds = np.zeros((N, self.k))
    for i, clf in enumerate(self.clfs):
        _, preds[:, i] = clf.predict(X)
    
    # get the argmax and the corresponding score
    return np.argmax(preds, axis=1), np.max(preds, axis=1)

Testing the Implementation

from sklearn.datasets import make_classification
import numpy as np

# Load the dataset
np.random.seed(1)
X, y = make_classification(n_samples=500, n_features=2, 
                           n_redundant=0, n_informative=2, 
                           n_classes=4, n_clusters_per_class=1,  
                           class_sep=0.3)

# Test SVM
svm = SVM(kernel='rbf', k=4)
svm.fit(X, y, eval_train=True)

y_pred = svm.predict(X)
print(f"Accuracy: {np.sum(y==y_pred)/y.shape[0]}") # 0.65

# Test with Scikit
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

clf = OneVsRestClassifier(SVC(kernel='rbf', C=1, gamma=4)).fit(X, y)
y_pred = clf.predict(X)
print(f"Accuracy: {sum(y==y_pred)/y.shape[0]}")    # 0.65

Plotting the decision regions for each further leads to the following plot:

Beware that, although Sci-kit Learn’s SVM supports OVR by default (without an explicit call as shown above), that potentially also has further optimizations specific to SVM.

Full Code


import numpy as np                  # for basic operations over arrays
from scipy.spatial import distance  # to compute the Gaussian kernel
import cvxopt                       # to solve the dual optimization problem
import copy                         # to copy numpy arrays 


class SVM:
    linear = lambda x, xࠤ , c=0: x @ xࠤ .T
    polynomial = lambda x, xࠤ , Q=5: (1 + x @ xࠤ.T)**Q
    rbf = lambda x, xࠤ , γ=10: np.exp(-γ * distance.cdist(x, xࠤ,'sqeuclidean'))
    kernel_funs = {'linear': linear, 'polynomial': polynomial, 'rbf': rbf}
    
    def __init__(self, kernel='rbf', C=1, k=2):
        # set the hyperparameters
        self.kernel_str = kernel
        self.kernel = SVM.kernel_funs[kernel]
        self.C = C                  # regularization parameter
        self.k = k                  # kernel parameter
        
        # training data and support vectors
        self.X, y = None, None
        self.αs = None
        
        # for multi-class classification
        self.multiclass = False
        self.clfs = []                                  

# This is useless here (only for notebook)
SVMClass = lambda func: setattr(SVM, func.__name__, func) or func


@SVMClass
def fit(self, X, y, eval_train=False):
    if len(np.unique(y)) > 2:
        self.multiclass = True
        return self.multi_fit(X, y, eval_train)
    
    # relabel if needed
    if set(np.unique(y)) == {0, 1}: y[y == 0] = -1
    # ensure y has dimensions Nx1
    self.y = y.reshape(-1, 1).astype(np.double) # Has to be a column vector
    self.X = X
    N = X.shape[0]
    
    # compute the kernel over all possible pairs of (x, x') in the data
    self.K = self.kernel(X, X, self.k)
    
    # For 1/2 x^T P x + q^T x
    P = cvxopt.matrix(self.y @ self.y.T * self.K)
    q = cvxopt.matrix(-np.ones((N, 1)))
    
    # For Ax = b
    A = cvxopt.matrix(self.y.T)
    b = cvxopt.matrix(np.zeros(1))

    # For Gx <= h
    G = cvxopt.matrix(np.vstack((-np.identity(N),
                                 np.identity(N))))
    h = cvxopt.matrix(np.vstack((np.zeros((N,1)),
                                 np.ones((N,1)) * self.C)))

    # Solve    
    cvxopt.solvers.options['show_progress'] = False
    sol = cvxopt.solvers.qp(P, q, G, h, A, b)
    self.αs = np.array(sol["x"])
        
    # Maps into support vectors
    self.is_sv = ((self.αs > 1e-3) & (self.αs <= self.C)).squeeze()
    self.margin_sv = np.argmax((1e-3 < self.αs) & (self.αs < self.C - 1e-3))
    
    if eval_train:  
      print(f"Finished training with accuracy {self.evaluate(X, y)}")


@SVMClass
def multi_fit(self, X, y, eval_train=False):
    self.k = len(np.unique(y))      # number of classes
    # for each pair of classes
    for i in range(self.k):
        # get the data for the pair
        Xs, Ys = X, copy.copy(y)
        # change the labels to -1 and 1
        Ys[Ys!=i], Ys[Ys==i] = -1, +1
        # fit the classifier
        clf = SVM(kernel=self.kernel_str, C=self.C, k=self.k)
        clf.fit(Xs, Ys)
        # save the classifier
        self.clfs.append(clf)
    if eval_train:  
      print(f"Finished training with accuracy {self.evaluate(X, y)}")

@SVMClass
def predict(self, X_t):
    if self.multiclass: return self.multi_predict(X_t)
    xₛ, yₛ = self.X[self.margin_sv, np.newaxis], self.y[self.margin_sv]
    αs, y, X= self.αs[self.is_sv], self.y[self.is_sv], self.X[self.is_sv]

    b = yₛ - np.sum(αs * y * self.kernel(X, xₛ, self.k), axis=0)
    score = np.sum(αs * y * self.kernel(X, X_t, self.k), axis=0) + b
    return np.sign(score).astype(int), score

@SVMClass
def multi_predict(self, X):
    # get the predictions from all classifiers
    preds = np.zeros((X.shape[0], self.k))
    for i, clf in enumerate(self.clfs):
        _, preds[:, i] = clf.predict(X)
    
    # get the argmax and the corresponding score
    return np.argmax(preds, axis=1)

@SVMClass
def evaluate(self, X,y):  
    outputs, _ = self.predict(X)
    accuracy = np.sum(outputs == y) / len(y)
    return round(accuracy, 2)


from sklearn.datasets import make_classification
import numpy as np

# Load the dataset
np.random.seed(1)
X, y = make_classification(n_samples=500, n_features=2, 
          n_redundant=0, n_informative=2, n_classes=4, 
          n_clusters_per_class=1,  class_sep=0.3)

# Test SVM
svm = SVM(kernel='rbf', k=4)
svm.fit(X, y, eval_train=True)

y_pred = svm.predict(X)
print(f"Accuracy: {np.sum(y==y_pred)/y.shape[0]}")


# Test with Scikit
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

clf = OneVsRestClassifier(SVC(kernel='rbf', C=1, gamma=4)).fit(X, y)
y_pred = clf.predict(X)
print(f"Accuracy: {sum(y==y_pred)/y.shape[0]}")

In summary, we implemented the support vector machine (SVM) learning algorithm, covering its general soft-margin and kernelized form. We provided an overview of SVM, developed the model in code, extended it for multiclass scenarios, and validated our implementation using Sci-kit Learn.

Hope you find what you have learnt from this story useful for your work. Till next time, au revoir.

Resources:

The code is mostly adaptations over the one present here (MIT License):
Persson, Aladdin. “SVM from Scratch — Machine Learning Python (Support Vector Machine).” YouTube.