Contents
In this post, we’ll go through:
(i) Role of Support Vectors in SVMs
(ii) Cost Function for SVMs
(iii) SVMs as a Large Margin Classifier
(iv) Non-Linear Decision Boundaries through SVMs with the help of Kernels
(v) Fraudulent Credit Card Transaction Kaggle Dataset Detection using SVMs
In the previous post, we had a good look at high bias and variance problems in machine learning and discussed how regularization plays a big role in solving these issues along with some other techniques. In this post, we’ll be having a detailed look at another supervised learning algorithm called the Support Vector Machine. Later in the post, we’ll be solving a Kaggle dataset to detect Fraudulent Credit Card Transactions using the SVM.
Support Vector Machines (SVM)
SVM is a supervised machine learning method which solves both, regression and classification problems. However, it is mostly used in classification problems where it constructs hyperplanes in the n-feature dimensions. An n-dimension feature space has a hyperplane of n-1 dimensions. Eg. In the dataset with 2 features (2-dimeansional feature space), the hyperplane constructed by the SVM is a curve(line, circle, etc.) If we are solving a classification problem on 2 classes, then the job of the SVM classifier is to find the hyperplane that maximizes the margin between the 2 classes. Before we look at how SVMs work, let’s understand where the name Support Vector Machine came from.

What is a Support Vector?
We know that an SVM classifier constructs hyperplanes for classification. But how does the SVM classifier construct a hyperplane? Let’s develop intuition by considering just 2 classes. We know that a hyperplane has to pass from somewhere in the middle of the 2 classes. A good separation between these classes is achieved by the hyperplane that has the largest distance to the nearest training data points from both the classes. In the figure alongside, the 2 dotted lines that mark the extremes of each class constitute the support vectors for each class. These support vectors help in finding the hyperplane that maximizes the distance (margin) of the hyperplane from each of the 2 classes with the help of their support vectors.
Working of SVMs
Support Vector Machines can fit both linear and non-linear decision boundaries as a classifier and one of the main advantages SVMs have over Logistic Regression is that they compute the training parameters fast due to a much simplified cost function.
Cost Function
Let’s recall the binary crossentropy cost function used for binary classification in logistic regression. Here, for the sake of simplification, we’ll ignore the bias term, so the final prediction that we make for the ith training example out of a total of ‘m’ training examples through logistic regression will be represented as h(x(i)) = sigmoid(W * x(i))

This cost function can be divided into 2 parts: when y(i) = 1 the term (1 – y(i))log(1 – h(x(i))) becomes 0 and when y(i) = 0, the term y(i)log(h(x(i))) becomes 0. The corresponding graphs for these equations (Cost vs W * x) (excluding the regularization term, since it is common to both) are:

SVM uses a slight modification of this cost function which provides it computational advantages over logistic regression. For the case y = 1, we see that the cost function has all its values closer to 0 when W x >= 1 and when W x < 1, the -log(h(x)) function values are approximated by a straight line by calculating the derivative of the cost function when W x = 0. Similarly for the case y = 0, we see that the cost function has all its values closer to 0 when W x <= -1 and when W x > 1, the -log(1 – h(x)) values are approximated by a straight line by calculating the derivative of the cost function when W x = 0.
Now since we are no longer using logarithmic cost function, let’s rename the log part in the logistic regression cost function. Let’s replace -log(h(x)) with cost1(h(x)) and -log(1 – h(x)) with cost0(h(x)). We’re ignoring the constant (1/m) here as it doesn’t affect our minimization objective and helps us to simplify our calculations. So, the final cost function for support vector machine looks like:

This leads to the following mathematical equation for the cost function:

Unlike logistic regression which outputs probability values, SVMs output 0/1. When h(x) >=1, the SVM outputs 1 and when h(x) <= -1, the SVM outputs 0. In logistic regression, we saw that when h(x) > 0, the output was a probability > 0.5, which was rounded-off to 1 and when h(x) < 0, the output was a probability < 0.5 which was rounded-off to 0. The range of (-1, 1) is an extra safety margin factor which allows SVMs to make more confident predictions than logistic regression.
Let us now re-parameterize the cost function a bit. Currently our cost function is of the form A + λB, where A is the cost function and B is the regularization term. Let’s convert it to CA + B form, where C plays a role similar to 1/λ.
SVM as a Large Margin Classifier
Earlier, we read about the role of support vectors in finding a hyperplane that acts as a classifier which is at a maximum distance from each of the 2 classes. Such a classifier is known as a Large Margin Classifier. A large margin classifier is the one which predicts the classes for given data points with greater confidence. This was not the case with logistic regression. Let’s see why this happens with SVM.
We re-parameterized the SVM cost function to *CA + B, where A is the loss associated with the SVM output (the cost component), B is the regularization term (regularization component) and C plays a role similar to 1/λ. When we choose C to be very large, then our model will be prone to overfitting. To counter this, we would want that A should be close zero otherwise there would be a huge cost which is undesirable. Since the value of A is directly proportional to the parameter W (represented by ‘theta’ in figures), this means that the parameters themselves will have very small values. So, in case of large C, the optimization objective is simply to find the minimum value of the training parameters present in B**. Formally defining, the minimization objective in this case is:

In the implementation of SVMs, vectorization is used and the dot product of vector W (‘theta in figure’) and the vector of features (X) is computed. From the knowledge of multiplication of matrices, we know that when dot product of 2 vectors (let’s say u and v) is computed, we get the projection of vector u on the vector v. In the same way, when dot product of W and X is computed, we get the projection of the vector X on W and the length of projection (P) is equal to

where ||thetha|| is the L2 norm of theta. This equation can be represented as:

So, the minimization objective can be rephrased as follows:

Now let’s take an example to see how this minimization objective leads to a large margin classifier in SVM.

Consider the following binary classification problem where ‘X’ represents the feature x1 and the ‘O’ represents the feature x2. From our current optimization objective we know that the SVM minimizes the cost function only when p.||theta|| <= -1 for y = 0 (x1) and p.||theta|| >= 1 for y = 1 (x2) i.e. the angle between the projection of a training example and the vector of parameters is either between 900–1800 and 2700–3600 respectively.
In the image below, consider the decision boundary in the left image (green). The vector of parameters (theta vector) is perpendicular to it since the region to the right of the vector of parameters has angles from 900–1800 and the region to the left of the decision boundary has angles between 2700–3600. For this decision boundary, we see that the length of projections of training examples (represented by red and pink) is quite small. To satisfy the conditions of p.||theta|| <= -1 for y = 0 (x1) and p.||theta|| >= 1 (x2) in this case, theta needs to have a large value. Due to this, the value of the cost function is not minimized and hence the decision boundary represented in green in the left image won’t be selected by the SVM.
Now, consider the decision boundary (green) for the image below at right. For the same training examples, we see that their projection on the vector of parameters is greater as compared to the previous decision boundary. Now to satisfy the conditions of p.||theta|| <= -1 for y = 0 (x1) and p.||theta|| >= 1 (x2), we already have a large value of the projection vector ‘p’. This means that a small value of theta will suffice, which in turn minimizes the cost function as well. Hence the decision boundary represented in the image on the right will most likely be selected by the SVM for the binary classification task and by comparing both the decision boundaries, we can clearly see that the classifier selected by the SVM is indeed a large margin classifier.

Let’s keep in mind that the conclusion that we have drawn above is proven when the parameter C in the SVM equation *CA + B was very large. We can get similar results when the value of C** is small and hence we generalize this claim. I encourage the reader to think about it. This will help clarifying the mathematics behind the SVM even more.
SVM for Non-Linear Decision Boundaries
So far we have seen the scenario where SVM is used as a classifier with linear decision boundaries. SVM uses Kernel methods for this task. Before we get onto kernel methods, let us understand why complex polynomial features won’t work well for non-linear SVM classification. Let’s have an intuition about it with the help of linear regression.
In linear regression, for training data with ‘n’ input features, we represent the output as follows:
y = W1x1 + W2x2 + W3x3 + ………. + Wnxn.
The output we get for ‘m’ training examples is a straight line. So how can we teak this to get a non-linear decision boundary? Well, we know that polynomials of degree > 1 produce non-linear outputs. But using these polynomials as input features has the following problems:
(i) For the number of features as small as 5, there are tonnes of possibilities for the choosing features. A sample representation of the output can be y = W1x1² + W2x2³x3² + W3x1²x4²x5 + W4x2x4³ + ……… (there can be tonnes of possibilities). Remember that a polynomial of degree 5 does not need to have all features of 5-degree.
(ii) With such complex feature representation, the decision boundary can’t be intuitively determined which makes it difficult to tune the features to get better results.
(iii) Additionally, computing these complex features is computationally very expensive and hence undesirable.
Kernel Methods
Due to all the drawbacks of polynomial feature vectors for non-linear classifiers mentioned above, SVMs provide a very smart and efficient way to generate non-linear decision boundaries by incorporating the use of kernel methods. The magic of the kernel is to find a function that avoids all the trouble implied by the high-dimensional computation. In Machine Learning, kernel methods are a class of algorithms for pattern analysis, whose best known member is the Support Vector Machine. Kernel methods require a user-specified kernel, i.e., a similarity function over pairs of data points in raw representation. Kernel methods are preferred over feature vectors since they reduce the computational time drastically while computing complex non-linear decision boundaries. The output of a kernel method is a scalar.
Here, although we are limiting the use of kernels to SVMs, one important thing to note is that every machine learning algorithm that can be expressed using dot-products can be replaced by kernels. Then we can use any of these machine learning algorithms with kernels, which give both, quick and better results. Although there are different types of kernel methods, I’ll be using Gaussian Kernel for further illustration.
Generating Non-Linear Decision Boundaries
If kernels don’t do such complex calculations, then how do they generate such good non-linear decision boundaries? Some landmarks are defined for the training examples and the similarity of each of the examples is computed with every defined landmark using the similarity function (kernel). The Gaussian kernel or Gaussian similarity function for the training example xi and the landmark lj is represented as exp(-||xi – lj||2 / 2σ2). From this, we can see that for a training example xi close to the landmark, its feature value fi is close to 1 and for a training example xi quite far away from the landmark, its feature value is close to 0, which seems a good measure for similarity.
The calculated similarity values using the Gaussian kernel are considered as the new features values of the respective training examples which can then be attached with their weight parameters (W/theta) to compute the final output class. Earlier in the post we saw that the minimisation objective was subject to the conditions when the dot product of the weight vector and the input feature vector was either >=1 or <= -1. After computing the new feature values using kernels, the conditions of the optimization objective change a bit. Although the inequality values remain the same, the dot product is now computed between the weight vector and the new feature vector computed using Gaussian kernel. Just to clarify, although the output of the kernel method is a scalar value, that scalar value is computed against a single landmark. Against L landmarks, L values are computed which then form a feature vector for a given training example.

Consider a training dataset (with 2 features only for demonstration purposes) with 2 output classes that don’t have a linear decision boundary. Let’s see how defining a few landmarks (let’s say 3) on the training set helps in solving non-linear decision boundaries. In the figure below, we see that for the training example X in pink, it’s quite close to the landmark L1 as compared to the landmarks L2 and L3. Computing the similarity using the Gaussian Kernel for this example, we get feature value f1 corresponding to L1 as close to 1 and the ones corresponding to L2 and L3 (f2 and f3) as close to 0. According to the above equation, we predict one when theta f >= 1. Now, if the theta values corresponding to these features are 2, 0.5, and 1, then 21 + 0.50 +10 = 2 i.e. >= 1. Performing the same step for multiple training examples gives us a proper non-linear decision boundary with a high accuracy.

Choosing Landmarks
We saw how landmarks combined with kernel methods help in defining complex non-linear decision boundaries in SVMs. Now the obvious question of the selection of landmarks arises. One way to choose the landmarks is to convert all the training examples as landmarks and then the feature vector for each training example is calculated by computing similarity between the training example and each landmark using the Gaussian Kernel.

Once these landmarks are chosen and these new feature vectors are calculated, the final optimization objective is modified to:

where f(i) represents the new feature vectors calculated using the Gaussian kernel. One thing to note here is that the number of feature vectors (n) will be equal to the number of training examples (m) since each training example is a landmark.
Computing Weight Parameters
We saw in linear and logistic regression how we computed weights using Gradient Descent. We can’t use gradient descent in its raw form to compute the optimal weight parameters for SVMs because the cost function in the SVM is not differentiable at all points. So, in order to compute weight parameters for SVM, some variants of gradient descent like sub-gradient descent or even various complex algorithms are used. The weight update in sub-gradient descent looks like:
W = W – alpha * subgradient_wrt_W(J), where ‘alpha’ is the learning rate.
Solving for sub-derivatives is not as simple as partial derivatives and requires us to understand limits and inequalities well. The first 2 pages of this pdf from Stanford is a good resource that will give you an intuition of how the sub-derivatives are calculated.
Once the optimal weight parameters for a classification task are calculated, we can run the SVM classifier and calculate the train and test accuracy to evaluate the model’s performance.
At times in this post, I’ve used various examples by Andrew NG sir while teaching SVMs on coursera along with my ability to provide intuition for the same. I recommend checking the machine learning course offered by him if you can spare yourself time for that.
So far, we’ve learned quite a bit about SVMs. We started from the role of support vectors, then understood how the cost function of SVM implicitly favours large margin classifier in depth followed by the use of kernel trick to calculate non-linear decision boundaries with ease of computation through various illustrations to make things easy to understand. In the end we learnt how vanilla (plain) gradient descent does not work for calculating weight parameters, so the need to use sub-gradient descent was realized. After having learnt this much, it’s now time for us to shift our focus to the Credit Card Transactions Kaggle dataset in order to consolidate our concepts and see the SVM in action.
Problem Statement
In this section, we’ll be using an SVM to determine fraudulent credit card transactions. Dataset for this problem can be found here. One thing to note here is that the features of this dataset are already computed as a result of PCA (Principal Component Analysis, which we’ll see in a later post) transformation. It helps in 2 ways:
(i) The confidentiality of the user data is maintained.
(ii) The features in the dataset are independent of each other due to PCA transformation.
Let’s get started by loading the dataset in memory as a dataframe.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing, svm
from sklearn.metrics import confusion_matrix
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

df = pd.read_csv("/kaggle/input/creditcardfraud/creditcard.csv")
df.head()

The dataset has 31 columns/features out of which 28 have been computed through PCA transformation and other features are numerical features only. Let’s now see how the fraudulent and non-fraudulent transactions are distributed in the dataset.
print(df['Class'].value_counts())

There are 492 fraudulent (1) transactions as compared to 284315 non-fraudulent (0) ones, which is highly skewed. Clearly we need to resample our data else we can achieve high accuracy by simply predicting every transaction as 0 (non-fraudulent).
print("Accuracy by predicting all transactions as non-fraudulent: " + str((284315 / (284315 + 492)) * 100) + "%")

There is no use of such a system since all the fraudulent transactions are ignored. We need an equal amount of fraudulent and non-fraudulent transactions for training in order to better capture the features of both kinds of transactions. First of all, let’s analyse the fraudulent transactions data wrt ‘Time’ and ‘Amount’.
df_fraud = df[df['Class'] == 1]
plt.figure(figsize=(15, 12))
plt.scatter(df_fraud['Time'], df_fraud['Amount'])
plt.title("Fraudulent transactions")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()

At all times we have some number of fraudulent transactions making time an irrelevant factor while predicting them. Additionally, most of the fraudulent transactions are very small amount values as seen from the plot above.
df_huge_fraud_amounts = df_fraud[df_fraud['Amount'] > 1000]
print("Number of fraudulent transactions over the amount of 1000 are: " + str((df_huge_fraud_amounts.shape[0])))

The given dataset says that the features have been calculated using PCA transformation, hence they should be independent of each other. Let us check if this is the case by calculating the correlation between the features.
import seaborn as sns
plt.figure(figsize = (15, 12))
df_correlation = df.corr()
sns.heatmap(df_correlation)
plt.title("Heatmap representing correlations")
plt.show()

The diagonal across the heatmap represents the highest correlation (close to 1.0), which is the correlation of a feature with itself. The correlation between other pairs of features has values between -0.2 to 0.2, which corresponds to very less correlation. This represents that the features mentioned are indeed independent of each other and hence no feature can be eliminated based on their dependency on each other.
Since the number of total fraudulent transactions is too small as compared to non-fraudulent transactions, we need to resample the dataset. By applying oversampling, we repeat the fraudulent transactions until they are close in number to non-fraudulent transactions. By applying undersampling, we eliminate a number of non-fraudulent transactions so that the final number of non-fraudulent transactions is roughly the same as fraudulent transactions in the dataset. By applying oversampling to this dataset, the training dataset will become huge (284315 non-fraudulent transactions as compared to just 492 fraudulent ones), so we use undersampling.
df_train = df[:200000]
df_train_fraud = df_train[df_train['Class'] == 1]
df_train_not_fraud = df_train[df_train['Class'] == 0]
print(df_train_fraud.shape[0]) # 385
Since there are 385 fraud transactions, we’ll bring down non-fraudulent transactions to around this number to have equal number of training examples of each kind.
df_sample = df_train_not_fraud.sample(400)
df_train_final = df_train_fraud.append(df_sample)
df_train_final = df_train_final.sample(frac = 1).reset_index(drop = True)
df_train_final.head()

We can now see that the ‘Class’ column has both 0s and 1s. We took the first 200,000 samples from the dataset and chose 400 (close to 385) non-fraudulent transactions randomly from the total of 284,315 non-fraudulent transactions. Hence, we have successfully implemented undersampling and our final training set consists of 785 training examples. Undersampling can lead to the loss of some important features in the data. But first of all, let’s see the results we get by applying the SVM classifier. Let’s now split data into train and test sets.
X_train = df_train_final.drop(['Time', 'Class'],axis=1)
y_train = df_train_final['Class']
X_train = np.asarray(X_train)
y_train = np.asarray(y_train)
df_test = df[200000:]
X_test = df_test.drop(['Time', 'Class'],axis=1)
y_test = df_test['Class']
X_test = np.asarray(X_test)
y_test = np.asarray(y_test)
print(df_test['Class'].value_counts())

We see that test dataset contains 107 fraudulent transactions. If the SVM classifier performs considerably well on non-fraudulent transactions and can also detect a lot of these fraudulent transactions, we can say that our model did pretty good.
classifier = svm.SVC(kernel='linear')
classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)
Now, we have applied the SVM classifier on the training dataset and have stored the prediction results on the test dataset in a ‘predictions’ variable. To evaluate the model’s performance, let’s plot the confusion matrix and identify the following:
(i) True positives: The original output class was a positive example (here non-fraudulent transaction) and the predicted output class was also a positive example (here non-fraudulent transaction).
(ii) False positives: ** The original output class was a positive example (here non-fraudulent transaction) but the predicted output class was a negative example** (here fraudulent transaction).
(iii) False negatives: The original output class was a negative example (here fraudulent transaction) but the predicted output class was a positive example (here non-fraudulent transaction).
(iv) True negatives: The original output class was a negative example (here fraudulent transaction) and the predicted output class was also a negative example (here fraudulent transaction).
For a banking system, it’s fine if some non-fraudulent transactions are detected as fraudulent, they’ll look into it, but if fraudulent transactions are labelled as non-fraudulent, then that can cause huge losses. Hence, our objective is to reduce the total number of false negatives as much as possible and on the same hand, try that the number of false positives are on the lower side as well. Now, let’s plot the confusion matrix.
import itertools
classes = np.array(['0','1'])
def plot_confusion_matrix(cm, classes,title='Confusion matrix', cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
cm = confusion_matrix(y_test, predictions)
plot_confusion_matrix(cm,classes)

The confusion matrix has
(i) 81746 true positives
(ii) 3224 false positives
(iii) 11 false negatives
(iv) 96 true negatives
From this, we can conclude that most of the fraudulent transactions have been captured and only 11 of them were misclassified, which is way better than a model which yields an extremely high accuracy but doesn’t capture any fraudulent transaction as we discussed earlier while solving this problem.
print('Total fraudulent transactions detected: ' + str(cm[1][1]) + ' / ' + str(cm[1][1]+cm[1][0]))
print('Total non-fraudulent transactions detected: ' + str(cm[0][0]) + ' / ' + str(cm[0][1]+cm[0][0]))
print('Probability to detect a fraudulent transaction: ' + str(cm[1][1]/(cm[1][1]+cm[1][0])))
print('Probability to detect a non-fraudulent transaction: ' + str(cm[0][0]/(cm[0][1]+cm[0][0])))
print("Accuracy of the SVM model : "+str(100*(cm[0][0]+cm[1][1]) / (sum(cm[0]) + sum(cm[1]))) + "%")

The entire code for this post can be found here.
Our primary focus was on capturing as many fraudulent transactions as we could and we’ve done a great job by detecting 96/107 fraudulent transactions given that we had only around 800 fraudulent examples in the entire dataset.
That’s it for this post. We went into a lot of depth into the concepts of SVMs and I made sure I provided intuition for each part in order for the readers to better grasp things quickly. The only topic that we didn’t discuss in detail was sub-gradient descent, but trust me, all the python packages that implement SVMs, as long as you know how SVMs work, with all their parameters, sub-gradient descent is implemented automatically in these packages. In the next post, we’ll dive deep into random forest and decision trees and solve a Kaggle dataset using them.