
The basics
The important job that SVM’s perform is to find a decision boundary to classify our data. This decision boundary is also called the hyperplane.
Lets start with an example to explain it. Visually, if you look at figure 1, you will see that it makes sense for purple line to be a better hyperplane than the black line. The black line will also do the job, but skates a little to close to one of the red points to make it a good decision line.
Visually, this is quite easy to spot.

Linear classification
So, the question remains, how does SVM itentify this? If we stick with our purple line as the optimal hyperplane, then SVM will look for points closest to it, calculating the distance between the points and the hyperplane. This distance is called the margin as indicated by the green dotted line in figure 2.
In figure 2, these gaps (in bright green) between the hyperplane and our data-points are known as the support vectors.
Once it finds the hyperplane with the maximum margin between the clusters, BOOM–BAM, we found our optimal hyperplane. Thus SVM ensures that the gap between the clusters is as wide as possible.

Non liner classification
Non linear classification is also possible with SVM and is in fact, this is where SVM shines. Here SVM uses ‘Kernels‘ to map the non-linear data to higher dimensions so that it becomes linear and finds the optimal hyperplane.
In fact, a Kernel function is always used by SVM. Regardless of whether it is linear or non-linear data, but its main function comes into play when the data is not separated in a straight linear form as in figure 1. The Kernel function adds dimensions to the problem for classification.
Figure 3 shows a simplified illustration of SVM hyperplane in a non linear scenario.

Kernels
There are a few kernel choices we have to assist us with SVM. I find you need to try them to see the best fit for your data. The most popular is the Gaussian kernel. Here is detailed explanation of this kernel. The parameters we select decide the shape of the Gaussian curve. I wont go into an explanation on the maths, since we have libraries that do this for us. Suffice to say you can see that figure 3 is a top view of figure 4. The shape of the curve, depends on the parameters which we will discuss next.

Important parmeters
"C" controls the trade off between smooth decision boundary and classifying training points correctly. This parameter represents the error penalty for misclassification for SVM. It maintains the tradeoff between smoother hyperplane and misclassifications.
- A large value means you will get more training points correctly as you will have a lower bias with a higher variance. Don’t make it too small or you can underfit your data.
- A small value means a higher bias with lower variance. Don’t make it too small or you could overfit your data.
"Gamma" The variance determines the width of the Gaussian kernel. In statistics, when we consider the Gaussian probability density function it is called the standard deviation, and the square of it, s2, the variance.
When do we use SVM, logistics regression or neural networks?
If n=number of features, while m= the number rows of the training set, then you can take this as a guide:
-
n=large (>1000), m=small (<1000)
use logistics regression or SVM without kernel
-
n=small (<1000), m=small (<10 000)
SVM with Gaussian kernel
-
n=small (<1000), m=small (>50 000)
Add features logistics regression or SVM without kernel
Neural networks will work well in all above scenarios, but are slower to train.
Lets get coding in matlab
Now, that we have the basics, lets get coding so we can see this working. Before you start, download the dataset from kaggle. Get to know the data and you will see the first column is not needed, while our classification is in column 2. Thus our features are columns 3 to 31.

We will start with loading the data and then normalising the features.
%% setup X & Y
clear;
tbl = readtable('breast cancer.csv');
[m,n] = size(tbl);
X = zeros(m,n-2);
X(:,1:30) = tbl{:,3:32};
[y,labels] = grp2idx(tbl{:,2});
[X_norm, mu, sigma] = featureNormalize(X);
Next, we can create our training and testing datasets. We will also create ‘cv’ which will split the training set into a training and cross-validation set. ‘cv’ will be used during SVM.
%% split up train, cross validation and test set
rand_num = randperm(size(X,1));
X_train = X(rand_num(1:round(0.8*length(rand_num))),:);
y_train = y(rand_num(1:round(0.8*length(rand_num))),:);
X_test = X(rand_num(round(0.8*length(rand_num))+1:end),:);
y_test = y(rand_num(round(0.8*length(rand_num))+1:end),:);
cv = cvpartition(y_train,'k',5);
Before we run SVM, lets let sequencialfs tell us which are the best features
%% feature selection
opts = statset('display','iter');
% create an inline function which will perform SVM on each column
classf = @(train_data, train_labels, test_data, test_labels)...
sum(predict(fitcsvm(train_data, train_labels,'KernelFunction','rbf'), test_data) ~= test_labels);
% fs will hold the most appropriate features
[fs, history] = sequentialfs(classf, X_train, y_train, 'cv', cv, 'options', opts);
Now, we can remove features not appropriate for an hypothesis
%% retain columns from feature selection
X_train = X_train(:,fs);
X_test = X_test(:,fs);
Finally, run SVM and use the test dataset to check for accuracy
%% predict, try linear, gaussian, rbf
mdl = fitcsvm(X_train,y_train,'KernelFunction','rbf',...
'OptimizeHyperparameters','auto',...
'HyperparameterOptimizationOptions',...
struct('AcquisitionFunctionName',...
'expected-improvement-plus','ShowPlots',true));
pred = predict(mdl,X_test);
accuracy = (mean(double(pred == y_test)) * 100)
You should end up with an accuracy of 98.24%, while the result plots should be as below.

Finally, before we conclude, here is the feature normalisation function
function [X_norm, mu, sigma] = featureNormalize(X)
mu = mean(X);
X_norm = bsxfun(@minus, X, mu);
sigma = std(X_norm);
X_norm = bsxfun(@rdivide, X_norm, sigma);
end
Conclusion
SVM is a good choice to make when classifying a dataset. It is useful to understand when it is a good choice over logistics regression and when to use a Gaussian kernel or run it without a kernel.
Sources
- Figures 1–6 were constructed in www.desmos.com