Machine Learning

Intro
It is essential to understand how different Machine Learning algorithms work to succeed in your Data Science projects.
I have written this story as part of the series that dives into each ML algorithm explaining its mechanics, supplemented by Python code examples and intuitive visualizations.
The story covers the following topics:
- The category of algorithms that SVM classification belongs to
- An explanation of how the algorithm works
- What are kernels, and how are they used in SVM?
- A closer look into RBF kernel with Python examples and graphs
What category of algorithms does Support Vector Machines classification belong to?
Support Vector Machines (SVMs) are most frequently used for solving classification problems, which fall under the supervised Machine Learning category.
Also, with small adaptations, SVMs can also be used for regression through the use of the Support Vector Regression algorithm (SVR).
Support Vector Regression (SVR) – One of the Most Flexible Yet Robust Prediction Algorithms
You can find the exact place of these algorithms in the interactive graph below. Make sure to click👇 on different categories to enlarge and reveal more.
If you share a passion for Data Science and Machine Learning, please subscribe to receive an email whenever I publish a new story.
SVM classification algorithm – a brief explanation
Let’s assume we have a set of points that belong to two separate classes. We want to separate those two classes in a way that allows us to correctly assign any future new points to one class or the other.
SVM algorithm attempts to find a hyperplane that separates these two classes with the highest possible margin. If classes are fully linearly separable, a hard-margin can be used. Otherwise, it requires a soft-margin.
Note, the points that end up on the margins are known as support vectors.
To aid the understanding, let’s review the examples in the below illustrations.
Hard-margin

- Hyperplane called "H1" cannot accurately separate the two classes; hence, it is not a viable solution to our problem.
- The "H2" hyperplane separates classes correctly. However, the margin between the hyperplane and the nearest blue and green points is tiny. Hence, there is a high chance of incorrectly classifying any future new points. E.g., the new grey point (x1=3, x2=3.6) would be assigned to the green class by the algorithm when it is obvious that it should belong to the blue class instead.
- Finally, the "H3" hyperplane separates the two classes correctly and with the highest possible margin (yellow shaded area). Solution found!
Note, finding the largest possible margin allows more accurate classification of new points, making the model a lot more robust. You can see that the new grey point would be assigned correctly to the blue class when using the "H3" hyperplane.
Soft-margin
Sometimes, it may not be possible to separate the two classes perfectly. In such scenarios, a soft-margin is used where some points are allowed to be misclassified or to fall inside the margin (yellow shaded area). This is where the "slack" value comes in, denoted by a greek letter ξ (xi).

Using this example, we can see that the "H4" hyperplane treats the green point inside the margin as an outlier. Hence, the support vectors are the two green points closer to the main group of green points. This allows a larger margin to exist, increasing the model’s robustness.
Note, the algorithm allows you to control how much you care about misclassifications (and points inside the margin) by adjusting the hyperparameter C. Essentially, C acts as a weight assigned to ξ. A low C makes the decision surface smooth (more robust), while a high C aims at classifying all training examples correctly, producing a closer fit to the training data but making it less robust.
Beware, while setting a high value for C is likely to lead to a better model performance on the training data, there is a high risk of overfitting the model, producing poor results on the test data.
Kernel trick
The above explanation of SVM covered examples where blue and green classes are linearly separable. However, what if we wanted to apply SVMs to non-linear problems? How would we do that?
This is where the kernel trick comes in. A kernel is a function that takes the original non-linear problem and transforms it into a linear one within the higher-dimensional space. To explain this trick, let’s study the below example.
Suppose you have two classes – red and black, as shown below:

As you can see, red and black points are not linearly separable since we cannot draw a line that would put these two classes on different sides of such a line. However, we can separate them by drawing a circle with all the red points inside it and the black points outside it.
How to transform this problem into a linear one?
Let’s add a third dimension and make it a sum of squared x and y values:
z = x² + y²
Using this three-dimensional space with x, y, and z coordinates, we can now draw a hyperplane (flat 2D surface) to separate red and black points. Hence, the SVM classification algorithm can now be used.

Radial Basis Function (RBF) kernel and Python examples
RBF is the default kernel used within the sklearn’s SVM classification algorithm and can be described with the following formula:

where gamma can be set manually and has to be >0. The default value for gamma in sklearn’s SVM classification algorithm is:

Briefly:
||x - x'||² is the squared Euclidean distance between two feature vectors (2 points).
Gamma is a scalar that defines how much influence a single training example (point) has.
So, given the above setup, we can control individual points’ influence on the overall algorithm. The larger gamma is, the closer other points must be to affect the model. We will see the impact of changing gamma in the below Python examples.
Setup
We will use the following data and libraries:
- Chess games data from Kaggle
- Scikit-learn library for splitting the data into train-test samples, building SVM classification models, and model evaluation
- Plotly for data visualizations
- Pandas and Numpy for data manipulation
Let’s import all the libraries:
import pandas as pd # for data manipulation
import numpy as np # for data manipulation
from sklearn.model_selection import train_test_split # for splitting the data into train and test samples
from sklearn.metrics import classification_report # for model evaluation metrics
from sklearn.svm import SVC # for Support Vector Classification model
import plotly.express as px # for data visualization
import plotly.graph_objects as go # for data visualization
Then we get the chess games data from Kaggle, which you can download following this link: https://www.kaggle.com/datasnaek/chess.
Once you have saved the data on your machine, ingest it with the below code. Note, we also derive a couple of new variables for us to use in the modeling.
# Read in the csv
df=pd.read_csv('games.csv', encoding='utf-8')
# Difference between white rating and black rating - independent variable
df['rating_difference']=df['white_rating']-df['black_rating']
# White wins flag (1=win vs. 0=not-win) - dependent (target) variable
df['white_win']=df['winner'].apply(lambda x: 1 if x=='white' else 0)
# Print a snapshot of a few columns
df.iloc[:,[0,1,5,6,8,9,10,11,13,16,17]]

Now, let’s create a couple of functions to reuse when building different models and plotting the results.
This first function will split the data into train and test samples, fit the model, predict the result on a test set, and generate model performance evaluation metrics.
def fitting(X, y, C, gamma):
# Create training and testing samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Fit the model
# Note, available kernels: {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}, default='rbf'
model = SVC(kernel='rbf', probability=True, C=C, gamma=gamma)
clf = model.fit(X_train, y_train)
# Predict class labels on training data
pred_labels_tr = model.predict(X_train)
# Predict class labels on a test data
pred_labels_te = model.predict(X_test)
# Use score method to get accuracy of the model
print('----- Evaluation on Test Data -----')
score_te = model.score(X_test, y_test)
print('Accuracy Score: ', score_te)
# Look at classification report to evaluate the model
print(classification_report(y_test, pred_labels_te))
print('--------------------------------------------------------')
print('----- Evaluation on Training Data -----')
score_tr = model.score(X_train, y_train)
print('Accuracy Score: ', score_tr)
# Look at classification report to evaluate the model
print(classification_report(y_train, pred_labels_tr))
print('--------------------------------------------------------')
# Return relevant data for chart plotting
return X_train, X_test, y_train, y_test, clf
The following function will draw a Plotly 3D scatter graph with the test data and model prediction surface.
def Plot_3D(X, X_test, y_test, clf):
# Specify a size of the mesh to be used
mesh_size = 5
margin = 1
# Create a mesh grid on which we will run our model
x_min, x_max = X.iloc[:, 0].fillna(X.mean()).min() - margin, X.iloc[:, 0].fillna(X.mean()).max() + margin
y_min, y_max = X.iloc[:, 1].fillna(X.mean()).min() - margin, X.iloc[:, 1].fillna(X.mean()).max() + margin
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)
# Calculate predictions on grid
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)
# Create a 3D scatter plot with predictions
fig = px.scatter_3d(x=X_test['rating_difference'], y=X_test['turns'], z=y_test,
opacity=0.8, color_discrete_sequence=['black'])
# Set figure title and colors
fig.update_layout(#title_text="Scatter 3D Plot with SVM Prediction Surface",
paper_bgcolor = 'white',
scene = dict(xaxis=dict(backgroundcolor='white',
color='black',
gridcolor='#f0f0f0'),
yaxis=dict(backgroundcolor='white',
color='black',
gridcolor='#f0f0f0'
),
zaxis=dict(backgroundcolor='lightgrey',
color='black',
gridcolor='#f0f0f0',
)))
# Update marker size
fig.update_traces(marker=dict(size=1))
# Add prediction plane
fig.add_traces(go.Surface(x=xrange, y=yrange, z=Z, name='SVM Prediction',
colorscale='RdBu', showscale=False,
contours = {"z": {"show": True, "start": 0.2, "end": 0.8, "size": 0.05}}))
fig.show()
Build a model with default values for C and Gamma
Let’s build our first SVM model using ‘rating_difference’ and ‘turns’ fields as our independent variables (attributes/predictors) and the ‘white_win’ flag as our target.
Note, we are somewhat cheating here as the number of total moves would only be known after the match. Hence, ‘turns’ would not be available if we wanted to generate model prediction before the match starts. Nevertheless, this is for illustration purposes only; hence we will use it in the below examples.
Since we are using our previously defined ‘fitting’ function, the code is short.
# Select data for modeling
X=df[['rating_difference', 'turns']]
y=df['white_win'].values
# Fit the model and display results
X_train, X_test, y_train, y_test, clf = fitting(X, y, 1, 'scale')
The function prints the following model evaluation metrics:

We can see that the model performance on test data is similar to that on training data, which gives reassurance that the model can generalize well using the default hyperparameters.
Let’s now visualize the prediction by simply calling the Plot_3D function:
Plot_3D(X, X_test, y_test, clf)

Note, black points at the top are actual class=1 (white won), and the ones at the bottom are actual class=0 (white did not win). Meanwhile, the surface is the probability of a white win produced by the model.
While there is local variation in the probability, the decision boundary lies around x=0 (i.e., rating difference=0) since this is where the probability crosses the p=0.5 boundary.
SVM model 2 – Gamma = 0.1
Let’s now see what happens when we set a relatively high value for gamma.
# Select data for modeling
X=df[['rating_difference', 'turns']]
y=df['white_win'].values
# Fit the model and display results
X_train, X_test, y_train, y_test, clf = fitting(X, y, 1, 0.1)
# Plot 3D chart
Plot_3D(X, X_test, y_test, clf)

We can see that increasing gamma has led to better model performance on training data but worse performance on test data. The below graph helps us see exactly why that is.

Instead of having a smooth prediction surface like before, we now have a very "spiky" one. To understand why this happens, we need to study the kernel function a bit closer.
When we choose high gamma, we tell the function that the near points are much more important for the prediction than points further away. Hence, we get these "spikes" as the prediction largely depends on individual points of the training examples rather than what is around them.
On the opposite side, reducing gamma tells the function that it’s not just the individual point but also the points around it that matter when making the prediction. To verify this, let’s look at another example with a relatively low value for gamma.
SVM model 3— Gamma = 0.000001
Let’s rerun the functions:
# Select data for modeling
X=df[['rating_difference', 'turns']]
y=df['white_win'].values
# Fit the model and display results
X_train, X_test, y_train, y_test, clf = fitting(X, y, 1, 0.000001)
# Plot 3D chart
Plot_3D(X, X_test, y_test, clf)

As expected, reducing gamma made the model more robust with an increase in model performance on the test data (accuracy = 0.66). The below graph illustrates how much smoother the prediction surface has become after assigning more influence to the points further away.

Adjusting hyperparameter C
I decided not to include examples in this story using different C values because it affects the prediction plane’s smoothness in a similar way to gamma, although be it due to different reasons. You can try this yourself by passing a value such as C=100 to the "fitting" function to see.
Conclusion
SVM algorithm is mighty and flexible. While I only covered the basic usage with one of the available kernels, I hope this has given you an understanding of SVM and RBF’s inner workings. This should enable you to explore all the rest of the options by yourself.
Please give me a shout if you have any questions, and thanks for reading!
Cheers 👏 Saul Dobilas