
What is Clustering Algorithm?
In a business context: Clustering algorithm is a technique that assists customer segmentation which is a process of classifying similar customers into the same segment. Clustering algorithm helps to better understand customers, in terms of both static demographics and dynamic behaviors. Customer with comparable characteristics often interact with the business similarly, thus business can benefit from this technique by creating tailored marketing strategy for each segment.
In a data science context: Clustering algorithm is an unsupervised machine learning algorithm that discovers groups of data points that are closely related. The fundamental difference between supervised and unsupervised algorithm is that:
- supervised algorithm: it requires partitioning the dataset into train and test set, and the algorithm learned based on the output/label of the train dataset and generalize it to unobserved data. For instance, decision tree, regression, neural networks.
- unsupervised algorithm: it is used to discover hidden patterns when there isn’t any defined output/label from the dataset. For instance, clustering, association rule mining, dimension reduction.
Here I have curated a collection of practical guides on various ML algorithms.
Prepare Data for Clustering
After giving an overview of what is clustering, let’s delve deeper into an actual Customer Data example. I am using the Kaggle dataset "Mall Customer Segmentation Data", and there are five fields in the dataset, ID, age, gender, income and spending score. What the mall is most concerned about are customers’ spending scores, hence the objective of this exercise is to find hidden clusters in respects of the field spending score.
1. Load and Preview Data
Load the dataset and summarize column statistics using describe().
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.api.types import is_string_dtype, is_numeric_dtype
df = pd.read_csv("../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv")

Examine the distribution of each field: use bar chart for categorical variables and histogram for numeric variables.



If you would like more details about data visualization and EDA, please check out these two articles.
2. Data Preprocessing
Now that we define the objectives and have a better understanding of our data, we need to preprocess the data to meet the model requirement. In this example, I am choosing K Means clustering as the main algorithm and the reason will be explained later on.
The standard k means algorithm is not directly applicable to categorical data. To be more specific, the search space of categorical variables (e.g. Gender in this example) is discrete (either male or female), hence cannot be directly combined with a continuous space and measured the distance in the same manner. Therefore I discarded the "Gender" field as well as "CustomerID" field.

Since K means interpret the closeness between data points based on Euclidean distance, it is important to reconcile all dimensions into a standard scale. An appropriate type of data transformation should be selected to align with the distribution of the data. This article from Google provides a general guidance of data transformation in clustering.
In Summary:
- if normal distribution → normalization
- if power law distribution → log transformation
- if data doesn’t conform to any distribution → quantiles
From the earlier univariate analysis, we can see that those variables don’t conform to either normal distribution or law distribution. Therefore, I use MinMaxScaler to shrink the data range to between 0 and 1, while maintaining the distribution shape.


3. EDA
Exploratory Data Analysis provides visual clues about whether it is likely to form insightful clusters when combining multiple variables together. It is also an imperative step because choosing an appropriate clustering algorithm is reliant on the shape of the cluster. Some center-based algorithms (e.g. K Means) are more adaptable towards globular shapes clusters and they tend to break linear shaped clusters apart. While density-based algorithms (e.g. DBSCAN) are better at clusters with irregular shape and a lot of noise.
I have visualized those three fields in the following way.
2D Scatter Plot
- age vs. annual income
- annual income vs. spending score
- age vs. spending income

3D Scatter Plot:
- age vs. annual income vs. spending score


In this case, it is apparent that the plot of "annual income vs. spending score" generates some center-based clusters. Therefore, I will use K means algorithm for further exploration.
K Means Clustering
1. How does K Means Clustering work?
K Means Clustering is a centre-based clustering algorithm, which means that it assigns data points to clusters based on closeness or distance, following these procedures:
- Specify the number of clusters "K".
- Initiate K random centroids and assign each cluster a centroid: Centroid is the center of each cluster. There are k data points randomly selected as the centroids at the beginning, and the cluster label of other data points are later defined relatively to them. Consequently, different initial centroid assignments may lead to different cluster formations.
- Form K clusters by assigning data points to the closest centroid: The closest centroid is usually defined by the smallest Euclidean distance but it can also be correlation or cosine depends on the use cases.
- Recompute the centroid of each cluster : After all data points have been assigned to a cluster, for each cluster we recalculate the mean of all data points belonging to that cluster and define it as the new centroid.
- Reach convergence when centroids no longer change: Iterate step 2–4 until reaching the stoping criteria that either the centroids no longer change or the maximum number of iterations are reached.
This procedure also determines that K Means algorithm has the limitation of clustering points into circular shapes with similar size and it is also heavily reliant on the predefined number K.
2. How to build a K Means Clustering model?
Now let’s take a closer look at how to implement it using python scikit learn.
Firstly, I define a _KMeansAlgorithm function that pass the dataset and the number of clusters as the parameters.

This function calls the KMeans() from sklearn library and specify some key parameters:
- _n_clusters:_ the number of clusters
- init: it controls initialization techniques and I chose "k-means++" to select initial centroids in a smart way to speed up convergence
- _max_iter:_ specify the maximum number of iterations allowed for each run
- algorithm: there are several options to choose from, "auto", "full" and "elkan" and I chose "elkan" because it is a K Means algorithm variation that uses triangle inequality to make it more time efficient.
- _randomstate: use an integer to determines the random generation of initial centroids
The output will generates following attributes and evaluation metrics:
- _clustercentres:_ returns an array of the centroid locations
- **inertia:**_ sum of squared distances of data points to their closest centroid
- **labels:**_ the cluster label assigned to each data point
- _silhouette_score:_ the distance between the data point to other data points in the same cluster compares to the data points in the nearest neighbour cluster
In this example, I am mainly interested in the formation of clusters with regards to "Spending Score". Therefore, I used dataset X1, X2, X3 as specified below to feed the model.

Notice that, we no longer needs to partition the dataset for training and testing for unsupervised models.
Age vs. Spending Score
The code below used the dataset X1 – scaled Age vs. scaled Spending Score, and visualize how the clustering changes as the number of clusters increased from 2 to 10.

The first part of the code iterates through the number of clusters and generate a scatter plot for each, with the red square representing the centroid. As shown below, the clusters change as the specified "k" value changes.

Afterwards, I evaluate the model performance based on two metrics: inertia and silhouette coefficient.
Since clustering is an unsupervised algorithm, we cannot directly assess the model quality based on discrepancies between the predicted results and the actual results. Therefore, the evaluation should be based on the principle of minimizing intra-cluster distance while maximizing the inter-cluster distance. There are two evaluation methods that determine the optimal number of clusters based on this principle.
1) Elbow method using inertia:
Inertia measures the sum of squared distances of samples to their closest cluster centroid. With the same number of cluster, smaller the inertia indicates better clusters. The elbow method determines the optimal number of clusters by looking at the turning point ("elbow" point) of the graph below, in this case is 6.

2) Silhouette coefficient:
Here is the Silhouette coefficient formula that makes it appear daunting:

But basically it translates into minimizing intra-cluster distance while maximizing the inter-cluster distance. a(i) is the average distance of data point i to other data points in the same cluster, and b(i) is the average distance of data point i to all points in the nearest neighbor cluster. The aim is to minimize a(i) and maximize b(i), therefore when __ the coefficient is closer to 1 indicates better clustering. In this example, both 2 and 6 are showing a peak in the score.

Based on the results from both metrics, 6 is most likely to be the optimal number of clusters when plotting "Age" against "Spending Score". However, either from the score or the visualizations, it is still not convincing enough to say that age and spending score together form insightful clusters.
Annual Income vs. Spending Score
From the EDA process, we observed that there are several distinct clusters from the chart. Now, let’s take a closer look at whether the model can differentiate customers into distinct segments as expected. The code is similar to the one above but just changing the input dataset into X2 – scaled Annual Income vs. scaled Spending Score.




It is quite obvious that there are decent segmentations when the number of clusters equals 5 and it is clearly shown in the scatter plot as well.

Age vs. Annual Income vs. Spending Score
If you are interested, we can also take a further investigation of how these three fields interacts with each other.



Although, in terms of both the visualization and the metric values, there is no indication of forming any outstanding clusters, unsupervised learning is about discovering and exploring hidden effort. Nevertheless, it may be still worth the effort taking this further investigation.
3. How does it apply to customer segmentation?
After all, the objective of clustering model is to bring insights to customer segmentation. In this exercise, grouping customers into 5 segments, based on two aspects: Spending Score vs. Annual Income, is most beneficial to create tailored marketing strategies.
As shown below, customers can be segmented into five groups:
- high annual income (above 70 k$) and high spending score (above 60)
- high annual income (above 70 k$) and low spending score (below 40)
- low annual income (below 40 k$) and high spending score (above 60)
- low annual income (below 40 k$) and low spending score (below 40)
- average annual income (between 40 k$ and 70 k$) and average spending score (between 40 and 60)

Based on the distinct features of each group, the business can approach each customer groups differently. The first and third types of customers generate the most revenue and they require retention using marketing strategies, such as loyalty program or discounts through newsletter. On the other hand, customer acquisition strategies are more suitable for group 2 and 4. This can be tailored marketing campaigns based on their income levels, since high income group and low income group will have different preferences in products.
Take a Step Further
It is worth noting several limitations using K Means clustering. It doesn’t work well when the clusters vary in sizes and density, or if there are any significant outliers. Therefore, we need to consider other clustering algorithms when encountering situations like this. For example, DBSCAN is more resistance to noise and outliers, and hierarchical clustering does not assume any particular number of clusters. If you would like to know more about comparing with DBSCAN, check out my Kaggle Notebook.
Hope you enjoy my article :). If you would like to read more of my articles on Medium, please feel free to contribute by signing up Medium membership using this affiliate link (https://destingong.medium.com/membership).
Take Home Message
This article introduces clustering algorithm, specifically K Means Clustering, and how we can apply it in a business context to assist customer segmentation. Some key takeaways:
- prepare data for Clustering Analysis: data transformation and exploratory data analysis
- K Means clustering algorithm implementation using scikit learn
- clustering algorithm evaluation using inertia and silhouette score
More Articles Like This
Semi-Automated Exploratory Data Analysis (EDA) in Python
Originally published at https://www.visual-design.net on July 4th, 2021.