In the previous article, we saw how to categorize our consumers based on their recent purchases, frequency of transactions, and other purchasing habits. There, we utilized RFM analysis, a management segmentation methodology. In this post, we’ll look at how k-means clustering, a machine learning algorithm, may be used to segment the same customer dataset, allowing us to better serve our customers while also increasing profitability.
As before, the complete code and the dataset used in this article are available on GitHub.
· Introduction
· Dataset
· Feature Engineering
· Find the optimal number of clusters
· Implementing K-Means Clustering
· Observation
· Conclusion
Introduction
K-Means is one of the most popular unsupervised clustering algorithms. It can draw inferences by utilizing simply the input vectors without referring to known or labeled outcomes. The input parameter ‘k’ stands for the number of clusters or groups that we would like to form in the given dataset.
We won’t go through the mathematical specifics of the k-means algorithm because it’s outside the scope of this essay. Instead, we’ll concentrate on the Business requirements: identify the different customer segments using the algorithm and see how we can better serve our customers.
I recommend the following site to learn more about the k-means algorithm, its applications, advantages, and disadvantages:
K-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks
To apply k-means clustering, all we have to do is tell the algorithm how many clusters we want, and it will divide the dataset into the requested number of clusters. There are a couple of methods to determine the optimal number of clusters. The Elbow method, which we’ll utilize in this essay, is one of them.
Essentially, we will run the clustering algorithm several times with different values of k (e.g. 2–10), then calculate and plot the cost function produced by each iteration. As the number of clusters increase, the average distortion will decrease and each data point will be closer to its cluster centroids. However, the improvements in average distortion will decline as k increases. Finally, we’ll get a chart (where we plot average distortion for each k) that resembles an arm with a bent elbow. Improvement of distortion declines the most at the value of k where the arm bends the most. That point is called the elbow, and that would be the optimal cluster size.
Dataset
We’ll start with a semi-prepared dataset from the previous article, in which the recency, frequency, and monetary values for each unique customer have already been calculated. If you would like to start from the raw dataset, you can refer to my previous article as well as the code on GitHub.
rfm_numbers = spark.read.csv("retail_rfm_numbers.csv",
inferSchema=True,
header=True)

We have three distinctive features:
- Recency: How recently customers made their purchase.
- Frequency: For simplicity, we’ll count the number of times each customer made a purchase.
- Monetary: The total amount of money they spent.
Explore the dataset using Pandas+Seaborn
Let’s look at the features of our dataset using a distribution chart to have a better understanding of the dataset.
import seaborn as sns
rfm_scores_df = rfm_scores.toPandas()
fig, ax = plt.subplots(1, 3, figsize=(16, 8))
# Recency distribution plot
sns.histplot(rfm_scores_df['Recency'], kde=True, ax=ax[0])
# Frequency distribution plot
sns.histplot(rfm_scores_df.query('Frequency < 1000')['Frequency'], kde=True, ax=ax[1])
# Monetary distribution plot
sns.histplot(rfm_scores_df.query('Monetary < 10000')['Monetary'], kde=True, ax=ax[2])

Feature Engineering
As we can see all three features (recency, frequency, and monetary) are right-skewed and are in different scales and ranges, therefore we need to standardize the data so that the ML algorithm can evaluate the relative distance between features and identify the trends between features. Thankfully Spark ML provides us with a class "StandardScaler" that allows us to easily scale and normalize the features.
In the following code we perform these feature engineering steps:
- Remove zero and negative values from the Monetary column
- Vectorize all the features (mandatory for K-Means from Spark ML to work)
- Standardize the feature-vector
# Remove zero and negative numbers
rfm_data = (
rfm_numbers.withColumn("Monetary",
F.when(F.col("Monetary") <= 0, 1)
.otherwise(F.col("Monetary")))
)
# Identifying feature columns
features = rfm_data.columns[1:]
# vectorize all the features
assembler = VectorAssembler(
inputCols=features,
outputCol="rfm_features")
assembled_data = assembler.transform(rfm_data)
assembled_data = assembled_data.select(
'CustomerID', 'rfm_features')
# Standardization
scaler = StandardScaler(
inputCol='rfm_features',
outputCol='rfm_standardized')
data_scale = scaler.fit(assembled_data)
scaled_data = data_scale.transform(assembled_data)
Find the optimal number of clusters
As we’ve discussed in the beginning, we’ll use the Elbow method to identify the optimal number of clusters for our dataset.
costs = {}
# Apply k-means with different value of k
for k in range(2, 10):
k_means = KMeans(featuresCol='rfm_standardized', k=k)
model = k_means.fit(scaled_data)
costs[k] = model.computeCost(scaled_data)
# Plot the cost function
fig, ax = plt.subplots(1, 1, figsize =(16, 8))
ax.plot(costs.keys(), costs.values())
ax.set_xlabel('k')
ax.set_ylabel('cost')

At the value of k=4, the line seems to bend like an elbow (improvement in distortion declines the most). So we may assume that k=4 is the optimal number of clusters.
Implementing K-Means Clustering
In this step, we’ll use the number of cluster ‘k’ equals 4 and run the k-means algorithm one last time with the whole dataset, and we will get the predicted cluster number for each customer in a column named ‘prediction’.
k_means = KMeans(featuresCol='rfm_standardized', k=4)
model = k_means.fit(scaled_data)
predictions = model.transform(scaled_data)
result = predictions.select('CustomerID', 'prediction')


Observations
Let’s join the prediction with the starting dataset so that we can inspect the result with some charts.
# Join other information with the prediction result-set
rfm_score = spark.read.csv(retail_rfm_numbers.csv',
inferSchema=True,
header=True)
rfm_score = rfm_score.select("CustomerID", "Recency", "Frequency", "Monetary", "RFM_Score", "RFM_ScoreGroup", "Loyalty")
combined_result = result.join(rfm_score, on='CustomerID', how='inner')

Now, to better comprehend the prediction result, we plot the recency, frequency, and monetary values for each cluster in a box plot. The resulting graph reveals these key points:
- Cluster 2 clearly has a higher frequency and monetary numbers compared to the other clusters and it has the lowest recency value than other clusters.
- On the other hand cluster 1 has the highest recency numbers but the lowest frequency and monetary numbers.
- Other clusters (1 and 3) are in the middle.
This observation indicates that our customers are divided into groups, where certain groups exhibiting characteristics of best-performing customers (cluster 2) and others losing interest in shopping with us (cluster 1, they haven’t visited us in a long time). Other customers (clusters 0 and 3) may require some more attention/incentives to shop with us more often. Based on this analysis the business may respond differently to different groups of customers to increase profit.
analysis_df = combined_result.toPandas()
fig, ax = plt.subplots(1, 3, figsize=(20, 12))
sns.boxplot(x='prediction', y='Recency', data=analysis_df, ax=ax[0])
sns.boxplot(x='prediction', y='Frequency', data=analysis_df, ax=ax[1])
sns.boxplot(x='prediction', y='Monetary', data=analysis_df, ax=ax[2])

We can also take a look at the pairwise feature comparison to understand how the features influence each segment.
# Monetary vs Frequency (combined)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
sns.scatterplot(x='Recency', y='Monetary',
data=selected_result_df,
hue='prediction',
palette="deep")
# Monetary vs Frequency (combined)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
sns.scatterplot(x='Recency', y='Frequency',
data=selected_result_df,
hue='prediction',
palette="deep")
# Monetary vs Frequency (combined)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
sns.scatterplot(x='Monetary', y='Frequency',
data=selected_result_df,
hue='prediction',
palette="deep")



Conclusion
K-means clustering is one of the most popular and extensively used techniques for data cluster analysis. However, its performance is usually not as good as other sophisticated clustering techniques. But it can still provide us great insights and help us understand the data. What are your thoughts on K-means clustering?