The world’s leading publication for data science, AI, and ML professionals.

Implementing Customer Segmentation using RFM analysis with PySpark

A step-by-step guide to implement customer segmentation using Recency, Frequency, and Monetary (RFM) analysis using Python and Apache…

A step-by-step guide to implement customer segmentation using RFM method with Python and Apache Spark.

Photo by Devin Avery on Unsplash
Photo by Devin Avery on Unsplash

Customer segmentation is a marketing tool to group your customers based on common characteristics so that you can focus and market to each group effectively and maximize the value of each customer to the business.

Like so many other disciplines, in business as well, we see that old 80–20 rule. Where 80% of your revenue comes from 20% of your customers. That’s why to increase your business you need to understand your customer.

There are at least two main goals of customer segmentation:

  1. Continue to provide the best service to your best customers.
  2. Focus on prospective customers who resemble your best customers.

Table of content

  ∘ RFM Model
  ∘ Dataset
  ∘ Recency, Frequency & Monetary value calculation
  ∘ RFM score calculation
  ∘ Segmentation based on RFM Score
  ∘ Segmentation results
  ∘ Conclusion

RFM Model

RFM stands for recency, frequency, and monetary, and this is a highly flexible managerial customer segmentation model.

This article will go through a step-by-step approach to segment a customer base using the RFM model with the most popular distributed data processing framework, PySpark. In the future, we’ll see how we can take advantage of Machine Learning algorithms (such as K-Means) to improve the segmentation process and even try to predict customer churns.

The complete code used in this article is available on GitHub.

Dataset

For this article, we are going to use a publicly available online retail transaction dataset from Kaggle, which includes the transaction information of each customer from all over the world. It includes information such as invoice number, invoice date, customer id, description of the product, purchased quantity, and country where the customer lives. To keep the article short, I’m going to exclude the data exploration and data preparation steps, and start with a clean dataset.

Since more than 90% of the data points are from the UK, we are going to consider only those data points. Let’s read our clean dataset and inspect it.

Recency, Frequency & Monetary value calculation

The first thing we’ll calculate is the three key factors of RFM Analysis (recency, frequency, and monetary).

  • Recency: How recently customers made their purchase.
  • Frequency: For simplicity, we’ll count the number of times each customer made a purchase.
  • Monetary: How much money they spent in total.

We are going to calculate these three key factors by grouping them by customers and taking "2011/12/10" as our reference end date since this is the last transaction date listed in our dataset.

Explore the RFM values using Pandas+Seaborn

Once we have every customer’s individual recency, frequency, and monetary value calculated, we’d like to see the distribution graph to understand the data better. Unfortunately, Apache Spark is not very suitable for visualization. That’s why we’ll use Pandas with Seaborn for this part.

RFM score calculation

As we see in the graph, our recency, frequency, and monetary values are on different scales and ranges, and all three indicators are right-skewed. It’s not a problem for the RFM segmentation models but if we would like to segment using machine learning models (which we’ll do in the future) we will have to normalize this data.

Let’s start with the segmentation. At first, we’ll assign each customer a specific score for their individual recency, frequency, and monetary value. Then, we’ll aggregate those individual scores and get a combined segmentation score. It will be like college grades. Your individual subject marks are converted into subject grades, and later by combining individual grades a final grade is computed.

We’re going to divide our customers into three equal sections (33% in each section) and assign scores from 1 to 3 (best to worst) to each section.

For recency, we’ll assign a score of "1" to the customers who have purchased recently (first 33%), score "2" to the mid group, and "3" to the third group (customers who last purchased long ago). Since customers who have purchased recently are more likely to do the business again, we are assigning better scores to them.

But for frequency and monetary, we’ll assign a score "1" to the last 33% of customers, to those who shop more frequently and spend more. And assign "3" to the first 33% of customers who shop less frequently and spend less.

In the end, we’ll get a segmentation grade ranging from "111" to "333" (best to worst), And an aggregated scores ranging from 3 to 9.

Our scoring matrix is as follows:

Let’s have a look at the code for RFM Scoring. The benchmark for the scoring will depend on the percentile of each indicator.

Explore the RFM scores using Pandas+Seaborn

As we can see in the left chart where we aggregate customer’s individual rfm-scores, we get a quite evenly distributed plot. But when we look into elementwise scores (right chart), we see they are not that evenly distributed, and because of the aggregation we lose important details. For example, rfm-score "5" (in the left chart) can be achieved by elementwise score "212" or "131" or "221" or even "113". But they are not the same.

Segmentation based on RFM Score

So far, we have computed our customer’s individual recency, frequency, and monetary value, then their separate r_score, f_score, and m_score, and finally an aggregated rfm-score. We’ve also kept the elementwise rfm-scores.

Now, depending on the business requirements we can divide the customer base in whichever way we want. However, for simplicity, we are going to divide our customer base into 3 segments based on the aggregated rfm-score and assign a loyalty badge (Platinum, Gold, Silver):

  • Segment 1 (Platinum): first 33%
  • Segment 2 (Gold): 33% – 66%
  • Segment 3 (Silver): last 33%

Inspect our 3 loyalty levels using Pandas+Seaborn

Inspect the result

Finally, let’s look into the segmentation result using a distribution chart.

Interestingly enough, when we observe the recency-vs-monetary and recency-vs-frequency charts, a clear contrast can be observed between our Platinum and Silver customers. It looks like Platinum customers tend to shop more frequently and spend more than others (which is great). And perhaps Silver customers likely to have lost interest in shopping with us. They shop less often and spend less (which is not so great). But, the Gold customers are in the middle and looks like there has been more activity among the Gold customers recently.

So, the takeaway from the analysis can be that we need to think about how we can continue satisfying our Platinum customers to continue shopping with us. And we also might have to think about how we can influence our Gold customers __ to increase engagement with us. And it might be too late to pursue those Silver users.

Monetary vs Recency

Frequency vs Recency

Monetary vs Frequency


Conclusion

RFM segmentation model is very insightful and a powerful tool to understand your customer. In this article, I tried to show how you can implement RMF managerial segmentation model with PySpark. I hope this post was helpful. I’ll try to show you a machine learning approach to do customer segmentation in a future post.


Related Articles