The world’s leading publication for data science, AI, and ML professionals.

Customer segmentation – Part II

Segmentation of online customers by RFM-country and combination with part I

Photo by Hal Gatewood on Unsplash
Photo by Hal Gatewood on Unsplash

Customer Segmentation is one of the most common uses of data analysis/data science. In this the second part of a two posts series, where we see an example of customer segmentation. The dataset we use is the Online Retail II data set which contains transactions of a UK-based online retail between 1/12/2009 and 09/12/2011. The dataset contains 1.067.371 rows about purchases of 5.943 customers.

First rows of Online Retail II data set
First rows of Online Retail II data set

In the first part, we created a customer segmentation based on the product categories. In this second part, we are going to perform a clustering based on Recency, Frequency, Monetary Value (RFM), and country of origin. Then, we are going to combine the results with the segmentation of part I.

The code for RFM-Country based segmentation as well for combining the results of the two segmentations can be found on Github.

Recency, Frequency, Monetary Value (RFM)

Recency, Frequency, Monetary Value (RFM) is a way of analyzing customer value. The name comes from the three aspects it examines for each customer:

  • Recency: how recent was the last buy of the customer. A customer that bought something recently has more value to a customer that made his/her last buy long ago.
  • Frequency: how frequently a customer makes a purchase. The more frequent buyer a customer is, the better.
  • Monetary Value: how much a customer spends. The higher, the better.

Both Investopedia and Wikipedia have extensive articles on RFM.

The are several ways to define Recency, Frequency, and Monetary Value. We are going to use the following definitions:

  • Recency is the number of days from the last InvoiceDate to the most current InvoiceDate of the dataset (2011–12–09),
  • Frequency is the number of days from first InvoiceDate to last InvoiceDate divided by the number of invoices,
  • Monetary value is the average cost, where cost is the product of Quantity by Price.

The code that calculates RFM is listed below.

Country

Online Retail II dataset contains the country each item was shipped to. By using each country’s GDP we can have a proxy of the financial strength each customer has. GDP data from Wikipedia are used to create an xlsx file which you can also find in Github. There are a few customers that are related to two countries. To deal with this, for every customer, we weight GDP with the number of invoices.

Result of calculation of weighted GDP per customer
Result of calculation of weighted GDP per customer

Customer Segmentation based on RFM – Country GDP

The final preprocessing step is to combine RFM analysis with GDP data.

For clustering, we will use the k-Means algorithm. After scaling the input data we perform k-Means clustering for a range of k (= number of clusters created). This will allow us to create a plot of the total sum of squared distances per k and use the elbow method to select the best value for k.

Based on the elbow method, we will examine clustering into 3 or into 7 clusters. For profiling the clusters, we create a custom function named cluster_profile_RFM_country.This function

  1. calculates the median of recency, frequency, monetary value, and weighted GDP for each cluster, then
  2. sums the medians of recency, frequency, monetary value, and weighted GDP, and
  3. divides each median with the corresponding sum.

This way, function cluster_profile_RFM_country calculates the percentage of each variable (recency, frequency, monetary value, and weighted GDP) that corresponds to each cluster. There is the option to exclude from the profiling clusters with few items.

If we cluster into 3 groups, we end up with three clusters with sizes 4.225, 1.708, and 9 customers. The small size of the third cluster is an indication of outliers. If we use the first two clusters for profiling, we see that the first cluster contains customers

  • that have made a transaction more recently (in the last 9 days vs last 91),
  • that spend more frequently (every 15 days vs 85) and,
  • that have bigger monetary value (243 vs 197) per transaction

than the customers of the second cluster.

Profiling of the two major clusters when clustering in 3 groups
Profiling of the two major clusters when clustering in 3 groups

Selecting 7 clusters, we obtain clusters with 3.549, 1.547, 572, 247, 17, 9, and 1 customers. Profiling customers with more than 100 customers, we have

  • cluster 4 with 572 customers. Cluster 4 contains customers with the lowest RFM score (i.e customers that shop less frequent, with lower monetary value per transaction, and have shopped longer ago than the rest clusters)
  • cluster 0 with 1.547 customers. Cluster 0 contains customers with the second lowest RFM score.
  • cluster 2 with 3.549 customers and cluster 5 with 247 customers. These contain customers with the best RFM score. The difference between the two of them is that cluster 5 has customers that are related to countries with lower GDP than the rest.

We could say that these four clusters can be ranked by RFM – Country score (from best to worst) as:

cluster 4 < cluster 0 < cluster 5 < cluster 2. (Cluster 5 has higher monetary value than cluster 2 but contains customers from countries with lower GDP. Thus we prioritize cluster 2).

Cluster 5 has customers from countries with a lower GDP than the rest.

Profiling of the four major clusters when clustering in 7 groups
Profiling of the four major clusters when clustering in 7 groups

Using hierarchical clustering we can gain a better understanding of the possible number of clusters.

We see hierarchical clustering verifies the selection of splitting into 3 or 7 clusters. Profiling major clusters in both cases, we see that the results are similar to k-Means.

Profiling major clusters in hierarchical clustering. To the left two major clusters when clustering into 3 groups. To the right four major clusters when clustering into 7 groups.
Profiling major clusters in hierarchical clustering. To the left two major clusters when clustering into 3 groups. To the right four major clusters when clustering into 7 groups.

We will use the clustering solution into 7 clusters with k-Means. The relative information is exported with pickle.


Combining results with part I

Finally, we combine the results of items category segmentation and RFM-Country segmentation. The code can be found on Github.

The process is straightforward.

  1. Import the data on two segmentations.
  2. Merge small clusters that can be seen as outliers into one.

I.e. in RFM-Country segmentation, there are three clusters with less than 20 customers each. If we combine them with the item category segmentation the resulting groups will contain a very small percentage of our customers. We are going to merge these into one. Given, the chance we will rename the clusters so that the cluster with the lowest RFM score will be first (keeping the merged with outliers to the end).

  1. Combine the two segmentations.

  2. Create a cross-tabulation of the two segmentations.

    Cross-tabulation of the two segmentations. Left, customer number per group. Right percentage of customers per group.
    Cross-tabulation of the two segmentations. Left, customer number per group. Right percentage of customers per group.
  3. Describe the resulting customer groups.

Here, we will focus on a few customer segments. Based on their profile we can try to give a name to each group.

  • "High-value average customer": the biggest segment has 2.501 (42.09% of the total). It contains customers that belong in cluster 1 of items category segmentation and cluster 4 of RFM-country based segmentation. This means that the customers in this segment shop from the general item’s category (i.e. they are not focused on bags, decoration, or Christmas items) and have relatively high RFM score.
  • "High-value Christmas shopper": segment in item category cluster 0 and RFM-Country cluster 4. In our item category clustering, we have identified that cluster 0 corresponds to customers that have a tendency to buy more Christmas items. This segment contains 289 (4.86%) customers.
  • "High-value decoration lovers": segment in item category cluster 2 and RFM-Country cluster 4. In our item category clustering, we have identified that cluster 2 corresponds to customers that have a preference for bags. This segment contains 489 (8.23%) customers.
  • "Low-value decoration lovers": segment in item category cluster 2 and RFM-Country cluster 1. This segment contains 69 (1.16%) customers. We could add to this segment customers in item category cluster 2 and RFM-Country cluster 2. This way the segment would contain 305 (5.13%) customers.
  • "High-value bag lovers": segment in item category cluster 3 and RFM-Country cluster 4. In our item category clustering, we have identified that cluster 2 corresponds to customers that have a preference in decoration. This segment contains 270 (4.54%) customers.
  • "Low-value bag lovers": segment in item category cluster 3 and RFM-Country cluster 1. This segment contains 42 (0.71%) customers. We could assign to this segment and customers in item category cluster 1 and RFM-Country cluster 2. This way the segment would contain 132 (2.22%) customers.

Final words

This ends our two-part introduction to customer segmentation. In a real-world scenario, a more detailed description of each segment would be provided. For every segment, we would have calculated several KPIs like:

  • total cost per segment,
  • average cost per customer
  • average cost per customer per transaction/invoice
  • number of transactions/invoices per customer
  • average frequency per customer
  • country distribution per customer, e.t.c.

All this information would be utilized to plan specific actions for each segment. Unfortunately, this is beyond the scope of our post. I wish for you that you’ll get the chance to see it in a real-life situation.


Related Articles