The world’s leading publication for data science, AI, and ML professionals.

RFM Analysis

Finally a clustering model you won't have to spending time explaining.

Customer Segmentation with RFM Analysis

Image by Author
Image by Author

Any good data scientist is (or at least should be) adept at taking complex mathematical and statistical models and explaining them in a simple and concise manner. In the end, our job is to create value for our company or client. Even if we have a model with 99.9999999% accuracy, management is unlikely to use it to make decisions unless they understand (at the very least) the basics of the model.

Our Problem

A large part of any business is built around understanding the company’s clients, and ensuring their needs and wants are being satisfied. This helps us ensure our clients are actually using the products we’re creating/providing and that we’re spending our own resources in optimal business segments. A common approach to understanding our clients is to segregate them into distinct groups. Instead of trying to understand and develop products for hundreds of thousands of individual people or companies, we can instead focus our efforts on a few distinct groups which represent our underlying clients. This allows us to make more informed, targeted decisions that will have a greater impact. In short, it ensures we see the forest and not the trees.

If you’ve been around machine learning over the past few years, you’re brain has automatically already switched into unsupervised learning mode and you’re already thinking of coding up a k-means or nearest-neighbor model. I can’t deny that I’m not usually right there with you. But let’s take a step back. Is there a simpler method? One that requires almost zero explanation to management? One that’s much less computationally expensive?

RFM to the Rescue

RFM analysis began in the mid 90’s when companies where trying to find optimal groups for Direct Mail. Most site Jan Bult and Tom Wansbeek’s article, "Optimal Selection for Direct Mail" in Marketing Science, as the first emergence of the idea. And the idea is simple.

RFM stands for Recency, Frequency, and Monetary Value. In short we want to group customers based on:

  1. How recent was their last transaction?
  2. How frequently do they purchase?
  3. How much money have they spent with us?

Customers who have purchased three items within the last month are more important than customers who purchased two items over the last three years. Customers who have spent $10,000 on our products/services are more important than customers who have spent $50. As you can see, this methodology requires little (if any) explanation.

The implementation of the analysis is equally as simple. For each group (R,F,M) we’ll divide our customers into N segments. We’ll then issue a rank to each group from 1-N (with 1 being optimal). At the end, we can use our RFM scores as-is, or we can use some aggregation method to develop one super-score for each customer.

Let’s get started

First, we’ll need a list of all invoices over the time period we’re analyzing. For each invoice we’ll only need three things:

  1. Client Name
  2. Invoice Date
  3. Amount.

For any analyst, this should be nothing more than a simple SQL query.

We’ll then pull our SQL data into a pandas dataframe. (The data in this example has been anonymized.)

Image by Author
Image by Author

We’ll now create a new dataframe (rfm) where we’ll house our RFM metrics. Our new dataframe will contain one row for each client, along with their individual R,F,M ratings.

We’ll need a few helper functions to generate our metrics. We can use Pandas’ apply to use these on our data.

Finally, we’ll need one last bit of code to split our customers into N groups and assign a ranking. For this example I’ve chosen 4 groups and am splitting based on simple quartiles.

Our ranking is now complete. Now we can dive into our analysis and pull out different segments that are useful for management.

Image by Author
Image by Author

Final Analysis

With RFM analysis, there are many different types of customer segments we can choose. Of course our best customers (with recent, frequent purchases who have spent a large amount of money) are denoted by R=1, F=1, M=1 – or more concisely as (1–1–1). A listing of some other notable groups is below:

  1. Low-spending but Active, Loyal clients – (1–1–3|4)
  2. Best customer we let get away (4–1|2–1|2)
  3. New big, spending customers (1–4–1|2)
  4. the list goes on and on….

All of these segments are easily accessible with a simple pandas filter.

Of course, what would an analysis be without a visualization? A small side-advantage of RFM analysis is that it provides 3 dimensions – making it easy to visualize. Let’s step into it with plotly.

Image by Author
Image by Author

Conclusion

Of course this analysis isn’t as elegant or intense as k-means or almost any true machine learning algorithm – but that’s not what we were after. We traded elegance for ease and simplicity.

Generating the data, coding the solution, and reviewing the results will probably take you no more than an extended coffee break. We also didn’t need a GPU… Developmentally and computationally, this analysis is a win.

As we discussed in the beginning, our job is to bring actionable insights to decision makers. Here again, time is money. The simplicity of our approach saves us time (in that we don’t have to develop a 10-page PowerPoint to explain our methodology) and saves management time (because they won’t have to sit through your PowerPoint). In the end however, we’ve provided actionable segments that can be used throughout marketing and product development.

What will you do with all of that spare time?

P.S.

  • If you want to condense the data even further, you could sum the R,F,M metrics to produce one super-metric.
  • If you’d rather have more granular results, use more bins instead of the quartiles I used above.
  • If one or two of the metrics is more important than others, combine the metrics with weighted sums to reflect this – a(R) + b(F) + c(M) = y where a+b+c = 1.
  • The ability to customize this simple methodology is truly remarkable.

Thank you for taking the time to read. I truly appreciate it. The full code can be found on my GitHub.


Originally published at http://lowhangingfruitanalytics.com on October 15, 2020.


Related Articles