The world’s leading publication for data science, AI, and ML professionals.

A Simple Way to Segment Customers Using Google BigQuery and Data Studio

A guide to RFM Segmentation and visualization of the resulting segments

Photo by Jacek Dylag on Unsplash
Photo by Jacek Dylag on Unsplash

Catering to thousands of customers can be a handful – each customer can have their own needs and wants, and it would seem like a never-ending challenge to satisfy them all. This is where market segmentation can be beneficial.

Market segmentation is defined as the separation of a group of customers into subgroups of customers according to their needs and preferences. Customers belonging to the same segment would have roughly similar needs and preferences. This analysis makes it easier for companies to tailor their products and services to meet each segment’s needs.

There are several ways to segment customers – methods can range from simple grouping according to demographics, such as males vs females, age group or generations, and grouping by country, to more sophisticated multivariate statistical techniques, like cluster analysis. In this article, I shall demonstrate a simple method using percentiles that can be applied to any retail purchase data.

Apart from the segmentation analysis, I shall also show a method to visualize the segments in a way that will be actionable for company decision-makers. To give a preview, here’s the dashboard that I have created using Google Data Studio:


What is RFM Segmentation?

RFM stands for Recency, Frequency, and Monetary. The following buying behaviors are analyzed:

  • Recency – how recent the last transaction is. We would want to continually engage recent buyers and discover why less recent buyers have lapsed.
  • Frequency – how many times the customer has bought from us.
  • Monetary – how much each customer has paid for our products and services.

Using these three pieces of information we can come up with a grouping that will help inform strategic business decisions. For example, the UK Data & Marketing Association (DMA) laid out 11 segments, and specified marketing strategies according to their respective characteristics:

Source: UK Data & Marketing Association (DMA)
Source: UK Data & Marketing Association (DMA)

Note that this type of segmentation focuses on the actual buying behavior and ignores the differences in motivations, intentions, and lifestyles of consumers. RFM is nonetheless a useful start-off point, and because of its simplicity can be executed fast and in an automated way, giving companies the power to act and decide on business strategies swiftly.


The dataset

The data used for this demo is the Online Retail Dataset from the UCI Machine Learning Repository. It is a dataset containing transactions from an online retail store. Any other dataset that contains sales and order dates can be used to do the same analysis.

The data was downloaded from the website and uploaded manually to Google BigQuery, a scalable and cost-effective data warehouse. If your data comes from an online store, integrations can be set up so that the data flows from the shop to a third-party connector, and then to your BigQuery warehouse. If you have this setup, the data will refresh regularly, ensuring that you get the most updated insights about your business.

I will not go into detail on how to set up the data warehouse as I intend to focus more on the Analytics and visualization, but BigQuery has quite extensive documentation here, for reference.

The data used for this analysis has been cleaned to remove outliers and negative sales values.


RFM Segmentation in BigQuery

The RFM Segmentation can be executed using these four steps:

  1. Compute for recency, frequency, and monetary values per customer
  2. Determine quintiles for each RFM metric
  3. Assign scores for each RFM metric
  4. Define the RFM segments using the scores in step 3

The whole code can be accessed here but I shall go through each step and explain the chunks of SQL code.

Step 1: Compute for recency, frequency, and monetary values per customer

For monetary, this is just a simple sum of sales, while for frequency, this is a count of distinct invoice numbers per customer.

For recency, I chose a reference date, which is the most recent purchase in the dataset. In other situations, one may select the date when the data was analyzed instead.

After choosing the reference date, we get the date difference between the reference date and the last purchase date of each customer. This is the recency value for that particular customer.

Here’s a snippet of the code chunk used for this calculation:

WITH 
--Compute for F & M
t1 AS (
    SELECT  
    CustomerID,
    Country,
    MAX(InvoiceDate) AS last_purchase_date,
    COUNT(DISTINCT InvoiceNo) AS frequency,
    SUM(Sales) AS monetary 
    FROM `online_retail_preprocessed_nooutliers`
    GROUP BY CustomerID, Country 
),
--Compute for R
t2 AS (
    SELECT *,
    DATE_DIFF(reference_date, last_purchase_date, DAY) AS recency
    FROM (
        SELECT  *,
        MAX(last_purchase_date) OVER () + 1 AS reference_date
        FROM t1
    )  
),

Step 2: Determine quintiles for each RFM metric

The next step would be to group the customers into quintiles in terms of their RFM values – we divide the customers into 5 equal groups, according to how high and low they scored in the RFM metrics. We do this for each of recency, frequency, and monetary values per customer. I used Bigquery‘s APPROX_QUANTILES() to achieve this:

t3 AS (
SELECT 
    a.*,
    --All percentiles for MONETARY
    b.percentiles[offset(20)] AS m20, 
    b.percentiles[offset(40)] AS m40,
    b.percentiles[offset(60)] AS m60, 
    b.percentiles[offset(80)] AS m80,
    b.percentiles[offset(100)] AS m100,    
    --All percentiles for FREQUENCY
    c.percentiles[offset(20)] AS f20, 
    c.percentiles[offset(40)] AS f40,
    c.percentiles[offset(60)] AS f60, 
    c.percentiles[offset(80)] AS f80,
    c.percentiles[offset(100)] AS f100,    
    --All percentiles for RECENCY
    d.percentiles[offset(20)] AS r20, 
    d.percentiles[offset(40)] AS r40,
    d.percentiles[offset(60)] AS r60, 
    d.percentiles[offset(80)] AS r80,
    d.percentiles[offset(100)] AS r100
FROM 
    t2 a,
    (SELECT APPROX_QUANTILES(monetary, 100) percentiles FROM
    t2) b,
    (SELECT APPROX_QUANTILES(frequency, 100) percentiles FROM
    t2) c,
    (SELECT APPROX_QUANTILES(recency, 100) percentiles FROM
    t2) d
),

Step 3: Assign scores for each RFM metric

Now that we know how each customer fare relative to other customers in terms of RFM values, we can now assign scores from 1 to 5.

Just keep in mind that while with F and M, we give higher scores for higher quintiles, R should be reversed as more recent customers should be scored higher in this metric.

t4 AS (
    SELECT *, 
    CAST(ROUND((f_score + m_score) / 2, 0) AS INT64) AS fm_score
    FROM (
        SELECT *, 
        CASE WHEN monetary <= m20 THEN 1
            WHEN monetary <= m40 AND monetary > m20 THEN 2 
            WHEN monetary <= m60 AND monetary > m40 THEN 3 
            WHEN monetary <= m80 AND monetary > m60 THEN 4 
            WHEN monetary <= m100 AND monetary > m80 THEN 5
        END AS m_score,
        CASE WHEN frequency <= f20 THEN 1
            WHEN frequency <= f40 AND frequency > f20 THEN 2 
            WHEN frequency <= f60 AND frequency > f40 THEN 3 
            WHEN frequency <= f80 AND frequency > f60 THEN 4 
            WHEN frequency <= f100 AND frequency > f80 THEN 5
        END AS f_score,
        --Recency scoring is reversed
        CASE WHEN recency <= r20 THEN 5
            WHEN recency <= r40 AND recency > r20 THEN 4 
            WHEN recency <= r60 AND recency > r40 THEN 3 
            WHEN recency <= r80 AND recency > r60 THEN 2 
            WHEN recency <= r100 AND recency > r80 THEN 1
        END AS r_score,
        FROM t3
        )
),

Step 4: Define the RFM segments using the scores in step 3

The next step is to combine the scores we obtained from step 4 to define the RFM segment each customer will belong to.

As there are 5 groups for each of the R, F, and M metrics, there are 125 potential permutations, a number that is too much to manage in terms of marketing strategies. For this, I would refer to the 11 personas defined in the DMA guide and was also introduced earlier in this article.

One way to cut down the segment permutations is to combine the scores in the frequency and monetary aspect in terms of averaging (as both of them are indicative to purchase volume anyway). After that, I used the 11 personas in the DMA as a guide and defined the R vs. FM scores accordingly.

For example, in the Champions segment, customers should have bought recently, bought often, and spent the most. Therefore, their R score should be 5 and their combined FM score should be 4 or 5. On the other hand, Can’t Lose Them customers made the biggest purchases, and often, but haven’t returned for a long time. Hence their R score should be 1, and FM score should be 4 or 5.

Here’s the code snippet for this step, and the rest of the definitions can be found here as well:

t5 AS (
    SELECT 
        CustomerID, 
        Country,
        recency,
        frequency, 
        monetary,
        r_score,
        f_score,
        m_score,
        fm_score,
        CASE WHEN (r_score = 5 AND fm_score = 5) 
            OR (r_score = 5 AND fm_score = 4) 
            OR (r_score = 4 AND fm_score = 5) 
        THEN 'Champions'
        WHEN (r_score = 5 AND fm_score =3) 
            OR (r_score = 4 AND fm_score = 4)
            OR (r_score = 3 AND fm_score = 5)
            OR (r_score = 3 AND fm_score = 4)
        THEN 'Loyal Customers'
        WHEN (r_score = 5 AND fm_score = 2) 
            OR (r_score = 4 AND fm_score = 2)
            OR (r_score = 3 AND fm_score = 3)
            OR (r_score = 4 AND fm_score = 3)
        THEN 'Potential Loyalists'
        WHEN r_score = 5 AND fm_score = 1 THEN 'Recent Customers'
        WHEN (r_score = 4 AND fm_score = 1) 
            OR (r_score = 3 AND fm_score = 1)
        THEN 'Promising'
        WHEN (r_score = 3 AND fm_score = 2) 
            OR (r_score = 2 AND fm_score = 3)
            OR (r_score = 2 AND fm_score = 2)
        THEN 'Customers Needing Attention'
        WHEN r_score = 2 AND fm_score = 1 THEN 'About to Sleep'
        WHEN (r_score = 2 AND fm_score = 5) 
            OR (r_score = 2 AND fm_score = 4)
            OR (r_score = 1 AND fm_score = 3)
        THEN 'At Risk'
        WHEN (r_score = 1 AND fm_score = 5)
            OR (r_score = 1 AND fm_score = 4)        
        THEN 'Cant Lose Them'
        WHEN r_score = 1 AND fm_score = 2 THEN 'Hibernating'
        WHEN r_score = 1 AND fm_score = 1 THEN 'Lost'
        END AS rfm_segment 
    FROM t4
)

After this step, each customer should have an RFM segment assignment like this:

Image by the Author
Image by the Author

Connecting BigQuery to Data Studio

Now we have all the data we need. The next step is to visualize the segments in a Data Studio dashboard. To be able to do this, we would need to connect BigQuery to Data Studio and create a data source inside Data Studio. Here are the steps to accomplish this:

  1. On the Data Studio home page, click Create. Choose Data Source.
  2. A variety of Google Connectors will appear. Choose BigQuery.
  3. Choose the BigQuery project and dataset for the transactional data.
  4. Click Connect.
  5. All the fields inside the data will appear. Check that the data formats are correct, and add custom calculated fields when needed.
  6. Click Create a Report.

A blank dashboard will appear, and you are now ready for the fun part – visualizing the RFM segments!


Creating the Dashboard

The RFM dashboard I created contains these key elements:

  • Page Filters
  • Topline Performance on Key Metrics
  • RFM Segment Sizes
  • Detailed Recency, Frequency, and Monetary Data
  • Specific Products Bought

Topline Performance on Key Metrics

This section gives context about the business’s health and provides benchmarks to compare against each segment’s performance. Simple scorecards shall suffice for this section.

Image by the Author
Image by the Author

RFM Segment Sizes

When doing Market Segmentation, it’s good practice to visualize the relative sizes of each segment. This will inform strategy and prioritization. For instance, knowing which segments contribute more to headcount and sales can give us an idea of who we should be reaching out to for loyalty privileges.

Using bubble charts as a method of visualization enables us to display three information at a time: Monetary (sizes of the bubbles), Frequency (y-axis), and Recency (x-axis). In doing so, we can visualize how different these segments are from each other.

In this example, we see that Champions are relatively large in terms of revenue, is lying on the upper end of the y-axis in terms of frequency, and on the left end of the recency spectrum. On the other end are the Hibernating and Lost customers, on the lower right part of the chart.

Image by the Author
Image by the Author

Detailed Recency, Frequency, and Monetary Data

This part summarizes how each segment performs in terms of several key measures: number and percentage of customers belonging to each segment, and the average recency, frequency, and monetary values for each segment.

Image by the Author
Image by the Author

Products Bought

This portion will let you determine which particular products a segment would most likely buy. Likewise, this will give us an idea as to which products most segments do not care about buying.

Image by the Author
Image by the Author

For this part, I had to join back the RFM segmentation results to the main dataset that contains each product ordered. Here is the code I used to do this:

SELECT  a.*,
rfm_segment
FROM `online_retail_preprocessed_nooutliers` a
LEFT JOIN
(SELECT CustomerID, 
rfm_segment
FROM `online_retail_rfm`) b
ON a.CustomerID = b.CustomerID

Summary

In this article, I demonstrated a simple way to execute RFM Segmentation, and display results into a dashboard. This type of analytics is designed to help businesses quickly know the different customer personas, and create targeted marketing actions for each of these personas.

This type of analysis can be applied to other retail and sales data, for as long as we have information about the recency, frequency, and monetary behavior of each customer.

Again, this is just one of the several ways of doing market segmentation. For example, in this Medium article, the author used a more advanced statistical methodology (k-means algorithm) and utilized BigQuery ML to do the segmentation. It’s up to the analyst and business stakeholders to determine which method would be most fitting in terms of data complexity and timelines. In the end, the segments should be clear enough and actionable for the analysis to benefit the business.

I hope you enjoy creating your own market segmentation analysis!

All the code I created for this article can be found on my GitHub.


Useful Resources for Google Data Studio and Big Query:

Special thanks to the following resources for the guide and inspiration in creating this article:


Related Articles