Marketing automation — Customer segmentation

Published in

Towards Data Science

11 min readJun 17, 2021

Do you know your customers? Customer analytics is becoming critical. These insights power businesses’ sales, marketing, and product development efforts and studies show that companies that use customer analytics are more profitable.

Customers have access to information anywhere, anytime including where to shop, what to buy, how much to pay, and so on. This makes it increasingly important to utilize predictive analytics and data to forecast how customers will behave when interacting with brands.

The goal of customer analytics is to create a single, accurate view of a customer to make decisions about how best to acquire and retain customers, identify high-value customers and proactively interact with them. The better the understanding of a customer’s buying habits and lifestyle preferences, the more accurate predictive behaviors become and the better the customer journey becomes. Without large amounts of accurate data, any insight derived from the analysis could be wildly inaccurate.

In this article, I would like to show how to create a Customer analytics dashboard and how to use this information in real life. For this experiment, I will use the Kaggle dataset — Retail Shop case study Dataset.

First of all, let’s make exploratory data analysis and clean our data from different garbage.

This dataset contains the following tables — customer, transaction, product. The transaction table has got the following columns:

transaction_id — transaction identifier;
cust_id — customer identifier;
tran_date — date of the transaction;
prod_subcat_code — product sub-category identifier;
prod_cat_code — product category identifier;
Qty — product quantity;
Rate — product price;
Tax — tax for this purchase;
total_amt — purchase value + tax;
Store_type — a type of the store where the purchase was made;

I have got near 23K of record in the transaction table with no NULL value. It is a toy dataset, but in real life, we must check and clean before use carefully.

The customer table has got the following columns:

customer_Id — customer identifier;
DOB — customer date of birthday;
Gender — customer gender;
city_code — a city where a customer was registered in a store;

The customer table has 5647 unique customers and looks we have got some customers with no information about Gender and city_code. HOw to handed such issue I will show later.

The product table has got the following columns:

prod_cat_code — product category identifier;
prod_cat — — product category name;
prod_sub_cat_code — product sub-category identifier;
prod_subcat — product sub-category name;

The product table has got 23 unique sub-categories and no records with a null value.

Let’s join our table and make drill-down analytics about our transaction table. I would like to check the following information —date variations, how many customers the transaction table has, and how many purchases they made, the most popular product.

The next analysis would be for our Quantity, Price, Tax, and Value column to analyze some summary statistics.

As I wrote, this is toy data set so we haven’t got any anomaly values, the only one we have is negative values, but it means that we have got information about зurchase returns which we will use in further analysis.

Let’s start to make customer segmentation. The first one would be RFM analysis. As for me, It one of the simplest and faster ways to know your customer. So, what is RFM analysis?

RFM is a method used for analyzing customer value. It is commonly used in database marketing and direct marketing and has received particular attention in the retail and professional services industries.

RFM stands for the three dimensions:

Recency — How recently did the customer purchase?
Frequency — How often do they purchase?
Monetary Value — How much do they spend?

There are also other variants of this technique:

RFD — Recency, Frequency, Duration is a modified version of RFM analysis that can be used to analyze consumer behavior of viewership/readership/surfing-oriented business products.

RFE — Recency, Frequency, Engagement is a broader version of the RFD analysis, where Engagement can be defined to include visit duration, pages per visit, or other such metrics.

RFM-I — Recency, Frequency, Monetary Value — Interactions is a version of the RFM framework modified to account for recency and frequency of marketing interactions with the client (e.g. to control for possible deterring effects of very frequent advertising engagements).

RFMTC — Recency, Frequency, Monetary Value, Time, Churn rate an augmented RFM model proposed by I-Cheng et al. (2009). The model utilizes the Bernoulli sequence in probability theory and creates formulas that calculate the probability of a customer buying at the next promotional or marketing campaign. The model has been implemented by Alexandros Ioannidis for datasets such as the Blood Transfusion and CDNOW data sets.

The first thing we need to do is to calculate three values — Recency, Frequency, and Monetary for every customer in the data set for some period of time. I would like to make this calculation base on half a year for the last 3 periods.

The next step creates three different groups based on the Recency, Frequency, and Monetary Value of customers which would help us to name our customer segments. RFM score is a simple sum of the Recency, Frequency, and Monetary Values and this sum is a result of integer values like 10, 9, 8 and etc. This score will indicate the value of the RFM score that will allow us to make decisions on a business product or on our customers. This is a very important metric due to future decision-making processes concerning the users or customers.

In the final, we have RFM analysis for three periods of time.

But, you ask — How it uses? Firstly we can calculate some global for different periods of time and analyze it in dynamic.

It very interesting, but let make the same with our RFM score name value.

Here we can see that we have increased the number of TOP clients over the last 1.5 years.

Let’s deep dive into RFM values separately and analyze them in dynamic.

We have increased the period between purchases which indicates a decrease in customer engagement.

Decreasing the average number of purchases for middle and Low groups of customers.

And here we can see a little increasing MOnetary for TOP group and decreasing for Middle and Low groups.

The next step we need to know to make some conclusion is drill-down analysis for every client which helps us to understand flows between groups.

Here we can see the number of users that migrate from one group to another during the last half-year.

As you can see RFM analysis is quite easy to implement, but provide us huge variants to analyze it. The part, that I shown is only a little part that RFM could tell you about your clients.

Let’s go on and make further analysis. The next step is Cohort Analysis. I would like to make this analysis for the 2011 year. This analysis based just on date value. Defining a cohort is the first step to cohort analysis. Let’s try to create monthly cohorts based on the month each customer has made their first transaction (received an InvoiceNo due to the InvoiceDate). The next step in order to build an appropriate cohort is to calculate time offsets for each customer transaction. It will allow us to report the metrics for each cohort in an appropriate way. To create time offsets I will create the function that splits the dates into the year, month, day columns. It will help to easily do calculations based on time dates.

Customer retention is a very powerful metric to understand how many of all the customers are still ‘alive’ (or active). Before that Retention shows us a percentage of active customers out of total customers.

Another useful metric for behavioral analytics and customer analytics is to calculate the average quantity of products that customers buy in the store and visualize it in a similar cohort table like above for the retention analytics.

As you can see in the toy data set we mostly have no changes in customer behavior during the whole year.

And now it’s time to use the power of the data science algorithm to build customer segments. First of all, we need to preprocess our data before start builds our model. I will show you a couple of concepts that are used in the data-preprocessing step and other considerations. After building and preprocessing the pipeline I will show you how to build the popular Machine Learning Algorithm called K-Means Clustering which will be based on our calculated RFM Scores and other factors that we calculate later. This will help us to clarify and identify users based on their customer behavior metrics.

What is K-Means Clustering and why use this algorithm?

K-Means is one of the most popular unsupervised learning methods to identify different patterns
K-Means is simple and fast
Works well on big datasets

There are some critical assumptions with the K-Means algorithm before building it:

All variables must have symmetrical distribution and should not be skewed.
All variables should have the same or almost the same average values.
All variables should have the same level of variance.

Ok, let’s create some additional factors to make out cluster more informative and interesting. Ler’s start from socio-demographic information about our customers. I have information about the date of birthdays, gender and city which I would like to use. For the date of birth, I will calculate the age of the user on the max date of transaction, gender, and the city I will transform with the One Hot Encoding technique. As the result, I have got the following table with customers, which I will join to the last RFM analysis which was made on 2014–12–02.

The next features would be build base on category and transaction information. It would be an average value of purchases by category for every client.

now that is all features I would like to use in this experiment, but the amount and their diversity depend only on your business needs and engineer’s fantasies.

Time to preprocess all this data because of algorithm limitation. There are some tips and tricks, that we can use:

We have several ways to get rid of skewness from data. The first one is logarithmic transformation. But log transformation works only with positive values.
Another method to get rid of skewness is widely used Z-Transformation which works well on the mixed (negative and positive) values in the data.

Further, I will explain different methods during the work. Here is enough information to dive into the work.

A further data exploration task is to identify skewness. Skewness is a measure of symmetry (or lack of symmetry). The distribution of a dataset is symmetric if it looks the same to the left and right of the center point. The histogram is an effective graph technique for showing the skewness of data.

Usually, there are 3 types of skewness:

Left Skewed data
Normal Distributed data
Right Skewed data

I will use Z-Transformation because I have got a negative value that I would like to use in my analysis.

The next thing we need to do is identify the number of clusters. How to identify the number of clusters. In the KMeans algorithm there are several ways to identify the number of clusters:

Visual method — so-called elbow method (or elbow criteria and etc.)
The quantitative method called silhouette coefficient
Experimentation and imagination

I have got too many factors, so to make my analysis sensible I need to build a lot of clusters. I would like to leave for further analysis 20 microclusters.

Further, I want to show you how to calculate base metrics to identify the ‘fit’ of the Kmeans algorithm.

Such a chart could help you to analyze the relative importance of attributes.

Also, there is one more useful chart — spider chart, which help us analyze factors which clusters differ among themselves

Here we can see, that these four clusters are mostly differed by the group of goods that customers bought inside their own cluster.

That is all, I would like to show in this analysis and it is time for a conclusion.

Conclusions

So, as you see these approaches are a powerful tool, which could help you to know your customers better.

Segmentation allows businesses to make better use of their marketing budgets, gain a competitive edge over rival companies and, importantly, demonstrate better knowledge of your customers’ needs and wants. It can also help:

Marketing efficiency — Breaking down a large customer base into more manageable pieces, making it easier to identify your target audience and launch campaigns to the most relevant people, using the most relevant channel.
Determine new market opportunities — During the process of grouping your customers into clusters, you may find that you have identified a new market segment, which could in turn alter your marketing focus and strategy to fit.
Better brand strategy — Once you have identified the key motivators for your customer, such as design or price, or practical needs, you can brand your products appropriately.
Improve distribution strategies — Identifying where customers shop and when can informatively shape product distribution strategies, such as what type of products are sold at particular outlets.
Customer retention — Using segmentation, marketers can identify groups that require extra attention and those that churn quickly, along with customers with the highest potential value. It can also help with creating targeted strategies that capture your customers’ attention and create positive, high-value experiences with your brands.

So, don’t be afraid and start making your marketing strategies more efficient and profitable.

All code you can find in the Git repository — link.

Marketing automation — Customer segmentation

Conclusions

Written by Andrii Shchur