Customer Segmentation in Online Retail

A detailed step-by-step explanation on performing Customer Segmentation in Online Retail dataset using python, focussing on cohort analysis, understanding purchase patterns using RFM analysis and clustering.

Published in

Towards Data Science

19 min readJan 1, 2021

In this article, I am going to write about how to carry out customer segmentation and other related analysis on online retail data using python.

This is going to get a bit long, so feel free to go through some sections at a time and come back again.

Before going into the definition of customer segmentation, let us take a look at how online retail works and how the associated data would look like. When a person goes into a retail store and purchases a few items, the following basic data points should be generated:

Customer Name/Customer ID
Address/Contact number of the customer (Demographic information)
Invoice number
Product name and product code
Quantity
Unit Price
Date and time of the transaction
Coupon Code (if applicable)
Discount amount (if applicable)

Now that we have developed a basic idea about how retail data looks like, let us think about how a company should think in order to make effective marketing policies.

For a small company, the customer base is usually quite small and individually targetable. But, as a business grows in size, it will not be possible for the business to have an intuition about each and every customer. At such a stage, human judgments about which customers to pursue will not work and the business will have to use a data-driven approach to build a proper strategy.

For a medium to large size retail store, it is also imperative that they invest not only in acquiring new customers but also in customer retention. Many businesses get most of their revenue from their ‘best’ or high-valued customers. Since the resources that a company has, are limited, it is crucial to find these customers and target them. It is equally important to find the customers who are dormant/are at high risk of churning to address their concerns. For this purpose, companies use the technique of customer segmentation.

One axiom frequently used in business and economics is the Pareto principle. This can be applied to understanding the revenue stream of a company as well.

As per the Pareto Principle, 80% of outcomes result from 20% of all the causes of any given event.

In business terms, we can say that 20% of customers contribute 80% share of the total revenue of a company. That’s why finding this set of people is important. I will explain the importance of customer segmentation in a detailed manner later in this article itself.

Let us now try to understand what customer segmentation is and why is it such an effective tool for developing an effective strategy. Then, we will work on how to perform segmentation.

Understanding Customer Segmentation

Customer segmentation is the process of separating customers into groups on the basis of their shared behavior or other attributes. The groups should be homogeneous within themselves and should also be heterogeneous to each other. The overall aim of this process is to identify high-value customer base i.e. customers that have the highest growth potential or are the most profitable.

Insights from customer segmentation are used to develop tailor-made marketing campaigns and for designing overall marketing strategy and planning.

A key consideration for a company would be whether or not to segment its customers and how to do the process of segmentation. This would depend upon the company philosophy and the type of product or services it offers. The type of segmentation criterion followed would create a big difference in the way the business operates and formulates its strategy. This is elucidated below.

Zero segments: <Undifferentiated approach> This means that the company is treating all of its customers in a similar manner. In other words, there is no differentiated strategy and all of the customer base is being reached out by a single mass marketing campaign.
One segment: <Focussed approach> This means that the company is targeting a particular group or niche of customers in a tightly defined target market.
Two or more segments: <Differentiated approach> This means that the company is targeting 2 or more groups within its customer base and is making specific marketing strategies for each segment.
Thousands of segments: <Hyper segmentation approach> This means that the company is treating each customer as unique and is making a customized offer for each one of them.

Once the company has identified its customer base and the number of segments it aims to focus upon, it needs to decide the factors on whose basis it will decide to segment its customers.

Factors for segmentation for a business to business marketing company:

Industry
Number of employees
Location
Market cap/Company size
Age of the company

Factors for segmentation for a business to consumer marketing company:

Demographic: Age, Gender, Education, Ethnicity, Income, Employment, hobbies, etc.
Recency, Frequency, and Monetary: Time period of the last transaction, the frequency with which the customer transacts, and the total monetary value of trade.
Behavioral: Previous purchasing behavior, brand preferences, life events, etc.
Psychographic: Beliefs, personality, lifestyle, personal interest, motivation, priorities, etc.
Geographical: Country, zip code, climatic conditions, urban/rural areal differentiation, accessibility to markets, etc.

Why segment your customers?

Customer segmentation has a lot of potential benefits. It helps a company to develop an effective strategy for targeting its customers. This has a direct impact on the entire product development cycle, the budget management practices, and the plan for delivering targeted promotional content to customers. For example, a company can make a high-end product, a budget product, or a cheap alternative product, depending upon whether the product is intended for its most high yield customers, frequent purchasers or for the low-value customer segment. It may also fine-tune the features of the product for fulfilling the specific needs of its customers.

Customer segmentation can also help a company to understand how its customers are alike, what is important to them, and what is not. Often such information can be used to develop personalized relevant content for different customer bases. Many studies have found that customers appreciate such individual attention and are more likely to respond and buy the product. They also come to respect the brand and feel connected with it. This is likely to give the company a big advantage over its competitors. In a world where everyone has hundreds of emails, push notifications, messages, and ads dropping into their content stream, no one has time for irrelevant content.

Finally, this technique can also be used by companies to test the pricing of their different products, improve customer service, and upsell and cross-sell other products or services.

How to segment your customers?

To start with customer segmentation, a company needs to have a clear vision and a goal in mind. The following steps can be undertaken to find segments in the customer base on a broad level.

Analyze the existing customer pool: Understanding the geographical distribution, customer preferences/beliefs, reviewing website search page analytics, etc.
Develop an understanding of each customer: Mapping each customer to a set of preferences to understand and predict their behavior: the products, services, and content they would be interested in.
Define segment opportunities: Once the segments have been defined, there should a proper business understanding of each segment and its challenges and opportunities. The entire company’s marketing strategy can be branched out to cater to different niches of customers.
Research the segment: After cementing the definition and business relevance of different customer segments, a company needs to understand how to modify its products or services to better cater to them. For example, it may decide to provide higher discounts to some customers compared to others to expand its active customer base.
Tweak strategy: Experiment with new strategies and understand the impact of time and economy on the purchasing behavior of customers in different segments. And then the process should be repeated for refining the strategy as much as possible.

Getting started

In the following analysis, I am going to use the Online Retail Data Set, which was obtained from the UCI Machine Learning repository. The data contains information about transnational transactions for a UK-based and registered non-store online retail. The link to the data can be found here.

Data Snapshot

Data Attributes

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.

Exploring the data

Before diving into insights from the data, duplicate entries were removed from the data. The data contained 5268 duplicate entries (about ~1%).

Let us now look at the total number of products, transactions, and customers in the data, which correspond to the total unique stock codes, invoice number, and customer IDs present in the data.

Fig: number of products, transactions and customers in the data

Thus, for 4070 products, there are 25900 transactions in the data. This means that each product is likely to have multiple transactions in the data. There are almost as many products as customers in the data as well.

Since the data, taken from the UCI Machine Learning repository describes the data to based on transactions for a UK-based and registered non-store online retail, let us check the percentage of orders from each country in the data.

The above graph shows the percentage of orders from the top 10 countries, sorted by the number of orders. This shows that more than 90% of orders are coming from United Kingdom and no other country even makes up 3% of the orders in the data.

Therefore, for the purpose of this analysis, I will be taking data corresponding to orders from the United Kingdom. This subset will be made in one of the next steps and will be mentioned as required.

Let us now look at the number of canceled orders in the data. As per the data, if the invoice number code starts with the letter ‘c’, it indicates a canceled order.

A flag column was created to indicate whether the order corresponds to a canceled order. All the canceled orders contain negative quantities (since it is a cancellation) and hence were removed from the data.

Finally, I ran a check to confirm whether there were any orders with negative quantities in the orders that were not canceled. There were 1336 such cases.

Fig: checking for orders with negative quantities in the data

As we can see from the above figure, these cases are the ones where CustomerID values are NaNs. These cases were also removed from the data.

Now, the data was filtered to contain orders only from the United Kingdom and finally, the structure of the data was checked by calling the .info() method:

Fig: .info() method output on the finalized clean data

There were no nulls in any of the columns in the data, and there were a total of 349227 rows in the data. Let us now check the number of products, transactions, and customers in our cleaned data:

Understanding Cohort Analysis

Let us now try to understand cohort analysis so that we can perform it on our data.

But, what is a Cohort?

A cohort is a set of users who share similar characteristics over time. Cohort analysis groups the users into mutually exclusive groups and their behavior is measured over time.

It can provide information about the product and customer lifecycle.

There are three types of cohort analysis:

Time cohorts: It groups customers by their purchase behavior over time.
Behavior cohorts: It groups customers by the product or service they signed up for.
Size cohorts: Refers to various sizes of customers who purchase the company’s products or services. This categorization can be based on the amount of spending in some period of time.

Understanding the needs of various cohorts can help a company design custom-made services or products for particular segments.

In the following analysis, we will create Time cohorts and look at customers who remain active during particular cohorts over a period of time that they transact over.

Diving into Cohort Analysis

Checking the date range of our data, we find that it ranges from the start date: 2010–12–01 to the end date: 2011–12–09. Next, a column called InvoiceMonth was created to indicate the month of the transaction by taking the first date of the month of InvoiceDate for each transaction. Then, information about the first month of the transaction was extracted, grouped by the CustomerID.

def get_month(x):
    return dt.datetime(x.year, x.month, 1)cohort_data['InvoiceMonth'] = cohort_data['InvoiceDate'].apply(get_month)
grouping = cohort_data.groupby('CustomerID')['InvoiceMonth']
cohort_data['CohortMonth'] = grouping.transform('min')
cohort_data.head()

Fig: creating the InvoiceMonth and CohortMonth column

Next, we need to find the difference between the InvoiceMonth and the CohortMonth column in terms of the number of months. The following code was used:

def get_date_int(df, column):    
    year = df[column].dt.year    
    month = df[column].dt.month    
    day = df[column].dt.day
    return year, month, dayinvoice_year, invoice_month, _ = get_date_int(cohort_data, 'InvoiceMonth') 
cohort_year, cohort_month, _ = get_date_int(cohort_data, 'CohortMonth')
years_diff = invoice_year - cohort_year
months_diff = invoice_month - cohort_month
cohort_data['CohortIndex'] = years_diff * 12 + months_diff
cohort_data.head()

After obtaining the above information, we obtain the cohort analysis matrix by grouping the data by CohortMonth and CohortIndex and aggregating on the CustomerID column by applying the pd.Series.nunique function. Here are the cohort counts obtained:

What does the above table tell us?

Consider CohortMonth 2010–12–01: For CohortIndex 0, this tells us that 815 unique customers made transactions during CohortMonth 2010–12–01. For CohortIndex 1, this tells that there are 289 customers out of 815 who made their first transaction during CohortMonth 2010–12–01 and they also made transactions during the next month. That is, they remained active.

For CohortIndex 2, this tells that there are 263 customers out of 815 who made their first transaction during CohortMonth 2010–12–01 and they also made transactions during the second-next month. And so on for higher CohortIndices.

Let us now calculate the Retention Rate. It is defined as the percentage of active customers out of total customers. Since the number of active customers in each cohort corresponds to the CohortIndex 0 values, we take the first column of the data as the cohort sizes.

cohort_sizes = cohort_counts.iloc[:,0]# Divide all values in the cohort_counts table by cohort_sizes
retention = cohort_counts.divide(cohort_sizes, axis=0)# Check the retention table
retention.round(3) * 100# Drawing a heatmap
plt.figure(figsize=(10, 8))
plt.title('Retention rates')
sns.heatmap(data = retention,annot = True,fmt = '.0%',vmin = 0.0,vmax = 0.5,cmap = 'BuGn')
plt.show()

From the above retention rate heatmap, we can see that there is an average retention of ~35% for the CohortMonth 2010–12–01, with the highest retention rate occurring after 11 months (50%). For all the other CohortMonths, the average retention rates are around 18–25%. Only this percentage of users are making transactions again in the given CohortIndex ranges.

From this analysis, a company can understand and create strategies to increase customer retention by providing more attractive discounts or by doing more effective marketing, etc.

RFM Segmentation

RFM stands for Recency, Frequency, and Monetary. RFM analysis is a commonly used technique to generate and assign a score to each customer based on how recent their last transaction was (Recency), how many transactions they have made in the last year (Frequency), and what the monetary value of their transaction was (Monetary).

RFM analysis helps to answer the following questions: Who was our most recent customer? How many times has he purchased items from our shop? And what is the total value of his trade? All this information can be critical to understanding how good or bad a customer is to the company.

After getting the RFM values, a common practice is to create ‘quartiles’ on each of the metrics and assigning the required order. For example, suppose that we divide each metric into 4 cuts. For the recency metric, the highest value, 4, will be assigned to the customers with the least recency value (since they are the most recent customers). For the frequency and monetary metric, the highest value, 4, will be assigned to the customers with the Top 25% frequency and monetary values, respectively. After dividing the metrics into quartiles, we can collate the metrics into a single column (like a string of characters {like ‘213’}) to create classes of RFM values for our customers. We can divide the RFM metrics into lesser or more cuts depending on our requirements.

Let’s get down to RFM analysis on our data now.

Firstly, we need to create a column to get the monetary value of each transaction. This can be done by multiplying the UnitValue column with the Quantity column. Let’s call this the TotalSum. Calling the .describe() method on this column, we get:

Fig: .describe() method on TotalSum column

This gives us an idea of how consumer spending is distributed in our data. We can see that the mean value is 20.86 and the standard deviation is 328.40. But the maximum value is 168,469. This is a very large value. Therefore, the TotalSum values in the Top 25% of our data increase very rapidly from 17.85 to 168,469.

Now, for RFM analysis, we need to define a ‘snapshot date’, which is the day on which we are conducting this analysis. Here, I have taken the snapshot date as the highest date in the data + 1 (The next day after the date till which the data was updated). This is equal to the date 2011–12–10. (YYYY-MM-DD)

Next, we confine the data to a period of one year to limit the recency value to a maximum of 365 and aggregate the data on a customer level and calculate the RFM metrics for each customer.

# Aggregate data on a customer level
data = data_rfm.groupby(['CustomerID'],as_index=False).agg({
'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
'InvoiceNo': 'count',
'TotalSum': 'sum'}).rename(columns = {'InvoiceDate': 'Recency',                                                                     'InvoiceNo': 'Frequency','TotalSum': 'MonetaryValue'})

As the next step, we create quartiles on this data as described above and collate these scores into an RFM_Segment column. The RFM_Score is calculated by summing up the RFM quartile metrics.

We are now in a position to analyze our results. The RFM_Score values will range from 3 (1+1+1) to 12 (4+4+4). So, we can group by the RFM scores and check the mean values of recency, frequency, and monetary corresponding to each score.

Fig: mean values of recency, frequency, and monetary for different RFM score values

As expected, customers with the lowest RFM scores have the highest recency value and the lowest frequency and monetary value, and the vice-versa is true as well. Finally, we can create segments within this score range of RFM_Score 3–12, by manually creating categories in our data: Customers with an RFM_Score greater than or equal to 9 can be put in the ‘Top’ category. Similarly, customers with an RFM_Score between 5 to 9 can be put in the ‘Middle’ category, and the rest can be put in the ‘Low’ category. Let us call our categories the ‘General_Segment’. Analyzing the mean values of recency, frequency, and monetary, we get:

Fig: mean values of recency, frequency, and monetary for different categories

Note that we had to create the logic for distributing customers into the ‘Top’, ‘Middle’, and ‘Low’ category manually. In many scenarios, this would be okay. But, if we want to properly find out segments on our RFM values, we can use a clustering algorithm like K-means.

In the next section, we are going to preprocess the data for K-means clustering.

Preprocessing data for K-means clustering

K-means is a well-known clustering algorithm that is frequently used for unsupervised learning tasks. I am not going into details regarding how the algorithm works here, as there are plenty of resources online.

For our purpose, we need to understand that the algorithm makes certain assumptions about the data. Therefore, we need to preprocess the data so that it can meet the key assumptions of the algorithm, which are:

The variables should be distributed symmetrically
Variables should have similar average values
Variables should have similar standard deviation values

Let us check the first assumption by building histograms of Recency, Frequency, and MonetaryValue variables using the seaborn library:

# Checking the distribution of Recency, Frequency and MonetaryValue variables.
plt.figure(figsize=(12,10))# Plot distribution of var1
plt.subplot(3, 1, 1); sns.distplot(data['Recency'])# Plot distribution of var2
plt.subplot(3, 1, 2); sns.distplot(data['Frequency'])# Plot distribution of var3
plt.subplot(3, 1, 3); sns.distplot(data['MonetaryValue'])

Fig: distribution of recency, frequency, and monetary value metrics

From the above figure, all the variables do not have a symmetrical distribution. All of them are skewed to the right. To remove the skewness, we can try the following transformations:

1. Log transformations
2. Box-Cox transformations
3. Cube root transformations

I will use the Log transformation here. Since the log transformation cannot be used for negative values, we need to remove them, if they exist. One common practice one can use here is to add a constant value to get a positive value and this is generally taken as the absolute of the least negative value of the variable to each observation. However, in our data, we do not have any negative values since we are dealing with customer transactions dataset.

Checking the distribution of the recency, frequency, and monetary variables, we get this by called the .describe() method:

From the above description, we can see that the minimum MonetaryValue for a particular customerID is 0. This transaction therefore does not make any sense and needs to be removed. Checking the occurrence:

Fig: The CustomerID with 0 monetary value

This customer was removed from the data. We also see that we do not get a constant mean and standard deviation values. To check that, we will standardize the data. Applying the log transformation on the data first and passing it through the StandardScaler() method from the sklearn library, we obtained the preprocessed data. Checking the distribution of RFM variables for symmetrical distribution now:

Fig: distribution of recency, frequency, and monetary value metrics on preprocessed data

As we can see from the above plots, skewness has been removed from the data.

Clustering with K-means

In this section, we will build multiple clusters upon our normalized RFM data and will try to find out the optimal number of clusters in our data using the elbow method. Attached below is the code for this purpose. For each cluster, I have also extracted information about the average of the intracluster sum of squares through which we can build the elbow plot to find the desired number of clusters in our data.

from sklearn.cluster import KMeanssse = {}# Fit KMeans and calculate SSE for each k
for k in range(1, 21):
  
    # Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=1)
    
    # Fit KMeans on the normalized dataset
    kmeans.fit(data_norm)
    
    # Assign sum of squared distances to k element of dictionary
    sse[k] = kmeans.inertia_# Plotting the elbow plot
plt.figure(figsize=(12,8))
plt.title('The Elbow Method')
plt.xlabel('k'); 
plt.ylabel('Sum of squared errors')
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()

One can also use silhouette analysis to find the optimal number of clusters. You can read more about it in my previous article here. For the purpose of this analysis, I have only used the elbow plot method.

From the above plot, we can see that the optimal number of clusters is 3 or 4.

Let us now compare the clustering performance. For this purpose, I calculated the mean values of recency, frequency, and monetary metrics to get the following result:

Fig: comparing cluster performance on our data (between 3 and 4 clusters)

From the above table, we can compare the distribution of mean values of recency, frequency, and monetary metrics across 3 and 4 cluster data. It seems that we get a more detailed distribution of our customer base using k=4. However, this may not be a very visually appealing method to extract insights.

Another commonly used method to compare the cluster segments is Snakeplots. They are commonly used in marketing research to understand customer perceptions.

Let us build a snake plot for our data with 4 clusters below.

Before building snake plots, we need to melt the data into along format so RFM values and metric names are stored in 1 column each. Link to understanding the pd.melt method: link.

# Melt the data into along format so RFM values and metric names are stored in 1 column each
data_melt = pd.melt(data_norm_k4.reset_index(),
                    id_vars=['CustomerID', 'Cluster'],
                    value_vars=['Recency', 'Frequency','MonetaryValue'],
                    var_name='Attribute',
                    value_name='Value')# Building the snakeplot
plt.title('Snake plot of standardized variables')
sns.lineplot(x="Attribute", y="Value", hue='Cluster', data=data_melt)

Fig: Snake plot for data with 4 clusters

From the above snake plot, we can see the distribution of recency, frequency, and monetary metric values across the four clusters. The four clusters seem to be separate from each other, which indicates a good heterogeneous mix of clusters.

As the final step in this analysis, we can extract this information now for each customer that can be used to map the customer with thei relative importance by the company:

Fig: RFM values for each customer along with the cluster assignment

Final Thoughts

From the above analysis, we can see that there should be 4 clusters in our data. To understand what these 4 clusters mean in a business scenario, we should look back the table comparing the clustering performance of 3 and 4 clusters for the mean values of recency, frequency, and monetary metric. On this basis, let us label the clusters as ‘New customers’, ‘Lost customers’, ‘Best customers’, and ‘At risk customers’.

Below is the table giving the RFM interpretation of each segment and the points that a company is recommended to keep in mind while designing the marketing strategy for that segment of customers.

Fig: Business interpretation of clusters and recommended action on each segment

Further analysis

Addition of new variables like Tenure: The number of days since the first transaction by each customer. This will tell us how long each customer has been with the system.
Conducting deeper segmentation on customers based on their geographical location, and demographic and psychographic factors.
Incorporating data from Google Analytics account of the business. Google Analytics is a great resource to track many important business metrics such as Customer Life time value, Traffic source/medium, PageViews per visit, Bounce rate of the company’s website, etc.

If you liked the article and found it informative, please share it to spread knowledge :). Thank you!

Link to the Github repository for this project: link