Using Machine Learning to Predict Customers’ Next Purchase Day

Machine Learning model to predict whether customers will make their next purchase after a certain period.

Evans Doe Ocansey

Published in

Towards Data Science

12 min readApr 30, 2021

Image by Mediamodifier and can be accessed here.

Introduction

If there is one major lesson that those in the retail business have learnt from the SARS-CoV-2 pandemic, it is the demand to switch to doing business via the Internet, i.e., e-commerce. The idea of e-commerce assists those in managerial positions to make decisions for the progress of their companies. Undoubtedly, most of these decisions are influenced by the results derived from studying the purchasing behavioural data of online customers by experts in data analysis, data science, and machine learning.

Suppose the managerial team of an online retail shop approaches you, a data scientist, with the dataset wanting to know whether customers will make their next purchase 90 days from the day they made their last purchase. Your answer to their inquiry will help them identify which customers their marketing team need to have a focus on with regard to the next promotional offers they will be rolling out.

In this article, my goal as a data scientist is to build a model that will provide a suitable answer to the question posed by the firm's managers. More precisely, using the given dataset, I build a machine learning model that predicts whether an online customer of a retail shop will make their next purchase 90 days from the day they made their last purchase.

It is worth mentioning that Barış Karaman¹ has done a similar work that answers a different question and not this exact forecast the question we have, seeks to preempt.

Some Information about the Dataset

Before proceeding to answer the question of interest, I will first present some general and useful information about the dataset.

The dataset recorded 5942 online customers from 43 different countries. Among the online customers of the retail shop, 90.8% of them were living in the United Kingdom.

Figure 1: Customer Country Count in Percentage — Image by Author

With such a huge customer base in the United Kingdom, it is not surprising that 83% of the company’s revenue came from the United Kingdom.

Figure 2: Revenue of Countries in Percentage — Image by Author

Figure 3 below gives a visual representation of the monthly revenue earned by the online retail company.

Figure 3: Monthly Revenue from December 2009 until December 2011 — Image by Author.

Here, one can observe that the company recorded its highest revenue in the month of November 2010, followed by November 2011. In addition, there is a rise in monthly revenue after August.

From the analysis made in this section, there is the advice one can give to the managers for consideration. In the company’s bid to increase its customer base in other countries than the United Kingdom, what could be a possible advice a data scientist can suggest to the managerial team? In an answer to this, I would say that…

Since the company has a solid customer base in the United Kingdom, it could capitalise on that and roll out a “win-win promotion”. In implementing this rollout, specifically, for any product that an existing customer buys, he/she gets the opportunity to invite a new customer outside the United Kingdom via a web link. If the new customer buys something from the online shop using the web link he/she received from the already existing customer, the company gives both the existing customer and the new customer a cash voucher that can be used by both parties in their next purchase. By so doing we see that the company, the existing or earlier customer, and the new customer all receive a level of satisfaction in the transaction made.

Predicting Customers’ Next Purchase

In this section, I focus on the methods that I deployed to solve the problem of interest. That is, to build a machine learning model that will predict whether an online customer of a retail shop will make their next purchase 90 days from the day they made their last purchase.

The major steps included the following:

Data Wrangling
Feature Engineering
Building Machine Learning Models
Selecting Model

I begin by importing the necessary Python packages, download the dataset and then load it into my Python environment. The code snippet below summarises this step.

Data Wrangling

I then wrangle with the dataset to put it into good shape so as to introduce new X features.

The CustomerID column of the given dataset has 243007 missing data. That represents 22.77% of the entire online customers. Moreover, the Description column has 4382missing data. How do I deal with these missing data? After talking to the company leaders, they suggested that any item that has missing CustomerID should be dropped.

The dataframe df_data is split into two pandas data frames. Namely,

The first sub-dataframe, ctm_bhvr_dt, contains purchases made by customers from 01–12–2009 to 30–08–2011. From this dataset, I get the last purchase date of all online customers.

The second sub-dataframe ctm_next_quarter is used to get the first purchase date of the customers from 01–09–2011 to 30–11–2011.

Next, I create a pandas dataframe that contains a set of features of each customer for us to build our prediction model. I begin by creating a dataset that contains the distinct customers in the dataframe ctm_bhvr_dt.

I then add a new label, NextPurchaseDay to the dataframe ctm_dt. This new label will be the number of days between the last purchase date of a customer in the dataframe the customer who has the most frequently purchased item that is with missing CustomerID the following procedure to deal with the missing values in the CustomerID column. turns out that thectm_bhvr_dt and his/her first purchase date in the dataframe ctm_next_quarter.

Figure 5 below is the output of the code snippet above. It shows the first 5 entries of the dataframe object, ctm_dt .

Figure 5: Customers’ next purchase day data

In the next section, I introduce some features and add them to the dataframe ctm_dt to build our machine learning model.

Feature Engineering

I introduce features into our dataframe ctm_dt that segments customers into groups based on their value to the company. In executing this I use the RFM segmentation method. RFM stands for

Recency: indicating how recent a customer made a purchase.
Frequency: How often or the number of times a customer purchases.
Monetary Value/Revenue: The amount of money a customer spends when making a purchase at a point in time.

Using these three features being recency, frequency, and monetary value/revenue, I create an RFM score system to group the customers. Essentially, the RFM score derived is what helps to give an insight into what a customer would probably do regarding purchase decisions in the future.

After calculating the RFM score, I then apply unsupervised machine learning to identify different groups (clusters) for each score and add them to the dataframe ct_dt. Finally, I apply the pandas dataframe method get_dummies to ctm_dt to deal with the categorical features in the dataframe. I now make a move into coding to fish out the computation of the RFM scores and the clustering.

Recency

In getting to know who is likely to make a current purchase, I use the recency feature to work this out. Factoring in the length of time a customer has taken off after his or her last purchase, the recency characteristic comes in handy here. I use this feature to know which customer will be coming in for a transaction. It is relevant to also note that the sales transaction of a recent purchasing customer is of far more worth than the customer who has not bought in a while.

Let us get into the coding here below.

Figure 6 below gives a visual presentation of the recency data of the online customers.

Figure 6: Histogram of the recency data of customers — Image by Author.

The code used to generate Figure 6 above can be accessed in the Jupyter notebook here.

Next, I need to assign a recency score for the recency values. This can be achieved by applying the K-means clustering algorithm. However, we need to know the number of clusters before using the algorithm. Applying the Elbow Method, one can determine the number of clusters needed for a given data. In our case, given the recency values as our data, the number of clusters computed is 4. The code used to compute the number of clusters is available in the Jupyter notebook here.

I can now build 4-clusters using the Recency column in the dataframe ctm_dt and create a new column RecencyCluster in ctm_dt whose values are the cluster value predicted by the unsupervised machine learning algorithm kmeans. Using the user-defined Python function order_cluster accessible here, I sort the dataframe ctm_dt in decreasing order of the values in RecencyCluster. The code snippet below outputs Figure 7 below.

Figure 7: Customer Recency Cluster Data

Let us group the dataframe ctm_dt by the cluster values in the column labelled RecencyCluster and fetch out the statistical description of the Recency data of each of these clusters

Figure 8: Statistical Summary of Recency Data against RecencyCluster

From Figure 8 above, it can be observed that cluster value 3 covers the most recent customers whereas 0 has the most inactive customers.

In the next subsections, I apply the method we have discussed in this subsection for the Frequency and Revenue features.

Frequency

So as earlier mentioned, in a particular frame of time, if we consider the number of times a customer has engaged in a purchasing transaction, frequency comes into play. Now, this frequency characteristic is that which helps us know a customers alliance to a specific company or trading brand. In view of this, it gives the company insight into the marketing strategies to relay and at what points in time, in order to reach out to such customers in particular.

Here, I conduct a similar procedure of analysis as I did in the previous subsection (Recency).

Figure 9: First 5 entries of the main dataset with Frequency

Figure 10 below illustrates the histogram of customers whose purchase frequency is less than 1200.

Figure 10: Histogram of customers with purchase frequency less than 1200 — Image by Author.

The code snippet below assigns a cluster value for the purchase frequency of each customer and sorts the cluster values in decreasing order.

Figure 11: First five entries of the main dataset with FrequencyCluster

The code snippet below groups the dataframe ctm_dt by the cluster values recorded in the column labelled FrequencyCluster and fetches out the statistical description of the Frequency data of each of these FrequencyCluster values.

Figure 12: Statistical Summary of Frequency Data against FrequencyCluster

As it was for the case of the Recency, customers with a higher frequency cluster value are better customers. In other words, they patronise the products of the retail shop very often than those with a lower frequency cluster value.

Monetary Value/Revenue

To give a little more detail to Monetary Value or revenue, it centres more on the money a customer spends when in for a purchase at any point in time. So here it helps to ascertain how much money a customer is likely to let out when making a purchase. Even though this feature of revenue does not expose one to predict when next there will be a purchase from the customer, knowing how much could come in when the customer comes through for a transaction is worth knowing.

I again follow a similar procedure to obtain a revenue score for each customer and assign cluster values for each customer based on their revenue score.

Figure 13: First five entries of the main dataset with Revenue

The figure below illustrates a visual representation of customers whose revenue is below £10,000.

Figure 14: Histogram of customers with a monetary value below £10000 — Image by Author.

The code snippet below assigns a cluster value for the revenue of each customer and sorts the cluster values in ascending order.

Figure 15: Statistical Summary of Revenue Data against RevenueCluster

Overall Score

In the code snippet below, I add a new column OverallScore to the dataframe ctm_dt with values as the sum of the cluster values obtained for the Recency, Frequency and Revenue.

Figure 16: Mean of the Recency, Frequency and Revenue grouped against OverallScore value

The scoring above clearly shows us that customers with an overall score of 8 are the positively outstanding customers who bring much value to the company whereas those assigned a score of 3 are supposedly unreliable and merely wandering.

As a follow-up, I group the customers into the segments based on their overall score as follows:

3 to 4: Low Value
5 to 6: Mid Value
7 to 8: High Value

The code snippet is as follows:

Figure 17: First five entries of the main dataset with Segment column

I then create a copy of the dataset ctm_dt and apply the method get_dummies to it so as to convert all categorical column Segment to indicator variables.

In pursuance of my goal to estimate whether a customer will make a purchase in the next quarter, I create a new column NextPurchaseDayRange with values as either 1 or 0 defined as follows:

If the value is 1, then it indicates that the customer will buy something in the next quarter, i.e., 90 days from his or her last purchase.
The value 0 indicates that the customer will buy something in more than 90 days from his or her last purchase.

I conclude this section by computing the correlation between our features and label. I achieve this by applying the corr method to the dataframe ctm_class.

Figure 18: Minimum and Maximum Correlation Coefficient

From Figure 18 above, it can be seen that OverallScore has the highest positive correlation of 0.97 with RecencyCluster and Segment_Low-Value has the highest negative of -0.99 with Segment_Mid-Value.

In Figure 19 below, I present a good visualisation of the coefficient matrix. The code snippet is below.

Figure 19: Correlation Matrix — Image by Author.

Building Machine Learning Models

In this section, I have what it takes with regard to the necessary prerequisites to build the machine learning model. The code snippet below separates the dataframe ctm_class into X features and the target variable y. Afterwards, I split X and y to get the training and test datasets and then measure the accuracy, F₁-score, recall, and precision of the different models.

Figure 20: Metric of all models

From the results in Figure 20 above, we see that the LogisticRegression model is the best in terms of the metrics accuracy and F₁-score.

Let’s see how an improvement can be made for the existing model XGB Classifier which ranks fourth in Figure 20 above, by finding suitable parameters to control the learning process of the model. This process is called hyperparameter tuning. I then verify using the computation below to know if the improved XGB Classifier model outperforms the LogisticRegression model.

Figure 21: XGB Classifier Model — Hyperparameter Tuning

Figure 22: Refined XGB Classifier Accuracy Score

Selecting Model

Comparing the accuracy of the LogisticsRegression in Figures 20 above and that of the refined XGB classifier in Figure 22 above, it is obvious that the refined XGB classifier model is accurate than the LogisticRegression model by a margin of 0.1. How about the other metrics?

Figure 23: Metric scores of XGB and LogisticRegression Classifiers

It is obvious from the output in Figure 23 above that for each metric, accuracy, F₁-score, recall, and precision, the refined XGB classifier model outperforms the LogisticRegression model.

In forecasting the expectancy of a customer to make another purchase at the online retail shop after 90 days of one’s last purchase, there is the need to be accurate in our submission. As a result, I am interested in the model which gives the highest accuracy possible in making this pre-emption. Thus it is the best of options to make a choice to use the improved XGB classifier model over the LogisticRegression model.

Conclusion

From the dataset, I highlight the fact that the strong customer base of the online shop centred in the United Kingdom is a major reason for the high revenue the company profits from the United Kingdom as a region.

I also give a detailed demonstration of how to build a machine learning model to predict whether an online customer of the retail shop will make their next purchase 90 days from the day they made their last purchase. Among the models that I used, I had to further improve on the XGB classifier model by the process of hyperparameter tuning to outperform the LogisticRegression model. The initial metrics after the hyperparameter tuning of the XGB classifier model max_depthand min_child_weight both set to 3, did not outperform that of the LogisticRegression. So I had to further tweak these values heuristically in order to get the XGB classifier model to outperform the LogisticRegression model.

The above notwithstanding, it will be interesting to investigate with further work how one can again improve the model’s accuracy and F₁-score metrics. I suggest improving the dataset by introducing the “right” X features so as to avoid the usage of a hyperparameter tuning process. So then my question now stands that

What X-features would be appropriate to introduce into the dataset to reach or increase the model’s hightened accuracy and F₁-score metrics without hyperparameter tuning?

The Jupyter notebook used for this article is available here.

Reference

[1] Barış Karaman. (Accessed on April 28, 2021) Predicting Next Purchase Day