
Cohort Analysis is a very useful and relatively simple technique that helps in getting valuable insights about the behavior of any Business‘ customers/users. For the analysis, we can focus on different metrics (dependent on the business model) – conversion, retention, generated revenue, etc.
In this article, I provide a brief theoretical introduction into the Cohort Analysis and show how to carry it out in Python.
Introduction to Cohort Analysis
Let’s start with the basics. A cohort is a group of people sharing something in common, such as the sign-up date to an app, the month of the first purchase, geographical location, acquisition channel (organic users, coming from performance marketing, etc.) and so on. In Cohort Analysis, we track these groups of users over time, to identify some common patterns or behaviors.
When carrying out the cohort analysis, it is crucial to consider the relationship between the metric we are tracking and the business model. Depending on the company’s goals, we can focus on user retention, conversion ratio (signing up to the paid version of the service), generated revenue, etc.
In this article, I cover the case of user retention. By understanding user retention, we can infer the stickiness/loyalty of the customers and evaluate the health of the business. It is important to remember that the expected retention values vary greatly between businesses, 3 purchases a year for one retailer might be a lot, while for another might be far too little.
Retaining customers is critical for any business, as it is far cheaper to keep the current customers (by using CRM tools, member discounts, etc.) than to acquire new ones.
Furthermore, cohort analysis can also help to observe the impact of changes to the product on the user behavior, be it design changes or entirely new features. By seeing how the groups behave over time, we can more or less observe if our efforts had some effects on the users.
This should be enough of theory for now, let’s move to the real-life example.
Setup
In this article, we will be using the following libraries:
The dataset
We will use a dataset downloaded from the UCI Machine Learning Repository, which is a great source for different kinds of datasets. They are already labeled according to the area of machine learning which they can be used for:
- supervised (regression/classification),
- unsupervised (clustering).
You can find the dataset here. Alternatively, you can download the data directly from the Jupyter Notebook using the following line:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx
The dataset can be briefly described as: "This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."
Next, we load the data from the Excel file.
The loaded data looks as follows:

We also inspected the DataFrame using df.info()
to see if there are missing values. As for the analysis, due to the fact that we need to have the customer IDs, we dropped all the rows without them.
df.dropna(subset=['CustomerID'], inplace=True)
For completeness’ sake, we also do a very quick EDA, with a focus on the users. EDA is always a very important step of any analysis, as we discover the specifics of the dataset we are working with.
We start by inspecting the distribution of the numeric variables – quantity and unit price.
df.describe().transpose()

From the table above, we can see that there are orders with negative quantity – most likely returns. In total, there are around 9 thousand purchases with a negative quantity. We remove them from the dataset. This introduces a kind of bias, as we include the initial orders and remove the return – this way the initial order is taken into account even though in theory it was not realized and did not generate revenue. However, for simplicity, we leave the initial order, as for metric such as retention (indicating the customers’ engagement) this should still be a valid assumption.
Then, we calculate an aggregate metric indicating how many orders were placed by each customer.
Using the code above, we can state that 65.57% of customers ordered more than once. This is already a valuable piece of information, as is seems that the customers are placing multiple orders. This means that there will be at least some retention. Given that the dataset has no sign-up/joined date, it would be problematic if the majority of the users only placed one order, but we will get back to it later.
Additionally, we look at the distribution of the number of orders per customer. For that, we can reuse the previously aggregated data (n_orders
) and plot the data on a histogram.
Running the code generates the following plot:

There are some infrequent cases of customers, who ordered more than 50 times.
Cohort Analysis
The dataset we are using for this example does not contain the customer sign-up date – the date when they registered with the retailer. That is why we assume that the cohort they belong to is based on the first purchase date. A possible downside of this approach is that the dataset does not contain the past data, and what we already see in this snapshot (between 01/12/2010 and 09/12/2011) includes recurring clients. In other words, the first purchase we see in this dataset might not be the actual first purchase of a given client. However, there is no way to account for this without having access to the entire historical dataset of the retailer.
As the first step, we keep only the relevant columns and drop duplicated values – one order (indicated by InvoiceNo
) can contain multiple items (indicated by StockCode
).
As the second step, we create the cohort
and order_month
variables. The first one indicates the monthly cohort based on the first purchase date (calculated per customer). The latter one is the truncated month of the purchase date.
Then, we aggregate the data per cohort
and order_month
and count the number of unique customers in each group. Additionally, we add the period_number
, which indicates the number of periods between the cohort month and the month of the purchase.
The next step is to pivot the df_cohort
table in a way that each row contains information about a given cohort and each column contains values for a certain period.
To obtain the retention matrix, we need to divide the values each row by the row’s first value, which is actually the cohort size – all customers who made their first purchase in the given month.
Lastly, we plot the retention matrix as a heatmap. Additionally, we wanted to include extra information regarding the cohort size. That is why we in fact created two heatmaps, where the one indicating the cohort size is using a white only colormap – no coloring at all.
The end result is the following retention matrix:

In the image, we can see that there is a sharp drop-off in the second month (indexed as 1) already, on average around 80% of customers do not make any purchase in the second month. The first cohort (2010–12) seems to be an exception and performs surprisingly well as compared to the other ones. A year after the first purchase, there is a 50% retention. This might be a cohort of dedicated customers, who first joined the platform based on some already-existing connections with the retailer. However, from data alone, that is very hard to accurately explain.
Throughout the matrix, we can see fluctuations in retention over time. This might be caused by the characteristics of the business, where clients do periodic purchases, followed by periods of inactivity.
Conclusions
In this article, I showed how to carry out Cohort Analysis using Python‘s pandas
and seaborn
. On the way, I have made some simplifying assumptions, but that was mostly due to the nature of the dataset. While working on a real-life scenario for a company, we would have more understanding of the business and could draw better and more meaningful conclusions from the analysis.
You can find the code used for this article on my GitHub. As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.
Liked the article? Become a Medium member to continue learning by reading without limits. If you use this link to become a member, you will support me at no extra cost to you. Thanks in advance and see you around!
I recently published a book on using Python for solving practical tasks in the financial domain. If you are interested, I posted an article introducing the contents of the book. You can get the book on Amazon or Packt’s website.