The world’s leading publication for data science, AI, and ML professionals.

What business people really mean when they say segmentation

and how to create an effective segmentation analysis using Python

What Business People Actually Look for When They Say Segmentation

Photo by Mel Poole on Unsplash
Photo by Mel Poole on Unsplash

It’s time to brush up on my Python skills again! 👩🏻 ‍💻

This week I want to share one project I did at my ex-employer, a German fintech startup. With this project, I worked together with the marketing team and delivered a user segmentation analysis using Python, eventually driving the transition from an opportunistic growth strategy to a sharp brand positioning and targeting strategy and growing the customer base from 5K to 400K.

With this post, I would love to share my approach with a Kaggle dataset, combined with my takeaways from a recent course, "Managing Customer Value," at my MBA program. If you are curious how my first week at INEAD MBA looks like, check this post for more details!

In addition to this post, here you can also find attached a Kaggle notebook that was implemented with the approach described here, based on a different dataset: https://www.kaggle.com/yaowenling/clustering-analysis-and-clv

If you are interested in how I did a specific segmentation analysis based on the RFM model, check here!

Below are my key takeaways:

🎯 When business people talk about segmentation, they are not looking for groups of data points, but a set of clear and actionable criteria with which one can identify distinct, big enough, and value-driven segments that pass the reality check.

🧩 It is not sufficient to use one algorithm only. To achieve the goal we need to combine regression, Clustering, mean comparison, and synthesis.

🕵🏻 ‍♀️ It is not sufficient to look at internal data only. To make sure our segments pass the reality check we need to work closely with the Product UX researchers and the brand marketing team.


The real business question

During the second week after I joined my ex-employer, the CEO approached me with the following question: "This is our second year. We have accumulated a customer base of 5K. The CPA looks good. The product looks good. We are also making revenue. Where should we head for? Who are our most valuable customers (MVC)? How do they look like? How can we acquire more of them?"

We defined how the expected output should look:

  • Segments should be distinct enough. We want to classify our customers into some groups. Individuals within each group should be as "close" to each other as possible, and individuals across groups should be as "far" from each other as possible.
  • Segments should be big enough. The groups we identify should be representative of our customers. In other words, with a total customer base of 5K, each identified group should be larger than 10% of the total population. Ideally, we also want to limit the number of groups to a maximum of 5.
  • Segments should be value-driven. The identified groups should demonstrate different values to the business. In the best case, we are able to identify a customer group that accounts for 20% of the customer base yet contributes 80% of the revenue – the famous 80/20 rule.
  • Bonus: segments should pass the reality check. In other words, suppose we have identified a perfectly distinct customer group that meets the 80/20 requirement above, yet these customers have an average income of over 200K, we might need to reconsider the criteria because there might not be that many people with such a high income on the market to acquire at a reasonable CPA. Or imagine another case where we have identified a valuable customer group, yet our user research suggests that users with such characteristics demand features that are certainly out of the scope of your product, then we should also reconsider our classification criteria (or reconsider the scope of your product :)). To address these two considerations, it is not enough to just look at the data of our users, but requires extensive collaboration with UX researchers as well as the marketing team – I had a chance to work together with talents in these two areas in the second iteration of my analysis at a later stage, and I will share some learnings at the end of this post.

An approach combining regression, clustering, and mean comparison, and most importantly, synthesis

Coming from a quantitative background, I initially believed that we need a clustering analysis to generate the segments and regression analysis to identify predictors of customer value. Very soon I realized that doing these two analyses independently won’t help answer the question we asked.

Why a simple regression analysis is not sufficient

Usually when we ask questions like "what kind of customers give us higher revenue?", our instinct is to take revenue or CLV (customer lifetime value) as a dependent variable and run a regression to find out what predictors can help explain the variance in revenue. However, the direct output from a regression model is not enough.

For example, a regression model may show that assuming all other things equal, younger customers generate higher revenue than older customers, and customers with higher income generate higher revenue. These insights are great, but they are all based on one single dimension. The Segmentation criteria, however, are always multi-dimensional.

That said, regression analysis still makes perfect sense – with it one can have a general idea of what variables are influential and should be included in our analysis.

Why a simple clustering analysis is not sufficient

Just to recap what a clustering analysis is and what it brings to us:

  • Clustering is an unsupervised machine learning algorithm, which means that it does not have any predictive power or explain any variation in a target variable.
  • Suppose each record in our dataset is an individual user and there are typical demographic variables such as gender, age, and job. The output of clustering is a new column with a cluster label (e.g., 1, 2, 3, …k, where k = the number of generated clusters) assigned to each user.
  • One could evaluate the clustering performance by checking the distance between data points within and across clusters. However, there are no right or wrong answers, or better or worse answers.

The direct output of clustering analysis, however, does not answer our questions. This is because in most cases these clusters do not necessarily form "good" descriptive criteria by themselves. For example, suppose we end up with two Clusters X and Y and we examine the descriptive statistics of the two clusters, we might see that the age range for both clusters is 18–75. This is because no matter how fine-tuned your algorithm is, you will always find some observations in a cluster that looks like outliers in certain dimensions. For example, in Cluster X where the average age is 25, you might find a user who is 70 or 18 and he/she is classified into Cluster X because he/she is extremely close to other cluster members in all other dimensions. However, comparing the average age of each cluster may give us a hint that these clusters are actually different from each other in terms of age distribution.

Hence, to generate clear and actionable criteria, one needs to go one level deeper to generate learnings about the cluster characteristics – I call this process "synthesis". In the example above, one can derive a more reasonable age range by looking at the 10% and 90% quartiles of each cluster, where the exact cutoff depends on your tolerance. We may find that 80% of users in Cluster X fall into the age range 20–35 whereas for Cluster Y the range is 45–55. Similarly, we may also find that users in both Cluster X and Cluster Y have all types of jobs, yet a closer look reveals that 80% of users in Cluster X are developers, and 80% in Cluster Y work in the financial industry.

The next step is to use these summarised learnings to define your clusters. Sticking to the same example above and recalling that I found that 80% of people in Cluster X are between 20 and 35 and that 80% are developers, I can define a criterion as follows: 1) age between 20 and 35; 2) they are developers. I then use these filters to generate a segment called "young developers". Note that this segment is for sure smaller than cluster X as given by the clustering algorithm (by 100%-80%*80%=36% with the assumption given above). However, in case we found a segment is too small (e.g., <10% of the total population), we need to consider merging it with other segments.

Evaluate customer value across defined clusters

After generating clusters, we want to check if how valuable these clusters are. One way to check that is to compare the average CLV across clusters. Alternatively, one can also use other monetization metrics such as avg. # orders, % conversion rate, CLV/CPA, etc.

It is worth noting that these value-related metrics should not be used as the input for the clustering algorithm in our previous steps. This is because when we define clusters we want to find criteria that we can use to identify customers on the market, whereas these value-related metrics (e.g., user activity, revenue, CPA) are only generated once users are acquired. However, if we ask a different question "can we group our customers by their engagement patterns?", then we can use these value-related metrics for clustering.

When comparing average customer value across clusters, it is also worthwhile to examine the statistical significance with the Tukey test.

In the end, to entertain our business stakeholders, we can create a 2D plot of the manually defined clusters based on PCA (Principal Component Analysis) to show how distinct the clusters are from each other!


The business impact

With the analysis I did during my second week, we were able to identify one segment we called "Young Professionals" which showed significantly revenue performance compared to other segments such as "Students", "Urban Families" and "Veteran". I communicated the results with the CEO and CMO. Very soon, we switched from mass marketing with the key message "our app helps you save money" to a targeted message towards the young generation featuring our digital solution. At the same time, we started to shift our marketing budget focus from TV to digital channels (e.g., CPC, social media, and in-app ads) because we believed they are the best channels to attract our Most Valuable Customers (MVC).

Two years later, when our customer grew to around 100K, I was tasked to revisit this analysis. I replicated the same approach above. The good news is that we see an increase in the share of "Young Professionals" and these customers remained equally valuable (i.e., their average revenue contribution is not diluted because we had tried to acquire more of their type). What’s new to me personally this time, is that I got a chance to collaborate with our product UX researchers and the brand marketing team. It was a really valuable experience because I saw how my insights were used to support their user research and further market research with external brand agencies.


That’s it! Let me know if you have thoughts or questions!


Related Articles