Customer Journey-Based Segmentation for Marketplaces

Know Your Customers With The Help of a Clustering Analysis: How to Extract Actionable Insights Into Your Customers Lifecycle Stages

Photo by Margarida CSilva on Unsplash


It is very likely that you have come across the concept called ‘Know Your Audience’ once in a while. It is a crucial approach used by most businesses to identify different customer groups and their respective needs. The importance of the concept for a business is all about being capable of understanding and reaching customers better. This enables businesses to deliver their contents and messages to customers in a more efficient and personalised way.

This concept gives businesses a chance to move from one-fits-all to more customer-centric strategies.

Customer Segmentation is a commonly used broad term for applying the ‘Know Your Audience’ concept to e-businesses. There are tens of different ways to do it and in this article, we would like to share with you our way of doing Customer Segmentation for the marketplaces in eCG (eBay Classified Group) based on a Customer Journey Analysis.

The rest of the article will follow the outline below:

Problem Definition

  • What do we try to accomplish?
  • What’s the main objective?
  • For which user group is this analysis done?
  • For which eCG’s marketplace brand the model is developed for?

Data Exploration

  • An overview of Customer Journey
  • Short Descriptions Of Viewers’ Features
  • Short Descriptions Of Repliers’ Features
  • Determining Time Period for the Analysis


  • Assumptions of K-Means Clustering
  • Step 1: Outliers Removal
  • Step 2: Data Rescaling
  • Step 3: Giving Weights to Dimensions
  • Step 4: K-Means Clustering


  • Short Descriptions and Indicators of Viewer Clusters
  • Short Descriptions and Indicators of Replier Clusters
  • Clusters In Detail & Transitions between Clusters


Problem Definition

What do we try to accomplish?

In this project, we would like to group our users into smaller cohorts depending on their current stage of Customer Journey. Therefore, the main question to be answered is that ‘Can we clearly identify meaningful journey stages of a complete user experience on our marketplace platform?’ __ In this way, we will be able to define some user clusters having different level of engagement with our platforms.

What’s the main objective?

The ultimate goal of this project is being able to target our users differently given their current status on the platforms. Another way to put it: ‘Personalised Targeting’. With the help of this model, we will have a chance to differentiate our customer targeting strategies accordingly to be more effective on our users and increase the number of active users on the platforms.

For which user group is this analysis done?

The question we asked ourselves at this stage is whether we should go for Buyers or Sellers. Buyers are the ones who look for a product on the platform to buy whereas sellers are there to sell their stuff by posting a new listing. Since the number of user actions on a marketplace platform created by buyers such as viewing an item, saving an item or sending a message to a user is way more than sellers, we decided to run this Customer Journey analysis for our buyers only. There is a way of doing the similar analysis for sellers as well but for now, we stick with buyers only for the sake of simplicity and efficiency. Therefore, in the rest of the article, sometimes we will refer to the problem by using the phrase ‘Buyer Journey’.

In a marketplace platform, there are mainly 2 different user types: B2C (Dealers) and C2C (Individuals). Dealers are the main source of ads posted on our platforms and they account for a significant part of our sellers. Therefore, we decided to exclude all dealers from this analysis since it does not make any sense to assign them into one of those buyer journey stages. In order to protect the analysis from misleading effect of the dealers data, we focused and run entire analysis on our C2C users only.

For which eCG’s marketplace brand the model is developed for?

Ebay Classified Group (eCG) is an umbrella company which is currently running 14 different classified platforms from all over the world. There are 2 different platform types under the eCG brand: Horizontal and Vertical Marketplaces. Horizontal marketplace means a platform consisted of many different categories as in the case of our Dutch tenant ‘Marktplaats’. Whereas, vertical in eCG refers to a platform used for a specific product category: Autos.

This project is developed for one of the eCG’s vertical platforms: Kijiji Autos which is the leading marketplace brand in Canada where people can buy or sell cars.


An overview of Customer Journey

In an e-commerce platform such as eBay where we know about users’ transactions or purchases, a customer journey mostly starts with signing up and ends with buying a product. For that reason, these journeys are generally called ‘Purchase Funnels’ or ‘Conversion Funnels’ for those businesses. However, in the case of our classified business, things are not that well-defined. Therefore, we need to approach the problem more delicately especially during the first step of the project: Definition of factors – features – that will be used for forming different customer journey stages in the following steps of the project.

During the Exploration Step, we discovered that there are mainly 2 different groups among the C2C users on our marketplace platform: Viewers and Repliers.

Viewers: Users who have not replied to a listing recently but just browsing on our platform.

Repliers: Users who have replied to at least 1 listing recently.

The main reason why we first split all our users into two user sub-groups is that they exhibit different customer behavioural characteristics. For users with no recent reply – Viewers, we can only track their browsing actions on the platform, whereas we prefer focusing more on their past messaging actions rather than browsing for Repliers.

Here is the list of features extracted for Viewers and Repliers separately:

Feature lists for 2 different user groups: Viewers-no reply and Repliers-with reply (Image by Author)
Short Descriptions Of Viewers’ Features

Days since last Ad view: Number of days since the last time user viewed a listing on the platform.

Number of visiting days: Number of different days user visited the platform.

Number of distinct Ad views: Number of different listings viewed by user

Number of Ads favourited: Number of listings saved by user

Interest in a specific Ad: The ratio of number of days the most viewed listing visited by user to number of total visiting days of that user.

Interest in a group of Ads: The ratio of number of distinct listings viewed by user to total number of listings viewed by that user.

Short Descriptions Of Repliers’ Features

Days since last reply: Number of days since the last time user replied to a seller on the platform.

Number of conversations: Number of conversations with different sellers.

Number of replied days: Number of days user replied to sellers on the platform.

Max number of days spent on a particular Ad: Maximum number of days user replied to a seller for a same listing.

Number of Ads favourited: Number of listings saved by a user

Average conversation length: The ratio of number of total replies by a user to number of conversations that user had.

Number of replies in the longest conversation: Number of messages sent to the seller in the longest conversation

Determining Time Period for the Analysis

Before extracting any user behavioural features out of the data, we need to set a time frame for this process. It is because only relatively recent user actions can only have an impact on current customer journey stage of the user. As you can see from the picture above, this time period was 60 days for our case and it was a result of an iterative process as summarised in the figure below.

Iterative Time Period Determination (Image by Author)
The entire process starts with extracting features from the users’ last X days actions on the platform, then we generate clusters accordingly and evaluate the quality of those final clusters during the last step. This iterative process keeps going until we end up with the best historical time period which is 60 days in our case.


Since we do not have any pre-defined user journey stages in our data, we cast our segmentation problem as unsupervised learning: Clustering. We only know about our users past actions on the platform, yet how many meaningful customer journey stages (clusters) there will be is still an open question to be answered by this analysis. Actually, that’s both beauty and challenge of a clustering analysis. You will find something new that may serve useful insights for your business but also you will have to solve a problem with many unknowns in it.

Although most of data scientists might think of a clustering analysis unchallenging but in fact it is really prone to make lots of basic mistakes if you miss several critical points. Regardless of algorithm type you use, there are always some assumptions behind a clustering algorithm you should take into consideration. Since we picked the ‘K-Means’ algorithm to apply on our clustering problem, I will dive into the assumptions of this algorithm specifically and talk about how we dealt with them one by one in the rest of the article.

Before talking about those assumptions and their respective remedies, let’s take a moment to mention the technology stack we have used throughout this project.

  • Previous Step: Feature Extraction → SparkSQL and Spark’s DataFrame API.
  • Following Steps → Different H2O’s algorithms running on Spark via Sparkling Water package.

If you haven’t tried using H2O on Spark, you should give it a chance because I believe you will like it once you realised how fast those H2O algorithms are compared to their counterparts in the SparkML package.

Assumptions of K-Means Clustering

  • Clusters are spatially grouped or ‘spherical’
  • Clusters are of a similar size

Since K-Means tries to optimise total sum of squares within clusters, it could be so sensitive to outliers. In other words, existence of outliers in the data may deviate cluster centroids and distort cluster formation in the end. Therefore, outliers must be removed before applying k-means algorithm on the data.

Step 1: Outliers Removal

Isolation Forest algorithm has been used for the purpose of removing outliers from the data. Discussing the details of the algorithm is beyond the scope of this article but we would like to briefly mention the logic behind.

An Isolation Forest is basically a Random Forest with random splits rather than selecting the best feature candidate per node split. It is an approach to unsupervised anomaly detection as in our case. It starts with building multiple decision trees such that trees isolate the observations in their leaves. Since outliers have extreme values for some features, isolation of those outliers will be easier and quicker compared to other observations. In other words, their branches in a tree tend to be shorter than others. Therefore, by taking average number of node splits across the trees for each observation, we end up with scores, where the less splits an observation needs, the more likely it is to be anomalous.

Similar to Random Forest, we also have parameters such as number of trees, height of a tree or sampling ratio to be tuned as well. However, we have one extra parameter specific for Isolation Forest indicating the overall ratio of outliers in the data: ‘contamination rate’. This determines the cut-off point for the scores to declare an observation to be anomalous or not.

In the plot below, you will see effect of the outliers removal process on the distribution of observations in our data space. Our data space is multi-dimensional but we apply Principal Component Analysis (PCA) to reduce the number of dimensions into 3 for the sake of visualisation to enable us to show data points in a cartesian coordinate system. Before eliminating those outliers from the data, almost all data points were densely close to each other and therefore not proper for a cluster analysis. However, as you can see from the right-hand side plot, after the process observations lie on the coordinate system more uniformly or less densely.

Outliers Removal: Before - After PCA Plots (Image by Author)
Step 2: Data Rescaling

Since K-Means is a spatial algorithm which uses a distance metric to determine how close 2 observations are to each other, mean normalisation is important to equalise impact of each dimension on the cluster formation.

It is recommended to do data rescaling before outliers removal because any dimension with a much larger scale than other dimensions may dominate and overrule the others if you are applying a linear outlier detection. However, since we use Isolation Forest to catch those anomalies, we don’t have such a restriction in our case. We applied data rescaling just to level the effects of dimensions off and also make it ready for the next step: Giving Weights to Dimensions

Step 3: Giving Weights to Dimensions

When working on any clustering problem, you should always take your business’ assumptions into account while formulating your approach. Things are not always straightforward for a clustering algorithm and you may want to adjust it according to your needs and assumptions down the road.

The K-Means algorithm was designed in a way that all dimensions matter equally while generating clusters but which is not a desired case all the time. In our case, we wanted to give more weights to some dimensions to increase their importance and effect on the cluster formation.

For viewers, we doubled weights for the dimensions listed below:

  • Days since last Ad view
  • Number of visiting Days
  • Number of distinct Ad views

For Repliers, we doubled weights for the dimensions listed below:

  • Days since last reply
  • Number of conversations
  • Number of replies in the longest conversation

In the plot below, you can see the difference between distributions of observations in a 2d coordinate system before and after applying the step of giving different weights to dimensions. It is clear that the cluster formation is highly affected by this step.

Weighted Dimensions: Before - After PCA Plots (Image by Author)
Step 4: K-Means Clustering

After we ended up with weighted dimensions, the next step for us to apply the k-means algorithm on the top of this data. Since we don’t know anything about possible stages of a customer journey yet, this is the point where we should determine the best number of clusters or in other words different journey stages. The way to find the optimum number of clusters is the Elbow Analysis. We will fit the K-Means algorithm for a bunch of different number of clusters and evaluate qualities of those clusters by looking at inter cluster sum of squared distances.

We run this analysis separately for our two main user groups: Viewers and Repliers as shown in the pictures below. In the first plot, we ended up with total of 3 clusters – journey stages for Viewers as the elbow function suggests. Whereas, we have total of 4 clusters as a result of the elbow analysis for Repliers as shown in the plots on the right.

Elbow Analysis and 2D-representation of optimum clusters for both Viewers and Repliers (Image by Author)
Below, you can find a detailed big picture which sums up all the steps taken throughout the project pipeline starting from dividing our user-base into two major user groups till ending up with clusters – customer journey stages.

Big picture of whole pipeline (Image by Author)
After finding out the optimum number of clusters for both user groups -Viewers and Repliers, we gave each and every one of the clusters a meaningful representative name according to different dimension values of their centroids.

First, let’s take a closer look at the centroids of clusters presented in the table below. We came up with the names of clusters by interpreting values of cluster centroids across different dimensions. We indicated decisive dimensions in clusters by putting a red border around those cells in the table.

After checking the centroids, we strongly suggest you to read through short descriptions and indicators of the clusters in the following section to see our rationale behind those cluster names more clearly.

Cluster centroids' dimension values (Image by Author)
Short Descriptions and Indicators of Viewer Clusters

1- Churn: They stopped visiting our platform a long time ago.

  • Maximum ‘Days since last Ad view’ among Viewers

2- Browser: They still visits our platform but not so actively.

  • In general, no extreme values for any dimensions

3- Prospective Replier: Currently, they aggressively visit our platform. This group of users is the most likely to start having conversations with sellers.

  • Minimum ‘Days since last ad view’ among Viewers
  • Maximum ‘Number of distinct ad views’ among Viewers
  • Maximum ‘Number of ads favourited’ among Viewers

Short Descriptions and Indicators of Replier Clusters

1- Losing Interest: They stopped replying to listings a while ago. Most likely, they already gave up on their search for a car on the platform.

  • Maximum ‘Days since last reply’ among Repliers

2- Vanilla Buyer: Currently, they are still having some conversations with sellers but not so actively.

  • In general, no extreme values for any dimensions

3- Prospective Buyer: Currently, they are aggressively having lots of conversations with different sellers. This group of users is the most likely to buy a car.

  • Minimum ‘Days Since Last Reply’ among Repliers
  • Maximum ‘Number of Conversations’ among Repliers
  • Maximum ‘Number of replied days’ among Repliers
  • Maximum ‘Number of Ads favourited’ among Repliers

4- Bought: Recently, they already bought a car and left our platform.

  • Maximum ‘Average Conversation Length’ among Repliers
  • Maximum ‘Max number of days spent on a particular Ad’ among Repliers
  • Maximum ‘Longest Conversation Length’ among Repliers

In the plot below, you can see the distribution of number of users across different clusters.

Distribution of users across the journey stages (Image by Author)
Clusters In Detail & Transitions between Clusters

After the cluster generation, we also wondered about the transitions between clusters in 7 days. The question to be answered is that what percentage of users in each cluster moves to any other cluster in the next following 7 days. In other words, we measured how much clusters are tending to keep being in a stable condition. Another reason of creating these transitions is to do a sanity check of our cluster generation process to see if any unexpected or uncanny transition happens between clusters or not. In the pie charts below, you will see percentages of the transitions from that particular cluster to other clusters in 7 days.

Viewer Clusters


  • 36% of the whole user group
  • No reply in the last 60 days (Viewer)
  • They even stopped visiting our platforms a while ago. (Last visit: ~40 days ago)
  • Compared to other journey stages, they have higher tendency to stay in the same stage (~ 90%)
Transition distribution from the Churn journey stage in next 7 days (Image by Author)
  • 45% of the whole user group
  • No reply in the last 60 days (Viewer)
  • They still visits our platform but not so actively. (Last visit: ~ 10 days)
  • These users are the most likely to churn (Churn rate in 7 days: 14%)
  • In 7 days, only 1.5% of these users become ‘Prospective Replier’ (next stage)
Transition distribution from the Browser journey stage in next 7 days (Image by Author)
Prospective Replier

  • 6.7% of the whole user group
  • No reply in the last 60 days (Viewer)
  • Currently, they aggressively visit our platform.
  • This group of users is the most likely to start having conversations with sellers.
Transition distribution from the Prospective Replier journey stage in next 7 days (Image by Author)
Replier Clusters

Losing Interest

  • 5.4 % of the whole user group
  • They had at least 1 reply during the last 60 days (Replier)
  • They stopped replying to Ads a while ago. (Last Reply: ~45 days ago)
  • They are prone to be a churn user. (Churn rate in one week: ~ 5.6 %)
  • 6.8% of these users become ‘Browser’ (previous stage) again in 7 days
  • Only 5.4% of these users keeps going on their buying journey (next stage: Vanilla Buyer)
Transition distribution from the Losing Interest journey stage in next 7 days (Image by Author)
Vanilla Buyer

  • 6.4% of the whole user group
  • They are still having some conversations with sellers but not so actively.
  • One of the most dynamic user group in their buyer journey (25% of the users tend to change their stage in 7 days)
  • 18.2% of these users move to ‘Losing Interest’ (previous stage) in 7 days
  • Only 5% of these users keep on their buying journey (next stages: Prospective Buyer or Bought)
Transition distribution from the Vanilla Buyer journey stage in next 7 days (Image by Author)
Prospective Buyer

  • 0.6% of the whole user group
  • Currently, they are aggressively having a lot of conversations with sellers.
  • One of the most stable user cluster (Stay in the same stage ratio after 7 days: 90%)
  • This group of users is the most likely to buy a car. (Ratio of moving to ‘Bought’ stage in 7 days: 2.4%)
Transition distribution from the Prospective Buyer journey stage in next 7 days (Image by Author)
  • 1% of the whole user group
  • They already bought a car
  • Left our platform recently (Last reply: ~20 days ago)
  • These users had the longest conversations with the sellers before they left
  • These users might be excluded from our campaigns for a while
Transition distribution from the Bought journey stage in next 7 days (Image by Author)
In this article, we walked you through the steps we took during our project: Customer Journey-based Segmentation for Marketplaces. First, we started with defining dimensions – features – we did use throughout the analysis. Then, we touched upon some critical assumptions of the k-means clustering and talked about how we did handle those issues in our case. After that, we also mentioned how we came up with those cluster names by interpreting the cluster centroids. Towards to end of the article, we deep dived into the cluster details and how transitions between the clusters are formed.

The main take-away from this article is that solving a clustering problem could be tricky, therefore, you might as well keep some critical points in mind while conducting your analysis. We hope you enjoyed reading this article and found it useful.

