Introduction

Customer segmentation is one of the most common uses of data analysis/data science. In this two posts series, we will see an example of customer segmentation. We are going to use the Online Retail II data set which contains transactions of a UK-based online retail between 1/12/2009 and 09/12/2011. The dataset contains 1.067.371 rows about purchases of 5.943 customers.

As we can see, the data set contains Invoice orders that can span multiple rows. Each row represents a specific item and has information on the purchase quantities, price, invoice date, customer ID, and customer country.
The plan is to combine two different segmentations.
- A segmentation based on the items that each customer has bought.
- A segmentation based on Recency, Frequency, Monetary Value (RFM), and country (part II).
In this, first part, we will perform the segmentation based on items. In the second part, we will perform the RFM-Country Clustering and we will combine the two segmentations.
The full code for this part can be found on Github.
Item-based segmentation
To perform an item based segmentation, we will assign each item to specific category. We could perform a segmentation on an item level but rolling up to categories has two benefits:
- since there exist 5.699 unique items using them would mean that we would have to work in a high dimensional space (i.e. a 5.699-dimensional space). In general, high dimensional spaces tend to be problematic when clustering (ex. see discussion in "k-Means Advantages and Disadvantages" ),
- it is easier to describe clusters if we use categories.
Unfortunately, item categories are not available in the dataset. Hence, we must find a way to create them. To do this, we are going to perform clustering on items’ descriptions. We will use k-means clustering (or better a variant of k-means called MiniBatchKMeans). Since k-means works only on numerical data we map descriptions to high dimensional vectors. This is done using:
- TfidfVectorizer
- Countvectorizer
( in sklearn’s documentation you can read a detailed description of both methods workings).
The two different methods will allow us to find the best way to model product categories.
Preprocessing
A little preprocessing is required. First, we check for missing values in the description and replace them with ‘NA". Then, we convert the ‘Description’ column to string (there is a cell which is integer causing problem latter) and create a list with unique descriptions.
Also, we would like to remove common words from our analyses. One way to find them is to create a word cloud from all descriptions.

Based on the word cloud, we create our stopwords list.
Clustering using TfidfVectorizer
The code below uses TfidfVectorizer to map descriptions to vectors. Then, it uses MiniBatchKMeans, a variant of k-Means for faster computation to perform clustering for 2, 3,…, 10 clusters. The results are used to create a graph of sum of squared distances.

According to the elbow method, the most promising values for the number of clusters are 4, 6, and 8. The code below clusters into 4 groups and then displays a word cloud from the descriptions in each cluster.

It seems that:
- category 0 is a quite general,
- category 1 is about bags,
- category 2 is about Christmas decoration,
- category 3 is about other types of decoration.
Notice that category 0 contains a lot more descriptions (5.026) compared to the rest categories (275, 204, and 194 respectively). Generally, this could be an indication that we should increase the number of clusters.
We can try to split into 6 clusters, hoping that the big one will split into several smaller ones. In this case, category 0 has 4.956 descriptions and the rest have 197, 173,168,130, and 75.
- category 0 is quite general,
- category 1 is about bags in this case too,
- category 2 is about aromatic candles,
- category 3 is about Christmas decoration,
- category 4 is about other types of decoration,
- category 5 is about cups, bowls, and plates.
There seems to be more overlapping between the categories. The word "mug" appears in category 0 while the word "cup" in category 5. The word "bowl" appears in both categories 0 and 5. The word "bag" appears in both categories 1 and 5. The word "christmas" appears in both categories 1 and 3.

Clustering using CountVectorizer
We repeat the process using CountVectorizer to map descriptions to vectors. We will omit the code snippet. You can read the code on Github. (The -very- careful reader will notice that we do not scale before applying k-Means/MiniBatchKMeans. This is because items’ descriptions have more or less the same length). According to the elbow method, the most suitable numbers for clusters are 3 and 8.

If we cluster into 3 clusters, then we have three categories with 5.334, 199, and 166 descriptions respectively. As we can see from the word clouds, there is significant overlap between categories (esp. categories 0 and 2).

It seems that using four clusters with TfidfVectorizer is more clear. We are going to use this. Please note one should try using both TfidfVectorizer and CountVectorizer for various numbers of clusters, complete customer clustering with all of them, and then decide which to keep. (More on this latter.)
Customer Segmentation – Preprocessing
Having created item categories, we add them to the initial dataset. We also calculate the cost for each row by multiplying quantity by Price.
With this, we are in the position to calculate the total cost (spend) per category per month. In category 2 there is a sharp increase in November while it is close to zero during the rest months. This verifies that category 2 has to do with Christmas related products.

We calculate total spending (cost) per category for every customer.
Finally, we will replace the total cost per category with the percentage of the total cost across all categories. Notice, that in some cases we have negative values. Probably this signifies returns. Since an item that is returned should have been bought at a previous time, we change negative values to positive. This will be the input to k-means for creating customer clusters.
Customer Segmentation – clustering
Classically, we scale the input matrix and use the elbow method to decide the number of customer clusters.
Remember, that according to the elbow method the best values for the number of clusters is where there is an elbow/angle in the graph. Based on this we can select either 4 or 7 as the number of clusters.

The code for creating 4 clusters and for profiling them is listed below. Function cluster_profile
calculates the median of the spending percentages for each cluster and plots a heatmap with the results.

We see that for all clusters there is a high percentage of spending in category 0. In addition:
- cluster 0 has customers with high spending in category 2,
- cluster 1 has customers with high spending only in category 0,
- cluster 2 has customers with high spending in category 3,
- cluster 3 has customers with high spending in category 1.
If we try creating 6 clusters, we get, in some cases, clusters with a higher percentage in category 1 or category 2 than in category 0. The drawback is that in order to achieve that we have to create clusters with few customers.

Using hierarchical clustering we can gain a better understanding of the possible number of clusters.

Selecting 4 clusters we observe that, now, the cluster with high spending in category 1 (cluster 3) has higher spending in category 1 than in category 0. This inversion is a result of its smaller size. Thus, we will chose to use k-means with 4 clusters. The relative information is exported with pickle.
In the next part, we will perform Customer Segmentation based on RFM value and customer’s country. In the second part, we will also combine the two segmentations into a complete analysis.
As a final remark, the image below shows the results we get if we use 6 item categories. On the left, it displays the elbow graph for selecting the number of clusters in customer segmentation. According to this the best number of clusters is either 3 or 5. If we try with 3 clusters, we get one cluster with most of the customers and two much smaller ones. To the right of the image is the profiling with 5 clusters. As you can see, the results are similar to our 4 clustering solution with 4 category descriptions. The difference being that now there exist two clusters where most of the spending is in category 0.

For good or worse, there are no hard rules for several clustering decisions. Only guidelines. You can experiment and make your own by running the jupyter notebook in Google’s colab.