Machine Learning

Customer Segmentation with Machine Learning

K-means algorithm applied to a real-world e-commerce sales data

Ceren Iyim
Towards Data Science
8 min readFeb 26, 2020

--

Photo by Graph Wyatt on Unsplash

Imagine that you are treating the grocery shop owner that you shop every day, as you treat your significant other. That can be fun at the beginning, however may cause disastrous situations too. Likewise, it can be unfavourable for a company to manage its relationships with every customer similarly.

Customer segmentation enables a company to customize its relationships with the customers, as we do in our daily lives.

When you perform customer segmentation, you find similar characteristics in each customer’s behaviour and needs. Then, those are generalized into groups to satisfy demands with various strategies. Moreover, those strategies can be an input of the

  • Targeted marketing activities to specific groups
  • Launch of features aligning with the customer demand
  • Development of the product roadmap

There are different products/solutions available in the market from packaged software to CRM products. Today, I will apply an unsupervised machine learning algorithm with Python.

This dataset from November 2018 — April 2019 is actual sales data courtesy of an e-commerce company. It was provided to me for an interview case study.

Yes, they have an amazing interview process, but more on that later, at the end of this article. I also did not get the role, mainly because of remote employment logistics, but that’s another story for another day.

I will apply K-Means clustering to the dataset with the following steps.

  1. Business Case
  2. Data Preparation
  3. Segmentation with K-means Clustering
  4. Hyperparameter Tuning
  5. Visualization and Interpretation of the Results

Along the way, I will explain how K-means clustering works. Eventually, I will provide specific strategies for the segments formed.

P.S. I anonymized the data for confidentiality reasons.

1. Business Case

In the case study, I visualized the customer behaviour and characteristics from diverse aspects. Taking it one step further, I will form the business case around the question: Can the customer base be grouped to develop customized relationships?

I will approach this question from a behavioural aspect (alternatives can be geographical or demographical perspectives) to better understand customers’ spending and ordering habits with the following features: Number of products ordered, average return rate and total spending.

2. Data Preparation

There are approximately 25000 unique customers combined with their order information in the raw dataset:

Dataset is well-formatted and had no NA values. So, we can start by forming the features. 3 features will be calculated per customer_id and they will help us with the visualization (using Plotly library) and algorithm explainability in the latter steps. Data preparation will be done with pandas and numpy.

  • Number of products ordered: It is calculated by counting the product_type ordered by a customer with the below function:
  • Average return rate: It is the ratio of returned_item_quantity to the ordered_item_quantity averaged for all orders of a customer.
  • Total spending: It is the aggregated sum of total sales, which is the final amount after taxes and returns.

After the calculations, 3 features merged in the customers data frame:

Let’s have a look at the individual distribution of the features:

All 3 distributions are positively skewed distributions. Products ordered shows a power-law distribution and average return rate of 99% of the customers are 0.

3 features have different ranges varying between [1, 13], [0, 1] and [0, 1000] which is an important observation showing that features need scaling!

Scaling:

K-means algorithm interprets each row in the customers data frame as a point in a 3-dimensional space. When grouping them, it uses the euclidian distance between the data points and the center of the group. With highly varying ranges, algorithm may perform poorly and not be able to form the groups as expected.

For K-means to perform effectively, we are going to scale the data using logarithmic transformation which is a suitable transformation for skewed data. This will scale down proportionally the 3D space which our data is spread, yet preserving the proximity between the points.

After applying the above function, customers data frame is ready to be fed into K-means clustering:

3. Segmentation with K-means Clustering

We are going to use K-means algorithm from scikit-learn. Let’s first understand how the algorithm will form customer groups:

  1. Initialize k=n centroids=number-of-clusters randomly or smartly
  2. Assign each data point to the closest centroid based on euclidian distance, thus forming the groups
  3. Move centers to the average of all points in the cluster

Repeat steps 2 and 3 until convergence.

K-means in action with n centroids=3. Source: Wikimedia

While running the steps through, the algorithm checks the sum of squared distances between clustered-point and center for each cluster. Mathematically speaking, it tries to minimize — optimize the within-cluster sum-of-squared-distances or inertia of each cluster.

Mathematical expression of within-cluster sum-of-squared-distances or inertia where X is the points in the cluster and µ is the current centroid

When inertia value does not minimize further, algorithm converges. Thus, iteration stops.

from sklearn.cluster import Kmeans
kmeans_model = KMeans(init='k-means++',
max_iter=500,
random_state=42)
  • init parameter with the k-means++ allows the algorithm to place initial centers smartly, rather than random.
  • max_iter is the maximum number of iterations of the algorithm in a single run, default value is 300.
  • random_state guarantees the reproducibility of the model results.

This algorithm is easy to understand, fits well to large datasets in terms of computing times and guarantees convergence. However, when centroids are initialized randomly, algorithm may not assign the points to the groups in the most optimal way.

One important consideration is the selection of k. In other words, how many groups should be formed? For example, K-means applied above uses k=8 as a default value.

In the next step, we are going to choose k which is the most important hyperparameter of K-means.

4. Hyperparameter Tuning

While selecting k, we are going to decide against the optimization criteria of the K-means, inertia, using elbow method. We are going to build different K-means models with k values 1 to 15, and save the corresponding inertia values.

results = make_list_of_K(15, customers.iloc[:,3:])
k_values_distances = pd.DataFrame({"clusters": clusters,
"within cluster sum of squared distances": results})

When we plot inertia against the k values:

With the elbow method, we are going to select the k value where the decrease in the inertia stabilizes.

When k=1 inertia is at the highest, meaning data is not grouped yet. Inertia decreases steeply until k=2. Between k=2 and 4, the curve continues to decrease fast.

At k=4, the descent stabilizes and continues linearly afterwards, forming an elbow at k=4. This points out the optimal number of customer group is 4.

5. Visualization and Interpretation of the Results

Let’s plug in the k=4 to K-means and visualize how customer groups are created:

# create clustering model with optimal k=4
updated_kmeans_model = KMeans(n_clusters = 4,
init='k-means++',
max_iter=500,
random_state=42)
updated_kmeans_model.fit_predict(customers.iloc[:,3:])
Values of each data point can be observed here interactively

Data points are shown in spheres and centroids of each group are shown with cubes. 4 customer groups are as follows:

Blue: Customers who ordered at least one product, with maximum total spending of 100 and having the highest average return rate. They might be the newcomers of the e-commerce website.

Red: Customers who ordered 1 to 4 products, with average total spending of 150 and a maximum return rate of 0.5.

Purple: Customers who ordered 1 to 4 products, with average total spending of 300 and a maximum return rate of 0.5.

Green: Customers who ordered 1 to 13 products, with average total spending of 600 and average return rate as 0. It makes the most favourable customer group for the company.

Let’s look at how many customers are there in each group — known as cluster magnitudes:

The overall strategy would be preserving the most favourable customer group — the green one — while moving the blue customer group to the red and purple areas.

Blue group is 42% of all customers, any improvements achieved in this customer group will dramatically increase the revenue. Eliminating high return rates and offering gift cards can move this customer group to low-average-return-rate and high-total-spending area. If we assume that they are newcomers, gift cards can expedite their come-back.

Red and purple group together consists of 50% of all customers. They are showing the same characteristics from the average return rate and products ordered perspectives but differ from total spending. These groups can be defined as who already know the brand and orders multiple products. Those customers can be kept up-to-date with the brand with some specialized communications and discounts.

Green customer group consists of 8% of all customers, forming the most favourable customer group for the brand. They order multiple products and they are highly likely to keep them. To maintain and possibly expand this group, special deals and pre-product launches might help. Moreover, they can be magnets for new customers impacting the expansion of the customer base.

Conclusion

We approached customer segmentation problem from a behavioural aspect with the number of products ordered, average return rate and total spending for each customer. Use of 3 features helped us with the understandability and visualization of the model.

All in all, the dataset was apt to perform an unsupervised machine learning problem. At first, we only had customers data with order information and did not know if they belonged to any group. With the K-means clustering, patterns in the data were found and extended further into groups. We carved out strategies for the formed groups, making meaning out of a dataset that is a dust cloud initially.

About the interview; I enjoyed working on the case study and the interview process is the best interview experience I had so far. It was a very realistic evaluation of my skills on a typical, practical task I would have regularly performed. This made the entire process very straightforward and fun. I hope this type of practical data science interviews become mainstream, benefitting both the candidates and the employers.

Thanks for reading, for comments or constructive feedback, you can reach out to me on responses, Twitter or Linkedin!

--

--