Selecting optimal K for K-means clustering

Using unsupervised clustering in a supervised way.

Tamjid Ahsan
Towards Data Science

--

K-means clustering is a way of vector quantization, originally from signal processing that aims to cluster observations based on mean. Lets start with clarifying the premise of clustering case that is explored here; to segment clients . Client segmentation is the method of partitioning an organization’s clients into clusters that reflect likeness among clients in each grouping. The objective of such dissection of clients is to determine how to identify with clients in each fragment to increase the worth of every client to the business.

One of the popular machine learning techniques for this is K-means clustering, one of the simplest and popular unsupervised machine learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.

In Steve Jobs: The Exclusive Biography by Walter Isaacson, although often misinterpreted, Jobs Said:

“Some people say, ‘Give customers what they want.’ But that’s not my approach. Our job is to figure out what they’re going to want before they do. I think Henry Ford once said, ‘If I’d asked customers what they wanted, they would have told me, “A faster horse!”. People don’t know what they want until you show it to them. That’s why I never rely on market research. Our task is to read things that are not yet on the page.”

Photo by AB on Unsplash

It made some believe that market research is not important according to him, but in reality, what Mr. Jobs meant was to go beyond typical market research, and decipher and discover customer need in advance. For this market segmentation is a good approach and running targeted market research on each cluster. As an example, a college freshman’s need is not the same as a middle-aged head of a household who is shopping for financial services, marketing approach for both of them should not be the same.

It is crucial for marketers and policy makers to be aware of different classes of clients to address their need to serve them better. Attracting new customers is no longer a good strategy for mature businesses since the cost of assisting existing customers is much lower. Which attributes should be used for this segmentation? Well that depends.

Four types of client segmentation:

  • Behavioral segmentation: focuses on the habits of the client. E.g., usage-based segmentation.
  • Psychographic segmentation: segmentation based on traits that are not immediately obviously apparent. E.g., values or opinions.
  • Demographic segmentation: based on cursory traits. E.g., occupation, marital status.
  • Geographic segmentation: based on location. E.g., city, country.

By doing customer segmentation, you will find similar characteristics in each customer’s behavior and needs. Then, those are generalized into groups which can be used to satisfy specific demands with a myriad of strategies. Moreover, those strategies can be an input of:

  • Targeted marketing
  • Introducing features aligning with the client demand
  • Development of the product roadmap
Photo by Emily Morter on Unsplash

Using unsupervised clustering also raises a question of how many clusters to create? This is a tricky one to give a straight answer. Often time manager or CEO will have a specific requirement on the number of clusters based on specific business goal. But how do you decide if you are expected to come up with the number of clusters as the data divulges. In this blog post I shall depict the technique that can be used to get that number. Being said so it is important to know that what the algorithm suggests might not be optimal number of clusters. The analyst should use business judgement to justify her choice. It is more of an art than science and selecting the number of clusters is detrimental for the success of the strategy formulated upon the analysis, as nonsense input data leads to nonsense output.

For this demo, the dataset from LEAPS Analyttica is used. The dataset is minimally cleaned, and then numerical data are scaled using StandardScaler and categorical variables are One-Hot-Encoded using OneHotEncoder for use in machine learning algorithms using scikit-learn API. All the transformation is performed using scikit-learn pipeline and transformers.

Method 1: Using K-means++ with different ‘K’s

Total 20 models are created, and inertia, Silhouette Score and Calinski Harabasz Score scores are plotted. Code for this is following:

This produced following plot:

Selecting optimal `k`, image by Author

Higher Silhouette Coefficient score relates to a model with better defined clusters. And higher Calinski-Harabasz score relates to a model with better defined clusters.

Although by looking at the visual no obvious optimal K can be spotted.

Based on the Silhouette Score and Sum of squared error (a.k.a. Elbow plot), 5 segmentation seemed optimal for initial model. Calinski Harabasz Score also supports this segmentation.

Method 2: using yellowbrick package

Testing K-means models with K of 2 to 10, using a random state for reproducibility and not showing timing of model fit. Code used:

Image by Author

This plot suggests K=5 as the optimal number of cluster.

Now using principal component analysis to visualize the clustering in two dimensional space using yellowbrick.

Image by Author

A clear separation between clusters is detected.

Method 3: using MeanShfit to discover cluster

Mean shift clustering aims to discover “blobs” in a smooth density of samples. It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids. (From scikit-learn documentation)

Code used:

output of the code is:

Number of estimated clusters : 5

MeanShift suggests 5 as the optimal number of cluster.

Now I created a K-means++ model with K=5 for my analysis. With PCA of 3 those clusters are visualized. Those PCA of 3 can explain 40% of the dataset. Decent distinction between clusters is observed.

PCA3 clusters, image by Author

Pretty good looking clustering, is it not so?

Supervising unsupervised clustering:

Next I validated my clustering by using a Random Forrest classification model. I used prediction from the clustering model as dependent variable for the Random Forest classification model, after splitting the dataset in train-test by the ratio of 80%–20%. And tested prediction ability of the model. If the clustering makes sense, the Random Forest model will be able to predict the clusters more accurately. The model achieved a model accuracy of 0.93 on test set. Then clusters are explored to identify characteristics, with insights from a combination of feature importance of the Random Forest model and a permutation importance for further exploration of features, both intra-cluster and inter-cluster. The K-means model was able to cluster fairly good based on the observed attributes of the clusters.

Model report of the Random Forest classifier:

Random Forest classifier model report, image by Author

Distribution of clusters:

Image by Author.

After exploring each clusters, they are labeled as:

  • Cluster 0: Low value frequent users of services.
  • Cluster 1: High risk clients segmentation.
  • Cluster 2: Regular clients.
  • Cluster 3: Most loyal clients. (mostly consists of older clients)
  • Cluster 4: High value clients.

This workflow is a good option for deciding on optimal K for an unsupervised clustering model and validate the choice with a supervised classification model.

All of this can be found on GitHub following this link. This analysis is expanded with churn analysis, which can be found on GitHub using this link.

That's all for today. Until next time!

--

--