How to make clustering explainable

In this article, I will explain how to use SHAP values to have a better understanding of clustering

Shuyang Xiang
Towards Data Science

--

Clustering is always a black box

Clustering is often required in business cases to have a better understanding of different types of clients. But it might be enough. From a business point of view, it does not suffice to know how many clusters we have, and who belongs to which cluster. We need to know what exactly forms these different clusters. In other words, we would like to explain the result of clustering which, unfortunately, is always a black box. I find myself always in the case that I know pretty well too which group a person belongs to while I am not sure about the reason he is in this group and what makes him different from other group members.

It does not seem evident. Back to supervised learning, SHAP values provide a general framework that gives machine learning explainability, no matter glass box models or black-box models. I am not going to explain SHAP values in detail as it was well done already and we can find references almost everywhere on the internet, e.g. https://christophm.github.io/interpretable-ml-book/shap.html. But unfortunately, the computation of SHAP values asks for our data to have labels. In fact, SHAP values are defined as how each feature of the sample contributes to the prediction of the output label. Without labels, SHAP can hardly be implemented.

To make a bridge between clustering and SHAP values, I would like to use the labels of the clustering of the data and build SHAP values on this base. I will explain my ideas by studying an example. Please find the link of the notebook for detailed code.

Wine dataset clustering

Wine dataset

I will be working on the wine dataset which is available here: https://www.kaggle.com/harrywang/wine-dataset-for-clustering. The data is the result of a chemical analysis of wines grown in the same region in Italy The analysis determined the quantities of 13 constituents found in the wines of study. Let us first take a quick look at the head of the table.

Image by author: Head of the wine dataset table

Here I show the histogram of all the features as well.

Image by author: Histogram of all features of the wine dataset

Data preparation

Before the clustering algorithm, we have to normalize the features. I used MinMaxScaler.

import pandas as pd
from sklearn import preprocessing
wine_value = wine_df.copy().values
min_max_scaler = preprocessing.MinMaxScaler()
wine_scaled = min_max_scaler.fit_transform(wine_value)
wine_df_scaled = pd.DataFrame(wine_scaled, columns=wine_df.columns)

Here are two scatter plots of the alcohol and ash values before and after scaling.

Image by author: Scatter plots of two features before vs after normalization

Clustering algorithm

We are now ready to implement the clustering algorithm on this wine dataset. I am going to use the K-mean algorithm. We can easily run K-Means for a range of clusters and collect the distortions into a list.

from sklearn.cluster import KMeans
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k)
kmeanModel.fit(scaled_wine_df)
distortions.append(kmeanModel.inertia_)

Plotting the distortions with different values of clusters, we are clear enough that 3 is the most appropriate number of clusters.

Image by author: Distortions of different values of clusters.

Explain cluster results with SHAP values

Now 3 clusters are created. The K-means model will simply output a number ranging from 0 to 2 representing which cluster a sample belongs to. No more than that. Since the dataset has 13 features, even a complete visualization is not quite evident. To explain the clustering result better, here is what I am going to do: Fit a classifier model whose output is exactly labels provided by the clustering and compute SHAP values based on this classifier model.

I will be using the RandomForest Classifier. The great thing is that there is no need to do normalization before fitting a RandomForest classifier and we can thus use the original data directly. Since the goal of the classifier is to better understand the clustering and there is no issue of overfitting, I will use all the datasets to do the fit.

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import label_binarize
kmeanModel = KMeans(n_clusters=3)
y=kmeanModel.fit(scaled_wine_df).labels_
y = label_binarize(y, classes=[0,1,2])
clf=RandomForestClassifier()
clf.fit(wine_df,y)

Now SHAP is finally ready to do its job. I will let SHAP explain the features of the wine dataset.

import shap
explainer= shap.TreeExplainer(clf)
shap_values = explainer(wine_df).values

What did SHAP tell us?

Take group 0 as an example. Let us have a look at the summary plots of both of these labels.

Each point on this summary plot is a Shapley value for a feature of a sample. The x-axis gives the SHAP value while the y-axis is determined b the features ordered by their importance.

Image by author: SHAP value of label 0

We can see that OD280, Flavanoids, and Hue are features that have the most positive impact on defining this cluster. We can then imagine a kind of wine containing a relatively big quantity of OD280, Flavanoids, and Hue. As a person very ignorant of wine, this summary plot did give me a better understanding of the clustering result.

To end this article, I would like to say that clustering is a very good way to interpret data without labels. But it is always a problem to explain the result of clustering. With the help of SHAP values, we can understand better clustering.

--

--