Cluster analysis is a great technique for identifying groups with similar patterns. However, once clusters are formed, it can remain challenging to determine the driving features behind the clusters. But this step is crucial to reveal valuable insights that may have been missed before and can be used for decision-making and a deeper understanding of your data set. One manner to determine the driving features is by coloring the samples on the feature values. Although this is insightful, it is labor-intensive when there are hundreds of features. In addition, the exact contribution of a (set of) feature(s) can be difficult to judge with clusters of different sizes and densities. I will demonstrate how to quantitatively detect the driving features behind the clusters. In this blog, the clusteval library is used for cluster evaluation and to determine the driving features that are behind the formation of the clusters.
Background
Unsupervised clustering is a technique to identify natural or data-driven groups in data without using predefined labels or categories. The challenge of clustering methods is that different methods can result in different groupings due to the implicit structure imposed on the data. To determine what constitutes a "good" clustering, we can use quantitative measures. More in-depth details can be read in the blog "From Data to Clusters; When is your clustering good enough?" [1].
The clusteval library tests whether features are significantly associated with the cluster labels.
Clusteval is a Python package that is developed to evaluate the clustering tendency, quality, the number of clusters, and to determine the statistical Association of the clusters with the features. Clusteval returns the cluster labels for the optimal number of cluster labels that produce the best sample partitioning. The following evaluation strategies are implemented: Silhouette score, Davies-Bouldin index, and the derivative (or Elbow) method, which can be used in combination with K-means, agglomerative clustering, DBSCAN, and HDBSCAN [1].
pip install clusteval
To detect the driving features behind the cluster labels, the HNET library [2] is utilized in clusteval that performs the Hypergeometric test for categorical features and the Mann-Whitney-U test for continuous values to assess whether features are significantly associated with the cluster labels. More in-depth details can be read here:
Explore and understand your data with a network of significant associations.
Make sure the clustering is trustworthy.
Before we can detect the driving feature behind the clusters, we first need to cluster the data and be convinced that our Clustering is valid. In contradiction to supervised approaches, clustering algorithms work with homogeneous data where all variables have similar types or units of measurement. This is utterly important because clustering algorithms group data points based on their similarity and thus will produce unreliable results when mixing data types or using non-homogeneous data. Be convinced of the following:
- The data is normalized according to the research aim and the statistical properties of the data.
- The appropriate distance metric is used.
- The cluster’s tendency and quality are evaluated.
With the cluster labels, we can investigate the contribution of the features. Let’s make a small use case in the next section.
Toy example to reveal driving features behind the cluster labels.
For this use case, we will load the online shoppers’ intentions data set and go through the steps of preprocessing, clustering, evaluation and then determining the significantly associated features for the cluster labels. This data set contains in total of 12330 samples with 18 features. This mixed dataset requires a few more pre-processing steps to make sure that all variables have similar types or units of measurement. Thus, the first step is to create homogeneous data sets with units that are comparable. A common manner is by discretizing and creating a one-hot matrix. I will use the df2onehot
library, with the following pre-processing steps to discretize:
- Categorical values
0
,None
,?
andFalse
are removed. - One-hot features with less than 50 positive values are removed.
- For features that had 2 categories, only one is kept.
- Features with 80% unique values or more are considered to be numeric.
The pre-processing step converted the data set into a one-hot matrix containing the same 12330 samples but now with 121 one-hot features. Notably, the above-mentioned criteria are not a golden standard but should be explored for each use case. For clustering, we will be using agglomerative
clustering with hamming
distance and complete
linkage. See code section below.
# Intall libraries
pip install df2onehot
# Import libraries
from clusteval import clusteval
from df2onehot import df2onehot
# Load data from UCI
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv'
# Initialize clusteval
ce = clusteval()
# Import data from url
df = ce.import_example(url=url)
# Preprocessing
cols_as_float = ['ProductRelated', 'Administrative']
df[cols_as_float]=df[cols_as_float].astype(float)
dfhot = df2onehot(df, excl_background=['0.0', 'None', '?', 'False'], y_min=50, perc_min_num=0.8, remove_mutual_exclusive=True, verbose=4)['onehot']
# Initialize using the specific parameters
ce = clusteval(evaluate='silhouette',
cluster='agglomerative',
metric='hamming',
linkage='complete',
min_clust=2,
verbose='info')
# Clustering and evaluation
results = ce.fit(dfhot)
# [clusteval] >INFO> Saving data in memory.
# [clusteval] >INFO> Fit with method=[agglomerative], metric=[hamming], linkage=[complete]
# [clusteval] >INFO> Evaluate using silhouette.
# [clusteval] >INFO: 100%|██████████| 23/23 [00:28<00:00, 1.23s/it]
# [clusteval] >INFO> Compute dendrogram threshold.
# [clusteval] >INFO> Optimal number clusters detected: [9].
# [clusteval] >INFO> Fin.
After running clusteval on the data set, it returns 9 clusters. Because the data contains 121 dimensions (the features), we can not directly visually inspect the clusters in a scatterplot. However, we can perform an embedding and then visually inspect the data using a scatterplot as shown in the code section below. The embedding is automatically performed when specifying embedding='tsne'
.
# Plot the Silhouette and show the scatterplot using tSNE
ce.plot_silhouette(embedding='tsne')

The results in Figure 1 (right panel) depict the scatterplot after a t-SNE embedding, where the samples are colored on the cluster labels. In the left panel is shown the Silhouette plot where we can visually assess the quality of the clustering results, such as the homogeneity, separation of clusters, and the optimal number of clusters that are detected using the clustering algorithm.
Moreover, the Silhouette score ranges from -1 to 1 (x-axis) for which a score close to 1 indicates that data points within a cluster are very similar to each other and dissimilar to points in other clusters. Clusters 0, 2, 3, and 5 imply to be well-separated clusters. A Silhouette score close to 0 indicates overlapping clusters or that the data points are equally similar to their own cluster and neighboring clusters. A score close to -1 suggests that data points are more similar to points in neighboring clusters than to their own cluster.
The width of the bars represents the density or size of each cluster. Wider bars indicate larger clusters with more data points, while narrower bars indicate smaller clusters with fewer data points. The dashed red line (close to 0 in our case) represents the average silhouette score for all data points. It serves as a reference to assess the overall quality of clustering. Clusters with average silhouette scores above the dashed line are considered well-separated, while clusters with scores below the dashed line may indicate poor clustering. In general, a good clustering should have silhouette scores close to 1, indicating well-separated clusters. However, be aware that we now have clustered our data in high dimensions and evaluate the clustering results after a t-SNE embedding in the low 2-dimensional space. The projection can give a different view of the reality.
Alternatively, we can also do the embedding first and then cluster the data on the low-dimensional space (see code section below). Now we will use the Euclidean
distance metric because our input data is not one-hot anymore but are the coordinates from the t-SNE mapping. After fitting, we detect an optimal number of 27 clusters, which is a lot more than in our previous results. We can see that the cluster evaluation scores (Figure 2) appear to be turbulent. This has to do with the structure of the data and whether an optimal clustering can be formed.
# Initialize library
from sklearn.manifold import TSNE
xycoord = TSNE(n_components=2, init='random', perplexity=30).fit_transform(dfhot.values)
# Initialize clusteval
ce = clusteval(cluster='agglomerative', metric='euclidean', linkage='complete', min_clust=5, max_clust=30)
# Clustering and evaluation
results = ce.fit(xycoord)
# Make plots
ce.plot()
ce.plot_silhouette()


The Silhouette plot now shows better results than previously, indicating that clusters are better separated. In the next section, we will detect which features are significantly associated with the cluster labels.
After determining the optimal number of clusters comes the challenging step; to understand which features drive to the formation of the clusters.
Detect the Driving Features Behind the Cluster Labels.
At this point, we detected the optimal number of clusters for which each sample is assigned with a cluster label. To detect the driving features behind the cluster labels, we can compute the statistical association between the features and the detected cluster labels. This will determine whether certain values of one variable tend to co-occur with one or more cluster labels. Various statistical measures of association, such as the Chi-square test, Fisher exact test, and Hypergeometric test, are commonly used when dealing with ordinal or nominal variables. I will use the Hypergeometric test to test for the association between categorical variables and the cluster labels, and the Mann-Whitney U test for the association between continuous variables and the cluster labels. These tests are readily implemented in HNET, which is in turn utilized in the clusteval library. With the enrichment
functionality, we can now test for statistically significant associations. After this step, we can use the scatter functionality to plot the enriched features on top of the cluters.
# Enrichment between the detected cluster labels and the input dataframe
enrichment_results = ce.enrichment(df)
# [df2onehot] >Auto detecting dtypes.
# 100%|██████████| 18/18 [00:00<00:00, 53.55it/s]
# [df2onehot] >Set dtypes in dataframe..
# [hnet] >Analyzing [cat] Administrative...........................
# [hnet] >Analyzing [num] Administrative_Duration...........................
# [hnet] >Analyzing [cat] Informational...........................
# [hnet] >Analyzing [num] Informational_Duration...........................
# [hnet] >Analyzing [cat] ProductRelated...........................
# [hnet] >Analyzing [num] ProductRelated_Duration...........................
# [hnet] >Analyzing [num] BounceRates...........................
# [hnet] >Analyzing [num] ExitRates...........................
# [hnet] >Analyzing [num] PageValues...........................
# [hnet] >Analyzing [num] SpecialDay...........................
# [hnet] >Analyzing [cat] Month...........................
# [hnet] >Analyzing [cat] OperatingSystems...........................
# [hnet] >Analyzing [cat] Browser...........................
# [hnet] >Analyzing [cat] Region...........................
# [hnet] >Analyzing [cat] TrafficType...........................
# [hnet] >Analyzing [cat] VisitorType...........................
# [hnet] >Analyzing [cat] Weekend...........................
# [hnet] >Analyzing [cat] Revenue...........................
# [hnet] >Multiple test correction using holm
# [hnet] >Fin
# Make scatterplot and show the top n_feat enriched features per cluster.
ce.scatter(n_feat=2)

Final words.
Understanding which features drive the formation of the clusters is crucial for extracting valuable Insights from complex data sets. Visual inspection of clusters by coloring on feature values can be labor-intensive and challenging when dealing with large datasets with numerous features of varying sizes and densities. The clusteval library provides a quantitative approach for evaluating the driving features behind clusters using statistical testing of associations between categorical and continuous variables with cluster labels using the Hypergeometric test and Mann-Whitney U test, respectively.
An important, but challenging step is ensuring that the clustering is trustworthy through proper data normalization, distance metric selection, and cluster evaluation. Only then, the driving features behind the clusters can provide sensible information. The example data set of online shoppers’ intentions demonstrates a practical application of clusteval in identifying driving features behind clusters. Overall, incorporating quantitative methods for determining driving features in cluster analysis can greatly enhance the interpretability and value of complex data sets.
Be Safe. Stay Frosty.
Cheers E.
If you like this blog about clustering, feel free to follow me to stay up-to-date with my latest content because I write more blogs like this one. If you use my referral link, you can support my work, and get access to all Medium blogs without limits.
Software
Let’s connect!
References
- E. Taskesen, From Data to Clusters: When is Your Clustering Good Enough?, Juli. 2023 Medium.
- E. Taskesen, Explore and understand your data with a network of significant associations, Aug. 2021 Medium