Consensus Clustering

A Robust Clustering Method With Application For Song Playlists

Published in

Towards Data Science

7 min readJul 6, 2018

In this post we will first briefly go over clustering and its common issues. Then we will explain how consensus clustering can alleviate some of those issues as well as how we can interpret the output. Finally, we will introduce a cool RShiny app designed to perform consensus clustering on a bunch of songs to evaluate possible playlist creations.

What is Clustering?

Clustering is the activity of creating non-overlapping groupings of things on their similarity. In the context of data science, we may wish to cluster users in our social media platform for targeted advertising, or group together gene expressions in our study to identify cancerous behavior, or match documents in our blog platform to show readers similar blog posts.

Clustering is an unsupervised learning task in which we do not have a labeled response variable to train our machine learning algorithm on. Therefore, we wish to find similarities between observations based on the complete dataset. There are many clustering algorithms, each with its pros and cons. In this post we will not go over each algorithm and their respective theory.

What’s wrong?

Methods like K-means or K-medoids utilize a random start procedure where either every run might produce slightly different results or initialization could bias results. Moreover, there is also the need to instantiate a value for K. Hierarchical clustering avoids the issue of initializing a value for K with its tree-like nature, allowing you to cut it at any desired number of clusters. It also provides a nice dendrogram to visualize clusters, which other methods lack. However, like other clustering methods, it is difficult to assess the stability of your chosen cut-point and desired K clusters. Within/between cluster distance are used to assess clusters but we can only trust these metrics as much as we trust the random start and our value for K.

What is Consensus Clustering?

Consensus clustering (or aggregated clustering) is a more robust approach that relies on multiple iterations of the chosen clustering method on sub-samples of the dataset. By inducing sampling variability with sub-sampling, this provides us with metrics to assess the stability of the clusters and our parameter decisions (i.e., K and linkage) as well as a nice visual component in the form of a heatmap.

As the name says, consensus clustering gains a consensus on an observation’s cluster assignment based on their assignments in all the iterations of the clustering algorithm.

We need to first decide how many iterations we wish to run. Next, we choose a set of K values to test. For each K, we will iterate a bunch of times, creating a set of clusters for our observations. Since we are sub-sampling our data (i.e., segmenting 80%), not all observations will be clustered in every iteration.

Suppose we wish to cluster four users: Alice, Brian, Sally and James. We choose k-means and decide on running four iterations.

For K = 2, we get the following table. Brian and Sally are clustered together in the 1st and 4th iteration. Brian and James are clustered in the 3rd iteration while Sally and James are clustered in the 2nd. Alice is never clustered with anyone.

To get our consensus matrix, we look at the pairwise relationship between each user. This will look like a dissimilarity matrix of size NxN. Out of all the iterations that two users were subsampled in, how many times were they in the same cluster? Our consensus matrix would look something like this.

Brian and Sally were subsampled together 3 times but were clustered together twice. Thus, entries for Brian ~ Sally are 0.67 which is approximately 2/3.

This toy example explains how we get a consensus matrix. We create one for each value K and grab the one that gives us the nicest distribution. Say we take the upper triangle of our consensus matrix and plot the distribution of the values. Optimistically, we would want something like the one on the left. Having consensus values bundled at 0 and 1 means that observations are clustered together and apart consistently throughout all iterations.

We then compute a CDF for each consensus matrix from every K. For each step in K, we calculate the change in the area under the CDF and use the usual elbow method to choose K. With the plot on the left, we’d choose K=3.

Now with our consensus matrix M, we compute 1-M and can treat it like a dissimilarity matrix for hierarchical clustering to sort our observations. We do this so that we can obtain a nice heat map like the one below. Sorting after hierarchical clustering brings puts similar rows/columns in our matrix together to create nice boxes along the diagonal.

Heatmap of consensus matrix after sorting using hierarchical clustering. source: https://link.springer.com/content/pdf/10.1023%2FA%3A1023949509487.pdf

Summary Statistics

There are two summary statistics we can compute that help us determine the stability of a particular cluster as well as the significance of certain observations within a cluster. First, is the cluster consensus m(k) which simply computes the average consensus value for each pair of observations within each cluster.

Next is item consensus m_i(k) which focuses on a particular item or observation and computes the average consensus value of that item to all others within its cluster.

Using these summary statistics, we can rank clusters by their consensus or identify observations that are central in their cluster.

RShiny App For Playlist Creation

This app uses top 50 songs on Spotify from 2017. Spotify offers various attributes for each song but only numeric values were used after scaling. The app allows you to try different values for K and resampling iterations in addition to other parameter options depending on the clustering algorithm chosen. The first visualization is a PCA plot displaying the first two PC’s and the cluster assignments for all observations. Using PCA we can visualize all the features in a 2d plot and see how well our clustering methods work.

The image above is an example of the heatmap created using hierarchical clustering with K=10, n.iter = 26, distance=euclidean and linkage=complete. As you can see, the observations are not clustered perfectly which causes coloring outside of the diagonal.

This app is not unique for just playlist creation. You can get the code off of my github and input your own dataset if you’d like! It is designed to help you visualize how different clustering algorithms (and their respective tuning) perform using consensus clustering.

Important Considerations

In some cases, the data may not be cluster-able. In other words, if the clusters are inherently not well-separated, consensus clustering might still produce nice-looking clusters. It is important to consider the chance of a false-positive result, displaying stability when there really isn’t any.

Summary

Consensus clustering alleviates common issues that arise in most clustering methods, such as random initialization, choosing K, intuitive visualization and assessing stability of clusters. This method gathers a consensus of cluster assignments based on sub-sampling the dataset and running a chosen clustering algorithm multiple times. There are various metrics we can use to validate our chosen K. We can also assess clusters using cluster consensus and see which items best represent a given cluster using item consensus. An RShiny app is provided to explore consensus clustering on a dataset of Spotify’s top 50 songs of 2017.

The app is still a work in progress so suggestions or corrections are highly encouraged!
Read this awesome paper to learn more about consensus clustering.
Here’s a link to the RShiny app as well as the github repo.
Feel free to contact me with any questions! Hope you enjoyed the post!