The world’s leading publication for data science, AI, and ML professionals.

Unsupervised Learning Series - Exploring Hierarchical Clustering

Let's explore how hierarchical clustering works and how it builds clusters based on pairwise distances.

Photo by Nathan Anderson @unsplash.com
Photo by Nathan Anderson @unsplash.com

In my last post of the Unsupervised Learning Series, we explored one of the most famous clustering methods, the K-means Clustering. In this post, we are going to discuss the methods behind another important clustering technique – Hierarchical Clustering!

This method is also based on distances (euclidean, manhattan, etc.) and uses an hierarchical representation of data to combine data points. Contrary to k-means, it does not contain any hyperparameter regarding the number of centroids (such as k) that data scientists can configure.

Mostly, hierarchical clustering can be divided into two groups: agglomerative clustering and divisive clustering. In the first, data points are considered single units and are aggregated to nearby data points based on distances. In the latter, we consider all data points as a single cluster and start to divide them based on certain criteria. As the agglomerative version is the most famous and widely used (sklearn’s built-in implementation follows this protocol), that is the hierarchical type we are going to explore in this post.

In this blog post we’ll approach Agglomerative Hierarchical Clustering in two steps:

  • First, we’ll do a play-by-play analysis of how hierarchies are built, using agglomerative clustering with the average method (one of the methods we can use to build the hierarchy of data points).
  • Then, we’ll see some examples of how to fit a hierarchical clustering on a real dataset using sklearn’s implementation. This is also where we’ll detail other methods we can use to build our hierarchies (ward, minimum, etc.)

Let’s start!


Agglomerative Clustering Example – Step by Step

In our step-by-step example, we are going to use a fictional dataset with 5 customers:

Hierarchical Clustering Example - Image by Author
Hierarchical Clustering Example – Image by Author

Let’s imagine that we run a shop with 5 customers and we wanted to group these customers based on their similarities. We have two variables that we want to consider: the customer’s age and their annual income.

The first step of our agglomerative clustering consists of creating pairwise distances between all our data points. Let’s do just that by representing each data point by their coordinate in a [x, y] format:

  • Distance between [60, 30] and [60, 55]: 25.0
  • Distance between [60, 30] and [30, 75]: 54.08
  • Distance between [60, 30] and [41, 100]: 72.53
  • Distance between [60, 30] and [38, 55]: 33.30
  • Distance between [60, 55] and [30, 75]: 36.06
  • Distance between [60, 55] and [41, 100]: 48.85
  • Distance between [60, 55] and [38, 55]: 22.0
  • Distance between [30, 75] and [41, 100]: 27.31
  • Distance between [30, 75] and [38, 55]: 21.54
  • Distance between [41, 100] and [38, 55]: 45.10

Although we can use any type of distance metric we want, we’ll use euclidean due to its simplicity. From the pairwise distances we’ve calculated above, which distance is the smallest one?

The distance between middle aged customers that make less than 90k dollars a year – customers on coordinates [30, 75] and [38, 55]!

Reviewing the formula for euclidean distance between two arbitrary points p1 and p2:

Euclidean Distance Formula - image by author
Euclidean Distance Formula – image by author

Let’s visualize our smallest distance on the 2-D plot by connecting the two customers that are nearer:

Connecting the two closest customers - Image by Author
Connecting the two closest customers – Image by Author

The next step of hierarchical clustering is to consider these two customers as our first cluster!

Considering closest customers as one cluster - Image by Author
Considering closest customers as one cluster – Image by Author

Next, we are going to calculate the distances between the data points, again. But this time, the two customers that we’ve grouped into a single cluster will be treated as a single data point. For instance, consider the red point below that positions itself in the middle of the two data points:

Considering closest customers as one cluster - Image by Author
Considering closest customers as one cluster – Image by Author

In summary, for the next iterations of our hierarchical solution, we won’t consider the coordinates of the original data points (emojis) but the red point (the average between those data points). This is the standard way to calculate distances on the average linkage method.

Other methods we can use to calculate distances based on aggregated data points are:

  • Maximum (or complete linkage): considers the farthest data point in the cluster related to the point we are trying to aggregate.
  • Minimum (or single linkage): considers the closest data point in the cluster related to the point we are trying to aggregate.
  • Ward (or ward linkage): minimizes the variance in the clusters with the next aggregation.

Let me do a small break on the step-by-step explanation to delve a bit deeper into the linkage methods as they are crucial in this type of clustering. Here’s a visual example of the different linkage methods available in hierarchical clustering, for a fictional example of 3 clusters to merge:

Linkage Methods Visualization
Linkage Methods Visualization

In the sklearn implementation, we’ll be able to experiment with some of these linkage methods and see a significant difference in the clustering results.

Returning to our example, let’s now generate the distances between all our new data points – remember that there are two clusters that are being treated as a single one from now on:

Considering closest customers as one cluster - Image by Author
Considering closest customers as one cluster – Image by Author
  • Distance between [60, 30] and [60, 55]: 25.0
  • Distance between [60, 30] and [34, 65]: 43.60
  • Distance between [60, 30] and [41, 100]: 72.53
  • Distance between [60, 55] and [34, 65]: 27.85
  • Distance between [60, 55] and [41, 100]: 48.85
  • Distance between [34, 65] and [41, 100]: 35.69

Which distance has the shortest path? It’s the path between data points on coordinates [60, 30] and [60, 55]:

Considering next closest customers as one cluster - Image by Author
Considering next closest customers as one cluster – Image by Author

The next step is, naturally, to join these two customers into a single cluster:

Creating next cluster - Image by Author
Creating next cluster – Image by Author

With this new landscape of clusters, we calculate pairwise distances again! Remember that we are always considering the average between data points (due to the linkage method we chose) in each cluster as reference point for the distance calculation:

  • Distance between [60, 42.5] and [34, 65]: 34.38
  • Distance between [60, 42.5] and [41, 100]: 60.56
  • Distance between [34, 65] and [41, 100]: 35.69

Interestingly, the next data points to aggregate are the two clusters as they lie on coordinates [60, 42.5] and [34, 65]:

Merging the next clusters - Image by Author
Merging the next clusters – Image by Author

Finally, we finish the algorithm by aggregating all data points in a single big cluster:

Merging the final data point into our cluster - Image by Author
Merging the final data point into our cluster – Image by Author

With this in mind, where do we exactly stop? It’s probably not a great idea to have a single big cluster with all data points, right?

To know where we stop, there’s some heuristic rules we can use. But first, we need to get familiar with another way of visualizing the process we’ve just done – the dendrogram:

Dendrogram of Our Hierarchical Clustering Solution - Image by Author
Dendrogram of Our Hierarchical Clustering Solution – Image by Author

On the y-axis, we have the distances that we’ve just calculated. On the x-axis, we have each data point. Climbing from each data point, we reach an horizontal line – the y-axis value of this line states the total distance that will connect the data points on the edges.

Remember the first customers we’ve connected in a single cluster? What we’ve seen in the 2D plot matches the dendrogram as those are exactly the first customers connected using an horizontal line (climbing the dendrogram from below):

First Horizontal Line in Dendrogram - Image by Author
First Horizontal Line in Dendrogram – Image by Author

The horizontal lines represent the merging process we’ve just done! Naturally, the dendrogram ends in a big horizontal line that connects all data points.

As we just got familiar with the Dendrogram, we’re now ready to check the sklearn implementation and use a real dataset to understand how we can select the appropriate number of clusters based on this cool clustering method!

Sklearn Implementation

For the sklearn implementation, I’m going to use the Wine Quality dataset available here.

wine_data = pd.read_csv('winequality-red.csv', sep=';')
wine_data.head(10)
Wine Quality Data Set Preview - Image by Author
Wine Quality Data Set Preview – Image by Author

This dataset contains information about wines (particularly red wines) with different characteristics such as citric acid, chlorides or density. The last column of the dataset gives respect to the quality of the wine, a classification done by a jury panel.

As hierarchical clustering deals with distances and we’re going to use the euclidean distance, we need to standardize our data. We’ll start by using a StandardScaleron top of our data:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
wine_data_scaled = sc.fit_transform(wine_data)

With our scaled dataset, we can fit our first hierarchical clustering solution! We can access hierarchical clustering by creating an AgglomerativeClustering object:

average_method = AgglomerativeClustering(n_clusters = None, 
                                         distance_threshold = 0, 
                                         linkage = 'average')
average_method.fit(wine_data_scaled)

Let me detail the arguments we are using inside the AgglomerativeClustering:

  • n_clusters=None is used as a way to have the full solution of the clusters (and where we can produce the full dendrogram).
  • distance_threshold = 0 must be set in the sklearn implementation for the full dendrogram to be produced.
  • linkage = 'average' is a very important hyperparameter. Remember that, in the theoretical implementation, we’ve described one method to consider the distances between newly formed clusters. average is the method that considers the average point between every new formed cluster in the calculation of new distances. In the sklearn implementation, we have three other methods that we also described: single , complete and ward .

After fitting the model, it’s time to plot our dendrogram. For this, I’m going to use the helper function provided in the sklearn documentation:

from scipy.cluster.hierarchy import dendrogram

def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

If we plot our hierarchical clustering solution:

plot_dendrogram(average_method, truncate_mode="level", p=20)
plt.title('Dendrogram of Hierarchical Clustering - Average Method')
Dendrogram of Average Method - Image by Author
Dendrogram of Average Method – Image by Author

The dendrogram is not great as our observations seem to get a bit jammed. Sometimes, the average , single and complete linkage may result in strange dendrograms, particularly when there are strong outliers in the data. The ward method may be appropriate for this type of data so let’s test that method:

ward_method = AgglomerativeClustering(n_clusters = None, 
                                         distance_threshold = 0, 
                                         linkage = 'ward')
ward_method.fit(wine_data_scaled)

plot_dendrogram(ward_method, truncate_mode="level", p=20)
Dendrogram of Ward Method - Image by Author
Dendrogram of Ward Method – Image by Author

Much better! Notice that the clusters seem to be better defined according to the dendrogram. The ward method attempts to divide clusters by minimizing the intra-variance between newly formed clusters (https://online.stat.psu.edu/stat505/lesson/14/14.7) as we’ve described on the first part of the post. The objective is that for every iteration the clusters to be aggregated minimize the variance (distance between data points and new cluster to be formed).

Again, changing methods can be achieved by changing the linkage parameter in the AgglomerativeClustering function!

As we’re happy with the look of the ward method dendrogram, we’ll use that solution for our cluster profilling:

Dendrogram of Ward Method - Image by Author
Dendrogram of Ward Method – Image by Author

Can you guess how many clusters we should choose?

According to the distances, a good candidate is cutting the dendrogram on this point, where every cluster seems to be relatively far from each other:

Dendrogram of Ward Method with Cutoff at 30 - Image by Author
Dendrogram of Ward Method with Cutoff at 30 – Image by Author

The number of vertical lines that our line crosses are the number of final clusters of our solution. Choosing the number of clusters is not very "scientific" and different number of clustering solutions may be achieved, depending on business interpretation. For example, in our case, cutting off our dendrogram a bit above and reducing the number of clusters of the final solution may also be an hypothesis.

We’ll stick with the 7 cluster solution, so let’s fit our ward method with those n_clusters in mind:

ward_method_solution = AgglomerativeClustering(n_clusters = 7,
                                         linkage = 'ward')
wine_data['cluster'] = ward_method_solution.fit_predict(wine_data_scaled)

As we want to interpret our clusters based on the original variables, we’ll use the predict method on the scaled data (the distances are based on the scaled dataset) but add the cluster to the original dataset.

Let’s compare our clusters using the means of each variable conditioned on the cluster variable:

wine_data.groupby(['cluster']).mean()
Cluster Profilling - Image by Author
Cluster Profilling – Image by Author

Interestingly, we can start to have some insights about the data – for example:

  • Low quality wines seem to have a large value of total sulfur dioxide – notice the difference between the highest average quality cluster and the lower quality cluster:
Sulfur Dioxide between Cluster 6 and 2 - Image by Author
Sulfur Dioxide between Cluster 6 and 2 – Image by Author

And if we compare the quality of the wines in these clusters:

Quality Density Plot between Cluster 6 and 2 - Image by Author
Quality Density Plot between Cluster 6 and 2 – Image by Author

Clearly, on average, Cluster 2 contains higher quality wines.

Another cool analysis we can do is performing a correlation matrix between clustered data means:

Correlation Matrix of Cluster Means— Image by Author
Correlation Matrix of Cluster Means— Image by Author

This gives us some good hints of potential things we can explore (even for supervised learning). For example, on a multidimensional level, wines with higher sulphates and chlorides may get bundled together. Another conclusion is that wines with higher alcohol proof tend to be associated with higher quality wines.


Conclusion

That’s it! Thank you for taking the time to read this blog post about Unsupervised Learning. I’ll keep adding more unsupervised learning algorithms to this series to showcase different types of methods we can use to get to know the structure of our data.

Naturally, Hierarchical Clustering has some pros and cons that we can discuss:

  • A big con of the algorithm is that it may require too much heuristics to reach a final solution. A combination of dendrogram analysis, distance based analysis, or silhouette coefficient methods may be applied to reach a number of clusters that make sense. Also, one must not discard crossing these technical approaches with some business knowledge about the data to avoid falling in some type of clustering trap.
  • On the positive side, the hierarchical clustering approach is very explainable, helping uncover hidden structures in the data.
  • Additionally, hierarchical clustering does not suffer from the centroid initialization problem – something that may be an advantage for some datasets.

Hierarchical clustering is a very famous clustering method that has been applied for multiple applications as diverse as:

  • Customer segmentation;
  • Outlier analysis;
  • Analyzing multi-dimensional gene expression data;
  • Document clustering;

It’s a very cool method that data scientists should have on their tool belt. Feel free to try it on your next project and stay tuned for more posts on this Unsupervised Learning Series!

If you would like to drop by my Python courses, feel free to join my free course here (Python For Busy People – Python Introduction in 2 Hours) or a longer 16 hour version (The Complete Python Bootcamp for Beginners). My Python courses are suitable for beginners/mid-level developers and I would love to have you on my class!

The dataset used on this post is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license as shown in the following link: https://archive.ics.uci.edu/dataset/186/wine+quality


Related Articles