Evaluating Clustering Results

The criteria used to evaluate clustering results

Published in

Towards Data Science

7 min readJun 9, 2020

The main goal of clustering approaches is to obtain high intra-cluster similarity and low inter-cluster similarity (objects in the same cluster are more similar than the objects in different clusters). I have been using different clustering approaches in my work on metagenomics binning analysis and in this article, I would like to share some of my thoughts and the techniques I have used.

Cluster Analysis (Clustering)

According to Wikipedia,

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

I hope you are familiar with the terms used in classification such as TP, TN, FP, FN, accuracy, precision, recall and F1-score. I won’t get into the details of these terms. If you want to recap, I found the following articles very useful.

The problems I came across when evaluating my clustering results

During some cases of my analyses, the clustering approaches were unable to find the correct number of clusters as in the gold standard. Some approaches over-estimates the number of clusters and the others under-estimate the number of clusters. During such cases, we cannot use the standard criteria used to analyse classification results where the resulting number of classes is the same as the number of classes in the gold standard.

After going through the relevant literature in my field, I came across the criteria which have been adapted for situations which I had discussed before. Let’s how we can continue with our evaluation.

Evaluation Criteria

The clustering result is represented as a K x S matrix, as shown in Figure 1, where K is the number of clusters predicted by the clustering approach and S is the number of classes present in the gold standard.

Here an element aₖₛ denotes the total number of objects clustered to the kᵗʰ cluster and belongs to the sᵗʰ class in the gold standard.

In an ideal case, K = S where the number of clusters predicted by the clustering approach is the same as the number of classes present in the gold standard. However, in certain cases, K < S or K > S where the number of clusters predicted by the clustering approach is not equal to the number of classes present in the gold standard.

Using this matrix, we can find the following criteria.

Precision
Recall
F1-score
Adjusted Rand Index

Precision

For each of the cluster, we obtain the gold standard class with the maximum number of objects assigned. Then, we sum the maximum number of objects for each cluster and divide it by the total number of clustered objects. The resulting value is the precision and using the K x S matrix, it is calculated as shown in Figure 2.

Recall

For each gold standard class, we obtain the cluster with the maximum number of objects assigned. Then, we sum the maximum number of objects for each gold standard class and divide it by the total number of clustered objects and unclustered objects. The resulting value is the recall (also known as sensitivity) and using the K x S matrix, it is calculated as shown in Figure 3.

Here U denotes the number of unclustered objects.

F1-score

The F1-score is the harmonic mean of precision and recall and is calculated as shown in Figure 4.

Adjusted Rand Index (ARI)

The Adjusted Rand Index (ARI) is the corrected-for-chance version of the Rand Index (RI). It is a measure of how similar the clustering result is to its gold-standard grouping. Using the K x S matrix, the ARI is calculated as shown in Figure 5.

Fig. 5. Equation of ARI (Image by Author)

Here N denotes the total number of clustered objects and (N 2) (binomial coefficient) is calculated as N(N-1)/2.

Example Walkthrough

Let us see an example where we will calculate the different evaluation criteria for a given clustering result. Assume that there are 257 objects in the gold standard where we know the class of each object.

Case 1: K=S

Let us assume an ideal case where we have 5 classes in the gold standard and 5 clusters are predicted by the clustering method. We can get a K x S matrix as shown below.

64, 4, 1, 5, 0
2, 37, 1, 6, 0
0, 0, 44, 0, 0
1, 1, 0, 72, 0
0, 0, 0, 0, 9

If we relate to Figure 1, rows represent clusters and columns represent classes. The number of objects clustered to the first cluster and belongs to the first class in the gold standard (the intersection of the first row and the first column) is 64. Similarly, the number of objects clustered to the first cluster and belongs to the second class in the gold standard (the intersection of the first row and the second column) is 4.

Note that the maximum values for each row and each column (64, 37, 44, 72 and 9) can be found in the diagonal, which tells us that the first cluster corresponds to the first class in the gold standard, the second cluster corresponds to the second class in the gold standard and so on.

The total number of objects clustered will be 247.

Total number of objects clustered 
=64+4+1+5+2+37+1+6+44+1+1+72+9
=247

The number of unclustered objects will be 10.

Total number of unclustered objects
=257-247
=10

Now, let us calculate the precision, recall and F1-score using this matrix.

Precision
=(64+37+44+72+9)/(64+4+1+5+2+37+1+6+44+1+1+72+9)
=91.50%Recall
=(64+37+44+72+9)/(64+4+1+5+2+37+1+6+44+1+1+72+9+10)
=87.94%F1-score
=2*(91.50*87.94)/(91.50+87.94)
=89.68%

Case 2: K<S

Let us assume a case where we have 5 classes in the gold standard and 4 clusters are predicted by the clustering method. We can see that two clusters have been combined (the third row with 44 and 9).

64, 4, 1, 5, 0
2, 37, 1, 6, 0
0, 0, 44, 0, 9
1, 1, 0, 72, 0Total number clustered = 64+4+1+5+2+37+1+6+44+9+1+1+72 = 247
Total number unclustered = 257-247 = 10

Now, let us calculate the precision, recall and F1-score using this matrix.

Precision
=(64+37+44+72)/(64+4+1+5+2+37+1+6+44+9+1+1+72)
=87.85%Recall
=(64+37+44+72+9)/(64+4+1+5+2+37+1+6+44+9+1+1+72+10)
=87.94%F1-score
=2*(87.85*87.94)/(87.85+87.94)
=87.89%

Case 3: K>S

Let us assume a case where we have 5 classes in the gold standard and 6 clusters are predicted by the clustering method. We can see that one cluster has been split (the third and fourth rows with 24 and 20).

64, 4, 1, 5, 0
2, 37, 1, 6, 0
0, 0, 24, 0, 0
0, 0, 20, 0, 0
1, 1, 0, 72, 0
0, 0, 0, 0, 9Total number clustered = 64+4+1+5+2+37+1+6+24+20+1+1+72+9 = 247
Total number unclustered = 257-247 = 10

Now, let us calculate the precision, recall and F1-score using this matrix.

Precision
=(64+37+24+20+72+9)/(64+4+1+5+2+37+1+6+24+20+1+1+72+9)
=91.50%Recall
=(64+37+24+72+9)/(64+4+1+5+2+37+1+6+24+20+1+1+72+9+10)
=80.16%F1-score
=2*(91.50*80.16)/(91.50+80.16)
=85.46%

Analysis

Let us go through the values we have obtained for each of the example cases as shown in Figure 6.

Fig. 6. Precision, recall and F1-score values (Image by Author)

We can see that if the clustering method under-estimates the number of clusters (case K<S), i.e. combines clusters together and those clusters would contain multiple gold standard classes, then the precision is reduced but the recall remains the same as the ideal case. This is because, for the combined bin, only the maximum value will be considered in the row (44).

Similarly, if the clustering method over-estimates the number of clusters (case K>S), i.e. splits actual clusters and gold standard classes are represented by multiple clusters, then the recall is reduced but the precision remains the same as the ideal case. This is because, for the split bins, only the maximum value will be considered in the column (24).

In both the cases K<S and K>S the F1-score reduces as either precision or recall is reduced.

Overall, we can see that the precision increases while the recall decreases with the number of clusters predicted by the clustering approach.

Ending Thoughts

I hope you found this article useful for your studies or research work.

I have attached below a simple script where you can input the clustering result and the gold standard and get the precision, recall, F1-score and ARI values. This script takes in 2 .csv files; one for the clustering result and other for the gold standard.

You can run this script using the command,

python evaluate.py --clustered clustering_result.csv --goldstandard gold_standard.csv

Here, clustering_result.csv and gold_standard.csv should be in the format object_id,cluster_id. The precision, recall, F1 score and ARI values will be printed.

Thank you for reading!

Cheers!

References

[1] MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Wang et. al. Bioinformatics, Volume 28, Issue 18, 15 September 2012, pages i356–i362, DOI: https://doi.org/10.1093/bioinformatics/bts397

[2] Binning metagenomic contigs by coverage and composition. Alneberg et. al. Nature Methods vol. 11, 2014, pages 1144–1146, DOI: https://doi.org/10.1038/nmeth.3103

[3] Improving contig binning of metagenomic data using dS2 oligonucleotide frequency dissimilarity. Wang et. al. BMC Bioinformatics 18(425), 2017, DOI: https://doi.org/10.1186/s12859-017-1835-1

[4] GraphBin: refined binning of metagenomic contigs using assembly graphs. Mallawaarachchi et. al. Bioinformatics, Volume 36, Issue 11, June 2020, Pages 3307–3313, DOI: https://doi.org/10.1093/bioinformatics/btaa180

Evaluating Clustering Results

The criteria used to evaluate clustering results

Cluster Analysis (Clustering)

The problems I came across when evaluating my clustering results

Evaluation Criteria

Precision

Recall

F1-score

Adjusted Rand Index (ARI)

Example Walkthrough

Case 1: K=S

Case 2: K<S

Case 3: K>S

Analysis

Ending Thoughts

References

Written by Vijini Mallawaarachchi