
In Supervised Learning, the labels are known and evaluation can be done by calculating the degree of correctness by comparing the predicted values against the labels. However, in Unsupervised Learning, the labels are not known, which makes it hard to evaluate the degree of correctness as there is no ground truth.
That being said, it is still consistent that a good Clustering algorithm has clusters that have small within-cluster variance (data points in a cluster are similar to each other) and large between-cluster variance (clusters are dissimilar to other clusters).
There are two types of evaluation metrics for clustering,
- Extrinsic Measures: These measures require ground truth labels, which may not be available in practice
- Intrinsic Measures: These measures do not require ground truth labels (applicable to all unsupervised learning results)
This article will discuss the various evaluation metrics for clustering algorithms, focusing on their definition, intuition, when to use them, and how to implement them with the Sklearn
library. Formulas for all the algorithms can be found in the Appendix section of the article.
Note: I checked all the algorithms and formulas by hand, do reach out if you need the calculations! Otherwise, for each algorithm, the variables and formula are explained in words and equations for better understanding – more in the Appendix 🙂
Table of Contents
Extrinsic Measures
Intrinsic Measures
Appendix: Formulas for measures

Extrinsic Measures require ground truth labels, which may not be available or require manual labeling by humans.
№1. Rand Index (RI, ARI)
Rand Index (RI, ARI) measures the similarity between the cluster assignments by making pair-wise comparisons. A higher score signifies higher similarity.
For each pair, it is considered correct if the pair is predicted to be in the same cluster when they are in the same cluster (somewhat like "true positive") and correct if the pair is predicted to be in different clusters when they are indeed in different clusters (somewhat like "true negative").

However, Rand Index does not consider chance; if the cluster assignment was random, there can be many cases of "true negative" by fluke. Ideally, we want random (uniform) label assignments to have scores close to 0, and this requires adjusting for chance.
Adjusted Rand Index (ARI) adjusts for chance by discounting a chance normalization term. The formula for ARI can be found in this article’s Appendix (Fig 2) to avoid visual clutter.
When to use Rand Index
- You want interpretability: RI is intuitive and easy to understand.
- You are unsure about cluster structure: RI and ARI do not make assumptions about the cluster structure and can be applied to all clustering algorithms.
- You want a basis for comparison: RI is bounded between the
[0, 1]
range, and ARI is bounded between the[-1, 1]
range. The bounded range makes it easy to compare the scores between different algorithms.
When to NOT use Rand Index
- You do not have the ground truth labels: RI and ARI are extrinsic measures and require ground truth cluster assignments.
Implementing Rand Index
from sklearn.metrics import rand_score, adjusted_rand_score
labels = [0, 0, 0, 1, 1, 1]
labels_pred = [1, 1, 2, 2, 3, 3]
RI = rand_score(labels, labels_pred)
ARI = adjusted_rand_score(labels, labels_pred)
№2. Mutual Information (MI, NMI, AMI)
Mutual Information (MI, NMI, AMI) measures the agreement between the cluster assignments. A higher score signifies higher agreement.
The degree of agreement between clusters is computed by joint and marginal probabilities. There are two variations to Mutual Information: Normalized Mutual Information (NMI) and Adjusted Mutual Information (AMI).
Normalized MI is MI divided by average cluster entropies and is often used in literature, while Adjusted MI is Normalized MI adjusted for chance by discounting a chance normalization term. The formula for MI, NMI, and AMI can be found in this article’s Appendix (Fig 3 and 4).
When to use Mutual Information
- You want a basis for comparison: MI, NMI, and AMI have an upper bound of 1.
When to NOT use Mutual Information
- You do not have the ground truth labels: MI, NMI, and AMI are extrinsic measures and require ground truth cluster assignments.
Implementing Mutual Information
from sklearn.metrics import (
mutual_info_score,
normalized_mutual_info_score,
adjusted_mutual_info_score,
)
labels = [0, 0, 0, 1, 1, 1]
labels_pred = [1, 1, 2, 2, 3, 3]
MI = mutual_info_score(labels, labels_pred)
NMI = normalized_mutual_info_score(labels, labels_pred)
AMI = adjusted_mutual_info_score(labels, labels_pred)
№3. V-measure
V-measure measures the correctness of the cluster assignments using conditional entropy analysis. A higher score signifies higher correctness.
Two metrics measure the correctness of cluster assignments, which are intuitive as they follow from supervised learning.
- Homogeneity: Each cluster contains only members of a single class (somewhat like "precision")
- Completeness: All members of a given class are assigned to the same cluster (somewhat like "recall")
V-measure is the harmonic mean of homogeneity and completeness measure, similar to how the F-score is a harmonic mean of precision and recall. The formula for homogeneity, completeness, and V-measure can be found in this article’s Appendix (Fig 5).
When to use V-measure
- You want interpretability: V-measure is intuitive and easy to understand in terms of homogeneity and completeness.
- You are unsure about cluster structure: V-measure does not make assumptions about the cluster structure and can be applied to all clustering algorithms.
- You want a basis for comparison: Homogeneity, completeness, and V-measure are bounded between the
[0, 1]
range. The bounded range makes it easy to compare the scores between different algorithms.
When to NOT use V-measure
- You do not have the ground truth labels: Homogeneity, completeness, and V-measure are extrinsic measures and require ground truth cluster assignments.
- Your sample size is less than 1000 and the number of clusters is more than 10: V-measure does not adjust for chance. This means that random labelling would not yield zero scores especially if the number of clusters is large.
Implementing V-measure
from sklearn.metrics import (
homogeneity_score,
completeness_score,
v_measure_score,
)
labels = [0, 0, 0, 1, 1, 1]
labels_pred = [1, 1, 2, 2, 3, 3]
HS = homogeneity_score(labels, labels_pred)
CS = completeness_score(labels, labels_pred)
V = v_measure_score(labels, labels_pred, beta=1.0)
№4. Fowlkes-Mallows Scores
Fowlkes-Mallows Scores measure the correctness of the cluster assignments using pairwise precision and recall. A higher score signifies higher correctness.
While V-measure is a harmonic mean between homogeneity ("precision") and completeness ("recall"), Fowlkes-Mallows Index (FMI) is the geometric mean of pairwise precision and recall, using True Positive (TP), False Positive (FP), and False Negative (FN).
While the Fowlkes-Mallows Score does not take into account True Negative (TN), it will not be affected by chance and there is no need for chance adjustments, unlike Rand Index and Mutual Information.
The definition of TP, FP, FN, and the formula for the Fowlkes-Mallows Index (FMI) is found in this article’s Appendix (Fig 6 and 7).
When to use Fowlkes-Mallows Scores
- You are unsure about cluster structure: Fowlkes-Mallows Score does not make assumptions about the cluster structure and can be applied to all clustering algorithms.
- You want a basis for comparison: Fowlkes-Mallows Score has an upper bound of
1
. The bounded range makes it easy to compare the scores between different algorithms.
When to NOT use Fowlkes-Mallows Scores
- You do not have the ground truth labels: Fowlkes-Mallows Scores are extrinsic measures and require ground truth cluster assignments.
Implementing Fowlkes-Mallows Scores
from sklearn.metrics import fowlkes_mallows_score
labels = [0, 0, 0, 1, 1, 1]
labels_pred = [1, 1, 2, 2, 3, 3]
FMI = fowlkes_mallows_score(labels, labels_pred)

Intrinsic Measures do not require ground truth labels, making them applicable to all clustering results
№5. Silhouette Coefficient
Silhouette Coefficient measures the between-cluster distance against within-cluster distance. A higher score signifies better-defined clusters.
The Silhouette Coefficient of a sample measures the average distance of a sample with all other points in the next nearest cluster against all other points in its cluster. A higher ratio signifies the cluster is far away from its nearest cluster and that the cluster is more well-defined.
The Silhouette Coefficient for a set of samples takes the average Silhouette Coefficient for each sample. The formula is found in this article’s Appendix (Fig 8).
When to use Silhouette Coefficient
- You want interpretability: The Silhouette Coefficient is intuitive and easy to understand.
- You want a basis for comparison: Silhouette Coefficient has a range of
[-1, 1]
, from incorrect clustering to highly dense clustering, with0
being overlapping clusters. The bounded range makes it easy to compare the scores between different algorithms. - You define good clusters as well-defined clusters: Silhouette Coefficient follows the general definition of good clusters being dense and well-separated.
When to NOT use Silhouette Coefficient
- You are comparing different types of clustering algorithms: Silhouette Coefficient scores tend to be higher for convex clusters (i.e., any line that connects any two points in a convex cluster is contained within the cluster), and would be unfair to compare against other types of clustering algorithms.
Implementing Silhouette Coefficient
from sklearn.metrics import silhouette_score
data = [
[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
]
clusters = [1, 1, 2, 2, 3, 3]
s = silhouette_score(data, clusters, metric="euclidean")
№6. Calinski-Harabasz Index
Calinski-Harabasz Index measures the between-cluster dispersion against within-cluster dispersion. A higher score signifies better-defined clusters.
The Calinski-Harabasz Index, or Variance Ratio Criterion, measures the sum of between-cluster dispersion against the sum of within-cluster dispersion, where dispersion is the sum of distance squared.
A higher ratio signifies the cluster is far away from its other clusters and that the cluster is more well-defined. The formula is found in this article’s Appendix (Fig 9).
When to use Calinski-Harabasz Index
- You want efficiency: The Calinski-Harabasz Index is fast to compute
- You define good clusters as well-defined clusters: Calinski-Harabasz Index follows the general definition of good clusters being dense and well-separated.
When to NOT use Calinski-Harabasz Index
- You are comparing different types of clustering algorithms: Calinski-Harabasz Index tends to be higher for density-based clustering algorithms, and would be unfair to compare against other types of clustering algorithms.
Implementing Calinski-Harabasz Index
from sklearn.metrics import calinski_harabasz_score
data = [
[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
]
clusters = [1, 1, 2, 2, 3, 3]
s = calinski_harabasz_score(data, clusters)
№7. Davies-Bouldin Index
Davies-Bouldin Index measures the size of clusters against the average distance between clusters. A lower score signifies better-defined clusters.
The Davies-Bouldin Index measures the average similarity between clusters, where similarity compares the size of clusters against the between-cluster distance.
A lower score means that the cluster is relatively small compared to the distance to another cluster, hence well-defined. The formula is found in this article’s Appendix (Fig 10).
When to use Davies-Bouldin Index
- You want interpretability: Davies-Bouldin Index is easier to compute than Silhouette scores and it uses point-wise distances.
When to NOT use Davies-Bouldin Index
- You are comparing different types of clustering algorithms: Davies-Bouldin Index tends to be higher for density-based clustering, and would be unfair to compare against other types of clustering algorithms.
- You want other distance measures besides Euclidean distance: Size of clusters, computed by centroid distance, limits distance metric to Euclidean space.
Implementing Davies-Bouldin Index
from sklearn.metrics import davies_bouldin_score
data = [
[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
]
clusters = [1, 1, 2, 2, 3, 3]
DB = davies_bouldin_score(data, clusters)
Conclusion
The characteristics of each algorithm are summarized below:

I hope you have understood more about different ways to evaluate a clustering algorithm – using intrinsic and extrinsic measures depending if you have the ground-truth labels. In practice, we may care more about whether the clusters make business sense rather than the distance within or between clusters through statistical measures. Nonetheless, these evaluation metrics are still good to know!
Appendix
The formula for Rand Index (ARI)
Chance normalization term considers the number of pairs occurring in the same cluster in actual cluster assignment and predicted cluster assignment.

The formula for Mutual Information (MI, NMI, AMI)
The formulas for joint and marginal probabilities and entropies form the basis for calculating Mutual Information.


The formula for V-measure

The formula for Fowlkes-Mallows Scores
Fowlkes-Mallows Index (FMI) is calculated using True Positive, False Positive, and False Negative. The definition for TP, FP, and FN is done by counting the number of pairwise points if they are allocated in the same or different cluster for the predicted and actual label.


The formula for Silhouette Coefficient
Note that for the calculation of b
, it considers the next nearest cluster to the sample itself and not the next nearest cluster to the assigned cluster.

Calinski-Harabasz Index

Davies-Bouldin Index

Related Links
sklearn
Documentation: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation