
Clustering is one of the branches of Unsupervised Learning where unlabelled data is divided into groups with similar data instances assigned to the same cluster while dissimilar data instances are assigned to different clusters. Clustering has various uses in market segmentation, outlier detection, and network analysis, to name a few.
There are different types of clustering methods, each with its advantages and disadvantages. This article introduces the different types of clustering methods with algorithm examples, and when to use each algorithm.
Table of Contents
- Centroid-based / Partitioning (K-means)
- Connectivity-based (Hierarchical Clustering)
- Density-based (DBSCAN)
- Graph-based (Affinity Propagation)
- Distribution-based (Gaussian Mixture Model)
- Compression-based (Spectral Clustering, BIRCH)
Centroid-based / Partitioning
Centroid-based methods group data points together based on the proximity of data points to the centroid (cluster center)

The proximity between data points to the centroid is measured by
- Euclidean distance: Shortest (straight-line) distance, useful for numerical features, this is the default distance measure
- Manhattan distance: Sum of absolute differences, useful for categorical features
- Hamming distance: Percentage of bits that differ, useful for binary features
- Mahalanobis distance: Standardised form of Euclidean distance
- Minkowski distance: Generalized form of Euclidean and Manhattan distance
- Chebyshev distance: Maximum absolute difference
Regardless of which distance measure, numerical features should always be standardized when performing clustering!
K-means
K-means involves initialization and performing iterative Expectation Maximization (EM) steps until convergence or when maximum iteration is reached.
During initialization, k
number of centroids are assigned based on the k-means++
optimization algorithm. In the Expectation (E) step, data points are assigned to clusters following their closest (by Euclidean distance) centroid. In the Maximization (M) step, centroids are updated to minimize the inertia or within-cluster sum-of-squares (WCSS). EM steps are repeated until convergence to local minima where the cluster assignment and centroids do not change.
When to use K-means
- You want interpretability: K-means is easy to understand and interpret.
- Clusters are even-sized and globular-shaped: K-means work well when clusters are well-separated globular shapes but do not perform well if clusters are long and irregularly shaped.
When to NOT use K-means
- You are unsure about the number of clusters: K-means require the number of clusters to be predefined. Usually, the Elbow method, which plots WCSS against the number of clusters, is used to determine the optimal number of clusters.
- You have outliers in the data: All data points are assigned to a cluster, hence the presence of outliers can skew the results and have to be removed or transformed.
- You want computation efficiency: The computation cost of K-means increases with the size of data as it runs in
O(tkn)
time wheret
is the number of iterations,k
is the number of clusters, andn
is the number of data points. Using dimensionality reduction algorithms such as PCA can speed up computation.
Other centroid-based algorithms: K-Medoids, Mean Shift, Fuzzy Clustering (soft K-means; data points can be part of multiple clusters)
Edit: There are also Variance-Ratio Criterion (VRC) and Bayesian Information Criterion (BIC) as more robust alternatives to the Elbow method (Credits to Victor for this information – link to paper).
Connectivity-based
Connectivity-based methods group data points together based on the proximity between the clusters

The linkage criterion calculates the proximity between clusters,
- Single linkage: Distance between closest points of clusters; minimum distance between clusters. Good for detecting arbitrarily shaped clusters but cannot detect overlapping clusters. It is efficient to compute but is not robust to noisy data
- Complete / Maximum linkage: Distance between furthest points of clusters; maximum distance between clusters. Good for detecting overlapping clusters but cannot detect arbitrarily shaped clusters
- Average linkage: Average of all distances across two clusters
- Centroid linkage: Distance between centers of two clusters
- Ward linkage: Sum of squared distance from each data point to the centroid of the cluster they are assigned to. This results in cluster merging that gives the smallest increase in total variance within all clusters. Ward linkage is the default linkage criterion
Hierarchical Clustering
Agglomerative hierarchical clustering works by doing an iterative bottom-up approach where each data point is considered as an individual cluster and the two closest (by linkage criteria) clusters get iteratively merged until one large cluster is left.
Divisive hierarchical clustering does the opposite and performs an iterative top-down approach where it starts from one large cluster and continuously breaks down into two smaller clusters until every data point is an individual cluster.
When to use Hierarchical Clustering
- You are unsure about the number of clusters: Hierarchical clustering requires the number of clusters to be predefined, but the clusters can be viewed on a dendrogram and the number of clusters can be tweaked without recomputation.
- You want computation efficiency: Cluster labels at any intermediate stage can be recovered, therefore it is not necessary to recompute the proximity matrix if the number of clusters changes.
- You have high dimensional data: Output can be visualized using a dendrogram which can be used with higher dimensional data.
Density-based
Density-based methods group data points together based on density instead of distance

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
DBSCAN algorithm clusters data points that are in dense regions together, separated by areas of low density. Samples in denser areas within eps
units apart with more than min_samples
number of data points are called core samples.
When to use DBSCAN
- You are unsure about the number of clusters: DBSCAN does not require the number of clusters to be predefined.
- You want arbitrarily-shaped clusters: Clusters are determined by dense regions, hence cluster shapes can be odd-shaped or complex.
- You want to detect outliers: Data points that are not assigned to any cluster as they do not fall in any dense region will be considered outliers.
When to NOT use DBSCAN
- When you want stable performance: DBSCAN output is highly influenced and sensitive to the
eps
parameter, which must be chosen appropriately for the data set.
Other density-based algorithms: Ordering Points To Identify the Clustering Structure (OPTICS), Hierarchical DBSCAN (HDBSCAN)
Graph-based
Graph-based methods group data points together based on graph distance

Affinity Propagation
Affinity propagation works by pair-wise sending of messages between data points until convergence. Exemplars, which are points that best represent the surrounding data points, are chosen and each point is assigned a cluster of its nearest exemplar.
When to use Affinity Propagation
- You are unsure about the number of clusters: Affinity Propagation does not require the number of clusters to be predefined.
When to NOT use Affinity Propagation
- You want computation efficiency: Affinity propagation runs in
O(tn²)
time complexity wheret
is the number of iterations andn
is the number of data points, as messages are sent pair-wise between every data point. - You want space efficiency: If a dense similarity matrix is used, affinity propagation takes up
O(n²)
memory as messages sent between every data point are considered a ‘vote’ cast to elect the exemplar.
Distribution-based
Distribution-based methods group data points together based on their likelihood of belonging to the same probability distribution

Distribution-based methods use statistical inference to cluster data such that the closer the data point is to a central point, the higher the probability to be assigned to that cluster.
Compared to clustering methods that consider distance measures, distribution-based methods are more flexible in determining the shape of clusters, provided that the clusters follow a predefined distribution such as with synthetic or simulated data. If clusters are noisy, the results will overfit.
Gaussian Mixture Model (GMM)
GMM assumes that every data point is generated from multiple Gaussian distributions with unknown parameters and performs iterative Expectation Maximization (EM) steps to fit the data points.
In the Expectation (E) step, data points are assigned to clusters that assume randomly selected Gaussian parameters. In the Maximization (M) step, the parameters of the Gaussian distribution are updated to best fit the data points in the cluster.
When to use Gaussian Mixture Model
- You want computation efficiency: GMM is the fastest distribution-based algorithm.
- Your clusters follow a Gaussian distribution: Distribution can be tweaked to fit spherical, diagonal, tied, or full covariance matrices.
When to NOT use Gaussian Mixture Model
- You have insufficient data in each cluster: It is hard to compute the covariance matrices when there are insufficient data points in a Gaussian distribution.
Compression-based
Compression-based methods transform data points to an embedding and subsequently perform clustering on the lower-dimension data
Spectral Clustering
Spectral clustering transforms the affinity matrix between data points to a low-dimension embedding before performing clustering (such as K-means).
When to use Spectral Clustering
- You want computation efficiency: Spectral clustering is fast if the affinity matrix is sparse and if the
pyamg
solver is used.
When to NOT use Spectral Clustering
- You are unsure about the number of clusters: Spectral Clustering requires the number of clusters to be predefined and the number of clusters should be ideally small.
BIRCH
BIRCH performs lossy compression of data points to a set of Clustering Features nodes (CF Nodes) that forms the Clustering Feature Tree (CFT). New data points are ‘shuffled’ into the tree until it reaches a leaf and store information regarding the subcluster within the node.
Information regarding the final cluster centroids is read off the leaf and other clustering algorithms (such as hierarchical clustering) can be subsequently performed.
When to NOT use BIRCH
- You have high dimensional data: BIRCH does not scale well to high dimensional data
It is interesting how the end goal of obtaining clusters can be obtained with vastly different approaches and algorithms. This article does not aim to cover all the possible clustering algorithms or go in-depth into the mathematical formulas involved in each algorithm, but I hope it does provide some high-level detail on the types of clustering methods and when to use different algorithms.
Related Stories
Another aspect of clustering is to evaluate the clustering algorithms to determine which performs the best using statistical measures.
Related Links
sklearn
Clustering Documentation: https://scikit-learn.org/stable/modules/clustering.html