The world’s leading publication for data science, AI, and ML professionals.

Clustering - A Daunting Task for a Data Scientist

Guidelines for designing a clustering model

Consider the situation where a political party invites you, a data scientist, to analyze the population data to help them win the upcoming elections. Your task is to identify the groups in the entire country’s population that are in favor of the party. Having this information, the party could strategize their campaigns in different regions. This is one of the classic examples of clustering (forming groups) on a massive dataset. There are many other situations where the datasets are small, mainly because of data privacy and the means of collecting data itself.

Whether you have a small or enormous data, clustering is always a challenge even for a seasoned data scientist. This is mainly because the notion of a "cluster" itself is not precisely defined. A data scientist may find three clusters in a dataset, while another one may cluster the same dataset into four regions. Usually, there may not be a consensus between the data scientists on this; however, the final call is usually made by the stake-holders of the dataset. It is their knowledge and understanding of the data that eventually would satisfy their needs.

So, how does a data scientist meet the client’s requirements?

Data Scientist’s Approach

The first and the foremost important aspect for a data scientist in solving a clustering problem is to examine the size of the dataset. Based on the size, a data scientist selects the appropriate clustering model and the algorithm. Over a period, researchers have developed many clustering algorithms that meet the specific needs of various data sizes. For small to medium-size datasets, one uses trivial connectivity-based clustering algorithms; while for really large datasets with multivariate distributions, one has to use algorithms that are based on very advanced statistical techniques. It is the knowledge and the skill of a data scientist to select the algorithm for clustering. In this article, I will describe the approach followed by a data scientist to solve this non-trivial problem. To understand the approach, let us briefly consider the clustering categories and the various algorithms in each one.

Clustering Categories

The clustering algorithms are broadly categorized as follows:

  • Centroid-based
  • Connectivity-based
  • Distribution-based
  • Density-based
  • Clustering huge datasets
  • Gossip-style clustering
  • Grid-based

I will now introduce you to these categories and briefly describe the algorithms in each.

Centroid-based

The k-means, k-medoids algorithms fall into this category. These are probably the simplest to understand and are usually the starting point for learning to cluster. Here, you need to specify the number of clusters you want to form before running the algorithm. There are techniques available, such as Elbow, Average Silhouette, and the Gap Statistics, for selecting the optimal number of cluster formations. The optimization problem in this case is NP-hard and thus it usually provides only approximate solutions. This model works perfectly for small to medium-sized datasets.

Connectivity-based

The agglomerative and divisive clustering algorithms fall into this category. These algorithms are based on the concept of connecting objects based on their distance from each other. They create a dendrogram or a hierarchical structure for the entire dataset. This visual representation helps you in identifying the clusters. The clustering mainly depends on the type of distance function and the choice of linkage criterion. We use this type of clustering in specialized applications like determining the phylogenetic tree of animal evolution, tracking viruses by charting phylogenetic trees, classifying documents by analyzing text similarity, and so on. If you wish to get a hierarchical representation of your dataset while determining the clusters, you would use these algorithms.

Distribution-based

The Gaussian Mixture Model (GMM) is one algorithm that falls under this category. Here, we assume that the stake-holders have some knowledge of the statistical distribution of their dataset. As there is a strong assumption in the data distribution, the algorithm will produce adverse clusters if the assumption goes wrong. So, use this category of algorithms only if you are sure of the statistical distribution of your dataset, in which case the algorithm would produce excellent clustering results.

Density-based

The DBSCAN algorithm falls under this category. The high density of data points in a certain area defines the cluster. A drop in density marks a cluster border. For distributions like Gaussian Mixtures, this algorithm works beautifully. The most important aspect of this type of clustering is that it can create clusters of any arbitrary shape, while the earlier above-discussed algorithms mostly produce circular clusters. Second, the algorithm complexity is fairly low and outputs are deterministic – meaning they discover the same results in every run.

OPTICS is another algorithm that lies in the same category. It generalizes DBSCAN. This algorithm also produces linkage visualizations, which help the data scientist to create clusters of his choice.

The Mean-Shift algorithm is yet another algorithm in this category where we form the clusters based on kernel density estimation (KDE). This algorithm can detect arbitrary-shaped clusters, but is slower than DBSCAN. Also, it does not produce good clustering on multi-dimensional datasets.

Data scientists use density-based clustering when the datasets are large.

Huge-datasets

How do you cluster enormous-sized datasets? You use a well-known strategy – divide-n-conquer. The BIRCH algorithm falls under this category. We divide the dataset into smaller units, keeping as much information as possible from the original in each split. We then cluster each split unit using other known clustering techniques. The advantage of this technique is that it eliminates the need for loading the entire data set into memory.

The CLARANS is another algorithm in this category that uses a randomized search while clustering the dataset. It then applies the PAM (Partitioning Around Medoids) algorithm on each unit to determine its clustering quality. If no good, it repeats the entire clustering process. Thus, it maintains a balance between the computational cost and the random sampling of data. It allows polygonal objects and can use any arbitrary distance measuring function. It provides a high quality clustering.

Gossip-style Clustering

Also known as Affinity Propagation clustering, here we form the clustering by gossiping between the peers. This is like social networking. We form the groups based on a certain affinity between the members and their leaders. Peer messaging and willingness of a person to be a group leader decide the affinity measure. You don’t need to estimate the number of clusters in advance. The algorithm surely produces excellent results, especially when you do not have a knowledge of the number of clusters that a dataset may have. The algorithm is computationally expensive.

Grid-based

All the above algorithms that we have considered so far are query-based. If you decide to try out another query, you will need to scan the entire dataset all over again. So, all these clustering techniques are computationally expensive.

The STING algorithm falls under this category. Initially, we split the entire dataset into rectangular cells with each higher level cell in the hierarchy containing a summary of the lower level set of cells. As the structure is query independent, we can parallelize the entire clustering process, saving you the clustering time. Not only this, because of its nature, it can handle incremental updates, which is a requirement in dynamic datasets. Another advantage is the low complexity of this algorithm.

CLIQUE is another algorithm in this category. It combines both density and grid based clustering techniques. This algorithm starts with one-dimensional subspaces and keeps merging them to compute high-dimensional subspaces. It uses the APRIORI kind of technique to find clusterable subspaces and is ideally suited for clustering high-dimensional datasets.

Concluding Remarks

So far, I have discussed many clustering categories that can help you in deciding which algorithm to use based on the dataset size and the customer’s expectations on the clusters. The overview that I have presented in this small article can help you select an algorithm to cluster small, medium, large or even spatial datasets.

The clustering is usually non-trivial because of its unsupervised nature and the fact that there may not be a consensus on the end results. As a data scientist, you should have a good conceptual overview of these various clustering algorithms, the purpose for which they are built and the clustering problem they address. Several libraries provide an efficient implementation of these algorithms and as a data scientist you need not worry about the mathematics behind them. Just learn the purpose of each category of algorithm and you will solve any clustering problem.

For further information, you may like to refer to my upcoming book, Thinking Data Science (Springer series in Applied Machine Learning). It has a large section on clustering, with practical examples on all the algorithms.

Join Medium with my referral link – Poornachandra Sarang


Related Articles