The world’s leading publication for data science, AI, and ML professionals.

Clustering types with various applications

Clustering types and their usage areas are explained with python implementation

Hands-on Tutorials

Table of Contents
1. Introduction
2. Clustering Types
2.1. K-Means
-----Theory
-----The optimal number of clusters
-----Implementation
2.2. Mini-Batch K-Means
2.3. DBSCAN
2.4. Agglomerative Clustering
2.5. Mean-Shift
2.6. BIRCH
3. Image Segmentation with Clustering
4. Data Preprocessing with Clustering
5. Gaussian Mixture Model
-----Implementation
-----How to select the number of clusters?
6. Summary

1. Introduction

Unlabeled datasets can be grouped by considering their similar properties with the unsupervised learning technique. However, the point of view of these similar features is different in each algorithm. Unsupervised learning provides detailed information about the dataset as well as labeling the data. With this acquired information, the dataset can be rearranged and made more understandable. In this way, unsupervised learning is used in customer segmentation, image segmentation, genetics (clustering DNA patterns), recommender systems (grouping together users with similar viewing patterns), anomaly detection, and many more. New and concise components are obtained according to the statistical properties of the dataset with the PCA that is one of the most frequently used dimensionality reduction techniques and mentioned in the previous article. This article explains clustering types, using clustering for image segmentation, data preprocessing with clustering, and Gaussian mixtures method in detail. All explanations are supported with python implementation.

Photo by Valentin Salja on Unsplash
Photo by Valentin Salja on Unsplash

2. Clustering Types

2.1. K-Means

Theory

K-means clustering is one of the frequently used clustering algorithms. The underlying idea is to place the samples according to the distance from the center of the clusters in the number determined by the user. The code block below explains how the k-means cluster is built from scratch.

The centers of the determined number of clusters are randomly placed in the dataset. All samples are assigned to the closest center, this closeness is calculated with Euclidian Distance in the code block above, but different calculation methods can also be used such as Manhattan Distance. Centers to which samples are assigned update their location (average) according to their population. Stage 2, that is, the distances of the samples from the center are recalculated and the assignment to the nearest center takes place, and it is repeated. Stage 3, each cluster center recenters itself according to the dataset. This process continues to an equilibrium state. The above code block is visualized in figure 1 by importing the steps described verbally here from the mglearn library.

Figure 1. mglearn.plots.plot_kmeans_algorithm(), Image by author
Figure 1. mglearn.plots.plot_kmeans_algorithm(), Image by author

The scratch above is imported with the sklearn library and exemplified in the practical section. As it is known, there are methods such as .fit, .predict, .transform. The K-means algorithm imported with .fit performs as described above by placing the samples in the clusters, updating the centers of the clusters, and repeating these operations. When the equilibrium state is reached, that is, there is no change, the algorithm is completed. The center of the clusters can be seen with Kmeans.cluster_centers_. With the .predict we can predict the cluster of any external sample we wonder which cluster it will belong to. With `.transform, the distance between the sample and each cluster can be obtained as a matrix. Each row of this matrix represents the distance of the sample from each cluster center. Choosing the smallest value (assigning the sample to the nearest center) is called _hard clustering_. **At this point, it is worth mentioning that the new dataset __ (with.transform`) to be obtained as a result of the distances of each sample to the selected number of clusters is actually a very effective nonlinear dimensionality reduction technique**.

The optimal number of clusters

Finding the inertia values for each k-value is one of the ways to choose the optimal number of clusters. Inertia is the sum of squared error for each cluster. According to this definition, the number k with the lowest inertia value would give us the most accurate result. Although it is theoretically correct, when we generalize the model, the place where the inertia value reaches the smallest value, i.e. 0, is k = sample numbers, that is, the point where each sample accepts the cluster. From this point, it is the best way to examine the graph and determine the breakpoint in order to choose the most accurate cluster.

Figure 2. graphs of inertia - number of clusters, Image by author
Figure 2. graphs of inertia – number of clusters, Image by author

Figure 2 shows the inertia-number of cluster plots of datasets grouped with k-means in the practical part. Looking at the graphs, it is seen that diffraction (elbow) occurs at n=2 point, which is interpreted as the optimum number of clusters to be selected.

Another method is to determine the silhouette score and compare the silhouette score values in the cluster numbers. Silhouette score is the measure that takes a value between -1 and 1 and is equal to (b – a) / max(a, b), where a is the mean distance to the other instances in the same cluster and b is the mean nearest-cluster distance. By definition, the highest score serves to determine the best cluster level. The only downside is computational complexity. In Figure 3, the silhouette score graphic of the practical part is shown.

Figure 3. Silhouette Score, Image by author
Figure 3. Silhouette Score, Image by author

As can be seen, the highest value is obtained at the level of 2 clusters.

A value of 1 is not added because at least 2 clusters are required to determine the Silhouette score.

Implementation

Datasets presented in the Sklearn library are imported and then the K-Means Clustering algorithm is applied to both of them as follow:

Elbow graphs and the result of the silhouette are already shown above and the effect of k-means is shown in figure 4.

Figure 4. Clustering capability of k-means on the datasets, Image by author
Figure 4. Clustering capability of k-means on the datasets, Image by author

2.2. Mini-Batch K-Means

As the name suggests, it updates the cluster center in mini-batches instead of the entire dataset. As expected, the inertia value is higher, although it shortens the time compared to k-means. It can be used in large datasets. It is implemented in Python as follow:

The results are almost the same compared to K-Means as seen in Figure 5.

Figure 5. Clustering capability of mini-batch k-means on the datasets, Image by author
Figure 5. Clustering capability of mini-batch k-means on the datasets, Image by author

2.3. DBSCAN

We can compare the principle of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to sorting out the chains in a box full of chains. Clusters are created with the proximity of the datasets to each other instead of the proximity of the samples in the population to the centers, and the number of clusters is not set by the user. In more formal terms, the algorithm groups samples that are close to each other at a shorter distance than the hyperparameter set by the user. Another hyperparameter, min_samples, is the minimum number of samples required to assign this collection as a cluster. As it can be understood from the expression and the name, grouping is done according to the dense of the dataset. The samples in the dense region are called core samples. The above recipe is visualized as in figure 6, using the mglearn library.

Figure 6. mglearn.plots.plot_dbscan() , Image by author
Figure 6. mglearn.plots.plot_dbscan() , Image by author

When eps=1.5 is examined, while 3 different groups are created and all samples are labeled in case min_samples=2. Because all datasets are closer to the nearest data than eps=1.5. In the same case, when min_samples=5 is set, only 5 data clusters are created, while 7 data remains unlabeled. Because data is far from eps=1.5 distance to the created cluster. In addition, the reason why new clusters are not formed is that there are not 5 samples close to each other at eps=1.5 distance. As it is seen that, the number of clusters is determined according to the hyperparameters set with DBSCAN. It is useful to use in anomaly detection with the appropriate settings, as would be expected.

It has been applied to datasets in the following code block:

Moons dataset is not suitable for separating with k-means in terms of structure, but after the data standardization process, it is clustered quite elegantly with DBSCAN. Unlike K-Means, DBSCAN does not contain the .predict method in its structure, so it cannot be determined which cluster the external data belongs to according to this dataset. But DBSCAN components are extracted with dbscan.component_ then called matrix X; the labels are extracted with the dbscan.labels_ method and then called y, at the final step, they are trained with the KNeighbourClassifier, the model becomes useful for external data as well.

Figure 7. Clustering capability of DBSCAN on the datasets, Image by author
Figure 7. Clustering capability of DBSCAN on the datasets, Image by author

2.4. Agglomerative Clustering

Each sample starts as a cluster, and mini-clusters (samples clusters) are combined with user-selected conditions until the specified number of clusters is reached. These conditions are:

linkage= 'ward': minimize the variance of the clusters being merged (default)

linkage= 'average': uses the average of the distance of each observation

linkage= 'complete': uses the maximum distances between all observations of the two sets.

linkage ='single': uses the minimum of the distances between all observations of the two sets.

In addition, the distance between clusters can be adjusted with the affinity hyperparameter. "euclidean", "l1", "l2", "manhattan", "cosine", or "precomputed". Agglomerative clustering is visualized with mglearn library as seen in figure 8.

Figure 8. mglearn.plots.plot_agglomerative_algorithm() , Image by author
Figure 8. mglearn.plots.plot_agglomerative_algorithm() , Image by author

As it is seen that, the clustering process starts from the closest samples and the merger continues until the number of clusters is determined by the user. Due to its structure, agglomerative clustering does not include the .predict method, just like DBSCAN. External data is estimated with the fit_predict method. Agglomerative clustering works by generating hierarchical clusterings. The best way to visualize these hierarchical clusterings is to create dendrograms. It is applied to our moons dataset with the following code block:

The output of the code block is as in figure 9.

Figure 9. Clustering capability of Agglomerative Clustering on the datasets, Image by author
Figure 9. Clustering capability of Agglomerative Clustering on the datasets, Image by author

If we interpret it from low level to high level, each sample is a cluster. So, we have a number of samples clusters. Since the number of them is too high, merging processes take place at the bottom, which is seen in red bulk, and clustering is performed as in figure 10. This process will continue until the number of clusters to be determined by the user and will be interrupted at the specified point.

Figure 10. Dendrogram of the moon dataset, Image by author
Figure 10. Dendrogram of the moon dataset, Image by author

2.5. Mean-Shift

It starts with a circle placed in the dataset and moves to reveal the mean value of the data in this circle. After reaching its new position, the average of the data inside is calculated and recentered again. This process is repeated until the equilibrium state. The places with high density can be defined as a density-based algorithm as they will pull the average value towards them (ie mean-shift). We can also detect different clusters within dense regions, assuming we reduce the size of the circle. Once the bandwidth is changed, lots of clusters are created as seen in Figure 12.

bandwith=0.75 is set in the moon dataset and the results are as follow:

Figure 11. Clustering capability of Mean-Shift on the datasets, Image by author
Figure 11. Clustering capability of Mean-Shift on the datasets, Image by author
Figure 12. Different bandwidth effects on mean-shift clustering algorithm, Image by author
Figure 12. Different bandwidth effects on mean-shift clustering algorithm, Image by author

2.6. BIRCH

The BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a tree-based algorithm suitable for large datasets. Compared to Mini Batch k means, it gives much faster results with similar success rates. Since the clustering process is tree-based, it can quickly assign the samples to the clusters without storing them with the model created. That’s why it’s fast. However, since it is tree-based, it is recommended to use it when the number of features is not more than 20.

Figure 12. Clustering capability of Birch on the datasets, Image by author
Figure 12. Clustering capability of Birch on the datasets, Image by author

3. Image Segmentation with Clustering

Since we have dealt with the sklearn library enough until this section, let’s apply K-Means to the image seen in figure 13 by using the CV2 library. The following code block has been applied to the image at various k values.

  • The results of K-Means are as follows:
Figure 13. Image segmentation with k-means, Image by author
Figure 13. Image segmentation with k-means, Image by author
  • The results of DBSCAN are as follow:
Figure 14. Image segmentation with DBSCAN, Image by author
Figure 14. Image segmentation with DBSCAN, Image by author

4. Data Preprocessing with Clustering

If we interpret it from the image dataset, there are hundreds of features and if these features are made with clustering, it can be considered as the features are grouped and dimensionality reduction is made. Classification with the decision tree algorithm with and without clustering applied to the Load_digits dataset was done in the following code block:

It is seen that the score ratio increases from 83 to 88, when the number of clusters is set to 40.

5. Gaussian Mixture Model

GMM is a probabilistic model that assumes that a dataset is made up of a combination of individual Gaussians with unknown parameters. With K-means, as mentioned above, hard clustering is done and the sample is assigned to the nearest cluster. Considering factors such as distribution and density in k means datasets, which is a cluster center-based unsupervised learning technique, k-means may give misleading results. For the density, it can be seen above that DBSCAN is more successful. If we tackle according to the distribution, the clustering performance would be better in GMM. Let’s define the building block, Gaussian, to better understand the GMM. What is the gaussian which is the main component and what does it reveal?

Gaussian Distribution represents a distinctive bell-shaped curve when a sample is plotted in a histogram. Normal (Gaussian) Distribution occurs when there are random factors that interact with the measure. In a normal distribution majority of data points will have a measure that is similar to the mean and the mean value is the center of the bell shape curve. However, fewer data points have values much larger or smaller than the mean value. The width of the curve in the distribution corresponds to the std. In the 1.0 std interval, 68.3% of the whole dataset takes place, while in the 2.0 std interval there is 95.4%. Figure 15 shows the normal (gaussian) distribution and mean, std points drawn from the histogram.

Figure 15. Gaussian(normal) distribution, Image by author
Figure 15. Gaussian(normal) distribution, Image by author

Gaussian Mixtures Model groups as a result of the combination of Gaussians, which reveals the statistical distribution in the dataset. Variance(for 1D) or covariance(for 2D) values derived from the standard deviation value created in Gaussian play an active role at this point. K-means does not consider the variance value, only distributes it according to the proximity to the cluster center. While Gaussian Distribution generates probabilistic ratios about which cluster the data belongs to (the sum of these ratios=1), that means soft clustering; K-Means clustering prefers hard clustering. It should also be noted that GMM uses the Expectation-Maximization algorithm to fit the Gaussians that form it. Gaussian Mixture Models will be covered in detail in subsequent articles.

Implementation

Since it is pretty easy to implement GMM to the sklearn dataset, it has been applied to the flower image used in image segmentation in the code block below:

Figure 16. Image segmentation with GMM, Image by author
Figure 16. Image segmentation with GMM, Image by author

It is necessary to mention one of the essential hyperparameters, covariance_type, in the implementation section. Covariance_type can be:

covariance_type='full': each component has its own general covariance matrix, which means clusters can be on any shape, size, orientation (default)

covariance_type='tied': all components share the same general covariance matrix, which means all clusters have the same ellipsoidal shape, size, orientation

covariance_type='diag': each component has its own diagonal covariance matrix, which means clusters can be any ellipsoidal size.

covariance_type='spherical': each component has its own single variance, which means All clusters must be spherical, but they can have different diameters.

Figure 17. Visualization of covariance types, source
Figure 17. Visualization of covariance types, source

How to select the number of clusters?

It was mentioned above that silhouette score or inertia can be used in the k-means header. However, because the cluster shape for the Gaussian Mixture Model may not be spherical or may be of different sizes, it may be misleading to choose according to a certain metric system. It is more correct to use the Bayesian Information Criterion or Akaike Information Criterion when choosing the best number of clusters for GMM.

AIC and BIC are the probabilistic model selection techniques that are used for scoring and selecting the best model.

Let’s look at the bic&aic values of the moons dataset above and visualize it:

Figure 18. graph of AIC&BIC-k, Image by author
Figure 18. graph of AIC&BIC-k, Image by author

As can be seen, both AIC and BIC are maximum in the case of n=2. In the middle part of the code block, their performances against different covariance types are also tested and the result is shown in figure 19.

Figure 19. Comparison of different covariance types, Image by author
Figure 19. Comparison of different covariance types, Image by author

It is seen that the best BIC value is obtained covariance_type='spherical' for each k value.

6. Summary

Most of the clustering techniques are discussed above. Theoretical parts of them were explained and these were implemented in python with basic examples. The image that includes a summary of clustering techniques is shown in figure 20.

Figure 20. summary of clustering techniques, source
Figure 20. summary of clustering techniques, source

The previous article involves one of the most useful dimensionality reduction techniques, that is, Principal Component Analysis (PCA). This article covers the clustering types and some of their usage areas with Python implementation.

Remaining subjects such as Outlier Detection, Expectation-maximization meta-algorithm (EM), Self-Organization Maps (SOM), Fuzzy C Means, etc. will be discussed in further articles.

Comprehensive guide for Principal Component Analysis

Machine Learning Guideline


Related Articles