POCS-based Clustering Algorithm Explained

Each data has different importance

LA Tran
Towards Data Science

--

Photo by Kier in Sight on Unsplash

Cluster analysis (or clustering) is a data analysis technique that explores and groups a set of vectors (or data points) in such a way that vectors in the same cluster are more similar to one another than to those in other clusters. Clustering algorithms are widely used in numerous applications, e.g., data analysis, pattern recognition, and image processing.

This article reviews a new clustering algorithm based on the method of Projection onto Convex Sets (POCS), called POCS-based clustering algorithm. The original paper was introduced in IWIS2022 and the source code has also been released on Github.

Convex Sets

A convex set is defined as a set of data points in which a line segment connecting any two points x1 and x2 in the set is completely subsumed in this set. Following that definition of a convex set, empty set ∅, singleton set, line segment, hyperplane, and Euclidian ball are considered to be convex sets. A data point is also considered to be a convex set as it is a singleton set (a set with exactly one element). That directs to a new path that the concept of POCS can be applied to clustering data points.

Projection onto Convex Sets (POCS)

Let me briefly review the concept of POCS (without equations). The method of POCS can be roughly divided into 2 forms: Alternating and Parallel.

Alternating POCS

Starting from an arbitrary point in the data space, the alternating projections from this point onto two (or more) intersecting convex sets will converge to a point within the intersection of the sets. A graphical illustration is shown below.

Image by author.

When the convex sets do not intersect, the alternating projections will converge to greedy limit cycles which are dependent on the orders of the projections.

Image by author.

Parallel POCS

Different from the alternating form, the parallel form of POCS performs projections simultaneously from the data point onto all the convex sets and each projection has a weight of importance. For two non-empty intersection convex sets, similarly to the alternating version, the parallel projections converge to a point in the intersection of the sets.

Image by author.

In the case of non-intersecting convex sets, the projections will converge to a minimization solution. The main idea of the POCS-based clustering algorithm came up from this property.

Image by author.

For more details of POCS, you can visit the original paper and/or some other recommended papers (with available pdf files):

POCS-based Clustering Algorithm

Taking advantage of the convergence property of the parallel POCS method, the authors proposed a very simple yet effective (to some extent) clustering algorithm. The algorithm operates in a similar spirit to the classical K-Means algorithm, but there is a difference in the way each one treats each data point, i.e., the K-Means algorithm treats each data point with equally weighted importance, however, the POCS-based clustering algorithm, on the other hand, treats each data point with a different weight of importance which is directly proportional to the distance from the data point to the cluster prototype. That’s it!

The pseudocode of the algorithm is shown below:

Image from paper.

Experimental Results

The authors examined the performance of the POCS-based clustering algorithm on some public benchmark datasets from the website Clustering basic benchmark. The descriptions of these datasets are summarized in the table below.

Image from paper.

In the paper, the author compared the performance of the POCS-based clustering algorithm against other conventional clustering methods including the K-Means and Fuzzy C-Means algorithms. The evaluations in terms of execution time and clustering error are summarized in the following tables.

Image from paper.
Image from paper.

The visual clustering results are also illustrated in the following figure.

Image from paper.

For more details, you can drop by the original paper here.

Example Code

Let’s play with this algorithm on a very simple dataset. For the sake of simplicity, one can install the released package of the algorithm using this command:

pip install pocs-based-clustering

First, let’s import several necessary packages and create a simple dataset with 5000 data points centered around 10 clusters:

# Import packages
import time
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from pocs_based_clustering.tools import clustering


# Generate a simple dataset
num_clusters = 10
X, y = make_blobs(n_samples=5000, centers=num_clusters, \
cluster_std=0.5, random_state=0)

plt.figure(figsize=(8,8))
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.show()
Image by author.

Now, perform clustering using the built-in function and display the result:

# POSC-based Clustering Algorithm
centroids, labels = clustering(X, num_clusters, 100)

# Display results
plt.figure(figsize=(8,8))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], s=100, c='red')
plt.show()
Image by author.

Conclusions

In this post, I have briefly reviewed a simple yet effective clustering technique based on the method of Projection onto Convex Sets (POCS), called the POCS-based Clustering Algorithm. This algorithm takes advantage of the convergence property of POCS to apply to clustering tasks and has realized feasible improvements to some extent. The effectiveness of this algorithm has been validated on some benchmark datasets. The original paper can be found on arXiv (preprint) or IEEE Xplore (published paper). The code was also released on Github.

I am so glad to welcome you to my Facebook page for sharing things regarding Machine Learning: Diving Into Machine Learning. Further notable posts from me can also be found here:

Thanks for spending time!

--

--

PhD in Machine Learning and Computer Vision | Once the WHY is clear, the HOW goes easy.