The world’s leading publication for data science, AI, and ML professionals.

Use this clustering method if you have many outliers

The k-medians variation for robust outcomes

Photo by fabio on Unsplash
Photo by fabio on Unsplash

If you have ever wanted to segment your data into groups, you have probably tried the famous k-means algorithm for it. Since it is so simple, it is widely used, but its simplicity also comes with several drawbacks. One of them is its sensitivity to outliers, as it uses classic euclidean distance as the dissimilarity metric.

Unfortunately, real-world data sets often come with many outliers that you might not be able to remove completely during the data cleanup phase. If you have run into this problem, I want to introduce you to the k-medians algorithm. By using the median instead of the mean, and using a more robust dissimilarity metric, it is much less sensitive to outliers.

In this article, I will show you the following:

As usual, you can also find all the code for this article on my GitHub.


k-medians intuition

k-medians tries to alleviate the sensitivity of k-means to outliers by choosing a different dissimilarity metric. Instead of the euclidean distance, we typically use the absolute difference, which is also called the L1 norm or the Manhattan or Taxicab distance (Because you can use it to calculate the number of turns a taxi needs to take to reach its target in a rectangular grid of blocks). This is much less sensitive to outliers because these are only contributing with their actual distance to the center, instead of the square of the distance, as is the case for euclidean distance:

The manhatten metric between two points p and q. For each dimension calculate the absolute difference between the points and sum over them. This is also called the L1 norm. Image by author.
The manhatten metric between two points p and q. For each dimension calculate the absolute difference between the points and sum over them. This is also called the L1 norm. Image by author.

But one could also use other metrics here, if they are more appropriate, such as the Kullback-Leibler divergence to compare distributions.

To make it even more reliable, we also choose the median instead of the mean for the centers. So finally we need to optimize the following problem:

Mathematical formulation of the optimization problem of k-medians. Try to find a set of clusters C that minimize the absolute difference of each point to its belonging cluster center Ci. Image by author.
Mathematical formulation of the optimization problem of k-medians. Try to find a set of clusters C that minimize the absolute difference of each point to its belonging cluster center Ci. Image by author.

The approach of k-medians is very similar to k-means, it is again Llodyd’s algorithm. To summarize it briefly:

Input parameter k (number of clusters) and n_iter (number of iterations)
Randomly initialize k objects in the data set as centers
Do n_iter times:
  Assign each object to its closest center
   Calculate the new centers

You can find a much more detailed explanation on Lloyds algorithm in my post on k-means:

A deep dive into k-means


Implementation from scratch in R

If we look at the programmatic implementation, we recognize that is it not as ubiquitously available as k-means. For example in R, there is no k-medians function available in the stats package. So let’s code it ourselves:


Testing it on the iris data set

Next, let’s see how our function performs on the common iris data set. We will compare it also to the base R k-means implementation, to see where they might differ:

Comparison of k-medians to ground truth and k-means. Both clustering algorithms tend to find correct clusters, with only minimal distances observed.
Comparison of k-medians to ground truth and k-means. Both clustering algorithms tend to find correct clusters, with only minimal distances observed.

For this data set we only observe minor differences between k-medians and k-means, but it also does not contain too many outliers to begin with.


Summary

As you can see, it is really similar to k-means, we really only use a different distance function and use the median.

One common misconception about k-medians is that the medians returned as centers always are actual points in the data set. This can be easily seen to not be true. Consider the following example cluster, consisting of 5 points:

Example why the center of a cluster does not need to be an actual data point for k-medians. Consider a cluster consisting of the above 5 points. Because the median is calculated for each dimension separately in k-medians, the medians would be x = 3, and y = 3. But there exists no point (3, 3) in the data set.
Example why the center of a cluster does not need to be an actual data point for k-medians. Consider a cluster consisting of the above 5 points. Because the median is calculated for each dimension separately in k-medians, the medians would be x = 3, and y = 3. But there exists no point (3, 3) in the data set.

Because the median is calculated for each dimension separately in k-medians, the medians would be x = 3, and y = 3. But there exists no point (3, 3) in the data set.

Finally, of course you could couple k-medians together with an improved initialization, like kmeans++, to make it even more robust. You can find the details how to do that from my article on it here:

Try this simple trick to improve your clustering

See you next time, when we will discuss the most advanced variation of k-means, partitioning around medoids, concluding this mini-series.


Related Articles