Hands-on Tutorials

During my work as a Data Scientist, I have often come across problems where simple algorithms do not suffice, because they are stuck in a local optimum. This often leads to a lot of frustration during development, because you first think your approach is working, only to discover later that it doesn’t do so consistently, or not for all of your data sets. In this final article in my mini-series on k-means and its variants, I will talk about the k-medoids algorithm, also commonly called partitioning around medoids (PAM). It has the beauty of being basically deterministic and find very good solutions reliably. This does come at the cost of higher computational cost, but if your data set is not extremely large it is still a good candidate to try out if you need reliable results. As usual, you can also find all the code for this article on my GitHub.
Intuition of k-medoids
As with k-medians, k-medoids also commonly uses manhattan metric, but the centers now always are actual points in the data set. Instead of the centroids, we now calculate the median points, ergo the medoids. This increases the explainability of the approach, as the representative point in the data can always be retrieved. This is often confused with k-medians (which you can checkout in the article below), where the center point do not need to be an actual object.
Consider the following example cluster, consisting of 5 points for 2 dimensions:

Because the median is calculated for each dimension separately in k-medians, the medians would be x = 3, and y = 3. But there exists no point (3, 3) in the data set.
Multiple algorithms for k-medoids are implemented, the most common ones are again a Lloyd style algorithm (also called Voronoi iteration) and true partitioning around medoids (PAM). Unfortunately, the Lloyd style algorithm is often also called PAM, but this is really not true, as the BUILD phase of PAM (as we will see later) is very different to Lloyd. The BUILD phase of true PAM is the crucial step for the success of this algorithm, that is also why Llody style k-medoids usually arrives at worse results than PAM.
k-medoids Lloyd style
To start off easy, let us first implement k-medoids in Lloyd style and later build upon it for true PAM. As usual, we first initialize the centers randomly, but the update of the centers is now fundamentally different.
The update step is now called the SWAP phase. As the name already suggests, we consider swapping the current medoid with all other points in its cluster. For each of this candidate swaps, we calculate the total cost, which is the sum of the distances of all points in this cluster to the new medoid. We remember all swaps that give a lower cost and perform the best one, i.e. the one which arrives at the lowest cost.
The algorithm then terminates if the cost can no longer be decreased. Do note that this does not mean that we arrive at a global minimum. Because we only perform steps that are decreasing the cost, the algorithm has no way to get out of a local minimum and into the global minimum, if it was not initialized within the global minimum "valley".
Partitioning around medoids (PAM)
Finally, the PAM algorithm. As I already hinted before, it has a unique BUILD phase that ensures a very good initialization. The following SWAP phase is the same as we previously implemented in the Lloyd style k-medoids.
During the BUILD phase the first medoid is selected to be the one that has the minimum cost, with cost being the sum over all distances to all other points. Therefore, the first point is the most central point of the data set.
All further points are then selected iteratively. For all non-medoids we calculate the cost for selecting this point as the next medoid (again the sum of distances from the candidate medoid to all other non-medoids), and then select the one with the smallest cost as the next medoid.
To clarify that this is indeed that true PAM algorithm, you can check out the paper or book from the authors that originally invented it here.
As one immediately sees, this is computationally expensive to perform. In our implementation we will calculate all distances in each iteration, a less expensive solution would be to calculate the distance matrix only once (and only one of the triangles, as it is symmetric) and then only index into it as needed for the cost calculation.
The advantage of this algorithm is that the exhaustive BUILD phase typically arrives at a very good Clustering already. The following SWAP phase is usually only performed a few times before it converges. The authors even state that one could sometimes even omit it and still get a good partitioning.
There are also some differences in the SWAP phase between the Lloyd style k-medoids and true PAM: Lloyd only considers swaps within the same cluster, whereas PAM considers all current non-medoids for a potential swap, irrespective whether they are within the same cluster currently or not. This increases the search space for PAM, and potentially enables it to find better solutions.
Another characteristic of PAM is that it is close to being deterministic, because it does not use random elements during initialization and always considers all points as possible next medoids. Because there can be ties between the costs of two medoids considered, depending on how these ties are resolved the algorithm is not 100% deterministic (i.e. one could solve ties randomly or depending on the order in which the points are presented.)
Comparison between algorithms
Throughout the series, we have implemented many different algorithms, let’s compare them a bit regarding runtime and outcome. Because we implemented everything in base R without taking advantage of vectorization, the runtime will be significantly longer than using optimized algorithms built in C or FORTRAN.
Clustering outcome
Let’s start by visualizing the results. Of course the colors for the "same" cluster can differ between the different algorithms, because they do not know which cluster belongs to which species.

Most algorithms do find more or less correct clusters, with the exception of k-medians. We also see that the PAM algorithm does actually not perform any swaps at all, highlighting the strength of its BUILD phase! Also keep in mind if you do compare the number of iterations between PAM and Lloyd k-medoids that PAM only performs a single swap per iteration, whereas the Lloyd k-medoids performs a swap for each current medoid, the number of total swaps is therefore k * iterations.
If you want to learn more about with which objective metrics you can judge your clustering outcomes, check out the corresponding section in my article on k-means:
Runtime
Finally, let’s compare the runtimes of the different algorithms, and let’s also check how much faster the implementation in FORTRAN from R is:

As expected, PAM is the slowest algorithm, followed by Lloyd style k-medoids. Since the other lines are very close on the scale lets look at ratios instead:

Our vanilla k-means algorithm is 4000x slower than the base k-means! This demonstrates the drastic performance gain you can get if one implements an algorithm more efficiently and in a lower-level language, like C++. But our goal here was not efficiency, but understanding.
Summary
Congratulations if you made it this far. With PAM, you know now a very sophisticated clustering method that can be robustly applied to many data sets. Due to its high computational cost, it might not be completely suitable for very large data sets. If this is the case for you, check out algorithms designed for that, such as CLARA or CLARANS. This post also concludes my mini-series on k-means and related clustering algorithms. Stay tuned on what comes next!