Speeding up your code (1): the example of the mean shift clustering in Poincaré ball space

Vincenzo Lavorini
Towards Data Science
6 min readAug 13, 2017

--

From this series:

  1. The example of the mean shift clustering in Poincaré ball space (this post)
  2. Vectorizing the loops with Numpy
  3. Batches and multithreading
  4. In-time compilation with Numba

Hello community!

This is the first of a series of posts where I will describe the steps I did to build a fast clustering algorithm to be used within a particular, “hyperbolic” space using Python and Numpy.

Because I want to show a real case of development, in this first post I will describe the goal I had to reach, with dome detail about the environment where I had to accomplish that goal and the reasons why I was there, then the first raw version of the code.

So let’s begin!

Why such thing?

On May 2017, in the world of Artificial Intelligence, a paper was published by Maximilian Nickel and Douwe Kiela from the Facebook AI Research. The content of that paper promise to have a huge impact in the world of Representation Learning.

The Representation Learning techniques, rawly speaking, have the goal of translate the “meaning” of objects (e.g. words) which belongs to a collection (e.g. a book), taking account of the properties (e.g. semantics) that exist between them, with the goal of have a basis that can be thereafter used by AI algorithms. The better are the objects represented, the more effective will be the subsequent algorithms.

The cutting edge of these techniques use “embeddings”: each object is represented by a point in a multi-dimensional space, and that point will “embed” ideally all the properties the object has with respect to the collection. As consequence, object with similar meaning will found their embeddings geometrically near each other. And that’s why I had to build a clustering algorithm.

This multi-dimensional space usually is of the “flat” type, i.e. has the geometrical properties that we see in our real space, the ones that we studied at the school. Instead, in the mentioned paper they use a particular, “curved” space: the Poincaré ball model. This because those kind of spaces are very good in representing hierarchies, and so the latent hierarchies present in the collections are easily represented. And in facts, they show state-of-the-art results in many fields using spaces with much lower dimensionality.

How this Poincaré ball looks like? As example, below there are two representations in 2-d ball (i.e. a circle). The arcs that you see are the so called “geodesics”, i.e. the shortest walk-through between two points. That’s one of the most evident differences with respect to the “flat” euclidean space, where the shortest walk-through between points is a straight line. Another big difference is that nothing can exist outside the circle: actually, points which are at a very big distance from the center (“at infinity”) lie on the disk circumference. And so, the arcs that connect the points you see in the figure on the left have all the same length. Mind blowing, isn’t it?

Left: the lines connecting the points have all the same distance. Right: the animation of a rotation within the Poincaré disk. Credits: Wolfram.com

The goal

I had to build an algorithm which groups together points that are near to each other in such space. And I needed this algorithm to work without specifying a priori the number of clusters. I choose then the mean shift procedure.

This procedure shifts every point towards its nearest bulk of points, and a single parameter defines how far is the nearest bulk. If the nearest bulk is too far, the point will form himself a lone cluster.

Here an animation of the result in a 2-d Poincaré disk, where the algorithm will end with 13 clusters:

Mean shift clustering in Poicaré disk.

You can notice that points near the boundary of the circle, for which the relative distance seems to be small, don’t clusterize. In opposite, points near the center of the circle, for which the relative distance seem to be bigger than the peripheral one, actually do clusterize.

That’s an effect of the curved space where we are, the Poincaré ball: as I already mentioned, the more you move from the center, the more the distance are actually bigger then what you see in the picture.

Many details can be said, but that goes beyond the scope of this post. But if you want to know more about the mean shift procedure and the fancy Poincaré ball space model, just search the web, there are plenty of resources.

Implementation

Descriptively, for implement mean shift procedure we have to substitute each point, P, with the weighted sum of all the other points. The weight to apply to each point depends on the distance it has with the considered one (P). And this procedure has to be repeated until all the points are clustered.

You can find several implementations of this algorithm on the web, for basically every programming language. But no one of them will work in this space!

The reason is the way we calculate distances between points. All those algorithms are made for working in the “flat” euclidean space, and so they compute the distances between points in the euclidean way. Instead, in this specific space the distance is more complicated.

So, here is the basic code, wrote with Python and Numpy. I took this version from the wonderful course from Jeremy Howard:

The basic code

This is a general meanshift implementation. For our case, the difference is in the way we define the difference between points:

As you can see, the broadcasting capabilities of Numpy help a lot in write simple code. To have a grasp about broadcasting, look at the Gaussian function: it accept as input two elements (d as distance and bw as bandwidth). Both of them can be numbers, or any n-dimensional array (vectors, matrices, tensors), and if their shape are compatible Numpy will understand how to treat them in a proper manner. In our example we feed the function with a vector (dists) and a number (sigma).

For completeness, the definition of distance between points in the Poincaré ball model is:

Definition of distance in the Poincaré Ball space

So, it’s done? Yes and no. It works, the animation you saw before is made with this algorithm. But is slow: on my laptop (i7–4750HQ CPU @ 2.00GHz, 8GB RAM) for cluster 1200 points it needs 14 seconds for each step, and ten steps are needed for the algorithm to converge.

So I needed to improve the code.

Speed up the code

Lots of good attempt can be made to improve it: using approximate techniques, random matrices, multi treading, GPUs, etc.

But we are programmers, and we are obliged to use more brain and less brute force. So the first attempt I wanted to try is to vectorize two loops: the one within the dist_poinc function, and the one in the mean algorithm.

In the next post I will show the reasons and the results. But just as appetizer, take a look to this plot:

Percentage of execution time of a vectorized product with respect to a looped one.

Yes, you read it good: the vectorized version of a simple operation is executed in less than 1.3% of the time needed to finish the loop.

Isn’t it worth a try?

--

--