The world’s leading publication for data science, AI, and ML professionals.

Efficient Euclidean distance computation in pandas

A neat little trick

Photo by Markus Spiske on Unsplash
Photo by Markus Spiske on Unsplash

In Data Science, we often encountered problems where geography matters such as the classic house price prediction problem. In most cases, it never harms to use k-nearest neighbour (k-NN) or similar strategy to compute a locality based reference price as part of your feature engineering.

The most important hyperparameter in k-NN is the distance metric and the Euclidean distance is an obvious choice for geospatial problems. Also known as the "straight line" distance or the L² norm, it is calculated using this formula:

The problem with using k-NN for feature training is that in theory, it is an O(n²) operation: every data point needs to consider every other data point as a potential nearest neighbour. In the absence of specialized techniques like spatial indexing, we can do well speeding things up with some vectorization.


Let’s begin with a set of Geospatial data points:

We usually do not compute Euclidean distance directly from latitude and longitude. Instead, they are projected to a geographical appropriate coordinate system where x and y share the same unit. I will elaborate on this in a future post but just note that

One degree latitude is not the same distance as one degree longitude in most places on Earth. The discrepancy grows the further away you are from the equator.

A non-vectorized Euclidean distance computation looks something like this:

In the example above we compute Euclidean distances relative to the first data point. Because we are using pandas.Series.apply, we are looping over every element in data['xy']. If we were to repeat this for every data point, the function euclidean will be called n² times in series. We can be more efficient by vectorizing.

Computation is now vectorized. But it is not as readable and has many intermediate variables. Is there a cleaner way?


As it turns out, the trick for efficient Euclidean distance calculation lies in an inconspicuous NumPy function: numpy.absolute.

One oft overlooked feature of Python is that complex numbers are built-in primitives. Instead of expressing xy as two-element tuples, we can cast them into complex numbers.

Notice the data type has changed from object to complex128.

Unless you are someone trained in pure mathematics, you are probably unaware (like me) until now that complex numbers can have absolute values and that the absolute value corresponds to the Euclidean distance from origin. Applying this knowledge we can simplify our code to:


There is one final issue: complex numbers do not lend themselves to easy serialization if you need to persist your table. Fortunately, it is not too difficult to decompose a complex number back into its real and imaginary parts.

Hope you’ll find this useful.


Related Articles