The world’s leading publication for data science, AI, and ML professionals.

Optimising pairwise Euclidean distance calculations using Python

Exploring ways of calculating the distance in hope to find the high-performing solution for large data sets.

Photo by Mathew Schwartz on Unsplash
Photo by Mathew Schwartz on Unsplash

Euclidean Distance is one of the most commonly used metric, serving as a basis for many machine learning algorithms. However when one is faced with very large data sets, containing multiple features, the simple distance calculation becomes a source of headaches and memory errors.

Although being aware that packages like SciPy provide robust solution, I couldn’t resist to explore other ways of calculating the distance in hope to find the high-performing approach for large data sets.

So, what is Euclidean distances?

We begin with quick reminder of the formula, which is quite straightforward. Given two vectors x and y, we take a square root of the sum of squared differences in their elements.

So, why do we get memory errors?

This question comes up a lot when dealing with extremely large data sets… Now, let’s say we have 1k vectors for which we need to calculate pairwise distances. This would result in the output matrix with 1m entries, meaning that for larger volumes of data you are very likely to run out of memory.

For all the computations Python uses local memory, as well as it does not give back allocated memory straightaway. This implies that you are bounded by the specs of your computer. Working in cloud services can help to scale the memory accordingly, however in most of the cases you would still have to parallelise computations.

Although memory limitation is not going anywhere, it is desirable to have optimised script.

On to the calculations…

Data set: credit card customers

For the task of testing the performance of different approaches to calculating the distance, I needed fairly large data set. The data set is available on Kaggle and can be dowloaded using link below.

Credit Card customers

Some of the features in the data set aren’t so useful in this case, so we will be using the reduced set.

import pandas as pd
cc_customers = pd.read_csv('BankChurners.csv')
cc_customers = cc_customers[['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender','Dependent_count', 'Education_Level', 'Marital_Status','Income_Category', 'Card_Category', 'Months_on_book','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Total_Trans_Amt','Total_Trans_Ct']]

We have mixed-type data set that represents information on individual customers with demographic and credit card related attributes.

Some of the features in the customer credit card data
Some of the features in the customer credit card data

Before we can use the data as an input, we need to ensure we transform categorical variables to numeric.

cat_col = ['Attrition_Flag', 'Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']
for col in cat_col:
    cc_customers[col]=cc_customers[col].astype('category').cat.codes

When dealing with large data sets, feature transformation is quite important aspect to consider, it can help to reduce the amount of memory used by the matrix (not only). Let’s look at the memory breakdown for the data frame before and after transformations take place.

Memory usage comparison before and after transforming variables.
Memory usage comparison before and after transforming variables.

Once we transformed the categorical variables to numeric we can see that the memory usage reduced quite substantially.

Now that we are done with the basic transformations, we can return to our goal which is calculating pairwise Euclidean distances barring in my mind the speed of computation. We have 10127 unique customers, this would result in matrix 10127×10127 dimension.

Main script

To understand how the code scales with larger data sets, for loop was introduced where at each iteration we consider larger random sample from the original data. We start with 10% from the data and each step our sample increases by 10%, when it comes to the performance time of the code we take average of 20 runs.

The code below was used for every approach, the only differences would be the distance function.

import numpy as np
import time
from tqdm import tqdm_notebook as tqdm
input_data = cc_customers.drop('CLIENTNUM', axis=1) # drop the customer ID
iter_times = []
for l,z in zip(range(0, 20), tqdm(range(0, 20))):
    times = []
    for r in np.arange(0.1, 1.1, 0.1):
        data_reduced = input_data.sample(frac=r, random_state=1)
        tic = time.perf_counter()
        ###INSERT DISTANCE CALCULATIONS###
        toc = time.perf_counter()
        times.append(toc-tic)
    iter_times.append(times)

tot_time = [sum(i)/20 for i in zip(*iter_times)]
data_size = [len(input_data.sample(frac=r, random_state=1)) for r in np.arange(0.1, 1.1, 0.1)]
for i, j in zip(tot_time, data_size):
    print(f"# of customers {j}: {i:0.4f} seconds")

Method 1: Python packages (SciPy and Sklearn)

Using python packages might be a trivial choice, however since they usually provide quite good speed, it can serve as a good baseline.

from scipy.spatial.distance import cdist
from sklearn.metrics.pairwise import euclidean_distances
scipy_cdist = cdist(data_reduced, data_reduced, metric='euclidean')
sklearn_dist = euclidean_distances(data_reduced, data_reduced)

Quite interestingly, Sklearn euclidean_distances outperformed SciPy cdist, with the differences in time becoming more noticeable with larger data sets.

cdist vs. euclidean_distances
cdist vs. euclidean_distances

Difference in implementation can be a reason for better performance of Sklearn package, since it uses vectorisation trick for computing the distances which is more efficient. Meanwhile, after looking at the source code for cdist implementation, SciPy uses double loop.

Method 2: single for loop

Optimisation and for loops aren’t usually best friends! However when it comes to pairwise distances…can be difficult to avoid, unless going the vectorisation route (implementation presented later in the article).

# Without pre-allocating memory
dist = []
for i in range(len(data_reduced)):
    dist.append(((data_reduced- data_reduced[i])**2).sum(axis=1)**0.5)

# pre-allocating memory
D = np.empty((len(data_reduced),len(data_reduced)))
for i in range(len(data_reduced)):
    D[i, :] = ((data_reduced-data_reduced[i])**2).sum(axis=1)**0.5
return D

We compared two approaches, with and without pre-allocating memory before calculating the distance.

It comes to no surprise that pre-allocating memory helped improve performance, though the time taken still exceeded Sklearn implementation.

Despite the slower performance in some cases it still might be preferential to use this approach, as it is capable to handle larger data sets without running out of memory.

Method 3: vectorised implementation

After reading few research papers online on this topic, I have to say, I was very hopeful about the performance of this approach. As well as seeing performance of Sklearn euclidean_distances, did boost those hopes even higher…

import numpy as np
x = np.sum(data_reduced**2, axis=1)[:, np.newaxis]
y = np.sum(data_reduced**2,    axis=1)
xy = np.dot(data_reduced, data_reduced.T)
dist = np.sqrt(x + y - 2*xy)

The approach comes quite close in time to cdist implementation for smaller data samples, however it doesn’t scale very well. For the largest data sample the time is almost the same as for loop approach without pre-allocating the memory. Unsurprisingly, it didn’t outperform euclidean_distances.

Wrap up

After testing multiple approaches to calculate pairwise Euclidean distance, we found that Sklearn euclidean_distances has the best performance. Since it uses vectorisation implementation, which we also tried implementing using NumPy commands, without much success in reducing computation time.

Although we yet again showed that in most cases Python modules provide optimal solution, sometimes one would still have to go with different option, depending on the nature of the task.


Related Articles