The world’s leading publication for data science, AI, and ML professionals.

Categorical Data, Jaccard’s Coefficient and Multiprocessing

Quickly find out which observations are most like your favorite

Photo by William Krause on Unsplash
Photo by William Krause on Unsplash

An airport in Florida is closer to the Detroit airport than one in Hyderabad, and we know that because we measure the distances using latitude and longitude (Hyderabad is a huge city in India). But, how do we say one shopping basket’s contents are closer to another’s? Or one forest is more similar to another in terms of the animals that live in them? We can treat these as comparisons between sets and measure the similarity (or dissimilarity) between them using Jaccard’s coefficient (We’ll use coefficient and similarity score interchangeably). For large datasets, this can be a big task, so we can use parallel processing to do it in a shortened period of time. See the full notebook on kaggle.


Quick Example to Set the Stage

So, when comparing two sets (which can be an array, a series, or even a vector of binary values) the numerator is the count of elements shared between the sets and the denominator is the count of elements from both sets. In our case, the denominator is the size of the either set, so we can also say that this similarity score is the number of shared elements divided by the number of elements that could be shared.

Let’s check out a simple example:

from sklearn.metrics import jaccard_score
from scipy.spatial.distance import jaccard
x = [[1,1,1],[1,0,1],[0,0,0]]
print(x)
[[1, 1, 1], [1, 0, 1], [0, 0, 0]]
jaccard(x[0],x[1])
0.33
jaccard_score(x[0],x[1])
0.66

The array x has three rows. The first row will be the observation we wish to compare to. Notice how the Jaccard function returns the number of elements not shared between the first two rows. The jaccard_score function returns the opposite: it’s the number of elements shared between the first two rows. One shows the dissimilarity and the other shows the similarity. I personally prefer the similarity score offered in scikit-learn, but it’s important you are aware of the difference.

(Further be aware that some people make the distinction that the element 0 shouldn’t be included at all in the calculation. This makes sense when under certain circumstances.)

Now that we’ve seen this metric under a simple case, let’s apply it to a larger dataset.


Measure Distance with Jaccard’s and Parallel Processing

import numpy as np
import pandas as pd
x0 = np.random.choice([0, 1], size=(100000,100), p=[4./5, 1./5])
x1 = np.random.choice([0, 1], size=(100000,100), p=[1./3, 2./3])
x2 = np.random.choice([0, 1], size=(100000,100), p=[1./2, 1./2])
colnames = ['x_'+str(i) for i in range(0,100)]
X = pd.DataFrame(data = np.stack([x0,x1,x2]).reshape(300000,100))
X.columns = colnames
target = np.ones(100).astype(int)

Our target is one observation where all the features are set to 1. Imagine a basket that has bought every item available on your web store and you want to see which observations are closest to it. This is mainly for the purposes of the example, but you can see how this can extend to other use-cases.

A huge array of 300k observations is created with binary value data (1s and 0s) to stand in for indicator features or dummy variables. The first third has a probability of (1/5) of being 1, the second third a probability of (2/3), and the last third a probability of (1/2). Let’s see how many observations overlap with our target and by how much! But first, let’s make use of the multiprocessing package and create a partial function to compare several observations to the target in parallel (this is a huge time and memory saver).

from functools import partial
import multiprocessing as mp
partial_jaccard = partial(jaccard_score, target)
with mp.Pool() as pool:
    results = pool.map(partial_jaccard, [row for row in X.values])

The above code takes almost a minute (~50 seconds). This is after multiprocessing and on 300k observations with 100 features. You’re likely to come across datasets with more features and more observations. Attempting to accomplish the above task in a loop caused my computer to crash entirely (blue screen/frowny face), but if you are brave then you should try on a subset of data and see how long it takes.

Below is the result. You’ll see that for the first third of the data (the one with 1/5 probability of being a 1) you can see that there is a peak with a Jaccard’s similarity score of 0.2 (20%). Similarly for the other peaks. This confirms that the comparisons are working with our multiprocessing and partial function.


Conclusion

When you have binary data (as is the case with indicator features or dummy variables) and you want to create some type of distance metric between your observations, consider this Jaccard’s coefficient/similarity score. It’s fairly intuitive, but takes a little bit of extra work to make the measurement on lots of data. Consider this metric for a dashboard or a report and if you consider it for a Clustering task, remember that making pairwise comparisons is a huge task for your computer to handle and you should consider making cluster centers and comparing to those instead.


References

Jaccard index

sklearn.metrics.jaccard_score – scikit-learn 0.24.1 documentation

multiprocessing – Process-based parallelism – Python 3.9.1 documentation

GitHub – caseywhorton/medium-blog-code


Related Articles