The world’s leading publication for data science, AI, and ML professionals.

User Similarity with Binary Data in Python

What to consider when choosing similarity measures for binary data.

Image by author.
Image by author.

User clustering and computation of similarities has increasingly gained relevance within today’s industries. Customers are asked to give ratings on f. e. specific products, which are then compared to other customers’ ratings in order to find similarities among them.

Most user clustering applications make use of scaled ratings, f. e. 0–5 stars, or rating scales of 1–10 like on IMDB. In these cases, we can easily apply measures like Euclidean Distance or Cosine Similarity to find how similar/different the user’s choices are. What if we don’t have such ratings and we are dealing with Binary data?

In this article I will show you why to be careful when using the Euclidean Distance measure on binary data, what measure to alternatively use for computing user similarity and how to create a ranking of these users. I will be using Python 3 with Scipy and Pandas.

import scipy.spatial
import pandas as pd

1 | Computing the appropriate similarity measure

Let’s assume we have three users: A, B and C. These users all filled out a multiple-choice survey regarding their favorite fruits.

Table 1: User choices of favorite fruits. Image by author.
Table 1: User choices of favorite fruits. Image by author.

The user choices can be interpreted as one-hot encoded vectors, whereas the ✔️ is replaced by a 1, and the ❌ by a 0.

user_choices = [[1, 1, 0, 0], 
                [1, 1, 1, 0], 
                [1, 0, 0, 0]]
df_choices = pd.DataFrame(user_choices, columns=['Apples', 
                          'Bananas', 'Pineapples', 'Kiwis'], 
                          index=(["User A", "User B", "User C"]))

Why are measures like Euclidean Distance or Cosine Similarity no appropriate measures for this dataset?

From a first look at Table 1 we would suggest User A and User B to have a more similar taste, because they both chose "Apples" and "Bananas" as their favorite fruits.

However, computing the Euclidean Distance between the users gives us the following results as shown in Table 2:

euclidean = scipy.spatial.distance.cdist(df_choices, df_choices, 
                                         metric='euclidean')
user_distance = pd.DataFrame(euclidean,    
                             columns=df_choices.index.values,
                             index=df_choices.index.values)
Table 2: Euclidean Distance between users. Image by author.
Table 2: Euclidean Distance between users. Image by author.

Even though we see that User A and User B both chose more of the same fruits, Euclidean Distance returns an identical distance value of 1.00 for both User B and User C. Why is that so?

Euclidean Distance considers joint absences (i. e. both users having a 0 on the same position). Therefore, it also considers User A and User C to be similar, because they both did not choose "Pineapples". The same principle applies to Cosine Similarity as well. For some use-cases, Euclidean Distance may still be an appropriate measure for binary data, but for our use-case, it gives us wrong results.

We do not want to assume users being similar based on selections they did not make.

For our aim, we should turn to a measure called Jaccard Distance.

Fig. 1: Jaccard Distance equation. Image by author.
Fig. 1: Jaccard Distance equation. Image by author.

TT (True True) is the number of times both users chose the same fruit (i. e. both have 1 at the same position). TF (True False) and FT (False True) are the number of times only one of the users chose a fruit (i. e. a 1 for one user and a 0 for the other on the same position). We do not consider the cases where both users did not choose a fruit.

Now let’s look at how the Jaccard Distance rates the similarity of our users:

jaccard = scipy.spatial.distance.cdist(df_choices, df_choices,  
                                       metric='jaccard')
user_distance = pd.DataFrame(jaccard, columns=data.index.values,  
                             index=data.index.values)
Table 3: Jaccard Distance between users. Image by author.
Table 3: Jaccard Distance between users. Image by author.

From the above table we can see that, indeed, the Jaccard Distance rates User B as being more similar to User A, than User C to User A, because its distance value is lower. This is the result we want to achieve for our use-case.

2 | Ranking the users by similarity of their choices

To finalize our task, let’s rank the users by their similarity and export them as a Python dictionary.

# prepare a dictionary
user_rankings = {}
# iterate over the columns in the dataframe
for user in user_distance.columns:
    # extract the distances of the column ranked by smallest
    distance = user_distance[user].nsmallest(len(user_distance))

    # for each user, create a key in the dictionary and assign a  
    # list that contains a ranking of its most similar users
    data = {user : [i for i in distance.index if i!=user]}
    user_rankings.update(data)

The output dictionary will look as following:

{'User A': ['User B', 'User C'],
 'User B': ['User A', 'User C'],
 'User C': ['User A', 'User B']}

User A’s choices of favorite fruits are more similar to User B, than to User C. User B’s choices of favorite fruits are more similar to User A, than to User C. User C’s choices of favorite fruits are more similar to User A, than to User B.


This article demonstrated how the Jaccard Distance can be an appropriate measure for computing similarities between binary vectors.

Nevertheless, always make sure you choose your Similarity/distance measures wisely depending on your type of dataset and on what goal you are trying to achieve.


References:

[1] A. S. Shirkhorshidi, S. Aghabozorgi S, T. Y. Wah, A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data (2015), PLoS ONE 10(12): e0144059.

[2] IBM, Distances Similarity Measures for Binary Data (2021)


Related Articles