Familiarity With Coefficients Of Similarity

Jayesh Salvi
Towards Data Science
8 min readApr 5, 2019

--

When you were doing a project on the recommendation system or the semantic segmentation of the images, you must have come across similarity scores. Based on these similarity scores, you predicted that this product is similar to that product or how much the predicted segmented image is similar to the ground truth.

Similarity metrics are important because these are used by the number of data mining techniques for determining the similarity between the items or objects for different purposes as per the requirement such as, clustering, anomaly detection,automatic categorization, correlation analysis.This article will give you a brief idea about different similarity measures without going too much into the technical details.

The main focus of this article is to introduce you to the below similarity metrics,

1. Simple matching coefficient (SMC)

2. Jaccard index

3. Euclidean distance

4. Cosine similarity

5. Centered or Adjusted Cosine index/ Pearson’s correlation

Let’s start!

Suppose, two users A and B have provided reviews of ten products in the form whether they liked the products or not. Let’s write it in vector form,

A = (P, P, P, P, P, N, N, N, N, N)

B = (P, P, P, N, N, N, N, N, N, N)

Where P means the user liked the product and N means that the user didn’t like the product.

SMC for users A & B will be calculated as,

Where,

M11 (Both A & B liked the product) = 3

M00 (Both did not like the product) = 5

M10 (A liked the product but B did not) = 2

M01 (A did not like the product but B did) = 0

For users A & B, SMC is 8/10 i.e. 0.8. This shows us that the users A & B have similarity 80% of the time. So if A likes a new product and B hasn’t seen it, you can recommend it to B as both have good similarity.

This is a very simple and intuitive approach to building a recommendation system. Ideally, we should use ratings as in the range 1 to 5 given by the users A and B. Basis which we can derive the extent to which they liked the product and then provide recommendation based on that.

The important thing to note here is that SMC is useful in the cases where vectors have binary attributes (Positive/Negative),(True/False), and (Male/Female) and both the classes carry equal information.

Can you think of any reason why can’t we use SMC in cases where classes do not carry equal information? This question takes us to the new similarity metric.

Jaccard Index:

Let’s consider another situation. An insurance company wants to segment the claims filed by its customers based on some similarity. They have a database of claims, there are 100 attributes in the database, on the basis of which the company decides whether the claim is fraudulent or not. The attributes can be driving skill of a person, car inspection record, purchase records, etc. Each attribute generates a red flag for the claim. In most of the cases, only a few attributes generate red flag, other attributes rarely change.

In this case, the presence of red flag provides more information to the insurance company than the green flag does(asymmetry).

If we use SMC, we will get scores which will be biased by attributes which rarely create red flags. In such cases, Jaccard index is used. Let’s check that with numbers.

Consider three claims A, B & C with 20 binary attributes,

Claim A = (R,R,R,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G)

Claim B = (R,R,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G)

Claim C = (R,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G)

Jaccard index for each pair is calculated as,

Where,

M11- Number of attributes where both claims have the red flag,

M10,M01- Number of attributes where one claim has the red flag and other has the green flag.

For claim A and B, Jaccard index is 2/3 i.e. 0.66 and SMC is 19/20 i.e. 0.95.

For claim A and C, Jaccard index is 1/3 i.e. 0.33 and SMC is 18/20 i.e. 0.90.

For claim B and C, Jaccard index is 1/ 2 i.e. 0.5 and SMC is 19/20 i.e. 0.95.

We see that the SMC scores of all three pairs are close to each other and Jaccard index is showing significant difference. This is the problem with SMC when the classes do not carry equal information. For e.g. in our case, R class carries more information than G but SMC considers them as equal.

Jaccard index is also called IOU (intersection over union) metric which is used while doing semantic segmentation of an image.

The similarity index is calculated by the number of highlighted pixels in the intersection image divided by the highlighted pixels in the union image.

Jaccard index can be thought of as a generalized case of SMC. In cases where we have multiple symmetric classes (multiple classes having equal weights) we cannot use SMC as it works only with binary symmetric classes. In that case, we can create dummy variables for each class which would make the individual dummy variables asymmetric as the presence of one class in each dummy variable will provide more information than the absence of that class. We can then use Jaccard index to find out the similarity score. Basically, we converted multiple symmetric classes into binary asymmetric dummy variables and then calculated the Jaccard index.

Until now, we were just discussing about vectors with binary attributes what if the attributes are continuous/numeric. This is the case where we turn to distance and angle based similarity scores.

Euclidean Distance:

Euclidean distance is more of a dissimilarity measure like Minkowski and Mahalanobis distance.I have included this as it forms the basis of discussion for the upcoming metrics.

We know that the points which are closer in space will have smaller distance between them than the points which are far from each other. So smaller distance relates to more similarity, this is the thought behind using Euclidean distance as the similarity metric. Euclidean distance between vectors p and q is calculated as,

Consider three users A, B and C.They have provided ratings to few movies, each rating can range from 1 to 5 and 0 means that the user hasn’t watched that movie.

User A = (0.5, 1, 1.5, 2, 2.5, 3, 3.5)

User B = (0.5, 1.5, 2.5, 0, 0, 0, 0)

User C = (0.5, 1, 1.5, 2, 0, 0, 0)

Using the above formula, we get distance between A & B as 5.72, between B & C as 2.29 and between A & C as 3.5. If you see carefully A & C vectors have given the same ratings to first four movies which tell us that both have similar liking for the movies, but since C has not seen few movies because of that we are getting a significant distance between them.

Since the above vectors have seven dimensions, we cannot visualize them here. Instead, let’s look at similar vectors on two axes where each axis represents one movie. In the plot, red vector represent user A, green represents user B and the blue vector represents user C. All the vectors have tail at origin.

As per above plot, we should expect blue and red vectors to show high similarity since they are co-linear. But we get significant distance between them when we calculate Euclidean distance. What if instead of using distance between the vectors, we calculate the cosine of angle between them? Vectors can have smaller length or bigger, the angle between them will remain the same.This takes us to the new similarity metric.

Cosine Similarity:

In our academics, we have come across dot product and cross product of two vectors. Dot product of two vectors is calculated as multiplication of magnitudes of each vector and the cosine of angle between the vectors i.e.

Where, |A|and |B| represent lengths of the vectors A and B. It is the distance of A and B from the origin.

A.B is obtained by summing the element-wise multiplication of vector A & B i.e.

Cosine similarity is calculated as,

Since the ratings are positive our vectors will always lie in the first quadrant. So, we will get cosine similarity in the range [0,1] , 1 being highly similar.

We thought of using cosine similarity because we knew that the angle between the vectors remains the same irrespective of their lengths, but can we improve it further? Do you see any problem yet? Let’s see!

Centered or Adjusted Cosine Similarity :

Centered? What’s that? Up until now we were trying to find similarity between smaller apples and bigger apples. How’s that? We know that there are some people who will always be strict when giving ratings and then there are the generous ones (I belong to this category 😃). If we try to find similarity between them we will always get some bias because of this behavior.

This can be handled by removing average rating a user gives from all the movie ratings of that user there by aligning the ratings around the mean, this is nothing but normalizing the ratings. Once all the vectors are normalized, we calculate the cosine similarity. This is nothing but centered or adjusted cosine similarity! This is also known by the popular name Pearson’s correlation!

To prove the above said point, I created two arrays, out of which the second array is obtained by adding the offset in the first array keeping all the variations of the first array same. Check below notebook for implementation.

We got correlation as 1 and cosine similarity as 0.85, which proves that correlation performed well compared to the cosine similarity.This is because of the normalization of vectors.

There are few other similarity metrics available too, but the metrics we discussed so far are the ones that we encounter most of the time while working on a data science problem. Below are some reference links where you can read more about these metrics and their use cases.

  1. Cosine similarity for vector space model
  2. Comparison study of similarity and dissimilarity measures
  3. Evaluating image segmentation models
  4. User-Item similarity ResearchGate
  5. Pearson’s correlation & Salton’s cosine measure

Thanks for reading till the end. I hope you enjoyed it.

Happy learning and see you soon!

--

--