The world’s leading publication for data science, AI, and ML professionals.

Cosine Similarity Intuition With Implementation in Python

Including hands-on calculations using the cosine similarity formula

Image by Pete Linforth from Pixabay
Image by Pete Linforth from Pixabay

It’s often the case in Machine Learning that you need to compare data and analyze how similar they are.

For instance, in automatic text summarization, during training, you need to identify which sentences from the original document are similar to the sentences in the reference summary. The same is required to evaluate the quality of the summarization.

Or perhaps you need to categorize a document in, let’s say, Science, Parenting, or Technology.

One technique to use for working out the similarity between two texts is called Cosine Similarity.

Consider the base text and three other ones below. I’d like to measure how similar text1, text2 and text3 are to the base text.

Base text
Quantum computers encode information in 0s and 1s at the same time, until you "measure" it
Text1
A qubit stores "0 and 1 at the same time" in the same way how a car travelling north-west travels north and west at the same time
Text2
Considering how quickly the brain reorganizes, it's suggested that dreams are a defence mechanism
Text3
Computers will come with more processing power due to more advanced processors

How do you do that?

Texts Become Vectors

The case is simple to help you clearly visualize how it works.

First, you need to choose – similar in terms of what?! For example, are two texts of the same word count similar?

Here are the two attributes, or called features, that I use to measure the similarity.

  • Word count (no prepositions or articles)
  • Words in the base text
Figure 1. Texts as vectors representation. By author.
Figure 1. Texts as vectors representation. By author.

The two attributes can be considered elements of a vector and are represented in Figure 1. As you can see, the features are represented in the X and Y-axis.

The difference between the two vectors forms an angle. This angle tells you how similar or different they are.

0-degree angle = they’re the same.

90 degrees angle = they’re the opposite of each other.

Cosine Similarity

Although knowing the angle will tell you how similar the texts are, it’s better to have a value between 0 and 1. 1 meaning the texts are identical.

That’s where Cosine Similarity comes into the picture. That’s the formula to calculate it.

Figure 2. Source: https://en.wikipedia.org/wiki/Cosine_similarity
Figure 2. Source: https://en.wikipedia.org/wiki/Cosine_similarity

Here are some maths for you to have fun with.

A and B are the vectors representation of the two texts. The numerator A . B means the dot product of the two vectors. ||A or B|| implies the magnitude of the vector.

Let’s work out the calculations for one of the pairs – Base text and text1.

Let A be the Base text vector = 11, 11
Let B be text1 vector = 4, 17
Dot product of A and B = 11 * 4 + 11 * 17 (A first element times B first element + A second element times B second element)
= 44 + 187
= 231
The magnitude is taken by squaring each element and adding them up, then taking the square root.
||A|| = √11² + 11²
||B|| = √4² + 17²
||A|| = 15.5563
||B|| = 17.4642
||A||*||B|| = 15.5563 * 17.4642
= 271.6783
Cos sim = 231 / 271.6783
= 0.85 (85% similar!)

Python Code

The code is simple, especially as I’m using a built-in function to calculate the cosine similarity.

Here are the results:

Conclusion

As you can see in the results, text2 is less similar to the base text. Text2 is about dreams, and there are no words in common with the base text about quantum computers.

However, did you notice that it’s 70% similar!? It can’t be.

That’s because the features I chose are not quite right. The word count in particular. Unless that’s what I truly want – texts with similar word count are similar, and the subject doesn’t matter…which is a weird take on text comparison.

In future articles, I’ll explore more about the techniques used in the real world for feature selection.

Thanks for reading.


For further reading:

The Other Approach to Solve Linear Regression

Unsupervised versus Supervised Machine Learning


Related Articles