Cosine similarity: How does it measure the similarity, Maths behind and usage in Python

Varun
Towards Data Science
3 min readSep 27, 2020

--

Photo by Artem Kniaz on Unsplash

What is cosine similarity?

Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between the two vectors.

Cosine similarity is one of the most widely used and powerful similarity measure in Data Science. It is used in multiple applications such as finding similar documents in NLP, information retrieval, finding similar sequence to a DNA in bioinformatics, detecting plagiarism and may more.

Cosine similarity is calculated as follows,

Angle between two 2-D vectors A and B (Image by author)
calculation of cosine of the angle between A and B

Why cosine of the angle between A and B gives us the similarity?

If you look at the cosine function, it is 1 at theta = 0 and -1 at theta = 180, that means for two overlapping vectors cosine will be the highest and lowest for two exactly opposite vectors. You can consider 1-cosine as distance.

cosine(Image by author)
values of cosine at different angles (Image by author)

How to calculate it in Python?

The numerator of the formula is the dot product of the two vectors and denominator is the product of L2 norm of both the vectors. Dot product of two vectors is the sum of element wise multiplication of the vectors and L2 norm is the square root of sum of squares of elements of a vector.

We can either use inbuilt functions in Numpy library to calculate dot product and L2 norm of the vectors and put it in the formula or directly use the cosine_similarity from sklearn.metrics.pairwise. Consider two vectors A and B in 2-D, following code calculates the cosine similarity,

import numpy as np
import matplotlib.pyplot as plt
# consider two vectors A and B in 2-D
A=np.array([7,3])
B=np.array([3,7])
ax = plt.axes()ax.arrow(0.0, 0.0, A[0], A[1], head_width=0.4, head_length=0.5)
plt.annotate(f"A({A[0]},{A[1]})", xy=(A[0], A[1]),xytext=(A[0]+0.5, A[1]))
ax.arrow(0.0, 0.0, B[0], B[1], head_width=0.4, head_length=0.5)
plt.annotate(f"B({B[0]},{B[1]})", xy=(B[0], B[1]),xytext=(B[0]+0.5, B[1]))
plt.xlim(0,10)
plt.ylim(0,10)
plt.show()
plt.close()
# cosine similarity between A and B
cos_sim=np.dot(A,B)/(np.linalg.norm(A)*np.linalg.norm(B))
print (f"Cosine Similarity between A and B:{cos_sim}")
print (f"Cosine Distance between A and B:{1-cos_sim}")
Code output (Image by author)
# using sklearn to calculate cosine similarity
from sklearn.metrics.pairwise import cosine_similarity,cosine_distances
cos_sim=cosine_similarity(A.reshape(1,-1),B.reshape(1,-1))
print (f"Cosine Similarity between A and B:{cos_sim}")
print (f"Cosine Distance between A and B:{1-cos_sim}")
Code output (Image by author)
# using scipy, it calculates 1-cosine
from scipy.spatial import distance
distance.cosine(A.reshape(1,-1),B.reshape(1,-1))
Code output (Image by author)

Proof of the formula

Cosine similarity formula can be proved by using Law of cosines,

Law of cosines (Image by author)

Consider two vectors A and B in 2-dimensions, such as,

Two 2-D vectors (Image by author)

Using Law of cosines,

Cosine similarity using Law of cosines (Image by author)

You can prove the same for 3-dimensions or any dimensions in general. It follows exactly same steps as above.

Summary

We saw how cosine similarity works, how to use it and why does it work. I hope this article helped in understanding the whole concept behind this powerful metric.

--

--