Cosine Similarity Matrix using broadcasting in Python

Andrea Grianti
Towards Data Science
7 min readDec 7, 2020

--

Learn how to code a (almost) one liner python function to calculate (manually) cosine similarity or correlation matrices used in many data science algorithms using the broadcasting feature of numpy library in Python.

Photo by mostafa rezaee on Unsplash

Do you think we can say that a professional MotoGP rider and the kid in the picture have the same passion for motorsports even if they will never meet and are different in all the other aspects of their life ? If you think yes then you grasped the idea of cosine similarity and correlation.

Now suppose you work for a pay tv channel and you have the results of a survey from two groups of subscribers. One of the anaysis could be about the similarities of tastes between the two groups. For this type of analysis we are interested to select people sharing similar behaviours regardless of “how much time” they watch TV. This is well represented by the concept of cosine similarity which allow to consider as “close” those ‘observations’ aligned to some interesting for us directions regardless of how different the magnitude of the measures are from each other .

So as an example if “person A” watches 10 hours of sport and 4 hours movies and “person B” watches 5 hours of sport, 2 hours movies, we can see the twos are (perfectly in this case) aligned given the fact that regardless of how many hours in total they watch TV, in proportion they share the same behaviours.

By contrast if the objective is to analyse those watching similar number of hours in an interval, the euclidean distance would have been more appropriate as that evaluates the distance as we are used normally to think.

It’s rather intuitive from the chart below to see this comparing the two points A and B with the length of segment f=10 (euclidean distance) with cosine of angle alpha = 0.9487 which oscillates between 1 and -1 where 1 means same direction same orientation, -1 same direction but opposite orientation.

Simple example to how cosine of alpha (0.94) show a good alignment between the two vectors (OA) and (OB)

If the orientation is not important in our analysis the module of cosine would null this effect and consider +1 the same as -1.

In terms of formulas cosine similarity is related to Pearson’s correlation coefficient by almost the same formula as cosine similarity is Pearson’s correlation when vectors are centered on their mean:

(image by author)

Cosine Similarity Matrix:

The generalization of the cosine similarity concept when we have many points in a data matrix A to be compared with themselves (cosine similarity matrix using A vs. A) or to be compared with points in a second data matrix B (cosine similarity matrix of A vs. B with the same number of dimensions) is the same problem.

So to make things different from usual we want to calculate the Cosine Similarity Matrix of a group of points A vs. a second group of points B, both with same number of variables (columns) like this:

(image by author)

Assuming the vectors to be compared are in the rows of A and B the Cosine Similarity Matrix would appear as follows where each cell is cosine of the angle between all the vectors of A (rows) with all the vectors of B (columns):

(image by author)

If you look at the color pattern you see that first vectors “a” replicate itself by row, while vectors “b” replicates itself by columns.

To calculate this matrix in (almost) one line of code we need to look for a way to use what we know of algebra for numerator and denominator and then put it all together.

Cell Numerator:

If we keep A matrix fixed (3,3) we have to operate a ‘dot’ product with the transpose of B [=> (5,3)] and we get a (3,3) result. In python this is easy with:

num=np.dot(A,B.T)

Cell Denominator:

It ’s a simple multiplication between 2 numbers but first we have to calculate the length of the two vectors. Let’s find a way to do that in a few Python lines using the numpy broadcasting operation which is a smart way to solve this problem.

To calculate the lengths of vectors in A (and B) we should do this:

  1. square the elements of matrix A
  2. sum the values by row
  3. root square the values out of point 2

In the above case where A=(3,3) and B=(5,3) the two lines below (remember that axis=1 means ‘by row’) return two arrays (not matrices !):

p1=np.sqrt(np.sum(A**2,axis=1)) # array with 3 elements (it’s not a matrix)p2=np.sqrt(np.sum(B**2,axis=1)) # array with 5 elements (it’s not a matrix)

If we just multiply them together it doesn’t work because the ‘*’ works element by element and the shapes as you see are different.

Because ‘*’ operation is element by element we want two matrices where the first has the vector p1 vertical and copied in width p2 times, while p2 is horizontal and copied in height p1 times.

To do this with ‘broadcasting’ we have to modify p1 so that it becomes fixed in vertical (a1,a2,a3) but “elastic” in a second dimension. The same with p2 so that becomes fixed in horizontal and “elastic” in a second dimension.

(image by author)

To achieve this we leverage the np.newaxis function with this:

p1=np.sqrt(np.sum(A**2,axis=1))[:,np.newaxis]p2=np.sqrt(np.sum(B**2,axis=1))[np.newaxis,:]

p1 can be read like: make the vector vertical (:) and add a column dimension and p2 can be read like: add a row dimension, make the vector horizontal. This operation for p2 in theory is not necessary because p2 was already horizontal and even if it was an array, multiplying a matrix (p1) by an array (p2) results in a matrix (if they are compatible of course) but I like the above because more clean and flexible to changes.

Now if you look p1 and p2 before and after you will notice that p1 is now a matrix and so p2 but still one dimensional.

If you now multiply them with p1*p2 then the magic happens and the result is a 3x5 matrix like the p1*p2 in grey in the above picture.

So we can now finalize the (almost) one liner for our cosine similarity matrix with this example complete of some data for A and B:

import numpy as npA=np.array([[2,2,3],[1,0,4],[6,9,7]])
B=np.array([[1,5,2],[6,6,4],[1,10,7],[5,8,2],[3,0,6]])
def csm(A,B):
num=np.dot(A,B.T)
p1=np.sqrt(np.sum(A**2,axis=1))[:,np.newaxis]
p2=np.sqrt(np.sum(B**2,axis=1))[np.newaxis,:]
return num/(p1*p2)
print(csm(A,B))

Correlation Matrix between A and B

In case you want to modify the function to use it to calculate the correlation matrix the only difference is that you should subtract from the original matrices A and B their mean by row and also in this case you can leverage the np.newaxis function.

In this case you first calculate the vector of the means by row as you’d usually do but remember that the result is again a horizontal vector and you cannot proceed with the code below

B-B.mean(axis=1)
A-A.mean(axis=1)

We must make the means vector of A compatible with the matrix A by verticalizing and copying the now column vector the width of A times and the same for B. For this we can use again the broadcasting feature in Python “verticalizing” the vector (using ‘:’) and creating a new (elastic) dimension for columns.

B=B-B.mean(axis=1)[:,np.newaxis]
A=A-A.mean(axis=1)[:,np.newaxis]
(image by author)

Now we can modify our function including a boolean where if it’s True it calculates the correlation matrix between A and B while if it’s False calculate the cosine similarity matrix:

import numpy as npA=np.array([[1,2,3],[5,0,4],[6,9,7]])
B=np.array([[4,0,9],[1,5,4],[2,8,6],[3,2,7],[5,9,4]])
def csm(A,B,corr):
if corr:
B=B-B.mean(axis=1)[:,np.newaxis]
A=A-A.mean(axis=1)[:,np.newaxis]
num=np.dot(A,B.T)
p1=np.sqrt(np.sum(A**2,axis=1))[:,np.newaxis]
p2=np.sqrt(np.sum(B**2,axis=1))[np.newaxis,:]
return num/(p1*p2)
print(csm(A,B,True))

Note that if you use this function to calculate the correlation matrix the result is similar to the numpy function np.corrcoef(A,B) with the difference that the numpy function calculates also the correlation of A with A and B with B which could be redundant and force you to cut out the parts you don’t need. For example the correlation of A with B is in the submatrix top right which can be cut out knowing the shapes of A and B and working with indices.

Of course there are many methods to do the same thing described here including other libraries and functions but the np.newaxis is quite smart and in this example I hope I helped you in that … direction

… just Follow me

Hi everybody my name is Andrea Grianti, I spent my professionial life in IT and Data Warehousing but I became later more and more passionate with data science and analytics topics.

Please consider following me in order for me to reach the threshold of number of followers so that Medium platform consider me in their partner program.

--

--

IT Senior Manager and Consultant. Data Warehouse and Business Intelligence expertise in design and build. Freelance.