The world’s leading publication for data science, AI, and ML professionals.

Using Cosine Similarity to Build a Movie Recommendation System

A step-by-step guide to build a Python-based Movie Recommender System using Cosine Similarity

Image by Jade87 from Pixabay
Image by Jade87 from Pixabay

Have you ever imagined that a simple formula that you have studied in high school would play a part in recommending you a movie on the basis of the one you already like?

Well, here we are, using the Cosine Similarity (the dot product for normalized vectors) to build a Movie Recommender System!

What are Recommender Systems?

Recommender systems are an important class of Machine Learning algorithms that offer "relevant" suggestions to users. Youtube, Amazon, Netflix, all function on recommendation systems where the system recommends you the next video or product based on your past activity (Content-based Filtering) or based on activities and preferences of other users similar to you (Collaborative Filtering). Likewise, Facebook also uses a recommendation system to suggest Facebook users you may know offline.

Photo by Glen Carrie on Unsplash
Photo by Glen Carrie on Unsplash

Recommendation Systems work based on the similarity between either the content or the users who access the content.

There are several ways to measure the similarity between two items. The recommendation systems use this similarity matrix to recommend the next most similar product to the user.

In this article, we will build a machine learning algorithm that would recommend movies based on a movie the user likes. This Machine Learning model would be based on Cosine Similarity.

Get the Dataset

The first step to build a Movie Recommendation system is getting the appropriate data. You may download the movies dataset from the web, or from the link below which contains a 22MB CSV file titled "movie_dataset.csv":

MahnoorJaved98/Movie-Recommendation-System

Let’s explore the dataset now!

Our CSV file contains a total of 4802 movies and 24 columns: index, budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count, cast, crew and director (sigh!).

Among all these different features, the ones we are interested in to find the similarity for making the next recommendation are the following:

keywords, cast, genres & director.

A user who likes a horror movie will most probably like another horror movie. Some users may like seeing their favorite actors in the cast of the movie. Others may love movies directed by a particular person. Combining all of these aspects, our shortlisted 4 features are sufficient to train our recommendation algorithm.

Start Coding

Now, let us start with the coding. First things first, let’s import the libraries we need, as well as the CSV file of the movies’ dataset.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df = pd.read_csv(r"...movie_dataset.csv")

We will import the two important libraries for data analysis and manipulation; pandas and numpy. We will also import Scikit-learn’s CountVectorizer, used to convert a collection of text documents to a vector of term/token counts.

Lastly, we will import the cosine_similarity from sklearn, as the metric of our similarity matrix (which will be discussed in details later).

We will read our CSV file into a dataframe df, which can then be accessed in the variable explorer of our Python IDE.

CSV loaded into a Dataframe (Image by Author)
CSV loaded into a Dataframe (Image by Author)

Features List

We will make a list of the features that we will be using. As discussed above, we will only use the features most relevant to us, considering our problem at hand. Hence, our chosen features will be keywords, cast, genres & director.

Moreover, we will do a little bit of data preprocessing and replace any rows having NaN values with a space/empty string, so it does not generate an error while running the code. This pre-processing has been done in the for loop.

features = ['keywords', 'cast', 'genres', 'director']
for feature in features:
    df[feature] = df[feature].fillna('')

Combining Relevant Features into a Single Feature

Next, we will define a function called combined_features. The function will combine all our useful features (keywords, cast, genres & director) from their respective rows, and return a row with all the combined features in a single string.

def combined_features(row):
    return row['keywords']+" "+row['cast']+" "+row['genres']+" "+row['director']
df["combined_features"] = df.apply(combined_features, axis =1)

We will add a new column, combined_features to our existing dataframe (df) and apply the above function to each row (axis = 1). The dataframe will now have an extra column at the end, which will comprise of rows of the combined features.

The Combined Features column in our dataframe (Image by Author)
The Combined Features column in our dataframe (Image by Author)

Extracting Features

Next, we will extract features from our data.

The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image. We will use CountVectorizer’s fit.tranform to count the number of texts and we will print the transformed matrix count_matrix into an array for better understanding.

cv = CountVectorizer()
count_matrix = cv.fit_transform(df["combined_features"])
print("Count Matrix:", count_matrix.toarray())

Using the Cosine Similarity

We will use the Cosine Similarity from Sklearn, as the metric to compute the similarity between two movies.

Cosine similarity is a metric used to measure how similar two items are. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The output value ranges from 0–1.

0 means no similarity, where as 1 means that both the items are 100% similar.

Cosine Similairty (Image by Author)
Cosine Similairty (Image by Author)

The python Cosine Similarity or cosine kernel, computes similarity as the normalized dot product of input samples X and Y. We will use the sklearn cosine_similarity to find the cos θ **** for the two vectors in the count matrix.

cosine_sim = cosine_similarity(count_matrix)

The cosine_sim matrix is a numpy array with calculated cosine similarity between each movies. As you can see in the image below, the cosine similarity of movie 0 with movie 0 is 1; they are 100% similar (as should be).

Similarly the cosine similarity between movie 0 and movie 1 is 0.105409 (the same score between movie 1 and movie 0 – order does not matter).

Movies 0 and 4 are more similar to each other (with a similarity score of 0.23094) than movies 0 and 3 (score = 0.0377426).

The diagonal with 1s suggests what the case is, each movie ‘x’ is 100% similar to itself!

Cosine Similarity Matrix (Image by Author)
Cosine Similarity Matrix (Image by Author)

Content User likes

The next step is to take as input a movie that the user likes in the movie_user_likes variable.

Since we are building a content based filtering system, we need to know the users’ likes in order to predict a similar item.

movie_user_likes = "Dead Poets Society"
def get_index_from_title(title):
    return df[df.title == title]["index"].values[0]
movie_index = get_index_from_title(movie_user_likes)

Suppose I like the movie "Dead Poets Society". Next, I will build a function to get the index from the name of this movie. The index will be saved in the movie_index variable.

Movie Index variable **** of the movie User likes (Image by Author)
Movie Index variable **** of the movie User likes (Image by Author)

Generating the Similar Movies Matrix

Next we will generate a list of similar movies. We will use the movie_index of the movie we have given as input movie_user_likes. The enumerate() method will add a counter to the iterable list cosine_sim and return it in a form of a list similar_movies with the similarity score of each index.

similar_movies = list(enumerate(cosine_sim[movie_index]))
Similar Movies list (Image by Author)
Similar Movies list (Image by Author)

Sorting the Similar Movies List in Descending Order

Next step is to sort the movies in the list similar_movies. We have used the parameter reverse=True since we want the list in the descending order, with the most similar item at the top.

sorted_similar_movies = sorted(similar_movies, key=lambda x:x[1], reverse=True)

The sorted_similar_movies will be a list of all the movies sorted in descending order with respect to their similarity score with the input movie movie_user_likes.

As can be seen in the image below, the most similar one with a similarity score of 0.9999999999999993 is at the top most, with its index number 2453 (the movie is ‘Dead Poets Society’ which we gave as input, makes sense, right?).

Sorted Similar Movies List with the Similairty Score (Image by Author)
Sorted Similar Movies List with the Similairty Score (Image by Author)

Printing the Similar Movies

Now, here comes the last part of the project, which is to print the names of the movies similar to the one we have given as input to the system through the movie_user_likes variable.

As seen in the sorted_similar_movies list, the movies are sorted by their index number. Printing the index number will be of no use to us, so we will define a simple function that takes the index number and covert it into the movie title as in the dataframe.

Index Number → Movie Title

Next we will call this function inside the for loop to print the first ‘x’ number of movies from the sorted_similar_movies.

In our case, we will print the 15 most similar movies from a pool of 4802 movies.

def get_title_from_index(index):
    return df[df.index == index]["title"].values[0]
i=0
for movie in sorted_similar_movies:
    print(get_title_from_index(movie[0]))
    i=i+1
    if i>15:
        break

Running the Entire Code

Now comes the application. Use the steps above to code your own recommender systems and run the code by giving a movie you like to the movie_user_likes.

I have given "Dead Poets Society", and it prints me the following similar movies:

IPython Console (Image by Author)
IPython Console (Image by Author)

As can be seen, the most similar one is obviously the movie itself. The algorithm defines "Much Ado About Nothing" as the next most similar movie! (will add it to my "To-watch list" 😄 )

That’s it for this article! The article provided a hands-on approach to build a Recommendation System, from scratch, by coding it on any python IDE.

Now, once the algorithm is built, its time to grab some popcorn, and watch the movie your system recommends!! 😁

Popcorn Time! (Photo by Georgia Vagim on Unsplash)
Popcorn Time! (Photo by Georgia Vagim on Unsplash)

Related Articles