The world’s leading publication for data science, AI, and ML professionals.

Recommender System in Python – Part 2 (Content-Based System)

Dive into the Amazing World of Recommendation Systems and Build one on your own (part 2)

Photo by Pixabay from Pexels
Photo by Pixabay from Pexels

Welcome to the second part of the 2-part series. This post will focus on developing a simple, content-based recommender system from previously explored movie dataset. The whole data exploration and analysis were done in the first part, so here’s the link if you’ve missed it:

Recommender System in Python – Part 1 (Preparation and Analysis)

I’ve got to say, today’s post will be much shorter than the previous one. The main reason for that is, there’s not much to recommender system (at this basic level at least). With that being said, today’s post will explain you the intuition and logic behind a simple content-based recommender system (see Part 1 if you don’t know what content-based systems are), and you’ll see that no actual Machine Learning is happening here, only advanced (sort of) filtering.


Why should you read this post?

As with the previous post, there are two main benefits:

  1. You’ll dive into the world of recommender systems, and build your first (probably)
  2. You’ll find out that they are simpler than they seem

How is the post structured?

I sad earlier, this part will be much shorter than the previous one, which covered the process of data gathering, preparation and exploration. Reading that post is a prerequisite because if you don’t read it (or at least copy the code from it), your data won’t be in the same format as mine, and therefore you won’t be able to proceed.

The contents of the post are as follows:

  1. Matrix creation
  2. Creating a function for fetching recommendations
  3. Getting and validating results

Before diving right in, this is how your dataset should look like:

Got it in this shape? You may proceed.


Matrix Creation

By matrix I mean that you should somehow create the table which has the following:

  • Every user ID as a row
  • Every movie title as a column
  • Rating each user gave to each movie as the intersection of row and column

This is easily obtainable through Pandas _pivot_table()_ function:

This matrix is essentially a Pandas DataFrame object, and by knowing that you know that you can call .head() on it:

Yeah, a lot of NaNs, I know. Take a minute to think about why so many values are missing, and then continue with the reading.

Did you get it?

The reason is, not every person has seen and rated every movie. There are over 9700 movies in this table, so think about yourself. How many movies did you watch? Of those you watched, how many did you publicly rated? Not so many, right?

You have the matrix now, and you can proceed to the next step.


Creating a Function for Fetching Recommendations

This is the meat and potatoes of the post. If this is your first-ever recommendation system, you’ll be surprised how easy is to make one, at least at this beginner level.

Here’s the logic you’ll have to implement:

  • Calculate the correlation of the desired movie with every other movie (using .corrwith() method)
  • Store movie titles with correlations in separate DataFrame
  • Merge that DataFrame with the original one, drop duplicates, and keep on the title, correlation coefficient, and the numRatings columns
  • Sort by correlation in descending order (from largest to smallest correlation)
  • Filter out movies that have a low number of ratings (those movies are irrelevant because they were seen only by a handful of people)
  • Return top n correlated movies

It sounds like a lot of work, but it’s only like 10 lines of code in reality.

As you can see, I’ve set the filter amount and number of movie recommendations to return as default function arguments, so they are easier for you to tweak.

And that is the whole logic you need to implement, in a nutshell.

Easy, right?


Getting and Validating Results

The process of getting recommendations is now as simple as a function call. The only parameter you need to pass in is the movie title, and it has to be the same as the one present in the dataset, every little spelling mistake will break everything. Feel free to play around with the function to get around this.

If I now fetch the recommendation for the Pulp Fiction movie:

The first one is obvious, Pulp Fiction is perfectly correlated with Pulp Fiction, but take a look at the ones after it.

Take a look at what recommendations IMDB gave for this movie. Cool, right?

Obtained from IMDB on 29th September 2019 - https://www.imdb.com/title/tt0110912/?ref_=tt_sims_tt
Obtained from IMDB on 29th September 2019 – https://www.imdb.com/title/tt0110912/?ref_=tt_sims_tt

I can now do the same with the movie Toy Story:

It’s already clear that the recommendations are valid, but let’s confirm that just in case:

Obtained from IMDB on 29th September 2019 - https://www.imdb.com/title/tt0114709/?ref_=nv_sr3?ref=nv_sr_3
Obtained from IMDB on 29th September 2019 – https://www.imdb.com/title/tt0114709/?ref_=nv_sr3?ref=nv_sr_3

Monsters, Inc and Finding Nemo are among the first 6 recommendations on IMDB, and The Incredibles is on the next page.


Conclusion

Recommender systems are a lot of fun to do, and so easy to validate (at least in this case). I hope that after reading this article you won’t look at them as black boxes, as they basically come down to doing some data manipulation with Python (or your language of choice).

For what movies did you test for? Are you satisfied with the results? Please let me know.


Related Articles