Collaborative Filtering Using fast.ai

Understanding recommender systems

Published in

Towards Data Science

7 min readJun 22, 2021

Ever wondered how Netflix recommends the right content tailor-made for a user? This deep dive focusses on recommender systems and embeddings [latent factors] to derive meaning from user-item interactions. If you haven’t worked with recommender systems before, this blog is a perfect start for you. The model described in this article uses the fast.ai library and assumes that you have a basic knowledge of python programming language as well as PyTorch. Let’s dive right in.

Data Analysis

To get started, let’s get ourselves comfortable with the ‘movielens’ dataset. To get the data into your jupyter notebook, just run the cells below. The fast.ai function will do the rest for you.

Note: this dataset uses a subset of the 100,000 movie rankings.

The ratings dataset has exactly what we need. A user column, movie id, corresponding rating and a timestamp. Why build a model? Because the ratings are not complete and the recommender tries to fill in missing ratings for movies that users have not watched before.

Understanding the latent factors

This is an important step for those who are new to recommender systems. Latent factors are basically randomly initialized parameters that exist for each user and movie in the dataset. Don’t worry too much if you don’t get it at this point, but try and understand the role of these latent factors in collaborative filtering (explanation below).

Assume we know to what degree a user likes the category a movie falls into (genre). Now assume we also know the same information about each of the movies [i.e. how closely knit the movie is to this category]. Now to fill the missing ratings, simply multiply these two latent factors. The answer is your predicted rating for that movie. Too much information? Let’s look at a small example to understand this concept.

Try running the blocks of code that follow this passage:

The assumption here is that the latent factors range between -1 and 1 where positive numbers indicate a strong match and negative numbers indicate a weak match. The categories in this example are “science-fiction”, “action” and “old movie”. With that said, the movie “the_force_awakens” has it’s latent factor:1 as 0.98 indicating “how sci-fi” it is, latent factor:2 as 0.9 indicating “how action” it is and latent factor:3 as -0.9 indicating “how old” it is.

What did we do here? We basically multiplied two latent vectors/factors and summed up their result. If you want to use data science jargon, you could call this the “dot product”. Quite simple right? This happens for every user and movie in the dataset and that is the underlying principle of this model. From the example above, you can clearly see that I’m into sci-fi/action movies, so “the_force_awakens” [2015] is rated higher than “A star is born” [1937].

Important: As mentioned here, the rating predictions are basically the dot product of the latent factors. Once the dot products are computed, the model tries to minimize loss by tweaking the latent factors (uses stochastic gradient descent — details of which will not be covered in this article but the model implementation will briefly explain its function).

That’s that — let’s get started with the actual code :)

Creating the DataLoaders

For the purpose of making this a fun exercise, let’s introduce the movie titles into our dataset and merge them with the ratings data. The table “u.item” consists of movie ID’s mapped to titles, so let’s drag them in.

Dataloaders object usually takes the first column as user, second column as the item and the ratings as third column. Since our items are movies, we rename it as ‘title’ with a batch size of 64.

PyTorch representation of collaborative filtering

The latent factors need to be represented one way or another for us to make some sense of the model. The pyTorch way of representing them is using matrices. This is done using the “torch.randn” function. How this works is — PyTorch randomly creates user and movie latent factors based on the total number of users and movies. Then it takes the dot product to arrive at ratings. We’ve established this already. But to do this dot product, we have to look up the index of the movie in our movie latent factor matrix and the user index in the user matrix. This is not something deep learning models know. This calls for embedding [a very simple topic].

Before we move forward, feel free to check out how pyTorch creates the latent factors below without embeddings:

Embeddings

As fancy as the term ‘embedding’ sounds, the underlying concept is straight forward. This is again just jargon for indexing items directly. To define the embeddings, let’s create a class and a function to define the latent factors (embeddings) as well as the dot product of these factors to arrive at predicted ratings.

Quite simple right? The first block creates the user & movie latent factors using embeddings while the second block multiples the two (dot product).

The key thing to understand here is the model takes as input, a tensor of shape (batch_size x 2). Here the first column represents the user id and the second represents movie id. The embedding layers are used to represent the matrix of user and movie latent factors. Again, explore the data a bit by yourself. Take a look at x and y individually for example.

Model Learner

With the embeddings done, we have nothing left to do but to run the model.

The results are quite decent. We could improve it further by using a sigmoid_range to ensure our predicted ratings are between 0 and 5.

Let’s try running the model now

**Epochs & Loss — II (with sigmoid range)**

Not a big difference. We can definitely do better than this. One important or obvious thing we can try is factoring in the biases. If we can have a single number for every user that can be added to the score, and do the same for every movie, we will account for user & movie biases. So let’s go ahead and try this.

Not great. Our results just got a bit worse. There’s one last thing we could try that should definitely improve our model results. Let’s introduce the weighted decay or L2 regularization as we data scientists would call it. This parameter will basically control the sum of squares that we add to the loss and also reduce overfitting.

Much better. Our results have improved finally. Let’s make some visuals and insights now from the recommendations.

Interpreting embeddings and biases

The model provides us with recommendations but it’s quite intriguing to see what factors it has identified. Let’s try and look at the biases. The code below gives us the movies with lowest values in terms of biases.

What does this mean? This is basically telling us that these movies could match strongly with certain users’ latent factors but still end up being disliked by users. We could have just sorted the movies by average ratings but this point of view using bias is more interesting because it does not just tell us if a movie is the kind that people do not enjoy as much, but the fact that it is well within their preferred category and yet they do not always like it.

I quite agree with the model. Those are some bad movies (some of them unheard of, at least to me — like Showgirls, Bio-Dome have internet ratings less than 5)

In that sense, here is the code to get the movies with the highest biases.

Now, this is again something I’d have to agree with. That’s an insane list of movies. I’ve personally seen all 10, so the biases have been factored in quite well for the movies.

PCA for visualizing the movies

Interpreting the embeddings is not that straight forward as the number of dimensions are usually high. But we could use PCA to extract the most important information and look at how the movies are scattered in a lower dimension.

This plot is quite meaningful. If you notice, we’ve got some science fiction movies like ‘Empire strikes back’, ‘Return of the jedi’, ‘Star trek’, ‘Indiana jones’ clubbed near each other. And towards the left we’ve got some neo-noir movies like ‘The godfather’ and ‘Fargo’. So those are a couple of examples of how closely similar movies are represented in the latent space.

That’s the end of this blog. To take a look at the notebook for the entire code and other machine learning algorithms, follow me on github.