“COLLABORATIVE FILTERING FROM SCRATCH”

Build a Collaborative Filtering for Movie Recommendation from Scratch

Published in

Towards Data Science

7 min readOct 24, 2018

Welcome to the second part of the Fifth Episode of Fastdotai where we will deal with Collaborative Filtering from Scratch — A technique widely used in Recommendation System. Before we start , I would like to thank Jeremy Howard and Rachel Thomas for their efforts to democratize AI.

To make best out of this blog post Series , feel free to explore the first Part of this Series in the following order:-

Reason behind Netflix and Chill.

First of all, lets import all the required packages.

%reload_ext autoreload
%autoreload 2
%matplotlib inlinefrom fastai.learner import *
from fastai.column_data import *

Set the path where

Input data is stored.
Temporary files will be stored. (Optional- To be used in kaggle kernels)
Model weights will be stored. (Optional- To be used in kaggle kernels)

path='../input/'
tmp_path='/kaggle/working/tmp/'
models_path='/kaggle/working/models/'

Reading of the Data.

ratings = pd.read_csv(path+'ratings.csv')
ratings.head()
# This contains the userid , the movie that the userid watched , the time that movie has been watched , the ratings that has provided by the user .

movies = pd.read_csv(path+'movies.csv')
movies.head()
# This table is just for information purpose and not intended for         # modelling purpose

u_uniq = ratings.userId.unique() 
user2idx = {o:i for i,o in enumerate(u_uniq)}
# Take every unique user id and map it to a contiguous user .
ratings.userId = ratings.userId.apply(lambda x: user2idx[x])
# Replace that userid with contiguous number.# Similarly, we do it for the movies. 
m_uniq = ratings.movieId.unique()
movie2idx = {o:i for i,o in enumerate(m_uniq)}
ratings.movieId = ratings.movieId.apply(lambda x: movie2idx[x])

Converting movieId and userId into contiguous integers helps us in deciding the embedding matrix. The value of these userId and movieID aren’t contiguous in the beginning . It may start with 1million and won’t be contiguous. So if we use these values for deciding our embedding matrices , then the size of embedding matrices will be too large which might lead to slow processing or overfitting.

class EmbeddingDot(nn.Module):
    def __init__(self, n_users, n_movies):
        super().__init__()
        self.u = nn.Embedding(n_users, n_factors)
        self.m = nn.Embedding(n_movies, n_factors)
        self.u.weight.data.uniform_(0,0.05)
        self.m.weight.data.uniform_(0,0.05)
        
    def forward(self, cats, conts):
        users,movies = cats[:,0],cats[:,1]
        u,m = self.u(users),self.m(movies)
        return (u*m).sum(1).view(-1, 1)model = EmbeddingDot(n_users, n_movies).cuda() # Class Instantiation

Concept of OOPs is involved in the above code. So let me explain in detail .

self is a reference variable which stores the object (i.e model) when its created.
def __init__(self, n_users, n_movies): is a magical function . It’s called automatically whenever object is created for the class. This type of function is known as constructors.
model = EmbeddingDot(n_users, n_movies).cuda() . Here the object is created . And with its creation , the constructor is called automatically.
But what’s an Object . An Object (i.e model) is an entity with some attributes and behavior.
These behavior are the shape and values of the embedding as shown below.

self.u = nn.Embedding(n_users, n_factors) # User Embeddings
self.m = nn.Embedding(n_movies, n_factors) # Movie Embeddings
self.u.weight.data.uniform_(0,0.05) # Values for User Embeddings
self.m.weight.data.uniform_(0,0.05) # Values for Movie Embeddings

To get the values of these embeddings we use nn.Embedding which has been inherited from nn.Module using the OOP’s concept of Inheritance using this line of Code :- super().__init__() .
self.u is set as an instance of Embedding Class. It has a .weight attribute which contains the actual Embedding matrix. The embedding matrix is a variable. A variable is same as a Tensor and it does automatic differentiation.
To get access to the Tensor use, self.u.weight.data attribute.
self.u.weight.data.uniform_ :- The underscore symbol at the end denotes its an inplace operation . The self.u.weight.data.uniform_denotes a uniform random number of an appropriate size for this tensor and don’t return it but fill in the matrix in place.
The forward function comes into action when we do a fit which comes later on . But let’s get into the details of what happens when the forward function is called upon.

def forward(self, cats, conts):
        users,movies = cats[:,0],cats[:,1]
        u,m = self.u(users),self.m(movies)
        return (u*m).sum(1).view(-1, 1)

users,movies = cats[:,0],cats[:,1] :- Grab a minibatch of users and movies .
u,m = self.u(users),self.m(movies) :- For that mini-batch of users and movies , look up into the Embedding matrix of users and movies using self.u(users),self.m(movies) .
After getting the embeddings for users and movies we are doing a cross product of those two to get a single number which is the predicted ratings.

x = ratings.drop(['rating', 'timestamp'],axis=1)
# The x contain movies and users from the dataframe. Independent     # variables.y = ratings['rating'].astype(np.float32)
# The y contains the dependent Variable i.e the ratings.data = ColumnarModelData.from_data_frame(path, val_idxs, x, y, ['userId', 'movieId'], 64)1# path :- path of the file.
2# val_idxs :- Validation data
3# x, y :- Described above as independent and dependent variable.
4# ['userId', 'movieId'] :- List of categorical variables.
5# 64 :- batch size.wd=1e-5 # Regularization parameter

opt = optim.SGD(model.parameters(), 1e-1, weight_decay=wd, momentum=0.9)
# Optimizer to be used to update the weights or model.parameters().
# model.parameters() is derived from nn.Module which gives list of all   # the weights that are needed to be updated and hence passed to optimizer # along with learning rate, weight decay and momentum.

For fitting of our data i.e for training , earlier we were using learner which is a part of fast.ai but now we will make use of PyTorch capabilities. When the below fit command is executed , checkout model.py file within fastai folder to know the underlying of fit command. Basically what it does is:-

A forward pass by calling the forward function def forward(self, cats, conts):
And a backward pass to update the Embedding which is a PyTorch functionality.

fit(model, data, 3, opt, F.mse_loss)

Here we won’t get the functionalities of SGDR , hence manually reset the learning rate and check out the loss.

set_lrs(opt, 0.01)

fit(model, data, 3, opt, F.mse_loss)

Although our model is performing good but since we aren’t implementing SGDR properly , hence our loss is higher as compared to earlier.

HOW TO FURTHER IMPROVE THE MODEL??

Now we will be taking bias into consideration . There would be some users who would be highly enthusiastic and would rate all the movies higher on an average . So for this reason we would add a constant for movie and user. This constant is known as bias.

min_rating,max_rating = ratings.rating.min(),ratings.rating.max()
min_rating,max_ratingdef get_emb(ni,nf): 
# Input is #User,#Factors i.e Embedding Dimensionality
    e = nn.Embedding(ni, nf) # Creation of Embedding matrix
    e.weight.data.uniform_(-0.01,0.01)
 # Fill it with randomly initialized values between (-0.01,0.01)
    return eclass EmbeddingDotBias(nn.Module):
    def __init__(self, n_users, n_movies):
        super().__init__()
# Creating an embedding for User (self.u) , Movies (self.m), 
# User bias (self.ub), Movie bias (self.mb) by calling get_emb().(self.u, self.m, self.ub, self.mb) = [get_emb(*o) for o in [
            (n_users, n_factors), (n_movies, n_factors), (n_users,1), (n_movies,1)
        ]]
        
    def forward(self, cats, conts):
        users,movies = cats[:,0],cats[:,1]
        um = (self.u(users)* self.m(movies)).sum(1)
        res = um + self.ub(users).squeeze() + self.mb(movies).squeeze()
# Add in user bias and movie bias. Using .squeeze() does a broadcasting.        res = F.sigmoid(res) * (max_rating-min_rating) + min_rating
# This is gonna squish the value between 1 and 5 . What it does is if its # a good movie then it will get a really high number else a low number.
#  F.sigmoid(res) is gonna squish it between 0 and 1.
        return res.view(-1, 1)wd=2e-4
model = EmbeddingDotBias(cf.n_users, cf.n_items).cuda()
opt = optim.SGD(model.parameters(), 1e-1, weight_decay=wd, momentum=0.9)
fit(model, data, 3, opt, F.mse_loss)

set_lrs(opt, 1e-2)
fit(model, data, 3, opt, F.mse_loss)

Finally we reach a Loss of 0.8 and that’s reasonably good.

If you like it , then ABC (Always be clapping . 👏 👏👏👏👏😃😃😃😃😃😃😃😃😃👏 👏👏👏👏👏)

If you have any questions, feel free to reach out on the fast.ai forums or on Twitter:@ashiskumarpanda

P.S. -This blog post will be updated and improved as I further continue with other lessons. For more interesting stuff , Feel free to checkout my Github account.

To make best out of this blog post Series , feel free to explore the first Part of this Series in the following order:-Dog Vs Cat Image Classification

Edit 1:- TFW Jeremy Howard approves of your post . 💖💖 🙌🙌🙌 💖💖 .

“COLLABORATIVE FILTERING FROM SCRATCH”

Build a Collaborative Filtering for Movie Recommendation from Scratch

Written by Ashis Kumar Panda