“COLLABORATIVE FILTERING FROM SCRATCH”
Build a Collaborative Filtering for Movie Recommendation from Scratch
Welcome to the second part of the Fifth Episode of Fastdotai where we will deal with Collaborative Filtering from Scratch — A technique widely used in Recommendation System. Before we start , I would like to thank Jeremy Howard and Rachel Thomas for their efforts to democratize AI.
To make best out of this blog post Series , feel free to explore the first Part of this Series in the following order:-
- Dog Vs Cat Image Classification
- Dog Breed Image Classification
- Multi-label Image Classification
- Time Series Analysis using Neural Network
- NLP- Sentiment Analysis on IMDB Movie Dataset
- Basic of Movie Recommendation System
- Collaborative Filtering from Scratch
- Collaborative Filtering using Neural Network
- Writing Philosophy like Nietzsche
- Performance of Different Neural Network on Cifar-10 dataset
- ML Model to detect the biggest object in an image Part-1
- ML Model to detect the biggest object in an image Part-2
Reason behind Netflix and Chill.
First of all, lets import all the required packages.
%reload_ext autoreload
%autoreload 2
%matplotlib inlinefrom fastai.learner import *
from fastai.column_data import *
Set the path where
- Input data is stored.
- Temporary files will be stored. (Optional- To be used in kaggle kernels)
- Model weights will be stored. (Optional- To be used in kaggle kernels)
path='../input/'
tmp_path='/kaggle/working/tmp/'
models_path='/kaggle/working/models/'
- Reading of the Data.
ratings = pd.read_csv(path+'ratings.csv')
ratings.head()
# This contains the userid , the movie that the userid watched , the time that movie has been watched , the ratings that has provided by the user .
movies = pd.read_csv(path+'movies.csv')
movies.head()
# This table is just for information purpose and not intended for # modelling purpose
u_uniq = ratings.userId.unique()
user2idx = {o:i for i,o in enumerate(u_uniq)}
# Take every unique user id and map it to a contiguous user .
ratings.userId = ratings.userId.apply(lambda x: user2idx[x])
# Replace that userid with contiguous number.# Similarly, we do it for the movies.
m_uniq = ratings.movieId.unique()
movie2idx = {o:i for i,o in enumerate(m_uniq)}
ratings.movieId = ratings.movieId.apply(lambda x: movie2idx[x])
Converting movieId and userId into contiguous integers helps us in deciding the embedding matrix. The value of these userId and movieID aren’t contiguous in the beginning . It may start with 1million and won’t be contiguous. So if we use these values for deciding our embedding matrices , then the size of embedding matrices will be too large which might lead to slow processing or overfitting.
class EmbeddingDot(nn.Module):
def __init__(self, n_users, n_movies):
super().__init__()
self.u = nn.Embedding(n_users, n_factors)
self.m = nn.Embedding(n_movies, n_factors)
self.u.weight.data.uniform_(0,0.05)
self.m.weight.data.uniform_(0,0.05)
def forward(self, cats, conts):
users,movies = cats[:,0],cats[:,1]
u,m = self.u(users),self.m(movies)
return (u*m).sum(1).view(-1, 1)model = EmbeddingDot(n_users, n_movies).cuda() # Class Instantiation
Concept of OOPs is involved in the above code. So let me explain in detail .
self
is a reference variable which stores the object (i.e model) when its created.def __init__(self, n_users, n_movies):
is a magical function . It’s called automatically whenever object is created for the class. This type of function is known as constructors.model = EmbeddingDot(n_users, n_movies).cuda()
. Here the object is created . And with its creation , the constructor is called automatically.- But what’s an Object . An Object (i.e model) is an entity with some attributes and behavior.
- These behavior are the shape and values of the embedding as shown below.
self.u = nn.Embedding(n_users, n_factors) # User Embeddings
self.m = nn.Embedding(n_movies, n_factors) # Movie Embeddings
self.u.weight.data.uniform_(0,0.05) # Values for User Embeddings
self.m.weight.data.uniform_(0,0.05) # Values for Movie Embeddings
- To get the values of these embeddings we use
nn.Embedding
which has been inherited fromnn.Module
using the OOP’s concept ofInheritance
using this line of Code :-super().__init__()
. self.u
is set as an instance of Embedding Class. It has a.weight
attribute which contains the actual Embedding matrix. The embedding matrix is a variable. A variable is same as a Tensor and it does automatic differentiation.- To get access to the Tensor use,
self.u.weight.data
attribute. self.u.weight.data.uniform_
:- The underscore symbol at the end denotes its an inplace operation . Theself.u.weight.data.uniform_
denotes a uniform random number of an appropriate size for this tensor and don’t return it but fill in the matrix in place.- The forward function comes into action when we do a fit which comes later on . But let’s get into the details of what happens when the forward function is called upon.
def forward(self, cats, conts):
users,movies = cats[:,0],cats[:,1]
u,m = self.u(users),self.m(movies)
return (u*m).sum(1).view(-1, 1)
users,movies = cats[:,0],cats[:,1]
:- Grab a minibatch of users and movies .u,m = self.u(users),self.m(movies)
:- For that mini-batch of users and movies , look up into the Embedding matrix of users and movies usingself.u(users),self.m(movies)
.- After getting the embeddings for users and movies we are doing a cross product of those two to get a single number which is the predicted ratings.
x = ratings.drop(['rating', 'timestamp'],axis=1)
# The x contain movies and users from the dataframe. Independent # variables.y = ratings['rating'].astype(np.float32)
# The y contains the dependent Variable i.e the ratings.data = ColumnarModelData.from_data_frame(path, val_idxs, x, y, ['userId', 'movieId'], 64)1# path :- path of the file.
2# val_idxs :- Validation data
3# x, y :- Described above as independent and dependent variable.
4# ['userId', 'movieId'] :- List of categorical variables.
5# 64 :- batch size.wd=1e-5 # Regularization parameter
opt = optim.SGD(model.parameters(), 1e-1, weight_decay=wd, momentum=0.9)
# Optimizer to be used to update the weights or model.parameters().
# model.parameters() is derived from nn.Module which gives list of all # the weights that are needed to be updated and hence passed to optimizer # along with learning rate, weight decay and momentum.
For fitting of our data i.e for training , earlier we were using learner
which is a part of fast.ai but now we will make use of PyTorch capabilities. When the below fit
command is executed , checkout model.py
file within fastai folder to know the underlying of fit command. Basically what it does is:-
- A forward pass by calling the forward function
def forward(self, cats, conts):
- And a backward pass to update the Embedding which is a PyTorch functionality.
fit(model, data, 3, opt, F.mse_loss)
Here we won’t get the functionalities of SGDR , hence manually reset the learning rate and check out the loss.
set_lrs(opt, 0.01)
fit(model, data, 3, opt, F.mse_loss)
Although our model is performing good but since we aren’t implementing SGDR properly , hence our loss is higher as compared to earlier.
HOW TO FURTHER IMPROVE THE MODEL??
Now we will be taking bias into consideration . There would be some users who would be highly enthusiastic and would rate all the movies higher on an average . So for this reason we would add a constant for movie and user. This constant is known as bias.
min_rating,max_rating = ratings.rating.min(),ratings.rating.max()
min_rating,max_ratingdef get_emb(ni,nf):
# Input is #User,#Factors i.e Embedding Dimensionality
e = nn.Embedding(ni, nf) # Creation of Embedding matrix
e.weight.data.uniform_(-0.01,0.01)
# Fill it with randomly initialized values between (-0.01,0.01)
return eclass EmbeddingDotBias(nn.Module):
def __init__(self, n_users, n_movies):
super().__init__()
# Creating an embedding for User (self.u) , Movies (self.m),
# User bias (self.ub), Movie bias (self.mb) by calling get_emb().(self.u, self.m, self.ub, self.mb) = [get_emb(*o) for o in [
(n_users, n_factors), (n_movies, n_factors), (n_users,1), (n_movies,1)
]]
def forward(self, cats, conts):
users,movies = cats[:,0],cats[:,1]
um = (self.u(users)* self.m(movies)).sum(1)
res = um + self.ub(users).squeeze() + self.mb(movies).squeeze()
# Add in user bias and movie bias. Using .squeeze() does a broadcasting. res = F.sigmoid(res) * (max_rating-min_rating) + min_rating
# This is gonna squish the value between 1 and 5 . What it does is if its # a good movie then it will get a really high number else a low number.
# F.sigmoid(res) is gonna squish it between 0 and 1.
return res.view(-1, 1)wd=2e-4
model = EmbeddingDotBias(cf.n_users, cf.n_items).cuda()
opt = optim.SGD(model.parameters(), 1e-1, weight_decay=wd, momentum=0.9)
fit(model, data, 3, opt, F.mse_loss)
set_lrs(opt, 1e-2)
fit(model, data, 3, opt, F.mse_loss)
Finally we reach a Loss of 0.8 and that’s reasonably good.
If you like it , then ABC (Always be clapping . 👏 👏👏👏👏😃😃😃😃😃😃😃😃😃👏 👏👏👏👏👏)
If you have any questions, feel free to reach out on the fast.ai forums or on Twitter:@ashiskumarpanda
P.S. -This blog post will be updated and improved as I further continue with other lessons. For more interesting stuff , Feel free to checkout my Github account.
To make best out of this blog post Series , feel free to explore the first Part of this Series in the following order:-Dog Vs Cat Image Classification
- Dog Breed Image Classification
- Multi-label Image Classification
- Time Series Analysis using Neural Network
- NLP- Sentiment Analysis on IMDB Movie Dataset
- Basic of Movie Recommendation System
- Collaborative Filtering from Scratch
- Collaborative Filtering using Neural Network
- Writing Philosophy like Nietzsche
- Performance of Different Neural Network on Cifar-10 dataset
- ML Model to detect the biggest object in an image Part-1
- ML Model to detect the biggest object in an image Part-2
Edit 1:- TFW Jeremy Howard approves of your post . 💖💖 🙌🙌🙌 💖💖 .