Introduction To Recommender Systems- 2: Deep Neural Network Based Recommendation Systems

How Deep Learning is used in Recommendation systems?

Abhijit Roy
Towards Data Science
15 min readJul 31, 2020

--

It is my second article on the Recommendation systems. In my previous article, I have talked about content-based and collaborative filtering systems. I will encourage you to go through the article if you have any confusion. In this article, we are going to see how Deep Learning is used in Recommender systems. We will go through the recommender system’s candidate generation architecture of Youtube. So, let’s jump in.

Why Deep Neural Networks?

Limitations of matrix factorization methods:

  1. The difficulty of using side features, that may affect the recommendation like the U/PG rating of a movie or the country of the user. We can only use the item ID and the user ID in the case of matrix factorization. It also prevents us from querying an item or user not present in the training set.
  2. The matrix factorization also had the cold start problem due to the fact that it had no feature vector or embedding for the new items
  3. Matrix factorization often tends to recommend popular items to everyone which does not always reflect the specific user interests mostly when Dot products are used.
  4. Matrix factorization works on the simple inner product of the User and item feature embeddings, it is often not enough to capture and represent the complex relations in the user and items.

Deep Neural networks are designed and used to address these shortcomings of the matrix factorization methods.

We will use the references from two research papers to get the idea of how deep learning works for recommender systems:

  1. Neural Collaborative Filtering: https://arxiv.org/abs/1708.05031
  2. Deep Neural Networks for YouTube Recommendations: https://dl.acm.org/doi/10.1145/2959100.2959190

Neural Collaborative Filtering:

Neural Collaborative Filtering (NCF) is a paper published by the National University of Singapore, Columbia University, Shandong University, and Texas A&M University in 2017. The paper proposed a neural network-based collaborative learning framework that will use Multi perceptron layers to learn user-item interaction function.

Matrix Factorization

Matrix factorization, which I have talked about in my last article, (please check out the working, if you have any confusion) maps both the users and items latent vectors, or the dense feature vectors used to describe items or users, on the same latent space or embedding space and represents user-item interaction as inner product or dot product of the user and item vectors. This paper argued that the simple choice of interaction function, i.e, the dot product may cause hindrance in the case of matrix factorization, as it may fail to capture the complex relations of the user-item interactions.

The paper demonstrated the following example:

Suppose there are M users and N items, and Y(M x N) denotes the item user matrix such that:

So, the generated matrix looks like,

Say, this is our user-item interaction matrix, 1 denotes the user interacted with the item and vice versa. Now if we obtain our user and item latent vectors or embedding vectors, they will look something like this:

For items and users respectively, we get:

(A) Item embedding (B) User Embedding

The above matrices on the inner product give back the interaction matrix. Now, the paper concentrated on implicit feedback, for example, if a user has interacted with an item its positive feedback, as considered here. But in actual cases, 1 here may not directly imply the user liked the content. Similarly 0 means, there was no interaction between that user and the item. Elaborately speaking, if I as a user watch a video, I interact with that item, it is considered as 1 implying positive feedback to that video by me, which may not be true in real-world, as I may not like the video. I think the intuition behind this is that, if I interact with an item, it shows I am interested in that type of item, so, it may give an idea about what type of content I like or want to watch. It is some kind of positive feedback. Though the 0’s may not be negative feedback, they may simply mean the user didn’t watch them or missing data. So, there is a crisis of negative feedback.

Now, for ranking the items using the model-based approach the paper uses:

Source

y(u, i) is the predicted score for interaction between user u and item and theta denotes the model parameters, which we need to optimize and f is the interaction function.

Now let’s talk about the objective function or the loss functions used:

Pointwise Loss: This considers the problem as a classification problem though it uses the regression framework loss function of Mean Square Error. It tries to minimize the MSE between the predicted score and the target value for an entry. If we go for the implementation details, we take a positive example and k negative examples and fit a sigmoid activation. We then use MSE to minimize the error. The negative example means the 0’s which are randomly sampled against a positive example or 1. Basically it is a classification, we try to get the layer to predict 1 for positive examples and 0 for the negative examples.

Pairwise Loss or Ranking Loss: It tries to rank the observed entries or the 1’s higher than the 0’s or unobserved entries, i.e, produces higher scores for 1’s than 0’s. It uses an equation like:

Loss= 1 — sigmoid (Positive score -Negative score).

It trains on triplets, an example, a positive or observed item, and a negative or unobserved are coupled together to create a training sample. If the difference between the scores for positive and negative is large, sigmoid gives a number close to 1 and the number is subtracted from 1 giving a lower loss. Such losses are used by Siamese Networks. There is an adaptive version of this loss function, which we are not going to discuss, but briefly, it does not sample the negative example randomly, it chooses the negative sample from a shortlisted set of negative samples, for whom the generated score is the largest, i.e, the chance that the model will produce a large number wrongly for the sample is maximum.

The paper used the pointwise loss.

Next, let’s observe a drawback of the matrix factorization method,

Source

Suppose, this is our item-user interaction matrix. We will use the cosine function to get the similarity between the user Latent vectors.

S{i,j}= Cosine(i,j)

S denotes similarity between user i and j.

So, we get,

S{23} > S{12} > S{13}.

Now, if we arrange them according to their similarities we obtain:

Source

This type of vector representation where each vector P represents a user.

Now, a new user comes in U4.

On recalculating, we will get.

S{41} > S{43} > S{42}.

If we observe closely there is a confusion U4 is most similar to P1, then P3 and finally P2.

Let’s see another vector representation:

Source

If by any chance u4 is placed p4, it will be similar to u2 and give the wrong recommendations as it will give u2 more weightage than u3 for prediction of u4.

This drawback was answered using deep learning for recommendation systems, which used deep neural nets to create embeddings.

Architecture for NCF:

Source

Working: The model takes in two sparse vectors, one representing the user and the other represents items. The item vector has 1 at an index means the user has interacted with the item corresponding to the index. So, elaborately,

User vector=[ 0,0,1 ………..0] with m elements, means this vector represents the 3 rd user out of m.

Item vector=[0,1,0,1,1,0…..1] with n elements, means the user interacted with those items out of n items.

Basically both items and users are one-hot encoded.

The next layers are the embedding layers that obtain the dense or latent vectors for the sparse inputs, from the input layer. Now, the obtained latent vectors are fed into the multi-layer neural architecture, to map the latent vectors to the predicted probability scores. The layers are responsible to find the complex user-item relations from the data. The output layer produces the predicted score Y_pred(ui), i.e, how much is the probability that the user u will interact with the item i.

For training the model, the authors have used pointwise loss function, to minimize the difference between the target value Y(ui) and the corresponding predicted value.

The NCF model’s Prediction equation is given by:

Source

“where P ∈ R(M×K) and Q ∈ R (N×K), denoting the latent factor matrix for users and items, respectively and Θf denotes the model parameters of the interaction function f”.- Source

We know for a multi perceptron layer, for each layer,

y= f1(x), where f1 is the activation function or mapping function. Here, the output prediction is formulated as,

Source

where Psi-out and Psi-x respectively denote the mapping function for the output layer and x-th neural collaborative filtering (CF) layer, and there are X neural CF layers in total.

Learning and Loss of NCF

The NCF uses a pointwise loss to train model parameters. So, it trains using a mean squared loss. The loss is described as:

“Source

“where Y denotes the set of observed interactions in Y and Y − denotes the set of negative instances, which can be all (or sampled from) unobserved interactions; and Wui is a hyperparameter denoting the weight of training instance (u, i)”.-Source.

Now, as we know, the squared loss is used generally for regression and explained on Gaussian distribution, but in our case, it is a type of binary classification with target value 0 or 1, depending upon whether the user interacted with the items or not. So, here to fit our requirements the paper proposed a probabilistic approach for learning Pointwise Loss that pays more attention to the binary property of the data. The paper uses a logistic function to keep the predicted output in the range of [0,1].

The function is redefined as :

Source

So, taking the negative logarithm, the obtained function is:

Source

The loss function is actually the binary cross-entropy loss or log loss. The optimization is done using the SGD optimizer.

The NCF thus obtains the loss function value and uses it to train the embeddings and model parameters using the backpropagation. The paper proved the Matrix factorization is the special case of NCF. Thus, we can obtain the item and user embeddings accurately using Neural collaborative filtering.

We have just talked about the NCF model here, the paper also proposed an idea of a model as a Fusion of general Matrix factorization GMF and Multiperceptron layer MLP with the name NeuMF.

Source

The general idea behind the fusion is that the GMF better generates the linear dependencies and the MLP better models the non-linear dependencies of in the user-item interactions.

Here the MLP’s item and user vectors and GMF layer’s user and item vectors are separately trained, so, that they can reach individual optimality.

Then the layers are concatenated and fed to a NueMF layer, for final predictions and backpropagation from loss. We are not going to talk about it’s working here if you want to know more, feel free to go through the original paper.

Deep Neural Networks Softmax approach

Now, let’s talk about the second paper to get a clear intuition about how youtube implements its recommender systems with the softmax layer.

Deep Neural Networks for YouTube Recommendations:

The paper is written by Paul Covington, Jay Adams, and Emre Sargin. This paper talks about how Youtube recommends videos to its users. The paper mentions mainly three challenges:

  1. The scale of data: Youtube has a massive user base and a huge corpus
  2. Freshness: The corpus of youtube is very dynamic, as every second, hour of videos are uploaded
  3. Noise: There are sparsity and external factors that create noisy implicit feedback.

System Overview

The youtube’s system comprises of two neural networks, one for candidate generation and another for ranking. The candidate generator is responsible for taking in the users watch history as input and give a small subset of videos as recommendations from youtube’s huge corpus of videos. The candidate generation networks work based on collaborative filtering. The features like watching history and demographics are used to decide the similarities between users. The ranking network accomplishes the choosing of top N items by assigning scores to each video according to the desired objective function using the set of features describing the video and user.

Source

The image shows the system overview for youtube. The system uses performance metrics like precision, recall, and ranking loss, which we talked about above.

Candidate Generation

The candidate generation neural network is based on the matrix factorization using ranking loss, where the embedding layer for a user is completely constructed using the user’s watch history. According to the paper, the method can be termed as a “non-linear generalization of factorization techniques”.-Source

Working

This paper has used recommendation as an extreme multiclass classification with a softmax layer as the output layer.

The softmax function is given by:

The paper proposes the problem’s classification equation as:

Source

Here the model tries to classify the probability that a user U will watch a video at time t, given by (W{t}) from the huge corpus of videos i in a given context C. ‘u’ represents the embedding or dense feature vector of a user and ‘v{j}’ represents the embedding of a candidate video. The embeddings are created from the sparse vectors of the users, created from their watch histories. The neural network learns the user embeddings ‘u’ as the function of the user’s history which is fed to the softmax layer to classify the videos that the user might want to watch based on the history and embeddings. The probability that the user will watch the video is given by 1 and the condition that the user won’t watch is given by 0.

Now, having said the youtube works on a classification mechanism, we should note that youtube has a huge corpus of items and if we want to classify which content or items a particular user likes, using a softmax layer it will be a huge classification, with a huge number of nodes in the output layer, with most of the item being a negative sample. Most of the items will be negative because, given the huge corpus of items given on youtube, there exists no user who has watched say, 20% of the total content available on youtube, we can say from common sense. In addition, if we analyze our watch patterns we will realize that we, as viewers are interested in some specific types of contents, i.e, we don’t want to actually watch any type of content, which I think can also clearly satisfy, why in the first place our watch history is so important to describe us as a viewer. So, most items in the interaction matrix will have 0 probability.

Using a classification on so many items at a time using softmax will take a lot of time, so Youtube uses a negative sampling approach. This approach helps youtube to randomly sample negative classes for a user from the background distribution. In practice, thousands of negative classes are sampled. Basically the idea is driven by the fact that we want to know what type of content the user likes, so, there is no point in using 100K negative samples which will delay the model performances. This decreases the training set size, consequently the number of nodes in the output layer, and finally, the time taken by the softmax model.

Now, once the candidate generation network generates the dense vectors or embeddings, the scoring network becomes just the nearest neighbor’s search in the dot product space of the user and item vectors, and thus produces top N choices that appear as recommendations.

Feedback Mechanisms

Youtube works based on an implicit feedback mechanism. Though it has explicit feedbacks available, given the huge number of videos and users, explicit feedback becomes very sparse in nature, as a result, implicit feedbacks is used to train the model. If a user completes watching a video, it is regarded as positive feedback, and so on. This way the feedbacks are collected. The approach results in a large amount of feedback which can be used to train models.

Source

Model Architecture

The paper states youtube’s module learns the embeddings of the videos using a fixed vocabulary. For a user, the user’s watch history is represented as a variable-length sequence of the sparse video id’s which are mapped to dense vectors, that serve as the embedding vector for the user. The embeddings are concatenated to form the features. Along with that, other features are taken into consideration like age, gender, and country. The full feature set is fed to fully connected multilayered Rectified Linear (ReLU) units. The parameters and embeddings are trained using normal gradient descent. We obtain the final embeddings from the softmax layer for the candidate videos and users and feed it to the nearest neighbor algorithm to generate top choices.

The paper also focuses on the inclusion of side features to train the model as an important aspect of using deep neural nets. It states that the approach helps to include a number of binary and continuous features in a normalized manner, like age, gender, and demographics which is very important for recommendation to a new user. For example, the paper mentions, youtube has observed “Fresh” contents or contents newly uploaded on youtube are preferred by users, so, the generator needs to recommend the fresh content to a user based on their interest. It also needs to do the same with the viral contents which are completely dynamic in nature.

The paper also talks about the ranking system, which we won’t be discussing. If you are interested to know the full procedure, I suggest going through the original paper in order to get a full idea.

Lastly, I want to talk about another type of Deep Learning-based recommender system.

Recommendation as sequence prediction

If we observe our interactions with different items say, we are watching videos of youtube, we watch the videos in a sequence, i.e, we pick one item, interact with it and then move to the new item. So, basically what we are doing is, as a viewer, we are creating a sequence of items in order of which we are accessing the items. This type of recommender system takes up these sequences and based on the sequence tries to predict the next item in the sequence.

Now, to predict the next items in the sequence it assigns a variable probability ‘p’ to each of the items for a particular user. It states that it is p% probable that the user will want to interact with the corresponding item next, based on the user’s history.

So, it may be represented as:

[a]->b

[a b]->c

[a b c ] -> d

And so on.

The system defines each user as the function of his/her item interaction history. There are several approaches to this problem formulation:

  1. The user is defined as an average of the item vectors he has previously interacted with, and then the obtained user vector is used for recommendation
  2. Recurrent Neural Networks are trained on the user’s sequential watch history to generate the embedding for the user
  3. A Convolutional Neural Network can be used. We can split the watch history sequence into training windows and shift the 1 Dimensional training window over the interaction sequence of a particular user to get the embedding

These three are the common approaches used to build recommendation systems treating the user data as a sequence.

Conclusion

In this article, we have talked about the ways in which deep learning is used in recommender systems.

Hope this article helps

--

--