RecoTour II: neural recommendation algorithms

Published in

Towards Data Science

15 min readSep 11, 2019

This is the second of a series of posts on recommendation algorithms in python. In the first of the series, that I wrote quite a while ago, I quickly went through a number of algorithms that I implemented and tried using Kaggle’s Ponpare dataset. You can find all the related code in the repo. In this post I will use the Amazon Reviews dataset [1] [2], in particular 5-core Movies and TV reviews, to illustrate the use of two DL-based recommendation algorithms: Neural Collaborative Filtering [3] and the more recent Neural Graph collaborative Filtering [4].

Before I move forward, let me emphasise the following. What I have done here is, after reading the papers by the authors, implementing their original solutions (in Keras and Tensorflow respectively) using Pytorch. As with all other algorithms in the RecoTour repo, I have coded a number of notebooks with a lot of details ranging from how to prepare the data to how one trains and validates the results. I have also included a number of additional functionalities and adapted the code to my preferences. However, and of course, all credit to the authors, for their nice papers and for releasing the code, which is always appreciated. Once that is clear, and without further ado, let’s move onto the algorithms

1. Neural Collaborative Filtering (NCF)

Let me start by clarifying that the authors refereed to NCF as a general framework, which is formulated in the section 3 in their paper as consisting in a series of so called NCF layers that receive the user and item embeddings.

1.1 The algorithm

The specific algorithm that they eventually implement within the context of that general framework is called Neural Matrix Factorisation (NeuMF). This algorithm is relatively simple, so I do not intend to spend too much time here. There are a number of posts online where you can perhaps find more information and as always, I strongly recommend reading the paper, where you can find all details. Moreover, in the first few sections you will find the rationale that lead the authors to the implementation of the algorithm (you know…the scientific justification).

NeuMF consists of two parts, one that the authors refer as General Matrix Factorisation (GMF) and a second one which is a Multi-Layer Perceptron (MLP). GMF is literally the element-wise product between the user and item embeddings. I guess is so simple that the authors did not even consider including a figure just for GMF. Nonetheless, I did one and here it is:

The corresponding forward pass is simply:

GMF forward pass

where out=nn.Linear(in_features=n_emb, out_features=1).

The MLP is not more complex really. The Figure 2 in their paper shows the NCF general frame:

Fig 2. MLP model. (Figure 2 in their paper)

If we think of the Neural CF layers as linear layers with some activation, we have the MLP model. The corresponding forward pass:

MLP forward pass

Where mlp is just a Sequential stack of linear layers and out is the same as in GMF with the corresponding number of input features.

Once we have all the building blocks, NeuMF is the combination of both. As illustrated in their Figure 3:

Fig 3. NeuMF model (Figure 3 in their paper.)

Formally, NeuMF can be defined as:

Equation(s) 1. NeuMF, all the math you need

where Φ-GMF is the element-wise product between the MF embeddings. Φ-MLP is the result of passing the so called MLP embeddings through a series of linear layers with some activation a, and y_hat is just the result of the concatenation of the Φ-GMF and Φ-MLP, passed through a linear layer with a sigmoid activation (σ).

In code (i.e. the forward pass):

NeuMF forward pass

And regarding to the NeuMF algorithm itself, this is it, really.

1.2 Training/Validation strategy and results

Before we move on to the next algorithm le me briefly comment on the authors’ training/validating strategy, and the results I obtained using that strategy for the Amazon Revies dataset .

For a given user, the authors split the dataset so that all but one of the rated items by that user are used for training and the remaining rated item is used for testing. Then, during training, the negative implicit feedback is represented by including N negative items (never rated) per positive item. For example, if a user has rated 10 movies and N=4, 50 observations will be used for that user during training (10 positives + 40 negatives). Note that the N negative samples are randomly sampled every epoch, so that the algorithm has a more comprehensive view of the types of items that users don’t like (see here in the repo). During training, the prediction score is the result of applying the sigmoid function to the output layer and the loss is the Binary Cross Entropy (BCE) loss.

During testing, a set of 100 items are used per user. One of them is the positive item that was not used during training, and the remaining ninety-nine are randomly selected negative items. We then predict the score for these 100 items and compute the ranking metrics. In the case of NeuMF, these are the Hit Ratio (HR) and Normalised Discounted Cumulative Gain (NDCG) at k recommendations (with k=10 in this example).

NeuMF can be run with or without GMF or MLP pre-trained weights, with varying number of embeddings, MLP layers, etc…I have run a number of experiments (which can be found in the script run_experiments.shin the repo), including the use of Cyclic learning rates [5], and a discussion of the results can be found in this notebook. Let me briefly summarise here.

Fig 4. Ranking metrics vs Embedding dimension for the GMF and MLP models

Figure 4 shows the ranking metrics obtained for the GMF and the MLP models plotted against the number of embeddings (8, 16, 32 and 64). It is interesting to see that better ranking metrics are obtained with the simplest model (i.e. GMF and low number of embeddings). This suggests that good recommendations for this particular dataset can be attained with simple algorithms which, in general, is not an unusual result. Nonetheless it is worth mentioning that the best results are obtained with the NeuMF algorithm using pre-trained weights, but the improvement is marginal. Of course, in “the real world” one might wonder if the added complexity compensates that marginal increase in ranking metrics.

Fig 5. Ranking metrics vs BCE Loss for the GMF and MLP models

Another interesting finding is shown in Fig 5, where the ranking metrics during testing are plotted against the BCE Loss obtained during training. Here one can see that the BCE Loss and the ranking metrics are correlated, while a priori one might expect the opposite behaviour (i .e an anti-correlation, the lower the BCE Loss the higher the ranking metric). This is not extremely unusual and deserves perhaps a bit more attention. I will come back to this result towards the end of the post, once I have described Neural Graph Collaborative Filtering and its corresponding results.

2. Neural Graph Collaborative Filtering

This algorithm is a bit more complex that the previous one, so I will describe it in more detail. Nonetheless, trying to keep the size of this post readable, I will limit the content to what I consider the minimum necessary to understand the algorithm. As always, please, read the paper and in this case the references therein. In particular Semi-Supervised Classification with Graph Convolutional Networks [6] and Graph Convolutional Matrix Completion [7]. The code release by Thomas Kipf and co-authors for their paper is fantastic. Let’s move on to the algorithm: NGCF.

2.1. The Algorithm

Fig 6. Illustration of the user-item interaction up to order 3 (Figure 1 in their paper)

The left-side in Figure 6 is an illustration of the user-item interaction graph, and what the authors call the high-order connectivity. The node/user u₁ is the target user to provide recommendations for. We can see that at first order (l-hop, with l=1), one would capture the information that the user interacted with items 1, 2 and 3. At 2nd order (2-hop), one would capture u₁ ← i₂ ← u2 and u₁ ← i₃ ← u₃, and so on.

To the right we have the model scheme. In the author’s words: “we design a neural network method to propagate embeddings recursively on the graph. This is inspired by the recent developments of graph neural networks […], which can be seen as constructing information flows in the embedding space. Specifically, we devise an embedding propagation layer, which refines a user’s (or an item’s) embedding by aggregating the embeddings of the interacted items (or users). By stacking multiple embedding propagation layers, we can enforce the embeddings to capture the collaborative signal in high-order connectivities.”

Let’s try to add clarity to the previous paragraph through some math and code. The first building block one needs is, of course, representing the graph, i.e. building the Laplacian matrix, defined as:

where D is the diagonal degree matrix, A is the adjacency matrix and R is the rating matrix, in this case a binary matrix with 1/0 indicating whether the user rated or not a movie. Building the Laplacian matrix happens in load_data.py, within the utis module. There you will see that, following the authors original implementation, I build a number of different adjacency matrices to explore different scenarios (e.g. in some cases we consider self-connections, or use different decay factors between connected nodes). Once the Laplacian matrix is built, let’s move to the model.

Schematically, the model described in quotes above can be drawn like this:

Fig 7. Illustration of NGCF model architecture (Figure 2 in their paper).

The first mathematical operation within the embedding propagation layers (grey squares in Figure 7) is the “embedding aggregation” (i.e. aggregating the embeddings of all the interacted items/users per user/item). The result of that aggregation is then passed through a series of linear layers with LeakyRelu activations and concatenated with the initial embeddings. Finally, we directly compute the loss (in this case, the BPR loss, see below). This is, the output of the forward pass in this particular implementation will be directly the loss value. All this will be clearer once we go through the math and the code.

Formally, the model consists of two pieces: message construction and message aggregation.

Message construction is defined as:

where ⊙ denotes element-wise multiplication. 1/√ (|Nᵤ||Nᵢ|) is the graph Laplacian norm. Nᵤ and Nᵢ are the first-hop neighbours of user u and item i. If you look at the code in load_data.py, you will see that this factor (decay factor between two connected nodes) is already accounted by in our Laplacian matrix by construction. This means that once we reach the forward pass in our Pytorch implementation, we can focus only the two terms within the parenthesis in Equation 3.

It is also worth reminding that here eᵢ or eᵤ are not the initial embeddings, but the aggregated embeddings, i.e. for user 1, eᵢ would be the aggregated embeddings of all the items that that user interacted with. Programatically, this will be simply achieved by multiplying the initial embeddings by the Laplacian matrix. We’ll see later in the forward pass.

Message aggregation is defined as:

Which is simply a LeakyRelu activation applied to the result of the summation of all constructed messages.

One the messages are constructed and aggregated, we repeat the process as many times (i.e. layers) as you might want. If you want to know more about the reasoning behind the formulation of the message construction and aggregation, please, go to the paper. Is easy to read and understand. For what I need for this post Eq 2, 3 and 4 will be enough.

Let’s now focus on the loss function used during training, the so called Bayesian Personalized Ranking loss [8]. In the NGCF paper is defined as:

Where O= {(u,i,j)| (u,i) ∈ ℝ⁺ , (u,j) ∈ ℝ⁻} is the set of training tuples with ℝ⁺ and ℝ⁻ corresponding to observed and unobserved interactions (aka positive and negative) respectively. σ is the sigmoid function and Θ are all training parameters.

In summary, we have seen how to build the Laplacian matrix, how to construct and aggregate the messages and how to evaluate the performance of the algorithm during training. We can now move to the code:

NGCF Forward Pass

The code in this snippet is similar to the one with the repo with a few extra components removed. Let’s comment a bit on this code to see how it relates to the mathematical expressions shown before:

Line 2: simple concatenation of the initial embeddings row-wise. This will result in a matrix of dimensions (# users+# items, #embeddings). The authors call the result of such concatenation ego_embeddings . This is because formally, if we only use these embeddings we would be only considering information passed from a given node (aka focal node) to the nodes directly connected to it, i.e. ego-networks.
Lines 9–10: per every hop (layer or connectivity order), we start by multiplying the Laplacian matrix by the result of the concatenation described before. The result of this matrix multiplication will be the aggregated embeddings mentioned before and referred as eᵤ and eᵢ in Equation 3. Because the Laplacian matrix is…BIG, we cannot do this at once, so we break it into “folds” (partitions by rows) and multiply sequentially.
Line 13: the first term within the parenthesis in equation Equation 3 is just line 13.
Line 15–16: this is the second term within the parenthesis in Equation 3.

And from there in advance is rather simple. We normalise the embeddings and concatenate them before computing the BPR loss.

Note that we also apply what the authors called “message dropout”, which is the usual nn.Dropout applied directly to the embeddings. They make a distinction between message dropout and node dropout. The authors implement the later by using tf.sparse_retain (see here) on the Laplacian Matrix, which retains locations. Personally, I would call this edge dropout, since node dropout would be, to my understanding, zeroing an entire row in the Laplacian matrix. Nonetheless, whether edge or node dropout, these are computationally very expensive. Therefore, even though I have implemented them and are included in the code in my repo, I have not explored their effect when running experiments.

At this stage, I would like to pause for a second and focus in the NGCF forward pass in the snippet above. As we can see, down until line 24, the size of the batch does not matter. Down to that line, one has to build and execute the entire graph for all users and items. In this scenario static (aka declarative) frameworks, such as Tensorflow, are more suitable. When using static frameworks the graph is built once and is then executed end-to-end as the data flows through the graph (e.g. as we call the loss function). On the other hand, when using dynamic frameworks (aka imperative) the graph is built and executed in every forward pass. Therefore, while the second have the advantage of being more flexible than the first ones, in problems like the one described here, they perform notably slower. For example, the only way for Pytorch to be “competitive” when running NGCF is using huge batch sizes, so one reduces to the minimum the number of forward passes per epoch. Still, the original TF implementation by the authors is faster.

2.2 Training/Validation strategy and results

In the case of the NGCF, training follows a strategy that is standard when using the BPR loss. Every batch is comprised by triplets like (user, positive item, negative item), i.e. a batch size of 256 will contain 256 of these triplets, and the output of the forward pass is directly the BPR loss that we need to minimise.

Then, during testing, we rank for each user all items that that user never rated In this case, the ranking score is simply the dot product between the user and the items(s) embeddings. Note that since in this case we are rating a large amount of items per user (all the items that the user did not interact with during training) the ranking metrics obtained here are significantly smaller than those obtained for NeuMF, which were computed within groups of 100 items per user.

In the case of NCGF, I have only run 15 experiments. Therefore, this is by no means an exhaustive exploration of the parameter space. I have tried using the RAdam optimiser [9], which is meant to lead to SOTA solutions in a number of problems. The truth is that when using RAdam, I obtained the best BPR loss in the least number of epochs. However, as I will discuss just below, in these types of problems (recommendation algorithms), the best loss value does not always imply the best ranking metrics. Nonetheless I find this results really promising and encouraging and I am looking forward to trying RAdam in other projects.

In general, a summary of the results and a brief discussion can be found in this notebook. Let me include here just the following figure.

Figure 8 shows a trend for the ranking metrics to improve as the BPR loss decreases, as one might expect. However, this is by far not a smooth trend. In fact, the second best result in terms of ranking metrics is obtained for the 6th best BPR loss value (see the notebook). This relates to my earlier comment about “loss vs ranking metrics” evaluation, and deserves perhaps a dedicated paragraph or two.

In general, when building a recommendation algorithm you can normally evaluate its performance as a classification/regression problem, or as a ranking problem. The later is more related to information retrieval effectiveness and is normally my preference. In the first place, because I believe is a more robust measure of how the recommendation algorithm performs, and secondly, because sometimes ratings can be a bit “erratic”. For example, they might be influenced by the mood of the user that day or because something happened during the movie (internet failed, or the site failed).

Also, you do not want to get “too good” predicting ratings. In general you want your algorithm to have good coverage (i.e. covering as much as possible the item space) and diversity (i.e. recommending items as diverse as possible that are likely to be liked by the user). This also relates to the notion of novelty (i.e. recommending items that the user was not aware of) and serendipity(recommending unexpected items to the user). If your recommendations rely completely on achieving the best loss when predicting explicit ratings, you will end up reducing all coverage, diversity, novelty and serendipity, and ultimately, engagement.

If you want more details on how to evaluate recommendation algorithms, make sure you check, for example, Chapter 7 in this fantastic book [10].

And that’s it for now. In both cases, NCF and NGCF, I have included a script called get_embeddings.py where I used the learned embeddings and KNN to show that these learned embeddings “make sense”. This is, that given a certain movie the closest movies in the embedding space would be sensible recommendations.

3. Future Work

When I find sometime I intend to review the notebooks related to Factorisation Machines [11] and Field Aware Factorisation Machines [12]using the latest release of the xlearn library. I believe they have solved a lot of the issues I faced when using the package back in the days and back then it was already a very promising package. Therefore I think it is worth giving it another go. In terms of adding more algorithms, the next in line is Mult-VAE [13]. As described in the excellent work of Ferrari Dacrema et al [14], Mult-VAE seems to be the only Deep Learning-based recommendation algorithm that actually performs better than simpler, non DL techniques.

As always, any ideas or comments, please email me: jrzaurin@gmail.com

References:

[1] J. McAuley, C. Targett, J. Shi, A. van den Hengel. Image-based recommendations on styles and substitutes. SIGIR, 2015

[2] R. He, J. McAuley. Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016

[3] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, Tat-Seng Chua. Neural Collaborative Filtering. arXiv:1708.05031v2. 2016

[4] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng and Tat-Seng Chua. Neural Graph Collaborative Filtering. SIGIR 2019. arXiv:1905.08108

[5] Leslie N. Smith. Cyclical Learning Rates for Training Neural Networks. WACV 2017. arXiv:1506.01186

[6] Thomas N. Kipf and Max Welling: Semi-supervised classification with
graph convolutional networks. ICLR 2017. arXiv:1609.02907

[7] Rianne van den Berg, Thomas N. Kipf, Max Welling. KDD 2018. arXiv:1706.02263

[8] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI. 452– 461.

[9] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han. On the Variance of the Adaptive Learning Rate and Beyond. arXiv:1908.03265

[10] Recommender Systems: The Textbook. Charu C. Aggarwal. Springer 2016

[11] Steffen Rendle. Factorization Machines. ICDM ’10 Proceedings of the 2010 IEEE International Conference on Data Mining

[12] Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun, Quan Lu. Field-weighted Factorization Machines for Click-Through Rate Prediction in Display Advertising. WWW 2018. arXiv:1806.03514

[13] Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, Tony Jebara: Variational Autoencoders for Collaborative Filtering. WWW 2018. arXiv:1802.05814

[14] Maurizio Ferrari Dacrema, Paolo Cremonesi, Dietmar Jannach: Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches. Proceedings of the 13th ACM Conference on Recommender Systems (RecSys 2019). arXiv:1907.06902

RecoTour II: neural recommendation algorithms

1. Neural Collaborative Filtering (NCF)

2. Neural Graph Collaborative Filtering

3. Future Work

References:

Written by Javier Rodriguez Zaurin