Getting Started

Modern Recommender Systems

A Deep Dive into the AI algorithms that companies like Facebook and Google have built their business around.

Maximilian Beckers
Towards Data Science
15 min readJan 23, 2021

--

As recently as May 2019 Facebook open-sourced some of their recommendation approaches and introduced the DLRM (Deep-learning Recommendation Model). This blog post is meant to explain how and why DLRM and other modern recommendation approaches work so well by looking at how they can be derived from previous results in the domain and by explaining their inner workings and intuitions in detail.

Personalized AI-based advertisement is the name of the game in online marketing these days and companies like Facebook, Google, Amazon, Netflix, and co are kings of the online marketing jungle because they have not only adopted this trend but have essentially invented it and built their entire business strategies around it. Netflix’s “other movies you might enjoy” or Amazon’s “Customers who bought this item also bought…” are just some examples of many in the online world.

So naturally, the everyday Facebook and Gooogle user that I am, I asked myself at some point:

“HOW EXACTLY DOES THIS THING WORK?“

And yeah, we all know the basic movie recommendation example to explain how collaborative filtering/matrix factorization works. Also, I am not talking about the approach of training a straight-forward classifier per user, that outputs a probability of whether or not that user likes a certain product. Those two approaches, namely collaborative filtering and content-based recommendation have to yield some sort of performance and some predictions that can be used, but Google, Facebook, and co surely must have something better up their sleeves, otherwise they wouldn't be where they are today.

In order to understand where today's high-end recommendation systems come from, we have to take a look at two of the basic approaches to the problem of

predicting how much a certain user likes a certain item.

which in the online-marketing world adds up to predicting click-through rates (CTR) for possible ads, based on explicit feedback such as ratings, likes, etc. as well as implicit feedback, such as clicks, search histories, comments, or website visits.

Content-based filtering vs. Collaborative-filtering

1. Content-based filtering

Loosely speaking content-based recommendation means to predict whether a user likes a certain product by using the user's online history. That includes, among others, likes the user gave (e.g. on Facebook), keywords he/she searched for (e.g. on Google), and simply clicks and visits he/she made to certain websites. All in all, it focuses on the user’s own preferences. We can for example think of a simple binary classifier (or regressor) that outputs a click-through rate (or rating) for a certain ad-group for this user.

2. Collaborative-filtering

Collaborative filtering however tries to predict whether a user might like a certain product by looking at the preferences of similar users. Here we can think of the standard matrix factorization (MF) approach for movie recommendations where the ratings matrix get’s factorized into one embedding matrix for the users and one for the movies.

A disadvantage of classic MF is that we cannot use any side features e.g. movie genre, release date, etc., the MF itself has to learn them from the existing interactions. Also, MF suffers from the so-called “cold start problem”, meaning a new movie that hasn’t been rated by anyone yet, cannot be recommended. Content-based filtering solves these two issues, however, is lacking the predictive power of looking at similar users’ preferences.

The advantages and disadvantages of the two different approaches bring up very clearly the need for a hybrid approach where both ideas are somehow combined into one model.

Hybrid recommendation models

1. Factorization Machine

One idea that was introduced by Steffen Rendle in 2010 is the Factorization Machine. It holds the basic mathematical approach to combining matrix factorization with regression

where the model parameters that need to be estimated during learning are:

and ⟨ ∙ , ∙ ⟩ is the dot product between two vectors vᵢ and vⱼof size ℝᵏ, who can be seen as rows in V.

It is pretty straight-forward to see how this equation makes sense when looking at an example of how to represent the data x that gets thrown into this model. Let’s have a look at the described example in the paper on Factorization Machines by Steffen Rendle:

Imagine having the following transaction data on movie reviews where users give ratings to movies at a certain time:

  • user u ∈ U = {Alice (A), Bob (B), . . .}
  • movie (item) i ∈ I = {Titanic (TI), Notting Hill (NH), Star Wars (SW), Star Trek (ST), . . .}
  • rating r ∈ {1,2,3,4,5} at time t ∈ ℝ
Fig.1, S. Rendle — 2010 IEEE International Conference on Data Mining, 2010- “Factorization Machines”

Looking at the figure above we can see the data setup for a hybrid recommendation model. Both the sparse features that represent the user and the item as well as any additional meta or side information (e.g. “Time” or “Last Movie Rated” in this example) are part of a feature vector x that gets mapped to a target y. Now the key is how they are processed by the model.

  • The regression part of the FM handles both the sparse data (e.g. “User”) as well as the dense data (e.g. “Time”) like a standard regression task and thus can be interpreted as the content-based filtering approach within the FM.
  • The MF part of the FM now accounts for the interactions between feature blocks (e.g. interaction between “User” and “Movie”), where the matrix V can be interpreted as the embedding matrix used in collaborative filtering approaches. These cross-user-movie relationships, bring us insights such as:

user i who has a similar embedding vᵢ (representing his preferences for movie attributes) as another user j with embedding vⱼ, might very well like similar movies as user j.

Adding the two predictions of the regression part and the MF part together and learning their parameters simultaneously in one cost function leads to the hybrid FM model that now uses a “best of both worlds” approach to making a recommendation for a user.

This hybrid approach of a Factorization Machine at first glance already seems to be a perfect “best of both worlds” model, however, as many different AI fields like NLP or computer vision have proven in the past:

“Throw it in a Neural Net and you will make it even better”

2. Wide and Deep, Neural Collaborative Filtering (NCF) and Deep Factorization Machines (DeepFM)

We will first have a look at how collaborative filtering can be solved by a neural net approach by looking at the NCF paper, this will lead us to Deep Factorization Machines (DeepFM) which are a neural net version of factorization machines. We will see why they are superior to regular FMs and how we can interpret the neural net architecture. We will see how DeepFM was developed as an improvement to the previously released Wide&Deep model by Google, which is one of the first major breakthroughs of deep learning in recommendation systems. This will finally lead us to the aforementioned DLRM paper, released by Facebook in 2019, that can be seen as a simplification and slight adjustment to DeepFM.

NCF

In 2017 a group of researchers released their work on Neural Collaborative Filtering. It contains a generalized framework for learning the functional relationship modeled by matrix factorization in collaborative filtering with a neural network. The authors also explained how to achieve higher-order interactions (MF is only order 2) and how to fuse the two approaches together.

The general idea is that a neural network can (in theory) learn any functional relationship. That means that also the relationship a collaborative filtering model expresses with it’s MF can be learned by a neural net. NCF proposes a simple embedding layer for both users and items (similar to standard MF) followed by a straight-forward multi-layer perceptron neural net to basically learn the MF dot product relationship between the two embeddings via neural net.

Fig.2 from “Neural Collaborative Filtering” by X He, L Liao, H Zhang, L Nie, X Hu, TS Chua — Proceedings of the 26th international conference on world wide web, 2017

The advantage of this approach lies in the non-linearity of the MLP. The simple dot product used in MF will always limit the model to learning interactions of degree 2, whereas a neural net with X layers can in theory learn interactions of a much higher degree. Think of 3 categorical features that all have an interaction, like male, teenager, and RPG computer games for example.

In real-world problems, we don’t just use a user and an item binarized vector as raw input to our embeddings but obviously include various other meta or side information that might be valuable (e.g. age, country, audio/text recordings, timestamp, …) so in reality we have a very high-dimensional, highly sparse and continuous-categorical mixed dataset. At this point, the above presented neural net from Fig. 2 could very well also be interpreted as a content-based recommendation in the form of a simple binary classification feed-forward neural net. And this interpretation is key to understanding how it ends up being a hybrid approach between CF and content-based recommendation. The network can in fact learn any functional relationship, thus interactions in the CF sense of degree 3 or higher, e.g. x₁ ∙ x₂ ∙ x₃, or any non-linear transformation in the classical neural net classification sense of the form σ( … σ(w₁x₁+w₂x₂ + w₃x₃ + b)) can be learned here.

Equipped with the power of learning high-order interactions, we can specifically make it easy for our model to learn also the low order interactions of order 1 and 2, by combining the neural net with a model that is well known to learn low-order interactions, the Factorization Machine. That’s exactly what the authors of DeepFM proposed in their paper. This combination idea, to simultaneously learn high and low-order feature interactions, is the key part of many modern recommender systems and can be found in some form or another in almost every network architecture proposed in the industry.

DeepFM

DeepFM is a mixed approach between FM and a deep neural network, that both share the same input embedding layer. Raw features are transformed such that continuous fields are represented by themselves and categorical fields are one-hot encoded. The final (e.g. CTR) prediction, given by the last layer in the NN is defined as:

which is a sigmoid activated sum of the two network components: the FM component and the Deep component.

The FM component is a regular Factorization Machine dressed up in neural net architecture style:

Fig.2 from Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. DeepFM: a factorization machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247, 2017.

The Addition part of the FM Layer gets the raw input vector x directly (Sparse Features Layer) and multiplies each element with its weight (“Normal Connection”) before summing them up. The Inner Product part of the FM Layer also gets the raw inputs x, but only after they have been passed through the embedding layer and simply takes the dot product without any weight (“Weight-1 Connection”) between the embedding vectors. Adding the two parts together through another “Weight-1 Connection” yields the aforementioned FM equation:

The xᵢxⱼ multiplication in this equation is only needed to be able to write the sum over i=1 through n. It isn’t really part of the neural network computation. The network automatically knows which embedding vectors vᵢ, vⱼ to take the dot product between due to the embedding layer architecture.

This embedding layer architecture looks as follows:

Fig.4 from Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. DeepFM: a factorization machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247, 2017.

with Vᵖ being the embedding matrix for each field p={1,…,m} with k columns and however many rows the binarized version of the field has elements. The output of the embedding layer is thus given as:

and it is important to note that this is not a fully connected layer, namely there is no connection between any field’s raw inputs and any other field’s embedding. Think of it this way: the one-hot encoded vector for gender (e.g. (0,1)) cannot have anything to do with the embedding vector for weekday (e.g. (0,1,0,0,0,0,0) raw binarized weekday “Tuesday” and it’s embedding vector with e.g.: k=4; (12,4,5,9)).

The FM component being a Factorization Machine reflects the high importance of both order 1 and order 2 interactions, which are directly added to the Deep component output and fed into the sigmoid activation in the final layer.

The Deep Component is proposed to be any deep neural net architecture in theory. The authors specifically took a look at a regular feed-forward MLP neural net (as well as a so-called PNN). The regular MLP is given by the following figure:

Fig.3 from Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. DeepFM: a factorization machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247, 2017.

a standard MLP network with embedding layer between the raw data (highly sparse due to one-hot-encoded categorical input) and the following neural net layers given as:

with σ the activation function, W the weight matrix, a the activation from the previous layer, and b the bias.

This yields the overall DeepFM network architecture:

with the parameters:

  • latent vector Vᵢ to measure impact of feature i’s interactions with other features (Embedding layer)
  • Vᵢ gets passed to the FM component to model order-2 interactions (FM Component)
  • wᵢ weighting the order 1 importance of raw feature i (FM Component)
  • Vᵢ also gets passed to the Deep component to model all higher-order interactions (>2) (Deep Component)
  • Wˡ and bˡ, the neural net’s weights and biasses (Deep Component)

The key to getting both high and low order interactions simultaneously is training all parameters at the same time under one cost function, specifically using the same embedding layer for both the FM as well as the Deep component.

Comparison to Wide&Deep and NeuMF

There are many variations that one can dream up on how to tweak this architecture to potentially make it even better. At the core, however, they are all similar in their hybrid approach on how to model high and low order interactions simultaneously. The authors of DeepFM also proposed interchanging the MLP part with a so-called PNN, a deep neural network that gets the FM layer as initial input combined with the embedding layer

The authors of the NCF paper also came up with a similar architecture which they called NeuMF (“Neural Matrix Factorization”). Instead of having an FM as low-order component, they used a regular matrix factorization fed into an activation function. This approach however is lacking the specific order 1 interactions modeled by the linear part of the FM. Also, the authors specifically allowed the model to learn different user and item embeddings for the matrix factorization as well as the MLP part.

As mentioned before, Google’s research team was one of the first to propose a neural network for a hybrid recommendation approach. DeepFM can be thought of as a further development of Google’s Wide&Deep algorithm that looks like this:

Fig.1 from Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. Wide & deep learning for recommender systems. In Proc. 1st Workshop on Deep Learning for Recommender Systems, pages 7–10, 2016.

The right side is our well-known MLP with an embedding layer, the left side however has different, manually engineered, inputs that are directly fed into the final overall output unit. The low-order interaction in the form of the dot product operation is hidden in these manually engineered features, that the authors say can be many different things, for example:

which captures the interactions between d features (with or without another previous embedding) by cross multiplying them (exponent equals 1 if xᵢ is part of k-th transformation) with each other.

It is easy to see how DeepFM is an improvement since it does not require any a priori feature engineering and is able to learn low and high-order interactions from exactly the same input data that all share one common embedding layer. DeepFM really has the FM model as part of its core network, whereas Wide&Deep does not do dot product computations as part of the actual neural net but beforehand in feature engineering steps.

3. DLRM — Deep Learning Recommendation Model

So with all these different options from Google, Huawei (research team around the DeepFM architecture), and others, let’s take a look at how Facebook views things. They came out with their DLRM paper in 2019, which focuses a lot on the practical side of these models. Parallel training setup, GPU computing as well as different handling of continuous vs categorical features.

The DLRM architecture is described in the below figure and works as follows: Categorical features are each represented by an embedding vector, while continuous features are processed by an MLP such that they have the same length as the embedding vectors. Now in a second stage, the dot product between all combinations of embedding vectors and processed (MLP output) dense vectors is computed. Afterward, the dot products are concatenated with the MLP output of the dense features and passed through another MLP and finally into a sigmoid function to give a probability.

DLRM Network as described by DLRM. Figure by Max Beckers.

This DLRM proposal is somewhat of a simplified and modified version of DeepFM in the sense that it also uses dot product computations between embedding vectors but it specifically tries to stay away from high-order interactions by not directly forcing the embedded categorical features through an MLP layer. The design is tailored to mimic the way Factorization Machines compute the second-order interactions between the embeddings. We can think of the entire DLRM setup as the specialized part of DeepFM, the FM component. The classical Deep Component of DeepFM that gets added to the outcome of the FM component in the final layer of DeepFM (and then fed into a sigmoid function) can be seen as completely omitted in the DLRM setup. The theoretical advantages of DeepFM are clear as it is, by design, better equipped to learn high order interaction, however according to Facebook:

“… higherorder interactions beyond second-order found in other networks may not necessarily be worth the additional computational/memory cost”

4. Outlook and Coding

Having introduced various deep recommendation approaches, their intuitions as well as pros and cons, in theory, I had a look at the proposed PyTorch implementation of DLRM on Facebook’s GitHub page.

I checked out the details of the implementation and tried out the predefined dataset APIs that they have built-in to handle different raw datasets directly. Both the Kaggle display advertising challenge by Criteo as well as their Terabyte dataset are pre implemented and can be downloaded and subsequently used to train a full DLRM with just one bash command (see DLRM repo for instructions). I then extended Facebook’s DLRM model API to include preprocessing and data loading steps for another dataset, the 2020 DIGIX Advertisement CTR Prediction. Please check it out here.

In a similar fashion after downloading and unzipping the digix data you can now train a model on this data with a single bash command as well. All the preprocessing steps, shapes of the embeddings, and neural net architecture parameters are adjusted towards handling the digix dataset. A notebook that takes you through the commands can be found here. The model delivers some decent results, as I am continuing to work on it to improve the performance by understanding the raw data and the advertisement process behind the digix data better. Specific data cleaning, hyperparameter tuning, and feature engineering are all things I would like to further work on and are mentioned in the notebook. The first goal was simply to have a technically sound extension of the DLRM model API that can use the raw digix data as input.

All in all, I believe hybrid deep models are one of the most powerful tools for solving recommendation tasks. However, there have been some seriously interesting and creative unsupervised approaches on solving collaborative filtering problems recently using autoencoders. So at this point, I can only guess what the large internet giants are using today to feed us ads we are most likely to click on. I assume that it very well could be a combo of the aforementioned autoencoder approach as well as some form of the deep hybrid models presented in this article.

References

Steffen Rendle. Factorization machines. In Proc. 2010 IEEE International Conference on Data Mining, pages 995–1000, 2010.

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proc. 26th Int. Conf. World Wide Web, pages 173–182, 2017.

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. DeepFM: a factorizationmachine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247, 2017.

Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. xDeepFM: Combining explicit and implicit feature interactions for recommender systems. In Proc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1754–1763. ACM, 2018.

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. Wide & deep learning for recommender systems. In Proc. 1st Workshop on Deep Learning for Recommender Systems, pages 7–10, 2016.

M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep learning recommendation model for personalization and recommendation systems,” CoRR, vol. abs/1906.00091, 2019. [Online]. Available: http://arxiv.org/abs/1906. 00091 [39]

--

--