RecSys Series

Recommendation System Series Part 4: The 7 Variants of Matrix Factorization For Collaborative Filtering

Mathematical Deep Dive Into Matrix Factorization

James Le
Towards Data Science
12 min readFeb 24, 2020

--

Update: This article is part of a series where I explore recommendation systems in academia and industry. Check out the full series: Part 1, Part 2, Part 3, Part 4, Part 5, and Part 6.

Collaborative filtering lies at the heart of any modern recommendation system, which has seen considerable success at companies like Amazon, Netflix, and Spotify. It works by collecting human judgments (known as ratings) for items in a given domain and matching together people who share the same information needs or the same tastes. Users of a collaborative filtering system share their analytical judgments and opinions regarding each item that they consume so that other users of the system can better decide which items to consume. In return, the collaborative filtering system provides useful personalized recommendations for new items.

The two primary areas of collaborative filtering are (1) neighborhood methods and (2) latent factor models.

  • Neighborhood methods focus on computing the relationships between items or between users. This approach evaluates a user’s preference for an item based on ratings of neighboring items by the same user. An item’s neighbors are other products that tend to get similar ratings when rated by the same user.
  • Latent factor methods explain the ratings by characterizing both items and users on many factors inferred from the rating pattern. For example, in music recommendation, the discovered factors might measure precise dimensions such as hip-hop versus jazz, amount of high notes, or length of the song, as well as the less well-defined dimensions such as the meaning behind the lyrics, or completely uninterpretable dimensions. For users, each factor measures how much the user likes songs that score high on the corresponding song factor.

Some of the most successful latent factor models are based on matrix factorization. In its natural form, matrix factorization characterizes items and users using vectors of factors inferred from item rating patterns. High correspondence between item and user factors leads to a recommendation.

Sharmistha Chatterjee — Overview of Matrix Factorization Techniques using Python (https://towardsdatascience.com/overview-of-matrix-factorisation-techniques-using-python-8e3d118a9b39)

In this post and those to follow, I will be walking through the creation and training of recommendation systems, as I am currently working on this topic for my Master Thesis.

  • Part 1 provided a high-level overview of recommendation systems, how they are built, and how they can be used to improve businesses across industries.
  • Part 2 provided a careful review of the ongoing research initiatives about the strengths and application scenarios of these models.
  • Part 3 provided a couple of research directions that might be relevant to the recommendation system scholar community.

In part 4, I dig into the nitty-gritty mathematical details of matrix factorization, arguably the most common baseline model for recommendation system research these days. More specifically, I will show you the seven variants of matrix factorization that can be constructed — ranging from the use of side features to the application of Bayesian methods.

1 — Vanilla Matrix Factorization

A straightforward matrix factorization model maps both users and items to a joint latent factor space of dimensionality D — such that user-item interactions are modeled as inner products in that space.

  • Accordingly, each item i is associated with a vector q_i, and each user u is associated with a vector p_u.
  • For a given item i, the elements of q_i measure the extent to which the item possesses those factors, positive or negative.
  • For a given user u, the elements of p_u measure the extent of interest the user has in items that are high on the corresponding factors, positive or negative.
  • The resulting dot product (q_i * p_u) captures the interaction between user u and item i, which is the user’s overall interest in the item’s characteristics.

Thus, we have the equation 1 as follow:

The big challenge is to compute the mapping of each item and user to factor vectors q_i and p_u. Matrix factorization does this by minimizing the regularized squared error on the set of known ratings, as seen in equation 2 below:

The model is learned by fitting the previously observed ratings. However, the goal is to generalize those previous ratings in a way that predicts future/unknown ratings. Thus, we want to avoid overfitting the observed data by adding an L2 regularization penalty to each element and optimize the learned parameters simultaneously with stochastic gradient descent.

This article by Shay Palachy did a great job explaining the intuition, based here are the quick notes:

  • When we are using SGD to fit a model’s parameters to the learning problem at hand, we take a step in the solution space towards the gradient of the loss function with respect to the network’s parameters at each iteration of the algorithm. Since the user-item interaction matrix in our recommendations is very sparse, this method of learning might overfit to training data.
  • L2 (also known as Tikhonov Regularization or Ridge Regression) is a specific way of regularizing a cost function with the addition of a complexity-representing term. The term is the squared Euclidean norm of the user and item latent factors. An additional parameter, \lambda, is added to allow control of the strength of the regularization.
  • Adding the L2 term usually results in much smaller parameters across the entire model.

Let’s see how this looks like in code:

The full experiment for this model can be accessed here: https://github.com/khanhnamle1994/transfer-rec/tree/master/Matrix-Factorization-Experiments/Vanilla-MF

2 — Matrix Factorization with Biases

One benefit of the matrix factorization approach to collaborative filtering is its flexibility in dealing with various data aspects and other application-specific requirements. Recall that equation 1 attempts to capture the interactions between users and items that produce different rating values. However, much of the observed variation in the rating values are due to effects associated with either users or items, known as biases, independent of any interactions. The intuition behind this is that some users give high ratings than others, and some items received high ratings than others systematically.

Thus, we can extend equation 1 to equation 3 as follows:

  • The bias involved in the overall average rating is denoted by b.
  • The parameters w_i and w_u indicate the observed deviations of item i and user u from the average, respectively.
  • Note that the observed ratings are broken down into 4 components: (1) user-item interaction, (2) global average, (3) item bias, and (4) user bias.

The model is learned by minimizing a new squared error function, as seen in equation 4 below:

Let’s see how this looks like in code:

The full experiment for this model can be accessed here: https://github.com/khanhnamle1994/transfer-rec/tree/master/Matrix-Factorization-Experiments/MF-Biases

3 — Matrix Factorization with Side Features

A common challenge in collaborative filtering is the cold start problem due to its inability to address new items and new users. Or many users are supplying very few ratings, making the user-item interaction matrix very sparse. A way to relieve this problem is to incorporate additional sources of information about the users, aka side features. These can be user attributes (demographics) and implicit feedback.

Going back to my example, let’s say I know the occupation of the user. I have two choices for this side feature: adding it as a bias (artists like movies more than other occupations) and adding it as a vector (realtors love real estate shows). The matrix factorization model should integrate all signal sources with enhanced user representation, as seen in equation 5:

  • The bias for occupation is denoted by d_o, meaning that occupation changes like rate.
  • The vector for occupation is denoted by t_o, meaning that occupation changes depending on the item (q_i * t_o).
  • Note that items can get a similar treatment when necessary.

How does the loss function look like now? Equation 6 below shows that:

Let’s see how this looks like in code:

The full experiment for this model can be accessed here: https://github.com/khanhnamle1994/transfer-rec/tree/master/Matrix-Factorization-Experiments/MF-Side-Features

4 — Matrix Factorization with Temporal Features

So far, our matrix factorization models have been static. In reality, item popularity and user preferences change constantly. Therefore, we should account for the temporal effects reflecting the dynamic nature of user-item interactions. To accomplish this, we can add a temporal term that affects user preferences and, therefore, the interaction between users and items.

To mix it up a bit, let’s try out a new equation 7 below with dynamic prediction rule for a rating at time t:

  • p_u (t) takes user factors as a function of time. On the other hand, q_i stays the same because items are static.
  • We have occupation changes depending on the user (p_u * t_o).

Equation 8 displays the new loss function that incorporates the temporal features:

Let’s see how this looks like in code:

The full experiment for this model can be accessed here: https://github.com/khanhnamle1994/transfer-rec/tree/master/Matrix-Factorization-Experiments/MF-Temporal-Features

5 — Factorization Machines

One of the more powerful techniques for the recommendation system is called Factorization Machines, which have a robust, expressive capacity to generalize Matrix Factorization methods. In many applications, we have plenty of item metadata that can be used to make better predictions. This is one of the benefits of using Factorization Machines with feature-rich datasets, for which there is a natural way in which extra features can be included in the model, and higher-order interactions can be modeled using the dimensionality parameter d. For sparse datasets, a second-order Factorization Machine model suffices, since there is not enough information to estimate more complex interactions.

Berwyn Zhang — Factorization Machines (http://berwynzhang.com/2017/01/22/machine_learning/Factorization_Machines/)

Equation 9 shows what a second order FM model looks like:

where the v’s represent k-dimensional latent vectors associated with each variable (users and items), and the bracket operator represents the inner product. Following Steffen Rendle’s original paper on Factorization Machines, if we assume that each x(j) vector is only non-zero at positions u and i, we get the classic Matrix Factorization model with biases (equation 3):

The main difference between these two equations is that Factorization Machines introduce higher-order interactions in terms of latent vectors that are also affected by categorical or tag data. This means that the models go beyond co-occurrences to find stronger relationships between the latent representations of each feature.

The loss function for the Factorization Machines model is simply the sum of mean squared error and feature set, as shown in equation 10:

Let’s see how this looks like in code:

The full experiment for this model can be accessed here: https://github.com/khanhnamle1994/transfer-rec/tree/master/Matrix-Factorization-Experiments/Factorization-Machines

6 — Matrix Factorization with Mixture of Tastes

The techniques presented so far implicitly treat user tastes as unimodal — aka in a single latent vector. This may lead to a lack of nuance in representing the user, where a dominant taste may overpower more niche ones. Additionally, this may reduce the quality of item representations, decreasing the separation in the embedding space between groups of items belonging to multiple tastes/genres.

Maciej Kula proposes and evaluates representing users as mixtures if several distinct tastes, represented by different taste vectors. Each of the taste vectors is coupled with an attention vector, describing how competent it is at evaluating any given item. The user’s preference is then modeled as a weighted average of all the user’s tastes, with the weights provided by how relevant each taste is to evaluate a given item.

Equation 11 gives a mathematical formula for this mixture-of-taste model:

  • U_u is a m x k matrix representing the m tastes of user u.
  • A_u is a m x k matrix representing the affinities of each taste from U_u for representing particular items.
  • \sigma is the soft-max activation function.
  • \sigma(A_u * q_i) gives the mixture probabilities.
  • U_u * q_i gives the recommendation scores for each mixture component.
  • Note that we assume identity variance matrices for all mixture components.

So equation 12 below captures the loss function:

Let’s see how this looks like in code:

The full experiment for this model can be accessed here: https://github.com/khanhnamle1994/transfer-rec/tree/master/Matrix-Factorization-Experiments/MF-Mixture-Tastes

7 — Variational Matrix Factorization

The last variant of matrix factorization that I want to present is called Variational Matrix Factorization. While most of what the blog post has discussed so far is about optimizing a point estimate of the model parameters, variational is about optimizing a posterior, which loosely speaking expresses a spectrum of model configurations that are consistent with the data.

Here are the practical reasons to go variational:

  • Variational methods can provide alternative regularization.
  • Variational methods can measure what your model doesn’t know.
  • Variational methods can reveal entailment as well as novel ways of grouping data.

We can make the matrix factorization in equation 3 variational by: (1) Replace point estimates with samples from a distribution, and (2) Replace regularizing that point with regularizing the new distribution. The math is quite complicated, so I won’t attempt to explain it in this blog post. The Wikipedia page on Variational Bayesian methods is a helpful guide to start. The most common type of variational Bayes uses the Kullback-Leibler divergence as the choice of dis-similarity function, which makes the loss minimization tractable.

Let’s see how that looks like in code:

The full experiment for this model can be accessed here: https://github.com/khanhnamle1994/transfer-rec/tree/master/Matrix-Factorization-Experiments/Variational-MF

Model Evaluation

You can check out all 7 Matrix Factorization experiments that I did for the MovieLens1M dataset at this repository. All models were trained for 50 epochs and the results were captured in TensorBoard. The evaluation metric is Mean Squared Error, which is calculated to be the sum of all squared differences between the predicted ratings and the actual ratings.

The result table is at the bottom of the README, and as you can see:

  • Variational Matrix Factorization has the lowest training loss.
  • Matrix Factorization with Side Features has the lowest test loss.
  • Factorization Machines has the fastest training time.

Conclusion

In this post, I have discussed the intuitive meaning of matrix factorization and its use in collaborative filtering. I also talked about many different extensions to it: (1) Adding biases, (2) Adding side features, (3) Adding temporal features, (4) Upgrading to Factorization Machines to take advantage of higher-order interactions, (5) Using a mixture of tastes with “attention” mechanism, and (6) Making the model variational. I hope that you have found this mathematical deep-dive into the world of matrix factorization helpful. Stay tuned for future blog posts of this series that go beyond the realm of matrix factorization and into the deep learning approaches to collaborative filtering.

References

Now continue on to Recommendation System Part 5!

If you would like to follow my work on Recommendation Systems, Deep Learning, and Data Science Journalism, you can check out my Medium and GitHub, as well as other projects at https://jameskle.com/. You can also tweet at me on Twitter, email me directly, or find me on LinkedIn. Sign up for my newsletter to receive my latest thoughts on data science, machine learning, and artificial intelligence right at your inbox!

--

--