Factorization Machines for Item Recommendation with Implicit Feedback Data

Go beyond classic Matrix Factorization approaches to include user/item auxiliary features and directly optimize item rank-order

Published in

Towards Data Science

13 min readJun 28, 2020

Introduction

In this article, we’ll introduce Factorization Machines (FM) as a flexible and powerful modeling framework for collaborative filtering recommendation. We’ll then describe how specialized loss functions that directly optimize item rank-order make it possible to apply FM models to implicit feedback data. We’ll finish up by demonstrating these points with a real-world implicit feedback data set using a new open-source FM modeling library. Onwards!

Recommender Systems

As the digital economy continues to expand in size and increase in sophistication, the role of recommender systems to provide personalized, relevant content to each individual user is more important than ever. E-commerce sites with ever-larger product catalogs can present individualized store fronts to millions of users simultaneously. Digital content providers can help users navigate more books, articles, songs, movies, etc. than could be consumed in a lifetime to find the small subset best matching each user’s specific interests and tastes. For example, Netflix reported in 2015 that its recommender system influenced roughly 80% of streaming hours on the site and further estimated the value of the system at over $1B annually.[1]

The two broad high-level approaches to recommender systems are Content-Based Filtering (CBF) and Collaborative Filtering (CF). CBF models represent users and items as vectors of attributes or features (e.g. user age, state, income, activity level; item department, category, genre, price). In contrast, CF methods rely only on past user behavior: the model analyzes co-occurrence patterns to determine user and/or item similarities and attempts to infer a user’s preferences over unseen items using only the user’s recorded interactions. CF-based approaches have the advantages of being domain-free (i.e. no specific business knowledge or feature engineering required) as well as generally more accurate and more scalable than CBF models.[2]

Matrix Factorization

Many of the most popular and most successful CF approaches are based on Matrix Factorization (MF), with development accelerating rapidly due to the 2006–2009 Netflix Prize in which the winning entry made heavy use of MF techniques, including the now-popular SVD++ algorithm.[3] MF models attempt to learn low-dimensional representations, or embeddings, of users and items in a shared latent factor space. In essence, the observed sparse user-item interaction matrix is “factorized” into the approximate product of two low-rank matrices which contain user and item embeddings. After these latent factors are learned, user/item similarity can be computed and unobserved preferences inferred by comparing the user/item latent factor representations.

Many popular MF algorithms learn these user/item latent factors by minimizing the squared error between observed and predicted ratings, where predicted ratings are computed as the inner product of the relevant user and item latent factors. Some model specifications further include global user/item biases and/or regularization terms to prevent overfitting. A common MF loss function can be expressed as:

Implicit Feedback

However, this formulation breaks down when dealing with implicit feedback data, where you do not directly observe any explicit numerical ratings or positive/negative responses, but rather only raw user behavior (e.g. watches, page views, purchases, clicks). Implicit feedback data is far more common in real-world recommendation contexts, and in fact recommender systems built solely using explicit feedback data (even when it exists) typically perform poorly due to the fact that ratings are not missing at random, but instead highly correlated with latent user preferences.[4] To adapt the MF approach to implicit feedback data Hu et al.[5] introduced the concept of ratings as a binary implied preferences (has the user observed the item or not) with a numeric confidence weight representing the assumed strength of that binary preference. This model formulation is the basis of the popular implicit feedback ALS algorithm implemented in both SparkML and the Implicit Python library:

While simple and effective, this approach has a few key drawbacks. First, by representing the data as a matrix of user/item interactions it is not possible to include auxiliary features, such as the user/item attributes used in CBF models and/or other contextual information about the interaction itself. This is a major lost opportunity when rich auxiliary features exist, and also prevents the model from generating informative predictions for new users/items, often referred to as the Cold-Start Problem. Second, encoding user implicit feedback as binary (0, 1) ratings and minimizing the model’s prediction error is a very indirect representation of user preferences — you don’t know for sure that the unobserved items actually are negative for the user, and certainly some positive (and negative) items are preferred more than others with the exact same encoded (0, 1) rating.

Consider how most recommendations are actually served to users: as one or more ordered lists of items. Intuitively, the true objective should, therefore, be to derive the correct rank-ordering of items for each user. This will allow us to generate recommendations where all items are ordered by relevance for each user, with the most relevant items appearing at the very top of each user’s recommended item list(s).

The Author’s Amazon Prime Video Homepage

To overcome these limitations we need a more general model framework that can extend the latent factor approach to incorporate arbitrary auxiliary features, and specialized loss functions that directly optimize item rank-order using implicit feedback data. Enter Factorization Machines and Learning-to-Rank.

Factorization Machines

Factorization Machines (FM) are generic supervised learning models that map arbitrary real-valued features into a low-dimensional latent factor space and can be applied naturally to a wide variety of prediction tasks including regression, classification, and ranking. FMs can estimate model parameters accurately under very sparse data and train with linear complexity, allowing them to scale to very large data sets[6] — these characteristics make FMs an ideal fit for real-world recommendation problems. Unlike the classic MF model discussed above which inputs a user-item interaction matrix, FM models represent user-item interactions as tuples of real-valued feature vectors and numeric target variables — this data format should be familiar to anyone who’s ever trained a standard regression or classification model.

Typically for collaborative filtering the base features will be binary vectors of user and item indicators, such that each training sample has exactly two non-zero entries corresponding to the given user/item combination. However, these user/item indicators can be augmented with arbitrary auxiliary features, for example, user or item attributes and/or contextual features relevant to the interaction itself (e.g. day-of-week, add-to-cart order, etc.).

The FM model equation is comprised of n-way interactions between features. A second-order model (by far the most common) includes weights for each base feature as well as interaction terms for each pairwise feature combination.

This model formulation may look familiar — it’s simply a quadratic linear regression. However, unlike polynomial linear models which estimate each interaction term separately, FMs instead use factorized interaction parameters: feature interaction weights are represented as the inner product of the two features’ latent factor space embeddings:

This greatly decreases the number of parameters to estimate while at the same time facilitating more accurate estimation by breaking the strict independence criteria between interaction terms. Consider a realistic recommendation data set with 1,000,000 users and 10,000 items. A quadratic linear model would need to estimate U + I + UI ~ 10 billion parameters. A FM model of dimension F=10 would need only U + I + F(U + I) ~ 11 million parameters. Additionally, many common MF algorithms (including SVD++, ALS) can be re-formulated as special cases of the more general/flexible FM model class.[7]

However, it’s not immediately obvious how to adapt FM models for implicit feedback data. One naïve approach would be to label all observed user-item interactions as (1) and all unobserved interactions as (-1) and train the model using a common classification loss function such as hinge or log loss. But with real-world recommendation data sets this would require the creation of billions of unobserved user-item training samples and result in a severe class imbalance due to interaction sparsity. This approach also has the same conceptual problem as the implicit feedback MF adaptation discussed above: it’s still minimizing rating prediction error instead of directly optimizing item rank-order.

Learning-to-Rank

Optimization techniques that learn rank-order directly instead of minimizing prediction error are referred to as Learning-to-Rank (LTR). LTR models train on pairs or lists of training samples instead of individual observations. The loss functions are based on the relative ordering of items instead of their raw scores. Models using LTR have produced state-of-the-art results in search, information retrieval, and collaborative filtering. These techniques are the key to adapting FM models to implicit feedback recommendation problems.

One of the most popular LTR techniques for item recommendation is Bayesian Personalized Ranking (BPR). BPR attempts to learn the correct rank-ordering of items for each user by maximizing the posterior probability (MAP) of the model parameters given a data set of observed user-item preferences and a chosen prior distribution. Each user’s observed items (implicit feedback) are assumed to be preferred over the unobserved items, and all pairwise preferences are assumed to be independent. To learn these preferences, one creates training samples comprised of [user (u), observed item (i), unobserved item (j)] tuples and maximizes the following log-likelihood function with respect to the model parameters.[8]

Where (>u|theta) is the model’s predicted item ranking for user (u). This can be learned using the training data described above by maximizing the joint probability that users’ observed items are preferred over their unobserved items. We can define this probability as the difference between the predicted utility scores of the user’s observed (i) and unobserved (j) items mapped onto [0, 1] using the sigmoid function:

Where the real-valued user-item utility scores f(u,i|theta) and f(u,j|theta) are generated using the main FM model equation given above. Putting it all together and including a regularization term the criterion to maximize becomes:

Taking things a step further, Weighted Approximate Pairwise Rank (WARP) doesn’t simply sample unobserved items (j) at random, but rather samples many unobserved items for each observed training sample until it finds a rank-reversal for the user, thus yielding a more informative gradient update. This is especially important in contexts with a large number of items and highly skewed item popularity (very common). [9] The basic procedure is:

1. Randomly sample an unobserved item for the user and compute its utility score. If the unobserved item’s score exceeds the observed item’s score plus a fixed margin then make a gradient update, otherwise continue to sample negative items

2. Scale the magnitude of the gradient update based on the number of negative items sampled before finding a margin violation — make smaller updates if more negative items were sampled as it’s more likely the model is currently ranking user preferences correctly

In fact, if you scale the magnitude of the gradient updates with a (0, 1] multiplier, BPR can be seen as a special case of WARP where the maximum number of negative samples is equal to one, resulting in a constant gradient update multiplier of one. Using WARP increases per-epoch training time relative to BPR, but often yields faster convergence and superior model performance.

Model Evaluation

Now that all the theory is out of the way, let’s see how these components come together to produce high-quality recommendations on a well-known real-world data set. We’ll train an implicit feedback FM model using the author’s new RankFM package which implements the techniques described above and compare its performance to the popular implicit feedback ALS MF algorithm discussed earlier (via the Implicit package). The goal is to show that a FM model incorporating auxiliary features and trained using LTR optimization techniques yields superior performance relative to a similarly-specified classical MF model.

For this exercise, we’ll use the 2017 Instacart Orders Data. It contains 200,000 users, 50,000 items, 3.4 million orders, and over 32 million total recorded user-item interactions. A quick exploratory look at the data reveals high sparsity: the median user has purchased only 48 items across 9 orders and the median item has been purchased by only 35 users across 60 orders. The overall interaction sparsity rate is 99.87%.

We’ll randomly split the overall interaction data using 75% to train the model and the remaining 25% to evaluate validation metrics and assess model performance. To reflect the idea that repeat-purchases carry stronger preference signals, we’ll incorporate sample weights defined using the log of each user-item purchase count. We’ll augment the main user-item interaction data with a set of item features denoting the supermarket department and aisle of each item. Unfortunately, there are no user auxiliary features to take advantage of with this data set.

Now let’s train the RankFM model. We’ll use 50 latent factors, WARP loss with max 100 negative samples, and an inverse scaling learning schedule that will decrease the learning rate over time to help with SGD convergence. RankFM will print the log-likelihood of each training epoch so you can monitor both training time and performance gains epoch-over-epoch:

We can generate real-valued utility scores using the main FM model equation with the predict() method. Users and items not present in the training data can either be dropped or have scores set to np.nan in the resulting scores vector. We can use the recommend() method to generate the TopN recommended items for each user in the validation set. There’s a helpful flag that lets you choose whether to include previously observed training items or generate only novel-item recommendations. The result is a pandas DataFrame with UserID index values and each user’s recommended items in the columns ordered from left to right by expected preference:

Lastly, we’ll evaluate performance metrics on the held-out validation data. RankFM includes a separate evaluation module which implements many popular recommender metrics including hit rate, reciprocal rank, discounted cumulative gain, and precision/recall.

Using a data set with 50,000 unique items where the median user purchases fewer than 50 we are able to get a validation hit rate above 80% with very little feature engineering and required data prep. Not too shabby.

Baseline Model Comparison

Now let’s compare RankFM’s performance against our baseline ALS algorithm. To do this we first need to convert our user-item interaction data into a sparse CSR matrix. We’ll use the exact same data split and input our generated sample weights as confidence scores to train the ALS model.

Now that we have our data in the required input format we can fit the ALS model and generate TopN recommendations for the validation users. We’ll use the same latent factor dimension as RankFM and otherwise use the default values.

Lastly we’ll calculate the model’s hit rate, precision, and recall for the same set of hold-out interactions used to evaluate our RankFM model.

Although a full analysis would need to appropriately tune both models across various hyperparameter combinations, we can see modest performance gains (5–10%) with the FM model formulation and LTR optimization techniques using identical interaction data and roughly equivalent model specifications (i.e. number of latent factors). We’d likely see even larger performance gains relative to the baseline ALS MF model if users had fewer recorded interactions (in the Instacart data every user has at least 3 orders) and/or we had more informative user/item auxiliary features.

Summary

In this article, we explored Factorization Machines (FM) as a generic yet powerful model framework especially well-suited to collaborative filtering recommendation problems. Unlike traditional Matrix Factorization (MF) approaches, FM models can be naturally extended to include user, item, or contextual auxiliary features to augment the main interaction data. We then showed how Learning-to-Rank (LTR) loss functions such as Bayesian Personalized Ranking (BPR) and Weighted Approximate Pairwise Rank (WARP) are the key to successfully adapting FM models to implicit feedback data. To demonstrate these points, we showed an implicit feedback FM model outperforming a popular ALS MF baseline algorithm on a well-known open-source implicit feedback recommendation data set. Lastly, we introduced RankFM: a new python package for building and evaluating FM models for recommendation problems with implicit feedback data.