Recommendation System PART 1 — Use of Collaborative Filtering and Hybrid Collaborative — Content in Retail using LightFM library on Instacart Dataset

Norma Dani Risdiandita

Published in

Towards Data Science

6 min readNov 4, 2019

Introduction

Disclaimer: I mentioned items and products in the texts and codes interchangeably. Both of them are the same.

When you open some online marketplaces such as Amazon, you will find some recommendations such as frequently bought together, customers who bought this also bought this, similar items, and so on. You will find your desired items easier on the websites. Currently, recommendation systems are widely used in online marketplaces and some retailers are using this technology to improve user experience, user retention, and conversion rate.

In this tutorial, I will show you how to create a product recommendation mimicking real business problem suitable for a warm start problem where the “interaction” data is abundant using Instacart dataset. The dataset can be found on Kaggle. We also want to provide the comparisons between pure collaborative filtering method with the hybrid collaborative — content-based method to our warm start problem.

Before going through the details we need to install the LightFM library by using pip

pip install lightfm

Once installed, you can import all of the necessary stuff including the cross-validation terms and metrics

importing libraries including standard pandas, numpy, and the lightfm’s model and cross-validation stacks

How to prepare data for LightFM library

In using LightFM library we need one sparse matrix named as user-item interaction matrix. For the hybrid collaborative — content recommendation case, we can also add item-feature matrix into consideration:

The user-item interaction matrix defines the interaction between the user (customer) to the item (product), this can be shown as movies ratings voted by customers. However, in the retail case, we can’t take an explicit rating from the historical data. In this case, implicitly, I take into account the “number of purchases” as the rating. If a customer A bought product B 10 times, then we can say customer A rated product B with rating 10. You can also take into account binary ratings where 1 refers to customer A had bought or 0 as had never bought product B. The user-item interaction matrix represents the collaborative filtering contribution to the model.
The item-feature interaction matrix defines the features of the items. Item to features can be represented as a product to its metadata such as the product’s category, sub-category, or even any pieces of information. Using Instacart’s dataset, I would use “aisles” and “department” as the products’ features. If product A is located in aisle B, then we can say the matrix element of product A in aisle B is 1 and 0 in otherwise. This matrix can add content-based contributions to the model.

**User-item interaction matrix**. User Id represents the customers and product rated represents how many times a specific user id bought the products. The matrix is sparse.

item-feature matrix. Item id represents the products and features refer to the metadata embedded to an item such as category, sub-category, and so on.

Constructing the matrix using Instacart Retail Dataset

We would create two matrices derived from Instacart Market Basket Analysis Kaggle Competition (https://www.kaggle.com/c/instacart-market-basket-analysis/data) and forming the dataset suitable for LightFM library. Before that, we need to download the required datasets from Kaggle and read the necessary datasets for our recommendation system

The display of the datasets can be shown below:

datasets preview

We also need to remove aisles and departments ‘s rows with aisle == missing or aisle == other and department == missing or department == other :

Constructing a user-item interaction matrix. We need to take into account that LightFM library can only read a sparse coo matrix, can be constructed using coo_matrix from scipy.sparse , in which we need to convert the item_id into integer index . Therefore, I build the user-item interaction matrix with converted user_id into index representing the row of the matrix and into indexes as the column. And also not forgetting about creating dictionary mappings for user_id to index, index to user_id, item_id to index, and index to item_id.
Constructing an item-feature matrix. Also the same case with the user-item interaction matrix, by mapping items/products and features into indexes, we can try to convert the items/products and features interactions into a sparse matrix.

The matrix generators and some helper functions for indexing and more are shown below

allocate all users , items , and features into lists

displayed below the lists of users , items , and features

displaying users, items (products), and features (departments and aisles)

LightFM library can’t read unindexed objects, therefore we need to create mappings for users , items , and features into their corresponding indexes.

Before generating interaction matrices, I prepared the train, test, and product_features data into their corresponding name

Showing the table of users with their corresponding purchased products/items. I also regard **count** as the ratings of the products/items or features

Convert every table into interaction matrices

resulting in

Sparse matrices representing interactions of users to products/items and products to features. The non-zero elements consist of **product count** as the number of products a user has bought

Model Selection and Cross-Validation

In this problem, I would try to cross-validation using LightFM library by measuring the test dataset’s AUC (ranging from 0 to 1). I used “WARP” loss function that often provides the best performance option in LightFM library. By using Instacart dataset, I took prior dataset as the training datasetand train dataset as the testing dataset. By fitting to the training datasetand testing on test dataset , we can try to evaluate the AUC score of the test dataset.

Here below we try to perform cross-validation of pure collaborative filtering method using LightFM library

the output from the above is

time taken = 118.23 seconds
time taken = 1088.21 seconds
average AUC without adding item-feature interaction = 0.95

From the result above, we know that the time taken for training can be around 1 minute and for validating is 10 minutes using my 8 GB RAM laptop. WARP loss function can be slow but the performance is very good. The AUC = 0.95 is amazing telling us that Instacart dataset is a warm start problem and rich in transaction data.

The hybrid collaborative — content based by adding products/items and features interactions with the code below

the output from above is

time taken = 154.22 seconds
time taken = 1709.78 seconds
average AUC with adding item-feature interaction = 0.80

From the AUC recorded for the hybrid case, the AUC is worse than the pure collaborative ones. This is a case for a warm start problem with abundant transaction data, pure collaborative filtering would provide better recommendations.

Requesting Products/Items Recommendation

We need to combine the train and the test dataset into one by combining through function below

and create a user to product interaction matrix

resulting in

returning in

time taken = 68.50 seconds

and the class object to ask about the recommendation is given by

calling the recommendation using the final model

printed out some recommendations for user 2 and user 10

User 2
     Known positives:
                  Organic Turkey Burgers
                  Wild Albacore Tuna No Salt Added
                  Cherry Pomegranate Greek Yogurt
     Recommended:
                  Organic Garlic
                  Organic Baby Spinach
                  Organic Hass AvocadoUser 10
     Known positives:
                  Cantaloupe
                  Parsley, Italian (Flat), New England Grown
                  Seedless Red Grapes
     Recommended:
                  Organic Baby Spinach
                  Organic Strawberries
                  Bag of Organic Bananas

Conclusion

LightFM library can provide both hybrid collaborative and content-based recommendation systems for transactional data. However, for a warm start problem, pure collaborative filtering provides better performance.