Recommendation System PART 1 — Use of Collaborative Filtering and Hybrid Collaborative — Content in Retail using LightFM library on Instacart Dataset
Introduction
Disclaimer: I mentioned items and products in the texts and codes interchangeably. Both of them are the same.
When you open some online marketplaces such as Amazon, you will find some recommendations such as frequently bought together, customers who bought this also bought this, similar items, and so on. You will find your desired items easier on the websites. Currently, recommendation systems are widely used in online marketplaces and some retailers are using this technology to improve user experience, user retention, and conversion rate.
In this tutorial, I will show you how to create a product recommendation mimicking real business problem suitable for a warm start problem where the “interaction” data is abundant using Instacart dataset. The dataset can be found on Kaggle. We also want to provide the comparisons between pure collaborative filtering method with the hybrid collaborative — content-based method to our warm start problem.
Before going through the details we need to install the LightFM library by using pip
pip install lightfm
Once installed, you can import all of the necessary stuff including the cross-validation terms and metrics
How to prepare data for LightFM library
In using LightFM library we need one sparse matrix named as user-item interaction matrix. For the hybrid collaborative — content recommendation case, we can also add item-feature matrix into consideration:
- The user-item interaction matrix defines the interaction between the user (customer) to the item (product), this can be shown as movies ratings voted by customers. However, in the retail case, we can’t take an explicit rating from the historical data. In this case, implicitly, I take into account the “number of purchases” as the rating. If a customer A bought product B 10 times, then we can say customer A rated product B with rating 10. You can also take into account binary ratings where 1 refers to customer A had bought or 0 as had never bought product B. The user-item interaction matrix represents the collaborative filtering contribution to the model.
- The item-feature interaction matrix defines the features of the items. Item to features can be represented as a product to its metadata such as the product’s category, sub-category, or even any pieces of information. Using Instacart’s dataset, I would use “aisles” and “department” as the products’ features. If product A is located in aisle B, then we can say the matrix element of product A in aisle B is 1 and 0 in otherwise. This matrix can add content-based contributions to the model.
Constructing the matrix using Instacart Retail Dataset
We would create two matrices derived from Instacart Market Basket Analysis Kaggle Competition (https://www.kaggle.com/c/instacart-market-basket-analysis/data) and forming the dataset suitable for LightFM library. Before that, we need to download the required datasets from Kaggle and read the necessary datasets for our recommendation system
The display of the datasets can be shown below:
We also need to remove aisles
and departments
‘s rows with aisle == missing
or aisle == other
and department == missing
or department == other
:
- Constructing a user-item interaction matrix. We need to take into account that LightFM library can only read a sparse coo matrix, can be constructed using
coo_matrix
fromscipy.sparse
, in which we need to convert theitem_id
intointeger index
. Therefore, I build the user-item interaction matrix with converteduser_id
into index representing the row of the matrix and into indexes as the column. And also not forgetting about creating dictionary mappings for user_id to index, index to user_id, item_id to index, and index to item_id. - Constructing an item-feature matrix. Also the same case with the user-item interaction matrix, by mapping items/products and features into indexes, we can try to convert the items/products and features interactions into a sparse matrix.
The matrix generators and some helper functions for indexing and more are shown below
allocate all users
, items
, and features
into lists
displayed below the lists of users
, items
, and features
LightFM library can’t read unindexed objects, therefore we need to create mappings for users
, items
, and features
into their corresponding indexes.
Before generating interaction matrices, I prepared the train, test, and product_features data into their corresponding name
Convert every table into interaction matrices
resulting in
Model Selection and Cross-Validation
In this problem, I would try to cross-validation using LightFM library by measuring the test dataset’s AUC (ranging from 0 to 1). I used “WARP” loss function that often provides the best performance option in LightFM library. By using Instacart dataset, I took prior
dataset as the training dataset
and train
dataset as the testing dataset.
By fitting to the training dataset
and testing on test dataset
, we can try to evaluate the AUC score of the test dataset.
Here below we try to perform cross-validation of pure collaborative filtering method using LightFM library
the output from the above is
time taken = 118.23 seconds
time taken = 1088.21 seconds
average AUC without adding item-feature interaction = 0.95
From the result above, we know that the time taken for training can be around 1 minute and for validating is 10 minutes using my 8 GB RAM laptop. WARP loss function can be slow but the performance is very good. The AUC = 0.95 is amazing telling us that Instacart dataset is a warm start problem and rich in transaction data.
The hybrid collaborative — content based by adding products/items and features interactions with the code below
the output from above is
time taken = 154.22 seconds
time taken = 1709.78 seconds
average AUC with adding item-feature interaction = 0.80
From the AUC recorded for the hybrid case, the AUC is worse than the pure collaborative ones. This is a case for a warm start problem with abundant transaction data, pure collaborative filtering would provide better recommendations.
Requesting Products/Items Recommendation
We need to combine the train and the test dataset into one by combining through function below
and create a user to product interaction matrix
resulting in
returning in
time taken = 68.50 seconds
and the class object to ask about the recommendation is given by
calling the recommendation using the final model
printed out some recommendations for user 2 and user 10
User 2
Known positives:
Organic Turkey Burgers
Wild Albacore Tuna No Salt Added
Cherry Pomegranate Greek Yogurt
Recommended:
Organic Garlic
Organic Baby Spinach
Organic Hass AvocadoUser 10
Known positives:
Cantaloupe
Parsley, Italian (Flat), New England Grown
Seedless Red Grapes
Recommended:
Organic Baby Spinach
Organic Strawberries
Bag of Organic Bananas
Conclusion
LightFM library can provide both hybrid collaborative and content-based recommendation systems for transactional data. However, for a warm start problem, pure collaborative filtering provides better performance.
Complete Jupyter Notebooks on Github
Bibliography
[1] LightFM documentation — https://lyst.github.io/lightfm/docs/home.html
[2] Instacart Kaggle competition — https://www.kaggle.com/c/instacart-market-basket-analysis
[2] Recommendation Systems — Learn Python for Data Science #3 https://www.youtube.com/watch?v=9gBC9R-msAk