Photo by JJ Ying on Unsplash

MLOps: How to Operationalise E-Commerce Product Recommendation System

Introduction

Burak Özen
Towards Data Science
9 min readMay 25, 2022

--

One of the most common challenges in an e-commerce business to build a well-performing product recommender and categorisation model. A product recommender is used to recommend similar products to users so that total time and money spent on platform per user will be increased. There is also a need to have a model to categorise products correctly since there might be some wrongly categorised products in those platforms especially where most of content is generated by users as in case of classified websites. A product categorisation model is used to catch those products and place them back into their right categories to improve overall user experience on the platform.

This article has 2 main parts. In the first part, we will talk about how to build an e-commerce product recommendation system and will do product categorisation with some hands-on coding exercises. In the second part, we will discuss how to operationalise this project in just a few steps with help of an MLOps platform named Layer.

Brief Methodology

Methodology in a Nutshell (Image by Author)

I believe most if not all of e-commerce platforms collects clickstream data of users which is basically a simple table consists of 3 columns: session_id, product_id and timestamp. Actually, this table is the only data you need to create such a product recommendation model described in this article for your business as well. Throughout this tutorial, we will be using a public Kaggle dataset (CC0: Public Domain) which is a clickstream data of an e-commerce shop that you can find the link below. [1]

Word2Vec algorithm resides at the heart of this methodology to produce product embeddings. The Word2Vec is used mostly within NLP and text context. There is an analogy here using the Word2Vec in this context. A product will be treated as a single word and a sequence of product views (sessions) will be treated as a sentence. Outputs of the Word2Vec algorithm will be numerical representative vectors of products.

In the next step, those product vectors are fed into a K-Means algorithm as inputs to create arbitrary number of product clusters. These clusters represent groupings (categorisation) of similar products.

In the final step, we will produce some product recommendations by randomly selecting from the cluster which a given product belongs to.

This article is more like a tutorial with some coding examples. For more information about the methodology and to learn about long story of such project, we strongly recommend you to read this article as well.

Table of Contents

PART I: HANDS-ON EXAMPLE

Step I: Load csv file into a pandas data frame

Step II: Convert clickstream data into sequences of product views

Step III: Generate product vectors (embeddings) using Word2Vec algorithm

Step IV: Fit K-Means model on the product vectors (embeddings)

Step V: Save clusters as a data frame

Step VI: Get similar product recommendations for a given product

PART II: MLOPS

Layer Installation and Login

Layer Dataset Decorator

Layer Model Decorator

Layer Running Environment Modes

Full Notebook Integrated with Layer

PART I: HANDS-ON EXAMPLE

Step I: Load csv file into a Pandas DataFrame

Define a simple function named raw_session_based_clickstream_data which reads the csv file from where it is located and returns a Pandas DataFrame.

Function #1: raw_session_based_clickstream_data()
raw_clickstream = raw_session_based_clickstream_data()raw_clickstream.head(5)
Sample Data Records (Image by Author)

Step II: Convert clickstream data into sequences of product views

Define a function named generate_sequential_products which takes the data frame raw_clickstream from the previous function's output and apply some data cleaning such as renaming some columns and dropping sessions where there is only one single product view. After that, it groups data by the session_id column and create lists of products per session. It is important to use the order column in the data while grouping product views by session because the sequence of products views must be in chronological order. If you only have timestamps of product views in your data, you should first create such a separate order column using the timestamp column.

There is also a helper function named remove_consec_duplicates which drops any consecutive duplicate products from the product view sequences. This is especially important since we will be using Word2Vec algorithm in the next part to generate product embeddings. It is very likely that you will have many consecutive duplicate product views in your data which might distort the algorithm.

Function #2: generate_sequential_products()
session_based_product_sequences = generate_sequential_products()session_based_product_sequences.head(5)
Sample Data Records (Image by Author)

Step III: Generate product vectors (embeddings) using Word2Vec algorithm

Define a function named create_product_embeddings which takes the data frame session_based_product_sequences from the previous function's output and train a Gensim's Word2vec model by setting parameters window size to 5 and embedding size to 10. This function returns a two-columns dataset where first column is product id and the other column is its representative 10-dimensional numerical vector returned from the Word2Vec model.

Function #3: create_product_embeddings()
product_ids_and_vectors = create_product_embeddings()product_ids_and_vectors.head(5)
Sample Data Records (Image by Author)

Step IV: Fit K-Means model on the product vectors (embeddings)

Define a function named fit_kmeans which trains a k-means model using the product vectors data frame product_id_and_vectors generated in the previous step. In the code snippet below, we set number of clusters to an arbitrary number 10. However, you could decide on the number of clusters with respect to the total number of categories that are supposed to exist on your platform.

We also created two different plots as results of two other helper functions: plot_cluster_distribution and plot_cluster_scatter. First one creates a visualisation which shows distribution of number of members across the clusters in a bar chart and the second one is a scatter plot to show how clusters are formed in 2D space and mark their centroids with a black dot.

Function #4: fit_kmeans()
model = fit_kmeans()
Check out how these plots look on Layer: https://app.layer.ai/layer/Ecommerce_Recommendation_System/models/clustering_model#Product-Distribution-over-Clusters (Image by Author)

Step V: Save clusters as a data frame

Define a function named save_final_product_clusters which creates a data frame storing member list of each cluster. This function uses the model from the previous function and product_ids_and_vectors data frame from the create_product_embeddings function's output. Since we set the number of clusters to 10 before, in our case there will be total of 10 rows in the dataset. As we know already, a cluster member is a product id in this case.

Function #5: save_final_product_clusters()
cluster_members_df = save_final_product_clusters()cluster_members_df.head(10)
Sample Data Records (Image by Author)

Step VI: Get similar product recommendations for a given product

Now, let's write a code block to fetch some similar product recommendations for a specific product id: "A13".

For that, first we need to get the representative numerical vector of this product from product_ids_and_vectors data frame and give it to the model to have its assigned cluster number. Then, we will fetch member list of the cluster which product "A13" belongs to. In the final step, we will randomly select 5 similar products from that cluster and voilà we are done!

Code Snippet for Demo

Output will look like:

5 Similar Product Recommendations for A13:  ['C17', 'P60', 'C44', 'P56', 'A6']

PART II: MLOPS

Layer is a collaborative machine learning platform comes with some pre-defined function decorators. As a user, all you have to do is to wrap your Python functions with one the Layer decorators (dataset & model decorators) depending on your function's return data type. For example, if your function returns a dataset and you want Layer to track it, then wrap it with the Layer dataset decorator and Layer will start versioning on your dataset automatically. The procedure is the same for models as well. If your function returns an ML model, then wrap it with the Layer model decorator and Layer will start doing versioning on your model automatically every time you run the same notebook.

Layer Installation and Login

Let's start with installing and logging into Layer in just a few lines of codes. Then, initialise your project on Layer using 'layer.init(your_project_name)'.

!pip install layer
import layer
from layer.decorators import dataset, model
layer.login()
layer.init("Ecommerce_Recommendation_System")

Layer Dataset Decorator

Let's wrap your first function raw_session_based_clickstream_data in this tutorial with the Layer dataset decorator '@dataset(dataset_name)]' and give your Layer dataset a name: "raw_session_based_clickstream_data". You can also log other types of data along with your dataset such as a dataset description using 'layer.log()' as shown in the code snippet below.

From now on, Layer will track your dataset returned from the function, log other data types along with your dataset and version it automatically. It means that every time you run this function, it will create a new version of your dataset. In this way, Layer will enable you to see whole journey of the same dataset on Layer Web UI as shown in the picture below.

Screenshot from the Layer Dataset Page (Image by Author)

You could see a list of dataset versions on the left and some data profile information on the right hand-side of the page. You will see all other data logged along with the dataset under the tab named "Logged data".

Layer Model Decorator

Now, let's do the same process for models. This time, you should wrap your model function: fit_kmeans() with the Layer model decorator ‘@model(model_name)]’ and give your Layer model a name: “clustering_model”. You can also log other types of data along with your model such as some plots using ‘layer.log()’ as shown in the code snippet below.

The only difference between the code block in Step IV of the previous section and the one below is just 3 extra Layer-specific lines.

From now on, Layer will track and version your model along with the logging all other data. It will enable you to compare different versions of your model, convert back to any previous model version in case of a failure and monitor your model performance continuously. Here is a screenshot taken from a model page on Layer WebUI.

Screenshot from the Layer Model Page (Image by Author)

Layer Running Environment Modes

Layer has 2 running environment modes: local and remote.

Local Mode: In the local mode, you will call your functions as usual in the order you wish and the code will be running on your local computer using your own computation power. This mode still logs all the data to the Layer's remote host such as datasets or models created along the run.

# LAYER LOCAL MODEraw_session_based_clickstream_data()
generate_sequential_products()
create_product_embeddings()
fit_kmeans()
save_final_product_clusters()

Remote Mode: In the remote mode, you will put all your Python function names into the 'layer.run()' , this will remotely run your code on Layer Infra using Layer's resources. In this way, you could easily make use of huge computation power of Layer machines and GPUs for possibly running your Deep Learning projects.

# LAYER REMOTE MODElayer.run([raw_session_based_clickstream_data,
generate_sequential_products,
create_product_embeddings,
fit_kmeans,
save_final_product_clusters],debug=True)

If you are using Layer in the remote mode, it is recommended to show dependencies between datasets or models in the decorator signatures. For instance, in the sample code below, the model 'clustering_model' depends on the dataset 'product_ids_and_vectors' and the dataset 'final_product_clusters' depends on a dataset 'product_ids_and_vectors' and a model 'clustering_model'.

#MODEL DECORATOR WITH DEPENDENCIES@model("clustering_model",dependencies=[Dataset("product_ids_and_vectors")])
#DATASET DECORATOR WITH DEPENDENCIES
@dataset("final_product_clusters", dependencies=[Model("clustering_model"), Dataset("product_ids_and_vectors")])

This is just meant to be a quick introduction to Layer. For more information about Layer SDK and other features, please visit:

Full Notebook Integrated with Layer

Let’s put all the code blocks together in a single python notebook. Here is the full version of the e-commerce product recommendation system notebook with the Layer integration:

You could also check out how this project looks on Layer by clicking the link below:

Thanks for reading! Your feedback is valuable. Please share your thoughts with us in the comment section below.

References:

  1. ŁapczyÅ„ski M., BiaÅ‚owÄ…s S. (2013) Discovering Patterns of Users’ Behaviour in an E-shop — Comparison of Consumer Buying Behaviours in Poland and Other European Countries, “Studia Ekonomiczne”, nr 151, “La société de l’information : perspective européenne et globale : les usages et les risques d’Internet pour les citoyens et les consommateurs”, p. 144–153

--

--

Senior Machine learning Scientist at Booking --- Living in Amsterdam --- M. Sc - Machine Learning