Hands-on Tutorials

Modeling Product Search Relevance in e-Commerce: Home Depot Case Study

Predict the relevance of search results on homedepot.com using Machine Learning

Kriz Moses

Published in

Towards Data Science

31 min readJun 29, 2021

Introduction
• Abstract
• Problem Statement
• ML Formulation of the Business Problem
• Data Overview
• Objective and Metric for the ML problem
• Real-World Constraints
Literature Review
Exploratory Data Analysis: Part 1
• Relevance Scores
• Product_uid
• Attributes
• Description
• Merging product text data
• Filling Null Values
Data Cleaning
• Basic Preprocessing
• Standardizing Units
• Correcting typos in Search
• Further Preprocessing
Exploratory Data Analysis: Part 2
• Basic Text Statistics
• Distribution of Unigrams and Bigrams
Feature Engineering
• Set-Theoretic Features
• VSM based Features
• Probabilistic Features
• Query Expansion using Word2Vec
Performance Chart of the Feature Sets
Final Modeling and Evaluation
• Stacking Regressor
• Base Models
• Meta Model
• Performance on Kaggle (Top 10%)
Extending into a Real-World Search Engine
Full Pipeline
Final Testing and Results
• Testing
• Run Time
• Web Application Demo
Conclusion and Future Work
Acknowledgments
References

Introduction

Abstract

Everything is moving to digitalization and so is the shopping experience. E-commerce has been on rapid growth over the past few years and hence, online product search has become one of the most important factors in providing the customer with a satisfying shopping experience. In this blog, I propose a robust way of predicting the desired products for a given search query, using techniques involving machine learning, natural language processing, and information retrieval.

Problem Statement

The task is simple to understand. For any search query that the customer enters, I need to find the most relevant products and show them to the user in order of their relevance. From a business point of view, there are a few points which need to be considered. First, the products need to be ranked, thus, even among the most relevant products, we need to be able to tell which one is more relevant. Second, there is a time constraint involved i.e. the results need to be shown within seconds.

Machine Learning Formulation of the Business Problem

The task can be formulated as follows: Given a search and a product, find the relevance score between them i.e. how relevant that product is to the search query at hand. So let’s say my machine has learned how to predict the relevance score for a (search-query, product) pair. Now, for any search that the user enters, I can calculate the relevance score for that very search paired with all the products in my database and show the (say) top 10 results to the customer.

Thus, if I have a lot of labeled data i.e. a lot of (search-query, product) pairs with their relevance scores then I can pose this as a supervised ML problem. And that is exactly what I have done in this case study. The data I have used is provided by Home Depot for the Kaggle competition home-depot-product-search-relevance

Now in a real-world e-commerce search engine, calculating the relevance score for every product for a given search is not possible, because, in any typical e-commerce website, the number of products is very large, and hence it’s computationally expensive and very time-consuming.

✦ Thus, first, we retrieve a few candidate products using a simpler retrieval model which permits fast query evaluation. And in the second phase, a more accurate but computationally expensive machine-learned model is used to re-rank these products.

To explain this, consider we have a set of 1,00,000 products. And the search query is “solar lamp”. The first simpler retrieval model will retrieve a few candidate products. This model can be as simple as an AND operator between the words of search and product-text. So here a model governed by AND operator will get all the products with the words “solar” and “lamp”. Say it retrieves some 500 products. Now on top of these 500 products, we can run our complex machine learning algorithm and calculate the relevance score for each (search, product) pair where our search term “solar lamp” remains constant and products vary. And now we re-rank them based on their relevance score and show the top products to the user. This is called Learning To Rank (LTOR). In this blog, the focus will be on the learning-to-rank part of the system. In the end, I have also explained how I extended it to make a full-fledged search engine.

Data Overview

The data is in the form of four CSV files; train.csv, test.csv, attributes.csv, and product_description.csv. You can download it from here.

Training Data: We have a total of 74067 rows for training in the train.csv file. A part of the data looks as shown below, each row corresponding to a pair of

(search_term, product_title) and the relevance score telling how much that product is relevant to the search. A relevance score is a real number from 1 to 3 (1 for irrelevant and 3 for perfect match). For each product, we generally have multiple rows i.e. multiple search queries with their relevance score. The product_uid is a unique identifier for each product.

The product_description.csv contains a text description of each product. And the attributes.csv file contains some additional information about a subset of products.

The search query is in the form of a string and the product is represented by text data available to us in the form of its title, description, and some attributes. In the real world, we might also have other features for the product like its image, ratings, etc. but here we only have text to work with.

The relevance scores for the data were manually marked. Each pair (search, product) was evaluated by at least three human raters and then the final relevance score was taken as the average.

Test Data: As this is a part of a Kaggle competition, we are also given a test.csv file containing around166,000 rows for which we need to predict the relevance scores. About half of the products in the train set are found in the test set. The Venn diagram for the two sets is shown below.

Objective and Metric for the ML problem

Objective: For every given (search-query, product), predict the relevance score. This will be posed as a regression problem as it will be easier for us to rank the products.

Metric Used:
This metric given in the competition is Root Mean Square Error (RMSE). It is the most widely used metric for regression problems and works well in most cases. It is different from MAE (mean absolute error) in the sense that it penalizes large errors more. One issue with RMSE is that its value is difficult to interpret; It can range from zero to infinity. Hence, generally, a baseline score is established using a naive regression model which we can then use as a reference for other models.

Source: https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/05/rmse.png

Real-World Constraints

Latency constraint: The products need to be shown within a few seconds.
Ranking of products is necessary.
Interpretability is partially important: We would want to know why our model has suggested a particular product to the customer.

Note: In the blog, I have used the terms ‘search query’, ‘query’, ‘search term’ and ‘search’ interchangeably.

Literature Review

The problem at hand is of search relevance which broadly comes under the field of Information Retrieval. A ton of research has been published about techniques for free-text querying.

The traditional non-machine-learned models centered around exploring various ways of modeling language. Set-theoretic models like the Boolean model and Fuzzy Retrieval are based on representing queries and documented as sets of words and then applying some basic set operations to get the relevance. Algebraic models like the VSM represent documents and queries as vectors (eg. bag-of-words) and then the similarity between them is calculated as a scalar quantity. More popular models like the Okapi BM25, Indri, and Language Models are based on probability theory. Although all these models come from different underlying theories, most of them use very similar concepts from the field of Natural Language Processing like bag-of-words, tf-idf, cosine similarity, etc.

The more recent approaches include Learning To Rank where typically the query-document pairs are represented by their feature vectors and then a supervised or a semi-supervised machine learning model is trained on them to get the rankings. Here, it's common to use the traditional retrieval models directly as features. LSA (or LSI) maps the term-document matrix to a ‘concept’ space and hence getting a dense and better representation of the text. The Word2Vec algorithm uses a neural network to learn word associations from a large corpus of text and map the words to high dimensional vectors, which can be used as features. Query Expansion is another effective idea to expand queries with similar words present in the corpus. For measuring similarity, word vectors from the Word2Vec model are commonly used.

Exploratory Data Analysis: Part 1

In this section, I present the analysis of the data at hand. I have saved the train.csv file as a dataframe in the train_df variable.

The id column is of no use. The product_uid is a unique id of each product and is important while merging the data from other text files. There are no null values in the train data.

Relevance Scores

The relevance scores were found to be categorical in nature. This was because most of the scores are the average score of 3 raters and each rater could only give an integral score of either 1 or 2 or 3.

The majority of the products had a high relevance score i.e. most of the products were relevant to the query. Because of the categorical nature of the relevance scores, a classification model can also be built on this if it helps the final rankings. The scores {1.25, 1.5, 1.75, 2.25, 2.5, 2.75} occurred in very low frequencies.

Product_uid

A lot of products occur multiple times in the train data i.e. we have multiple search terms and corresponding relevance for the same product.

More than 10,000 products have multiple occurrences
The total number of unique products out of 74067 is 54667
Max times a product has occurred is 21

Attributes

For the products, we have text information in the form of product title, product description, and attributes. The product title is in the train.csv file. The description and attributes are extracted from the product_description.csv and attributes.csv file respectively.

An instance of the attributes.csv file is shown below. The field “name” represents the name of the attribute and “value”, the attribute text.

Each product in the attributes.csv file has multiple attributes. The minimum number of attributes for a product is 5 and the maximum is 88.

But not all the products in the train.csv file are found in the attributes.csv file. Total 16263 out of a total of 54667 products do not have attributes.

Sets of products in the train.csv and attributes.csv files

Then I explored the most frequently occurring attributes and Brand was the most common one found in 86,250 instances in the attributes.csv file.

Not only this but it is also observed that mostly, the product_title contained the brand information. Thus, I added brand as a separate feature in my train_df.

Description

The product_description.csv has the description of the products. Unlike attributes, we do have descriptions for all the products in the train.csv file.

Merging the Attributes, Description, and the Brand feature

Finally, I merged all the attributes, the description, and the brand with the train_df. All the attributes for a product were first concatenated and then stored as the ‘combined_attr’ feature.

Note: From here on, the text fields representing the product, i.e. the product_title, brand, combined_attr, and product_description have been referred to collectively as the ‘product-text’ or ‘document-fields’ or ‘documents’.

Filling Null Values

Before moving onto the cleaning of the text, I found that our train_df has a lot of null values. These occurred when I merged the attributes, as a lot of products in the train data do not have attributes.

We can see that the null values occur only in the brand and the combined_attr feature. There are a considerable amount of null values and hence filling them properly was an important step in the cleaning.

Brand: It was observed that the brand of a product was generally present in the first few words of the title. For eg., the brand for the product with the product title “Simpson Strong-Tie 12-Gauge Angle” is “Simpson Strong-Tie”. And for “Delta Vero 1-Handle Shower Only Faucet” is “Delta”. Also, 99% of the brands are of length four words or less. So mostly, the brand should be present in the first 4 words of the title.

So, I first stored all the unique brands in a list “unique_brands”. Now, for any row with a null brand, I checked if the first four words of the title of that product are present in the unique_brands list. If not, then I checked the first three words and so on till the first word. This method worked pretty well and I was able to fill 15003 null brands. For the remaining 2625 values, I filled them with the first word of the title.

Description: Filling the null values for description was not that hard as it was observed that the description text is very similar to the combined attributes text. Hence, all the null values for the combined_attr field were filled with the corresponding product_description values.

Data Cleaning

This was one of the most important parts of this whole case study. A large amount of time went into observing the text and making the necessary preprocessing steps to make it suitable for modeling. Many techniques from natural language processing were used.

A very simple motivation as to why text preprocessing is so important in this project is because we are in a way trying to find the similarity between the search-query and the product-text. Consider the search as “5gal bucket” and the product-title as “Leaklite 5-Gal Black Bucket”. If we were using a measure like number-of-common-words, then the similarity, in this case, would be zero as there are no words in common. Now say with a bit of preprocessing, our search and title become “5 gal bucket” and “leaklite 5 gal black bucket” respectively. Now our similarity score comes out as 3 because now we have 3 words in common. Thus, little preprocessing can make a huge difference.

Basic Preprocessing

The basic preprocessing steps involved lowercasing, removing special characters, removing stop-words, and stemming.

Standardizing Units

Many of the product-titles contain the dimensions of the product. A very common way of representation is “8 ft. x 5 ft. x 8 ft.”. The same is observed in product-description.
In product_title, units are mostly represented as {in. lb. sq. lb. oz. gal. mph}, but in product_description and search_term, units are represented in multiple ways. For eg. inch as {in. in inches} and gallons as {gal. gals gallon}. And hence, these units need to be standardized.
In search terms, there are some common errors when specifying measurements. Some examples are ‘4shelves’, ‘9x12’, ‘g135’, ‘5gal’, ‘1/2 in by 12’. Hence, it was needed to separate numbers from words.

Preprocessing Steps:

Separate numbers from words: ‘9x12’ → ‘9 x 12’, ‘5gal’ → ‘5 gal’ …
Bring all units of measurements to a consistent representation. For eg. {gal. gals gallon} all are converted to “gallon”. Similarly {ft fts feets foot foots} are converted to “feet”.

Correcting Typos in Search Terms

While going through the search terms, it was found that it contains a lot of spelling errors like ‘mower’ was spelled as ‘mowe’, ‘toilet’ as ‘toiled’, etc. This is understandable, as these searches are entered by users and in online search, such spelling errors are common. But this was a big problem, as misspelled words could make features like common-word-count very ineffective. Hence, to correct this, I used a custom spelling corrector. And the results were pretty positive.

The spelling corrector I used was built by Peter Norvig. It is based on probability theory but is very simple to understand and very effective. The best thing about it is that you can fit it to your own data. This was very important, as a general-purpose spelling corrector might not be able to recognize brand names that are limited to our data. I encourage you to refer to the link and read more about it.

I corrected all the search terms and for experimental purposes, I also kept the raw/uncorrected search.

Further Preprocessing

In the basic preprocessing, I left the stop-word removal and the stemming part as I wanted to fix the incorrect search terms first. Here both these steps are performed and the cleaned text is stored in cleaned_df.

And our final cleaned dataframe looks as shown below

I also created a dataframe cleaned_df2 where I store the text data without stemming. This was done to be able to use pre-trained encodings like Word2Vec later.

Exploratory Data Analysis: Part 2

After cleaning, I was able to make more accurate findings from the text data analysis. There was a lot of analysis done, and I’ll go over the most important findings/conclusions in this section.

I divided the relevance scores into two classes for easier analysis: “High” with scores > 2 and “Low” with scores ≤ 2. They are also referred to as the positive and the negative class respectively. Although, this is a regression problem, dividing the scores into two classes helped a lot in the analysis; A feature that helps separate the High from the Low class would mostly be helpful in the final regression problem as well.

We have a total of 6 text fields and the analysis was done separately on each.

NOTE:
After close inspection, I observed that the description and attributes text are very similar and thus, I did not consider the attributes text in any further analysis.

Basic Text Statistics

Analysis of a number of basic stats on the text was done. Namely the length of the text string, word count, and average word length. For corrected_search, the distribution of these stats is shown below.

Observations :

Most search queries have a word count of between 2 to 5 words.
The distributions of these stats are very similar for both positive and negative classes.
The most important stat from the above was found to be word-count.

Notice that as we move towards the right, the ratio of height/density of positive to negative class starts decreasing or we can say that negative class starts becoming more prominent. That means as word-count increases, the probability of relevance being High decreases. Thus, word-count gives some information about the relevance score.

4. And the above statement can be justified by looking at the correlation of word-count with the relevance. There is a little correlation and it’s negative in nature.

5. So word-count might not be that bad a feature.

6. number-of-characters and average-word-length had very low correlation
scores of -0.092 and 0.084 respectively, with relevance, hence might not be helpful.

All these stats were analyzed for title, description, and brand text as well. And the same pattern was found. The only stat which had any noticeable correlation with the relevance was word-count.

Apart from this, I also experimented on two more stats: readability-index and sentiment-score. The correlation scores for the two were -0.035 and 0.019 which are very low. And they did not help distinguish the two classes.

Most common Unigrams and Bigrams

Search Text: Most common unigrams for corrected_search are as follows:

‘inch’ and ‘x’ are the most frequent terms. Also, a lot of numbers are very frequently occurring. This means that a lot of search terms contain the dimensions of the products needed. These numbers and units are important as they can be a good indication as to what kind of product the user desires. Hence I didn’t remove the numbers while data cleaning and will be keeping them for the final featurization part as well.

However, the most frequent words for both classes are very similar. On the left, you see the most frequent words in search for each class with their frequencies.

Even the most common bigrams for both classes were very similar.

Hence, from the above observations, for search-text, we saw that most frequent unigrams and bigrams are very similar in both classes. Thus, vectorizations like simple BOW or TF-IDF BOW might not work that well.

Description Text: Even in the description text, it was found that the most common word is inch, with numbers being prominent.

The most frequent unigrams and bigrams for both classes were found to be very similar.

The same analysis was done for brand and title text and a similar trend was found. The text fields in themselves did not give a lot of information about the relevance score.

Conclusions from EDA2

Among all the stats on text, word-count was the most effective in separating the two classes. We could not get a lot of information from other stats like average-word-length and length-of-string, both of which showed very low correlation values. Hence in the final featurization, I will only include the word-count feature.
The unigrams and bigrams distributions of all the text fields (search, title, description, brand) are very similar for both classes. And hence simple vectorizations like BOW and TF-IDF BOW might not work that well on their own.

Feature Engineering

The features for this problem were mostly derived from the field of Information Retrieval. Note that here, the features need to be arbitrary functions of both the search and the product combined, as we are trying to find the relevance for the pair of (search, product). Some basic retrieval models were used directly as features. I have divided the features into four sets:

Set 1: Set-Theoretic Features
Set 2: Vector Space Models (VSM) based features
Set 3: Probabilistic Features
Set 4: Query Expansion using Word2Vec

Feature Set 1: Set-Theoretic Features

These features are derived from simple set-theoretic retrieval models like the Boolean model, Fuzzy Retrieval, etc. which are based on representing queries and documents as sets of words and then applying some operation on them to get the relevance score. The features include:

➤ The count of common words between search and document-fields (title, brand, description)

➤ Cosine coefficient and Jaccard coefficient between search and document fields

➤ Length of search and the document fields

➤ Whether the last word in the search is in the document fields or not

The last feature was taken intuitively because, in a multiple word search-query, the initial words generally describe the attributes of the product you need but the last word represents the very noun of product. Like in “black jacket”, the first word describes the color of the jacket but the last word is the very noun of the product you need.

The above features were created for both the corrected_search and the raw_search.

Comparison between corrected_search and raw_search

It is observed that there is a considerable amount of rows in which the num-common-words feature is more for corrected_search than the raw_search. Hence, correcting the search terms did help.

EDA on Feature Set 1

A total of 29 features were created here. The PDF (Probability Density Function) plots of a few are given below.

Note: The features suffixed with ‘ST’ represent that between the corrected_search and the title, ‘SB’ is for brand and ‘SD’ for description. For raw_search, I have used suffix in the form of ‘r_ST’. For eg. ‘cosine_ST’ represents the cosine coefficient between corrected_search and title, whereas ‘cosine_r_ST’ represents that between raw_search and title.

It’s difficult to make sense out of them unless you look closely. Take cosine_SD as an example. Generally, in all the PDFs, we see that the curve for the positive or “High” class is above the negative or “Low” class. This is plainly due to data imbalance as there are more points belonging to the positive class than the negative. You can see that at the mark 0, the curve for the positive class is a little above negative. As you move towards the right, the positive curve still remains above but the ratio of the height of the positive curve to the negative curve increases. That is, as cosine_SD increases, the ratio of points in the positive class to the negative class increases. Hence the probability of the point belonging to positive class increases. Now if you give me two points, then I can say that the point having a larger value of cosine_SD should have a higher relevance score.

Thus, from the above observation, we can conclude that cosine_SD does provide some information about the relevance score for a (search, product-text) pair.

The same analysis was done for all the features in this set including the ones for raw_search and the key observations are:

The features num_common, cosine, Jaccard showed considerable differences in distributions of positive and negative class. Among them, cosine_ST was the most effective.
The features suffixed with ST i.e. the features between search and title showed more distinction in distributions of the positive and negative class as compared to SD and SB.
Length features didn’t show a lot of difference in distributions.
One of the most interesting features was the islast feature. A pretty considerable difference in distribution was observed. For eg., in islast_ST the density of the negative class seems to be similar at both 0 and 1 values but the density of the positive class increases considerably from 0 to 1.

✦ So as per the observations, cosine and islast features are the most important features followed by Jaccard. Among them, ST features should show the best results.

The correlation matrix for these features was also constructed:

And their correlation scores with relevance were found

Most correlated set 1 features to the relevance score

Note: A desired feature set would be a set in which all the features have a high correlation score with relevance and low correlation scores with each other.

Observations and Conclusions:

There are a few features that have a high correlation with each other. These correlated features can be redundant and hence can be removed while modeling.
The correlation scores wrt. relevance are not very high but some correlation does exist. Hence, these features can be useful.
We can see that correlation of cosine_ST, jaccard_ST, and islast_ST are highest, exactly as we predicted.
These values were also calculated for the raw_search and in general, it’s seen that correlation of corrected_search features with relevance is more than that for raw_search features.
I also applied T-SNE on these features to visualize them in 2-D space. I projected the embedded features into 2-D space and colored them with their individual classes. I tried it with the two classes from before and also with seven classes. But it did not give any insight at all. None of the classes showed any clustering.

Visualizing TSNE features with seven and two classes respectively.

Feature Set 2: VSM based Features

VSM based features are typically based on representing documents and queries as vectors and then calculating the similarity between them as a scalar quantity.

Here, to convert the text into vectors, I used Latent semantic indexing where Truncated-SVD is applied to the term-document matrix such that the queries and documents are converted to vectors in the “concept space”. Basically, here I first vectorized the text using TF-IDF BOW and then used Truncated-SVD to reduce the dimensions of the vectors and create a more dense and meaningful representation. The features include:

➤ Apply LSI on the term-document matrices of search, title, and description separately and use the low-rank approximations as features directly.

➤ Apply LSI on full text (combined title and description). Then, transform the search query into the ‘concept’ space and calculate cosine similarity between them.

EDA on Feature Set 2

The PDFs for the cosine-similarity features are shown below.

PDFs of set 2 cosine-similarity features

As you can see, the similarity features provide a considerable distinction between the two classes and were very useful. The correlation values of each of these three with the relevance class were found to be 0.32 which is pretty good.

Feature Set 3: Probabilistic Features

These features are based on the core concepts in probability theory like the Bayes Theorem. The features used are:

➤ Language Models with Dirichlet, Absolute, and Jelinek Miller smoothing

➤ Okapi BM25 ranking function

➤ Query and fields represented as TF-IDF Word2Vec

➤ (sum, min, max) of (tf, idf, tf-idf) for the search query in each of the text field

The BM25 Ranking Function:
Out of these features, I would like to focus on the BM25 ranking function, as it on its own was used for text querying in the traditional search engines and is still very popular for simple free-text querying. For a given (query, document) pair, it calculates a similarity score which is as shown below:

Source: https://en.wikipedia.org/wiki/Okapi_BM25

Image by Author, inspired by the lectures by Victor Lavrenko on Probabilistic Models for IR

Above is an attempt at explaining the whole equation. It is basically a modified form of TF-IDF weighting where the TF value has been modified to incorporate the varying length of the documents and to bring non-linearity in the increase of the score wrt term-frequency. For understanding it better, I encourage you to go through the following lecture by Victor Lavrenko.

EDA on Feature Set 3

A similar analysis was done and all the features had a decent correlation value with the relevance score but a little lower compared to set 1 and set 2. Interestingly enough, the feature with the highest correlation was found to be min_tf_ST i.e the minimum term-frequency of search words in the title text. And intuitively also, it should be a good feature; min_tf_ST = 0 means that there are words in search which do not occur in that title and min_tf_ST ≠ 0 would mean that all the words in the search are found in the title thus signifying high relevance.

most correlated set 3 features to relevance

The PDF plots of the ST features i.e between search and title are shown below.

Observations and Conclusions:

Among the tf, idf, tf-idf features, the tf features i.e. min_tf, max_tf and sum_tf are the most effective. Whereas the idf features are not able to separate the two classes and don't show a high correlation with relevance score as well.
The bm25 and the language model features (JM, AD, Dir) do show distinction between the two classes. Note that the language models which I designed have a negative correlation with relevance hence lower values would point to high relevance
The only caveat with all these features was that they had a very high correlation value with each other hence there was a lot of redundancy which might affect the linear models like Linear Regression and SVM. To tackle this, one can remove one of each of these highly correlated features.

High correlation between AD_SB and Dir_SB features

Feature Set 4: Query Expansion

Query Expansion is a very simple concept to understand.

First, all the words in the search query are converted to word vectors using Word2Vec.
Then, let q be a user query represented by a bag of terms, q = (t1, t2, . . . , t|q| )
To expand a query q, we follow • For each t ∈ q, collect the k-most similar terms to t using w2v. • Include the new terms into the query set giving us q.

After query expansion, we can use the expanded search queries to recalculate all the basic features from the feature set1.

I did perform a bit of EDA on this set and results observations were very similar to the search without expansion, i.e. expanding the terms did not prove to be very helpful. That might be because I trained a custom word2vec model on my corpus and the model might not have been trained well. With a well-trained model, I believe query expansion could be very helpful.

Performance Chart of the Feature Sets

On these feature sets, I trained an XGBoost model with some hyperparameter tuning. I kept on adding the feature sets, checking for any improvements and the results were very positive. The training was done on the whole train data with 74067 data points. For evaluation, the metric used was RMSE with 5-fold cross-validation.

Observations and Conclusions :

The error continuously reduced as we added more features.
The error for f1_raw was 0.4805 which improved a little with f1_corrected. It improved considerably to 0.4763 when we used both the feature sets combined. Hence both raw and corrected feature sets do improve the overall performance.
Adding the f2_lsi set also saw a considerable drop in error to 0.471.
The most improvement was observed when we added the feature set 3 where the error dropped from 0.471 to 0.4547. Not only this, feature set 3 alone performed very well with the error of 0.4569. And hence this was the most important set of all.
The features from query expansion that is f4 did not bring a lot of improvement.
The best score observed was 0.4541 on features from all the sets combined.

Final Modeling and Evaluation

Stacking Regressor

The XGBoost models in the previous section were built mainly to observe and compare the performance of different feature sets. For the final modeling part, I built a custom Stacking Regressor with 17 base models and a Ridge Regression model as the meta-model. For this, I proceeded as follows:

Split the whole data from the train.csv file (74067 points) into train and test set(80–20)
Now, I split the 80% train set into D1 and D2.(50–50). So D1 and D2 have about 29k points each.
From D1, I did sampling with replacement to create 17 samples d1,d2,d3….d17 for my 17 base models. The sampling ratio used was between 0.85 and 0.9. Thus to train each of the base models, around 25k data points were used.
Then I created 17 models and train each of these models with each of these 17 samples.
I passed the D2 set to each of these 17 models and got 17 predictions for D2 from each of these models.
Now, using these 17 predictions, I created a new dataset, and for D2, I already knew its corresponding target values, so then I trained my Meta-Model with these 17 predictions as features. The Meta Model hence was trained with 29k points each having 17 features.
For model evaluation, I use the 20% data that I kept as the test set in the beginning. I passed that test set to each of the base models, got the 17 predictions, created a new dataset with these 17 predictions and finally, passed it to the metamodel to get the final prediction. Using this final prediction and the targets for the test set, I calculated the model’s performance score.

Base Models

I divided the base models into three sets M1, M2, and M3. Mainly tree-based and linear models were used.

M1 models were trained on a total of 71 base features. These are f1_comb(29) + f2_lsi(3) + f3(39). A variety of models (total 7) were trained on this set.
M2 models were trained on a total of 971 features. 71 of these were the base features. The other 900 features are from Word2Vec embeddings for search, title, and description.
M3 models were trained on a total of 4071 features: 71 base features and the rest 4000 from Truncated-SVD in feature set 3. For this set, due to its high dimensionality, I stuck to linear models only.

For cross-validation, I generally used 5-fold cross-validation. The performance report of the base models is shown below:

Meta Model

For the meta-model, I used a simple ridge regression model.

After hyperparameter tuning, I got a final RMSE of 0.4539 on the test set. On the train set, I got an RMSE of 0.4531 hence the model was not overfitting also.

Performance on Kaggle

I tested my final model on the test set provided by Kaggle containing over 160,000 datapoints and it got a Private Score of 0.46525. It ranked in the top 10% out of more than two thousand teams that participated in the competition.

Extending into a Real-World Search Engine

As mentioned earlier in the article, in a real-world search engine, for any search-query, we would first retrieve a few candidate products using a simpler retrieval model. And then, a machine-learned model is used to re-rank these products.

We have already built the machine-learned model in the form of a stacking regressor. For the initial retrieval part, I first created a database out of all the products in the train.csv and the test.csv files. The database contained 124428 unique products with the fields being product_uid, title, description, and attributes. For the model, I use BM25, which we have also used as a feature for our machine learning model. Given a search query, the BM25 model retrieves the top 100 products based on its custom ranking function.

Now on top of these 100 products, I run my stacking regressor to which as input I give 100 pairs of (search, product-text) where the search remains constant but the products vary. And so my model then ranks them and we can then show the products to the customer with the highest relevance scores.

Full Pipeline

The full pipeline of the model is shown below:

Final Testing and Results

So now the whole search engine was complete. I tested it for a bunch of queries. And the results were very interesting. I am attaching the screenshots of some results below.

First, a simple search “pan” and the results are good enough.

Then I searched for “air conditioner” and then with a spelling error “aor condiotioner”. And the results for both are exactly the same. This means that our spelling corrector is working fine!

For “warm water”, the results are very interesting. None of the product titles have both the words ‘warm’ and ‘water’ in them. In the top results, you can see that all the titles have either of the words present. Then what is the reasoning for the first title to be given the first rank? This is due to the presence of the word ‘Icicle’ for which the definition is “a hanging, tapering piece of ice formed by the freezing of dripping water”. The word embeddings that we used must have encoded it in such a way that the encoding is similar to the word ‘water’. Hence our model found an association of ‘icicle’ with ‘water’ and displayed the results in such a manner. Also notice, in the next three results, we have gotten more desired results in the form of water heaters. Here also, the word embeddings much have made the model know that the word ‘warm’ is associated with the word ‘heat’ and hence the results.

Results of a few more searches are shown below.

Run Time
The run time varied for different searches. For one-word simple searches like “pan”, the average run-time was around 5.09 seconds but for long searches and the ones requiring spelling correction, the run time was between 5.5 and 5.7 seconds.

Note: The main purpose of this case study was to model product search, mainly focusing on coming up with useful features on which we can apply our ML algorithms. We were not focused so much on the design and optimization part. For real-world use, this search engine can be optimized like crazy and it wouldn't take long for the run time to be reduced to below one second.

Web Application Demo

Conclusion and Future Work

In this case study, I explored a variety of techniques from the field of Information Retrieval to extract features from the data provided. The techniques ranged from simple set operators to Latent Semantic Indexing. LSI-based similarities and traditional retrieval models such as BM25 and Language Models have proven to be extremely effective. For modeling, I built a Stacking Regressor with 17 base models and Ridge Regressor as the meta-model which got a final score of 0.4652 on Kaggle. Further, I extended this to a full-fledged search engine by adding an initial retrieval model in the form of BM25.

There is still a significant difference between my score and the top-ranked team’s (0.4319). Thus, it’s clear that much more work needs to be done to improve the model. Here are a few suggestions:

The text data we have is pretty different from the general text found on the internet. Hence, whenever it's possible to train a model on our own custom data then we should explore that possibility. For example, we can train our own Word2Vec model. With this, Para2Vec and Sent2Vec models can also be explored.
A better spelling corrector can be implemented as this could make the model much more robust.
The attributes text was not used in any of the featurization. It can be included.
In the preprocessing part, lemmatization can be performed.
There must brands which have similar types of products, hence the brands can be clustered, and then the cluster numbers can be used as a categorical feature itself.
Only the “count” of common words was used as a feature. We can also try to find a way to embed these common words to tell our model which kind of common words are present.
In query expansion, for calculating similar words, better vectorization techniques can be explored
For the modeling part deep learning models like LSTMs, GRUs, BERT, CNNs can be explored.

Acknowledgments

I would like to thank the whole AppliedAICourse team for guiding me throughout the Case Study.

References

https://www.kaggle.com/c/home-depot-product-search-relevance
https://en.wikipedia.org/wiki/Information_retrieval
Boolean Model in Infomation Retrieval for Search Engines
Fuzzy information retrieval model revisited by SławomirZadrożny and KatarzynaNowacka
Integrating the Probabilistic Model BM25/BM25F into Lucene
https://en.wikipedia.org/wiki/Learning_to_rank by Joaquín Pérez-Iglesias, José R. Pérez-Agüera, et al.
Indri: A language-model-based search engine for complex queries by Trevor Strohman, Donald Metzler, Howard Turtle, and W. Bruce Croft.
Using Linear Algebra for Information Retrieval by Michael W. Berry, et al.
Query Expansion Using Word Embeddings by Saar Kuzi, et al.
Feature Selection and Model Comparison on Microsoft Learning-to-Rank Data Sets by Sen Lei, et al.
The Vector Space Model in Information Retrieval: Term Weighting Problem
https://opensourceconnections.com/blog/2016/03/29/semantic-search-with-latent-semantic-analysis/
http://www.ccs.neu.edu/home/jaa/CSG339.06F/Lectures/vector.pdf
Video Lecture on Language Models for IR by Victor Lavrenko
http://datasciencebar.github.io/blog/home-depot-produce-relevance-review/
https://medium.com/kaggle-blog/home-depot-product-search-relevance-winners-interview-2nd-place-thomas-sean-qingchen-nima-68068f9f9ffd
https://arxiv.org/pdf/1803.05127.pdf
http://billy-inn.github.io/papers/cmput690.pdf
https://arxiv.org/pdf/2001.04980.pdf
https://www.appliedaicourse.com

Full work with code can be found here on my Github profile:

kriz17/Home-Depot-Product-Search-Relevance

Contribute to kriz17/Home-Depot-Product-Search-Relevance development by creating an account on GitHub.

github.com

Connect with me on LinkedIn:

Kriz Moses - IIT Indore - Indore, Madhya Pradesh, India | LinkedIn

View Kriz Moses' profile on LinkedIn, the world's largest professional community. Kriz's education is listed on their…

www.linkedin.com

PS: Feel free to provide comments/criticisms if you think they can improve this blog, I will definitely try to make the required changes.