Learning product similarity in e-commerce using a supervised approach

A practical solution to finding similar products using deep learning. A product-centric approach.

Slava Kisilevich
Towards Data Science

--

Photo by Scott Webb on Unsplash

The main components of every e-commerce are products and consumers that interact with products. Among many use cases that e-commerce deals with, there are those that are consumer-centric, which try to understand the behavior of consumers by analyzing the history of purchased or viewed products and recommending similar products to consumers having similar purchasing behavior, and product-centric, such as price estimation of a new product, demand forecasting, finding similar products, etc. The scope of this article is to examine the product-centric approach to learning product similarities with the goal to get a lower-dimensional vector representation of each product. Having such a numerical representation allows us to apply some measure of similarity such as cosine similarity and establish similarity between different products. I show two practical approaches to building a lower-dimensional product representation by first training a regression deep learning model using PyTorch and second building such a representation by either extracting embedding weights of each word comprising a product title or extracting a complete vector comprising of all product attributes. Each of the approaches has advantages and disadvantages which will also be addressed in this article.

The number of articles published by e-commerce companies shows that building product similarity solutions is indeed a very important task: Price2Spy, EDITED, Walmart, Webinterpret. The topic is addressed also in some scientific papers:

The definition of similarity between products is domain and use case-specific For example: given two Adidas shoes. What is the degree of similarity on a 0–1 scale?

Screenshot from amazon.com
Screenshot from amazon.com

The answer: it depends. If we compare product titles only, the products are completely similar. If we bring color attributes in, the similarity is still high but not 1, maybe 0.95 or 0.9, we do not know exactly. How about comparing shoes of the same color but different sizes? We see that the price now changed.

Screenshot from amazon.com

This example shows that similarity depends on the chosen product attributes. What are the possible solutions to estimate product similarity based on product attributes?

  • We could train a word2vec or BERT model on titles or use pre-trained embeddings to learn vector representation of words comprising titles or vector representation of complete titles using doc2vec models — However, short product titles, titles having misspelled or very technical words, domain-specific abbreviations, or titles written in languages other than English may not be represented well enough by these models.
  • We could apply image similarity of products and combine them with vector representation of titles. But images of products may be missing or have inadequate quality for training an image similarity model.
  • We could cluster product representations constructed from different attributes, image embeddings, and title embeddings —among the disadvantages are the scalability issues, difficulty in establishing the optimal number of clusters, and the need to perform additional similarity comparisons inside a cluster.
  • Last but not least, we could use supervised approaches (classification or regression) based on Deep Learning in which product vector representation is being learned during loss minimization. The only question we must answer is what our minimization task can be. One of the candidates can be price estimation or demand forecasting. In such a case the model can learn what products are more similar to each other with respect to the price (in the case of price estimation) or what products are more similar to each other with respect to sales (in the case of demand forecasting). The advantages of this approach are numerous (1) it doesn’t require special treatment and complex preprocessing of product attributes including product titles and other textual information (2) it solves the main supervised task and generates product vector representations in parallel (3) the main supervised task can be then improved by updating the knowledge about similar products.

I follow the supervised approach and demonstrate it on the data from Kaggle Mercari Price Suggestion Challenge where the task was to automatically suggest product prices to online sellers:

Mercari, Japan’s biggest community-powered shopping app, would like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari’s marketplace.

Mercari data set has about 1.5 million products and has the following columns:

  • name — the product title provided by the seller.
  • item_condition_id — five conditions (1,2…5) of the items provided by the seller
  • category_name — three levels of the category listing for each product like men/tops/t-shirts, women/jewelry/necklaces
  • brand_name — the corresponding brands to which each product may belong
  • shipping — 1, if the shipping fee is paid by the seller, and 0, otherwise
  • item_description — product description (the seller is free to provide any description)
  • price — the target variable and is represented in USD

As we may see, the Mercari data contains all types of features: numeric, categorical, and textual, which usually appear in e-commerce data. We have also several challenges here: since the seller is free to provide any information about the product, the product description can be completely empty or misspelled. Product names can also be written with typos like in the example below.

Image by the author

The deep learning solution has the following architecture:

Image by the author

The input consists of all sorts of features: numeric, categorical, and text. The code provided as PoC is written with maximum flexibility in mind. It allows defining any set of features that belong to one of the three mentioned types. Categorical input is translated into lower-dimensional space through Embedding Layer. Text input (product title, product description) is first tokenized into words and characters and transformed into lower-dimensional space through Embedding Layer. Learning product titles and descriptions on a character level may increase matches for misspelled products or products having slight differences in text input. GRU/LSTM Layer returns a hidden state of the last word or character in the sequence. Finally, all layer outputs are concatenated into a dense layer and additional dense or skip-connection layers can be defined on top. Two layers highlighted in green play an important role in our downstream task.

After the model is trained, there are two ways to extract product representations:

Word Embedding Layer contains embeddings for each word. We extract each word representation belonging to a product title and average all word vectors. The resulted averaged vector is a numeric representation of a product title. Having constructed vector representations for all products, we can apply similarity measures like cosine similarity to compare the degree of similarity between products. The advantage of this approach is that the trained model can be discarded after we extract word embeddings. The disadvantage of this approach is that title words not seen during the training won’t be represented. As a result, the similarity of product titles not having a representation won’t be possible to measure. Another downside is a sensitivity to products having almost similar naming but belonging to different categories. Since the product names are constructed from Word Embedding Layer a word has only one single representation.

PyTorch Example: Assume we have tokenized our product titles into words. There are 10 words in the dictionary and our word vector representation has 4 dimensions.

import torch
import torch.nn as nn
from sklearn.metrics.pairwise import cosine_similarity
#size of the dictionary + 1 out of vocabulary token
word_embedding = nn.Embedding(10 + 1, 4, padding_idx = 0)
#here we should have trained our model
#extract word representations
#there are 4 products each having 3 words indexed by the vocabulary
#index 0 reserved for any out of vocabulary token
product_words = torch.LongTensor([[1, 2, 3],
[3, 9, 8],
[4, 5, 7],
[1, 2, 4]
])
#average words per product
title_vectors = word_embedding(product_words).mean(axis = 1).detach().numpy()
#compare similarity of the first product to all products
cosine_similarity(title_vectors[0].reshape(1,-1), title_vectors)

Concatenated Layer is the first dense layer that concatenates output from individual numeric, categorical, and text layers. We can treat this layer as a product representation consisting of all types of input. The advantage is that if a test product consists of words not present in the Word Embedding Layer, it will still be possible to find similarities using other product attributes. This approach requires changes in the forward propagation function. Instead of returning the results of regression (last output layer), we return both the results of the last output layer and the concatenated layer. The extraction of word representations is performed by applying the model on the train and test products.

PyTorch Example: Assume we have tokenized our product titles into words. There are 10 words in the dictionary and our word vector representation has 4 dimensions. In addition, we have one categorical input having 4 unique values that we want to represent as a 3-dimensional embedding vector.

First, we process product titles by using an Embedding Layer and LSTM Layer.

import torch
import torch.nn as nn
EMBEDDING_DIM = 4#size of the dictionary + 1 out of vocabulary token
word_embedding = nn.Embedding(10+1, EMBEDDING_DIM, padding_idx = 0)
#do we want to process text from left to right only or in both directions?
BIDIRECTIONAL = True
HIDDEN_SIZE = 5
NUM_LAYERS = 2
lstm = nn.LSTM(input_size=EMBEDDING_DIM, hidden_size=HIDDEN_SIZE, batch_first=True, num_layers = NUM_LAYERS, bidirectional = BIDIRECTIONAL)
product_words = torch.LongTensor([[1, 2, 3],
[3, 9, 8]
])
#apply lstm on embedded words
packed_output, (hidden_state, cell_state) = lstm(word_embedding(product_words))
batch_size = hidden_state.shape[1]#extract last step from the hidden state
last_step = hidden_state.view(NUM_LAYERS, 2 if BIDIRECTIONAL else 1, batch_size, -1)
last_step = last_step[num_layers-1]
#reorder batch to come first
last_step = last_step.permute(1, 0, 2)
#flatten dimensions (B, HIDDEN_SIZE * 2 if BIDIRECTIONAL else 1)
last_step = last_step.reshape(batch_size, -1)
last_step.shape

Second, we process categorical input by running it through the Embedding Layer

EMBEDDING_DIM = 3cat_embedding = nn.Embedding(4 + 1, EMBEDDING_DIM, padding_idx = 0)#indices of categorical values
batch_categories = torch.LongTensor([[1],
[3]
])
batch_size = batch_categories.shape[0]
cat_result = cat_embedding(batch_categories)
cat_result = cat_result.reshape(batch_size, -1)
cat_result.shape

Lastly, we concatenate two vectors and form our product representation vector:

#tensor dimensionality [batch_size, last_step.shape[1] + cat_result.shape[1]#this is our product representation
concatenated_layer = torch.cat([last_step, cat_result], dim = 1)
concatenated_layer.shape

Training on Kaggle Mercari data

I applied data cleaning and feature engineering using some methods from the following Kaggle notebook.

I used the following numeric features in the model:

  • shipping — 1, if the shipping fee is paid by the seller, and 0, otherwise
  • desc_len — number of words in the product description
  • name_len — number of words in the product title
  • is_brand_missing — 1, if information about the brand is missing, and 0, otherwise
  • is_item_description_missing — 1, if the product description is missing, and 0, otherwise

Categorical features:

  • item_condition_id — five conditions (1,2…5) of the items provided by the seller
  • brand_name — the corresponding brands to which each product may belong
  • subcategory_1, subcategory_2, subcategory_3 — three levels of the category listing for each product

Text features (Word and Character Level)

  • name — product title
  • item_description — description of the product

The target variable price was first log-transformed and then transformed with scikit-learn PowerTransformer

I split the train set provided by Kaggle into train and validation allocating 20% of products for validation. Finally, I trained the model on 1,185,328 products for 10 epochs with early stopping

Model parameters are summarised below:

Image by the author

The product representation vector constructed from Word Embedding Layer has a dimension of 50 (text_embedding_dimension). The Product representation vector constructed from Concatenated Layer has a dimension of 787 and is calculated as follows:

  • 5 categorical features (categorical_embedding_size) with varying embedding dimensions of 4, 178, 6, 23, 71 → 282
  • 5 numeric features (numerical input_size)
  • 2 word-based bidirectional text features (text_recurrent_hidden_size):
    2 * (100 * 2) →400
  • 2 character-based unidirectional text features (char_recurrent_hidden_size): 2 * 50 →100

Results

First comparison: A product that doesn’t exist in the training set

Such product is “speacker” which is misspelled by the seller.

Image by the author

In this case, the Word Embedding approach doesn’t work because the model doesn’t have this word mapped in the vocabulary. The Concatenated Layer approach works.

Image by the author

The first 4 most similar products are speakers having comparable prices as the searched product. Categories, brands, and shipping have the same values. I guess that the first product has a higher similarity because its product description is shorter than the others. If features like product description length make less sense to account for during similarity comparison, we can create two concatenated layers: one which will be used for similarity comparison, and the other will hold inputs from features that are important for modeling but not important for similarity comparison in the downstream task.

Second comparison: “express portofino blouse size m”

Image by the author

Top 5 similar products from the Word Embedding Layer:

Image by the author

The topmost product shares a similar brand with the searched product. It also has three common words in the title.

Top 5 similar products from the Concatenated Layer:

Image by the author

Although the top products have a different number of common words within the title of the searched product, all products belong to the same brand. Moreover, the prices of these products are also comparable to the price of the searched product.

Third comparison: “iphone 5c 16gb att gsm”

Image by the author

Top 5 similar products from the Word Embedding Layer:

Image by the author

Here the top similar products have more common words in the title with the searched product

Top 5 similar products from the Concatenated Layer:

Image by the author

The searched product and the fourth product both mention AT&T (att vs at&t). They both have the same price. It is difficult to decide whether the fourth product is more similar to the first one but the difference in similarity is not big (0.808 vs 0.805)

Conclusion

A few examples that I brought in this article suggest that two approaches for product similarity work quite well. The Word Embedding Layer has a disadvantage in that product titles consisting of completely new words not known at the time of training, won’t be possible to compare. On the other side, the trained model can be discarded once product titles are reconstructed from word representations. In the Concatenated Layer approach, the model is required to make predictions on new products and for extracting product representations. Irrespective of the approaches, proper tuning of the model parameters and domain expertise are required. In the Concatenated Layer approach, one needs to decide which features are important for a product representation. Additional text preprocessing or/and inclusion of image embeddings may also improve similarity.

The provided code is quite flexible to adapt to many tabular datasets with numeric, categorical, and text features for regression tasks.

The complete code can be downloaded from my Github repo

Thanks for reading!

--

--