Vector representation of products Prod2Vec: How to get rid of a lot of embeddings

Published in

Towards Data Science

7 min readApr 4, 2022

Hello! My name is Alex, and I work at Ozon in the Product Matching team. Ozon is an e-commerce company that offers customers to buy goods from different sellers. Every day we deal with dozens of millions of products, and our task is to identify and compare similar offers (find matches) on our site, in order to collect different sellers’ offers into one product card.

Every product has the following information: pictures, title, description and additional attributes. We want to retrieve and process all this information for dealing with different tasks, while it is especially important for the product matching team.

In order to extract features from a product, we create vector representations (embeddings) using different text models (fastText, transformers) for descriptions and titles, and a great number of convolutional neural networks (ResNet, Effnet, NFNet) for images. These vectors are further used for feature generation and product matching.

There are millions of updates on Ozon every day, this is why counting embeddings for all models becomes challenging. Usually each vector describes a different part of the product. What if, instead of this, we obtain just one vector for the whole range of products at once? That sounds good, but how to implement it correctly…

Image by the Author, inspired by “The Hangover” movie

To create a vector representation of a product, we can use:

1. Content — image information, texts, product names, and attributes.

2. User sessions — the history of products viewed/purchased by our users.

Let’s talk about how we solved this problem applying the first method (using content).

Scheme

Apart from the fact, that such an architecture can be useful for recommendations, search, and matching, it allows to unite all the information about a product (images, title, descriptions and attributes) into a single vector and, therefore, simplify some pipelines (ranking, search, and selection of product candidates).

Architecture

It’s logical to use the Metric Learning approach for this task: minimize the distance between similar products and force dissimilar products to be apart from each other using, for example, triplet loss. There are many interesting issues (negative sampling, what to consider as positive examples here, how to build a dataset correctly). As we already have some models of this type, we decided to use the supervised learning approach to solve this problem — predicting the lowest level of category in the category tree.

Each product belongs to a particular category tree, from the high-level (clothes, books, electronics) to the low-level one (shorts, mugs, smartphone cases). We have several thousands of such low-level categories.

For example, Electronics (cat1) → Phones, Tablets (cat2) → Apple Smartphone (cat3).

To classify such a large number of categories, instead of the usual Softmax (which hasn’t shown the satisfying results), we decided to try using another approach, which was originally proposed for the face recognition task — ArcFace.

Classic Softmax does not directly affect the proximity of the learned embeddings within one class and the remoteness in different classes. ArcFace is designed specifically for this: by selecting the margin penalty m parameter, we can adjust the embeddings closing/distancing of the same or different classes.

The first version of the model architecture looked like this:

Distinguishing of cat3 at once proved to be too difficult for the model: we tried to train image, text and attribute models on one final Cross-Entropy loss for cat3 at each iteration. This caused poor and slow converging of their weights. Therefore, we decided to improve the model in the following way:

1. From each encoder we obtain intermediate layer outputs — cat1 (higher-level category) predictions.

2. The total loss is a weighted sum of all losses, and at first, we give more weight to cat1 losses, and then gradually shift it towards cat3 loss.

As the result, we have the following architecture:

We take the usual exponential function as the weighting coefficient:

During the inference, we are no longer interested in cat3 prediction, but in the vector representation of the product, so we take the output of the layer to ArcFace — this is the embedding we need.

Preparing data

If we just count all products’ categories, we will get about 6000, some of which are very similar (vitamin and mineral complexes and dietary supplements), while others are embedded within each other (coffee and coffee capsules), and still others contain too few examples of products (physiotherapy machines).

That is the reason why taking raw categories as a target is not an option — we had to perform quite considerable preprocessing uniting similar categories. As a result, we got a dataset including about 5 mln items with 1300 cat3 categories and at least 500 samples per category.

The data itself was processed as follows:

1. Texts were converted to lowercase and stop words were removed.

2. Images were augmented using standard methods (horizontal, vertical flips, brightness, and contrast changes).

3. Those attributes that did not make much sense and were found in almost all products (for example, serial number) were removed. We tried different options to process attributes: add “key-value” pair separately into each attribute and combine all of them in one string. There was not much difference in the end, but the second option looked more elegant in the learning pipeline, so we settled on it.

Learning process

We decided to look at lighter architectures because we had too much data and needed to fit two texts and one image model into the learning pipeline. We used ResNet34 as CNN and two Rubert-Tiny for texts — for titles and attributes.

We have both text and image models, so we set up a separate optimizer for each of them: AdamW — for BERTs and SGD — for ResNet and model head. All in all, we trained 60 epochs: at first 15 epochs with higher learning rate, then continued with smaller one, and parallelized them on GPU using horovod.

The result on validation was 85% Acc@1 and 94% Acc@5. To compare: fastText trained on titles gave an accuracy of 60% Acc@1.

The accuracy of the category predictions is not enough when we are eager to understand whether we managed to generate good embeddings for products or not. Additionally, we used Projector with 3D vector visualization: there you can select different ways of dimension reduction and see how our vectors look projected onto the sphere.

Here are, for example, t-SNE and UMAP visualizations:

If we take a closer look, we will see that each cluster contains products of the same category:

And here’s what happens when you look at the nearest neighbors of products in the production pipeline:

Most importantly, the inference time of the ranking model was reduced greatly: using Prod2Vec embeddings instead of image and text ones, we got a speedup of more than three times:

Results and perspectives

We are pleased with the results, so we launched the finished architecture into production and now compute millions of such embeddings every day via Spark Structured Streaming. They may be further put in the ranking model resulting in good candidates for matches.

In addition, the embeddings can be used in many other tasks that arise in our team or related ones.

The result of product matching looks like this: different sellers are visible in one product card.

However, it is interesting to check whether this architecture would work well if we trained it with Metric Learning. All this remains to be found out in the future.

If you have done something like this or know any other way of solving a similar problem, please leave a comment below. I hope you find this article interesting and helpful :)

Vector representation of products Prod2Vec: How to get rid of a lot of embeddings

Written by Alexander Golubev