
Motivation
The amount of times I’ve thought about a magic tool that could find similar items of clothes is worrying.
Let’s imagine a situation: You are scrolling through Instagram and suddenly see a stunning coat. What if there was a button ‘I want this coat’ and Instagram shops could recommend this to you as well as an array of similar coats, that perhaps you’d like even more. On top of this, what if you could filter them by price range and by brands and get a full experience of online shopping?
Maybe I wouldn’t have to worry after all. Sometimes the only thing that I need when I’m shopping online is for someone or something to understand me and my fashion needs.
Read this article and check out the project on GitHub if you’d like to gain end-to-end experience in data science. You’ll practice everything from neural network modeling to docker deployment.
What is similarity
Characteristics of items, or clothes in my examples, such as color, shape, type can be presented as numerical values, that all together form a vector. We can thus build a vector space where each item will have a corresponding vector.
The ”Similarity” of objects will simply be defined by a distance between these objects in this space. The less distance between them, the more similar or connected they are.

The ideology of the project
This project is aimed to be as simple in implementation as possible. So, for building vectors of images, I would retrain a pre-trained model. For the purpose of vector storage and search of similar vectors, I would apply a vector search engine. A simple web interface would be built on an app framework whereas the backend would be based on a web framework for building APIs. And finally, the full project would be deployed on a virtual machine using Docker containers. It may sound overwhelming at first glance, however, it’s not that difficult and requires only following it step by step.
Stages of the project: from the idea to app
The key stages are presented below:
- Finding a dataset
- Finding the relevant transformer model
- Retraining the model and evaluating it
- Applying the model to all data and extract vectors for the dataset
- Coding frontend and backend to make a simple web interface
- Deployment of the project on a virtual machine.
Let’s start with the first stage and go ahead with the rest.
1 The dataset was found on Medium and contains 5,000 images of 20 categories of clothes. Ranging from shoes to hats. The top 10 classes contain at least 150 items each.
More details about how it was collected are in the article.

2Vision Transformer (ViT) is a vision model based on the transformer architecture originally designed for text-based tasks (you may have heard about BERT). This is a pretty new model that was published in December 2020. More details are here.
3 Transformer models aren’t usually applied without training. Embeddings, or vectors, extracted from a transformer directly may cover plenty of different parameters, and some of these parameters may be more important than the others for a particular problem you are solving.
Hence, such models require retraining, or fine-tuning. I’ve come up with the following scheme for retraining based on the approach presented in the notebook.

The model used a linear layer on top of a pre-trained ViT model. So I placed a linear layer on top of the last hidden state of the [CLS] token, which served as a good representation of an entire image. Besides, dropout was added for the purpose of regularization. The last layer returned probabilities of belonging to a class. The input of the added linear layer may be used exactly as the vectors of objects in a Similarity Search.
The full script for model retraining can be found in the notebook.
For the model evaluation I used two metrics:
- compared accuracy in the training dataset and in the test dataset,
- compared "% of top N neighbors that belong to the same class" before and after fine-tuning.
I’ve also experimented with a number of layers added and vectors extraction.

Retraining (fine-tuning) the transformer model increased the ‘similarity’ in classes of the closest items: "% of top 5 neighbors that belong to the same class" was raised from 56.7% to 83.4%.

Adding an extra linear layer on top of the model didn’t improve any metrics. Besides, it turned out that during the retraining, the transformer weights were tuned as well as the weights in the added layers. Due to this fact, I could use the input of the added layer as the vector for an object in a ‘similarity’ search.
For the rest of the article, I’ll be referring only to the model with one linear layer added on top of the pre-trained model.
4 In order to save inputs of the added linear layer I used a forward hook. A detailed guide on how to use it is provided in the article.
5 For making a simple web interface, this documentation was used. The work consisted of the following steps made on a local machine:
-
A simple interface was created using Streamlit to show images and maintain an image selection.
-
I’ve run Qdrant (vector search engine) in a docker container, created a collection, and added vectors there.
-
The backend piece was coded using FastAPI and included search requests.
-
The presentation of similar objects was added to the frontend script.
- The frontend and backend were debugged without dockers
-
Docker files as well as docker-compose.yml were added, where in the latter I described all services and dependencies. There were 3 services: backend, frontend, and qdrant.
- I made sure that everything worked on the local machine and then pushed the project to the GitHub repository and moved to the next stage.
6 Deployment of the project on a virtual machine consisted of the following steps:
- Booking free Linux machine on Amazon Web Services with HTTP connection added
- Uploading images and vectors.json from the local machine to the virtual one.
- Running docker for Qdrant and creating a collection with the points from vector.json. However, a collection could be uploaded directly from the local machine.
- Cloning repository from GitHub, adjusting python scripts and the docker-compose.yml.
- Building and running docker-compose.
- Enjoying web interface.
The result
The web page includes:
- images of 5 random clothes
- item selector
- image of the selected item (it’s the first item by default)
- images of three of the most similar items of clothing.
When you select a number of the item you like, the page will show the selected item as well as three of the most similar items. If you don’t like any presented clothes, just tap on "Show other items" and you’ll see the other 5 items. Even for the pickiest amongst us, we are hoping it won’t take too long for you to find something truly special!

The next perspectives
Similar clothes don’t look good for some items. For example, the black t-shirts are identified as similar to the grey one even though there are other grey t-shirts in the dataset. I think that the background color may play a key role in this case.

Apparently, the model may be improved using contrastive learning. If manually selected ‘wrong’ pairs are marked in the data, the model could and should learn to ignore a background on images.
Besides, the search for similar items can be implemented for an absolutely new item! Just apply the model to a new image, receive a corresponding vector, and run a search on the current base. Coding wise it’s super easy to add this functionality into the app service.
If I had an online store I could also implement the search of similar clothing with the essential filters such as price, brands, time to delivery, and so on. That’s how the search would become even more effective and precise. Just to note, applying filters in search is supported by the vector search engine that I used.
Having said all this… I believe that the future of online shopping will be significantly improved, through the understanding of our retail habits, supported by AI.
What extra learnings I got during this project
This was a very intense and productive journey for me. I’ve used Transformers for the first time. I’ve deployed the web service for the first time. Here are the top discoveries that I’d like to share with you:
- It’s better to debug model training on a local computer, but train a model on Google Colab. It was taking a while to upload the images to Colab whereas it was time-consuming to train the model on the local machine. Besides, I downloaded the model after it was trained in Colab, and applied it to extract vectors on the local machine.
- Weights in transformer layers are corrected when a transformer is being fine-tuned. I’ve thought that only the added layers are being trained. But the transformer is changing inside too. However, it’s possible to configure settings to not allow transformer layers to change.
- Streamlit framework can’t be used for tracking clicks on images that’s why I used a sidebar solution. The framework has some pros though, for example, it’s coming with a nice mobile version without any coding.
- Docker composition was the most tricky part of the deployment. It turned out that requests from frontend to backend could work on a mac machine and didn’t work on an amazon ec2 machine. So it required some tuning, and, fortunately, the solution for Linux worked as well on Mac OS.