How to Implement Image Similarity Using Deep Learning

We can use machine learning to return meaningful similar images, text, or audio in the proper context. Simple, and fast.

Imagine the programmatic effort needed to implement an algorithm to visually compare different T-Shirts to find matching ones. Archiving good results with a traditional programmatic approach is challenging.

With deep learning, we can easily scale this to an unlimited number of different use cases. We do this by using the learned visual representation of a Deep Learning model.

This article covers a simple and fast way to implement an Image Similarity Search.

For the sake of simplicity, we use images but the same approaches also generalize to everything that can be represented as a Vector.

Photo by the author — Image Similarity result based on Deep Learning, source: https://ioannotator.com/image-similarity-search

Jump Directly to the Notebook and Code

If you want to run the code covered in this article head over to the Google Colab notebook.
You need to upload and reference your own images in the notebook.

Use cases

Let us start with a few use cases to inspire you before we dig deeper into the technical details. The use cases are endless, and you can use image similarity in many different areas.
A few use cases I worked on in the past several years:

E-Commerce

E-Commerce has many use cases for a similarity search. The most obvious one, finding similar products based on an actual product image. But also less obvious ones like keeping the product catalog clean by preventing uploading multiple images of the same product.

Digital Document Archives

Imagine a digital archive with millions of scanned documents, large and unstructured. Putting the right labels to find similar documents is extremely time-consuming. A deep learning approach removes the need to put labels on the images.

Machine Learning

High-quality training data is key for successful machine learning projects. Having duplicates in the training data can lead to bad results. Image Similarity can be used to find duplicates in the datasets.

Visual Representation of an Image

When using a deep learning model we usually use the last layer of the model, the output layer. This layer gives us for example the class of the image. Is this a cat, car, table, dog, or mouse on the image?

We can remove this classification layer and get the visual representation of the model produced for the input image. The representation is a vector of float numbers between 0 and 1.

Photo by the author — classification vs feature vector

Let us generalize the approach and compare it to how we as humans perceive the world. We learned as a child unique specifics like a car is mostly square, and has tires, lights, and doors. The numbers of our vector might represent similar specifics. The deep learning model learned this information during the training.

What the model learned is “mostly” a black box. Recent efforts in research help us better understand that. I recommend reading more about Explainable AI if you are interested in that specific topic.

To get such a Vector also called an Embedding, we need just a few lines of code and a pre-trained model.

Pre-Trained Model

We can take advantage of one of the many pre-trained deep learning models. They generalize well enough to fit most of the use cases. It can make sense to fine-tune the pre-trained model for your specific use case. But before you invest that effort, I recommend starting without fine-tuning.

TensorFlow Hub makes it easy to reuse already pre-trained image features, and vector models.
We load the model using TensorFlow Keras. The input shape defines the image size on which the model was trained. That means that the model learned to find patterns of certain sizes, we do right by following the recommended image size.

If you change the model URL keep in mind different models expect different image shapes.

Embedding

To get the embedding we input an image and let the model do its prediction. The image itself needs a bit of preprocessing

  1. Resizing to the size the model was trained on
  2. Converting the image into a color representation for each pixel
  3. Normalizing the values between 0 and 1

With these few lines of code, we can get the visual representation, a vector of float numbers between 0 and 1. The length of the vector differs depending on the model. In our example, the used model has a vector length of 1280 numbers.

We can use that visual representation to calculate how similar images are.

How to measure Image Similarity

To calculate the image similarity, we need a metric. For simplicity, we cover just the most common ones euclidean, cosine, and dot.

To avoid unnecessary math, I try to describe it as practically as possible.

Imagine a coordinate system with 3 axes x, y, and z and assume for a moment our feature vector has a length of 3 elements instead of 1280.
Let us also imagine we have the following feature vectors for 3 images represented in that coordinate system.

  • cat [0.1, 0.1, 0.2] (blue point)
  • cat [0.2, 0.12, 0.2] (orange point)
  • table [0.8, 0.6, 0.6] (green point)
Photo by the author

The euclidean distance as the name suggests measures the distance between points. We can clearly see that the cats (orange and blue) distance to each other is smaller compared to the table (green).

While this works for this specific case it might not fit all the time. Let us add two additional vectors.

  • mouse [0.8, 0.6, 0.6] (red point)
  • cat [0.2, 0.16, 0.2] (purple point)

We can see the mouse (red) is close to the cat (purple) and if we measure the Euclidean distance in fact it is even closer than the other cat (blue). We need another factor to measure the similarity accurately. We can use the direction of the points inside of the coordinate system.

Photo by the author

It gets clearer if we draw a few lines visualizing the direction, this way we can see similar images share the same direction. The dot metric in comparison to euclidean uses the distance and additionally the direction to calculate the similarity.

Photo by the author

The last metric is the cosine similarity it just takes into account the direction without the distance between the two points. Which one you need to choose depends on your specific use case. If you’re not sure I recommend evaluating all 3, starting with cosine.

The example we just covered is a simplification, the feature vectors we get from our models have a much higher dimensionality.

To calculate the similarity values, we don’t need to implement the math on our own. Many libraries provide that method for us, ready to use.

For this example we use SciPy but you could also use sklearn to mention just two.

The smaller the value more similar the images. You can see the cats to each other are very close with a value of 0.345. While the cats compared to the rocket with a higher value of 0.731 and 0.690, which means a lower similarity.

Photo by the author

How to scale it to millions of images

Iterating over millions of images to find the most similar ones doesn't scale very well if we need to process all of the images every time we want to find a similar one. To optimize this process, we need to build some kind of index and find a way to iterate over it more efficiently. Luckily Spotify, Google, and Facebook open-sourced their solution for exactly that problem. But there are also managed services if you don’t want to take care of the required infrastructure and scaling on your own.

What next?

The approaches covered in this article can be easily applied to other Machine Learning areas like text, video, audio, or even tabular data. Every model that outputs a feature vector is suitable.

Let me know if you are interested in a deep dive into one of the open-source frameworks or managed services.

I hope this article inspired you to think of use cases in your specific domain. I have a Google Vertex AI Matching Engine article and video if you plan to put Similarity Search into production.

Thanks for Reading

Your feedback and questions are highly appreciated. You can find me on Twitter Your feedback and questions are highly appreciated. You can find me on Twitter @HeyerSascha or connect with me via LinkedIn. Even better subscribe to my YouTube channel ❤️.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sascha Heyer

Sascha Heyer

Hi, I am Sascha, Senior Machine Learning Engineer at @DoiT International and Founder of IOAnnotator an AI Platform.