Similar Image Retrieval using Autoencoders

Anson Wong
Towards Data Science
4 min readAug 13, 2017

--

Fig 1: Querying a test burger image in a small training set of steakhouse food images

In this article, we will explain how autoencoders can be used for finding similar images in an unlabeled image dataset. Along with this, we provide the Python implementation of an image similarity recommender system trained on steak house food images at:

https://github.com/ankonzoid/artificio/tree/master/image_retrieval

For more of my blogs, tutorials, and projects on Deep Learning and Reinforcement Learning, please check my Medium and my Github.

Q: What is an Autoencoder?

Autoencoders are neural networks comprised of both an encoder and a decoder. The goal is to compress your input data with the encoder, then decompress the encoded data with the decoder such that the output is a good/perfect reconstruction of your original input data.

Q: Why are Autoencoders useful?

As a blackbox, the autoencoder might seem useless: why do we bother reconstructing a less-perfect copy of our data if we already have the data? Good point. But the true worth of the autoencoder lies in the encoder and decoder themselves as separate tools, rather than as a joint blackbox for reconstructing input data. For example, if we lower the encoding dimensionality sufficiently enough, we can guide the autoencoder to learn the most salient features of your data during training (and ignore the ‘noises’ of the data) with the heuristic being that the autoencoder is forced to reduce its reconstruction error with the constraint of having only limited degrees of freedom available — therefore the autoencoder will prioritize retaining the most macroscopic details of the data first. Note that autoencoders fall into the category of unsupervised learning as the ground truth target is the data itself. Training the autoencoder is typically done through back propagation, with the discrepancies between the input data and reconstruction being the reason for error propagation throughout the network for weight updates.

Fig 2: A schematic of how an autoencoder works. An image of a burger is encoded and reconstructed.

Ok ok ok. So now that we can train an autoencoder, how can we utilize the autoencoder for practical purposes? It turns out that encoded representations (embeddings) given by the encoder are magnificent objects for similarity retrieval. Most raw and highly unstructured data, for example an image of your face, is typically embedded in a non-human interpretable vector space representation. So instead of operating painstakingly in the space of RGB pixels, we can use a trained encoder to convert the image of your face to lower-dimensional embedding sitting in a hopefully more meaningful dimensions such as “image brightness”, “head shape”, “location of eyes”, “color of hair”, etc. With such a condensed encoding representation, simple vector similarity measures between embeddings (such as cosine similarity) will create much more human-interpretable similarities between images.

The similar-image retrieval recommender code

With our described method of using embedding images with a trained encoder (extracted from an autoencoder), we provide here a simple concrete example of how we can query and retrieve similar images in a database. As mentioned earlier, the code for our similar image recommender system can be found at:

https://github.com/ankonzoid/artificio/tree/master/image_retrieval

Specifically we work with a small training database of 5 common steakhouse item classes: steak, burger, salad, fries, asparagus. In the figures below, we show the t-SNE visualizations of our steakhouse food embeddings (Fig 3), a top k=5 image retrieval of a test burger image (Fig 4), and some reconstruction images of some training steakhouse food images (Fig 5). The performance of the algorithms provided are far from perfect, but provide for a good starting point for anyone interested in deep learning image retrieval.

Fig 3: t-SNE visualizations of the training steakhouse food item images
Fig 4: Test burger image retrieval from the training set
Fig 5: Steakhouse food item image reconstructions using a convolutional autoencoder

--

--

AI Research Scientist / Machine Learning Engineer / Theoretical Physicist