Image similarity using Triplet Loss

Published in

Towards Data Science

5 min readJul 16, 2019

Have you ever trained a Machine Learning model to solve a classification problem? If yes, what was the number of classes? maybe 10 to 200? or 1000? Does the same model work when the number of classes is in the order of millions? If the answer is No, this article is for you.

Several real-world applications in the industry, ranging from Face Recognition to Object Detection, POS Tagger to Document ranking in NLP, are formulated as a multi-class classification problem. A typical softmax-based deep network wouldn’t help when the number of classes in the output layer is too high because of the sparseness of the network. Rather, this kind of problem can be formulated in a different way. The idea is to learn distributed embeddings representation of data points in a way that in the high dimensional vector space, contextually similar data points are projected in the near-by region whereas dissimilar data points are projected far away from each other.

Triplet Loss architecture helps us to learn distributed embedding by the notion of similarity and dissimilarity. It’s a kind of neural network architecture where multiple parallel networks are trained that share weights among each other. During prediction time, input data is passed through one network to compute distributed embeddings representation of input data.

In this article, we will discuss how to train Triplet Loss and how to use the trained model during prediction.

Training Data Preparation:

For Triplet Loss, the objective is to build triplets <anchor, positive, negative> consisting of an anchor image, a positive image (which is similar to the anchor image), and a negative image (which is dissimilar to the anchor image). There are different ways to define similar and dissimilar images. If you have a dataset having multiple labels as the target class, Then images of the same class can be considered as similar, and images across different classes can be considered as dissimilar.

I have a dataset of Geological images having 6 different classes. To generate triplets, first, 2 classes are selected randomly. Then, two images are selected from one class and one image is selected from the other one. Now, images of the same classes are considered similar, so one of them is used as an anchor and the other one as positive whereas images from the other class is considered a negative image.

Likewise, for every batch, a set of n number of triplets are selected.

Loss function:

The cost function for Triplet Loss is as follows:

L(a, p, n) = max(0, D(a, p) — D(a, n) + margin)

where D(x, y): the distance between the learned vector representation of x and y. As a distance metric L2 distance or (1 - cosine similarity) can be used. The objective of this function is to keep the distance between the anchor and positive smaller than the distance between the anchor and negative.

Model Architecture:

The idea is to have 3 identical networks having the same neural net architecture and they should share weights. I repeat all the networks should share underlying weight vectors. [Please refer to github repository to know how to share weights between networks in tensorflow implementation]. The last layer of the Deep Network has D-number of neurons to learn D-dimensional vector representation.

Anchor, Positive and Negative images are passed through their respective network and during backpropagation weight vectors are updated using shared architecture. During prediction time, any one network is used to compute the vector representation of input data.

Following is the tensorboard visualization of the implementation.

tensorboard visualization of the computation graph

Model Learning:

The model not only learned to formulate clusters for different classes at the same time, but it’s also successful in projecting similar-looking images into their neighborhood region. In the case of classification architecture, the model tries to learn a decision boundary between a pair of classes, but the model doesn’t take care of the integrity between similar and dissimilar images within a class.

To have a sense of how good is the learning of the trained model, I have randomly selected 20% of images and plotted those images in 2D space after doing dimensionality reduction on high dimensional vector space representation.

Results:

Following is a snapshot of how the model is performing. I have randomly chosen 20 query images from the corpus of test images. And for each query image, plotted top 10 most probable images which are similar in terms of cosine similarity on high dimensional vector space representation.

Conclusion:

Triplet Loss architecture helps us to solve several problems having a very high number of classes. Let’s say you want to build a Face recognition system, where you have a database of 1 million human faces, pre-compute D dimensional vectors for each face. Now, given a human face image [as test image] compute the cosine similarity with all 1 million pre-computed vectors, and whatever image has the highest similarity will be the selected candidate. If the test image is nothing but noise, the highest similarity would be very low and will fall below a threshold parameter.

Computing cosine similarity with every image in the corpus would be very computationally inefficient. At the same time, it might require a huge amount of physical memory to store the vector representation of corpus images.

faiss, an open-source framework developed by facebook-research, helps to build an index of corpus-based on available memory and provides several methods to find out similar images comparatively faster.

Git repository:

Clone the git repository: https://github.com/sanku-lib/triplet_loss.git

Further Reading:

Implementing Content-Based Image Retrieval with Siamese Networks in PyTorch

References:

[1] https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf

[2] https://arxiv.org/pdf/1503.03832.pdf

[3] https://arxiv.org/pdf/1702.08734.pdf

Image similarity using Triplet Loss

Written by Shibsankar Das