Novel Approaches to Similarity Learning

From Siamese network and Triplet Loss to ArcFace Loss

Andy Wang

Published in

Towards Data Science

5 min readMar 25, 2021

What is Similarity Learning?

A Neural Network can be trained to classify images or predict the prices of certain products, namely classification, and regression. These two methods are commonly used not only in deep learning but across machine learning in general. Instead of classifying images or objects into two categories, similarity learning determines if the two objects are in the same category or similar, thus the name “similarity learning”.

Many fields such as facial verification/identification and recommendation systems utilize similarity learning to achieve their goal. Similarity Learning is mostly applied to images.

An Intuitive Understanding

When we want to compare two images and decide whether they are similar or not, it’s best to compare their embeddings produced by a trained CNN. Simply put, embedding is just vectors extracted from the network that contains important patterns and information learned by the network.

For example, if we want to compare if a picture of a dog is similar to a cat, we would put the images through the same neural network with the same weights. Then we will take the activation of the last layer before the output layer. The vectors produced are the embedding for the images.

To compare the differences between the images, we can find the distance between the two vectors. Mathematically, we can compute a distance function of the two vectors as the L2-norm of the difference between the vectors, written as

Where x1 and x2 are the embeddings of the cat and the dog, respectively. If the two images are similar, which, in this case, they are not, the result should be minimal. However, the result in this case, where the two images are very different, should be larger.

The concept of comparing two images from their embeddings created by the same neural network is called the Siamese Network. We can train the network using the Triplet Loss.

Triplet Loss

Still using the example of the cats and dogs, let’s call the dog image the anchor image or the original image that we will be comparing with. We have our second image of another dog, similar to the first, but slightly different in the details, let’s call this image positive. We want the network to classify these two images as the same category. Finally, we have the negative image of a cat, which is not in the same category as the dog.

Sources: Top left image. Top right dog image. Bottom cat image.

For the network to correctly classify if the images belong in the same category, we want the difference between the positive and the anchor to be smaller than the difference between the anchor and the negative. Using the distance function above, we can write this as

Where A, P, and N represent the embedding vectors of anchor, positive and negative. Moving the terms on to one side, the equation becomes

However, the network can find a way “around” this function without any actual learning by setting both embedding vectors to zero. To fix this, we can add a “margin” to the equation, call it alpha.

This way, the network cannot find a “trivial” solution to this and has to learn. Formalizing this into a loss function, we will take the max between the computed difference of (A, P), (A, N) (equation above), and 0.

Then the loss becomes,

The loss will result in zero when the network has discovered a difference between the anchor and the negative. On the other hand, if the network wasn’t able to find the difference between the negative and the anchor, the loss will produce a measure of the error.

The triplet loss has a couple of disadvantages that should be considered.

First, it requires a careful selection of the anchor, positive, and negative images. The difference between the negative and the anchor images can’t be too much, if it was, the network will satisfy the loss function easily without learning anything. The anchor and the negative images must be similar but shouldn’t belong in the same class.

Second, it’s computationally expensive and lastly, the triplet loss requires the hyperparameter alpha, or the margin. it can lead to worse results when not chosen carefully.

Alternatives: ArcFace Loss and the Angle Margin Penalty

There are many alternatives to the triplet loss, one of them is the ArcFace Loss. This is a loss based on the cross-entropy loss aiming to maximize the decision boundary between classes thus grouping similar data points closer together.

The idea behind ArcFace is that it maximizes the angle between interclass and minimizes the angle between intraclass on a hypersphere. We then add the angular margin penalty which is inserted between the weight of the true logit and the embedding. This adds a angle penalty to the original angle between the logit and the embedding.

The angle margin penalty helps in penalizing the embedding vectors that goes far and help in bringing the embedding features of a certain class come more closer.

There are many other losses out there that operates on a similar idea compared to ArcFace, such as SphereFace, CosFace.

Using the triplet loss with the Siamese Network are early approaches to similarity learning and thus have many problems and disadvantages. On the other hand, ArcFace achieves the goal of maximizing the decision boundary between classes cleverly without wasting computational resources.