A THOROUGH COMPARISON

How to choose your loss when designing a Siamese Neural Network ? Contrastive, Triplet or Quadruplet ?

A performance comparison of three popular techniques (Contrastive, Triplet & Quadruplet loss) used to train similarity learning algorithms on the Quora dataset [1]

Thomas Di Martino
Towards Data Science
8 min readJun 30, 2020

--

Through this article, I will evaluate and compare three different losses for the task of Deep Similarity Learning. If this topic is still not perfectly understandable to you, I have written an article introducing the main concepts with code examples as well as a complete GitHub repository for you to check:

Table of Content

I. Quick overview of the task

II. Siamese Recurrent Network: similarity learning for sequences

III. Losses for Deep Similarity Learning

IV. Concrete Application: question pairs detection

I. Quick overview of the task

I used for this task the famous Quora question pairs dataset, where the main goal is to predict if two question pairs have the same intent. For instance:

  • What can make Physics easy to learn ? / How can you make Physics easy to learn ? have similar intents
  • What is the best way to make money online ? / What is the best way to ask for money online ? have different intents

For this task, different solutions can be used but the one we will see today is: Word Embeddings + Siamese Recurrent Network. The Word Embedding algorithm is not the point of focus here (Word2Vec will be used) but we will focus on training the Siamese Recurrent Network. Hence, before talking about training, we will have a quick overview of what is a Siamese Recurrent Network (more details can be found in my other article above…).

II. Siamese Recurrent Network: similarity learning for sequences

Figure of a Siamese BiLSTM Figure

As presented above, a Siamese Recurrent Neural Network is a neural network that takes, as an input, two sequences of data and classify them as similar or dissimilar.

The Encoder

To do so, it uses an Encoder whose job is to transform the input data into a vector of features. One vector is then created for each input and are passed on to the Classifier. When working with images, this encoder will often be a stack of convolutional layers, while when working with sequences, it will often be a stack of RNNs. In our case, we used a stack of 3 Bidirectional LSTMs.

The Classifier

The classifier then calculates, from these two inputs, a distance value (the distance function can be any distance: L1, L2…). This distance is then classified as being the distance of similar or dissimilar data instances: this process is just similar to finding the right distance value threshold above which two data objects are considered dissimilar.

Training a Siamese Neural Network

Given the definitions of the encoder and the classifier, one may realize that all the difficulty of working with Siamese NN lies within the creation process of the vector of features. Indeed, this vector needs the following properties:

  • To be appropriately descriptive enough so that two similar pieces of data (with variability) will have similar vectors (and hence, small distances)
  • To be discriminative enough so that two dissimilar pieces of data will have dissimilar vectors
Animation of Data Comparison process

Hence, we see that training this network is all about training it, on one hand, to recognize similar things, while on the other hand, to recognize when things are dissimilar: both with good confidence. It is not enough to teach a model what two similar pieces of data are, it would overfit to the training data and have a tendency to find everything to be similar (high recall but low precision): it is also about training it to recognize dissimilar data (and thus, balance its recall and precision) and ultimately, what makes two pieces of data inherently different.

For the intent of training a Siamese Neural Network, the most popular loss function used is the Contrastive Loss [2] (it has been more thoroughly reviewed in my earlier post that you can find above). However, it is not the only one that exists. I will compare it to two other losses by detailing the main idea behind these losses as well as their PyTorch implementation.

III. Losses for Deep Similarity Learning

Contrastive Loss

When training a Siamese Network with a Contrastive loss [2], it will take two inputs data to compare at each time step. These two input data could either be similar or dissimilar. This is modelled by a binary class variable Y whose values are:

  • 0 if dissimilar;
  • 1 if similar.

These classes can obviously be changed, to the condition that the loss function is adapted.

Illustration of the Contrastive Loss details

You can find the PyTorch code of the Contrastive Loss below:

Triplet Loss

When training a Siamese Network with a Triplet loss [3], it will take three inputs data to compare at each time step. Oppositely to the Contrastive Loss, the inputs are intentionally sampled regarding their class:

  • We sample an anchor object, used as a point of comparison for the two other data objects;
  • We sample a positive object, known to be similar to the anchor object;
  • We then sample a negative object, known to be dissimilar to the anchor object.
Illustration of the Triplet Loss details

You can find the PyTorch code of the Triplet Loss below:

Quadruplet Loss

When training a Siamese Network with a Quadruplet loss [3], it will take four inputs data to compare at each time step. Just like the Triplet Loss, the inputs are again intentionally sampled regarding their class:

  • We sample an anchor object, used as a point of comparison for the two other data objects;
  • We sample a positive object, known to be similar to the anchor object;
  • We sample a negative object, known to be dissimilar to the anchor object;
  • Then, we sample another negative object, known to be dissimilar to every other of the 3 data objects.
Illustration of Quadruplet Loss details

You can find the code of the Quadruplet Loss below:

Visual Comparison of the losses and their impact on the network

Comparison of the 3 Losses and their impact on the Siamese network architecture

In this illustration, the 3 losses are compared side by side. We can easily see the difference in number of inputs, depending on the loss used.

On the right of each encoder is a graph representation of the computed loss. It can give some more insights on how the output of the encoder are used to train the network but also to show the different levels of complexity amongst the losses. The output of each loss is the computation node of purple color.

IV. Concrete applications

Architecture & Loss definitions (PyTorch)

I trained three different models, one for each loss. They all used the same encoder to process their input, the only difference between them was the number of inputs they had:

  • 2 Inputs for the Contrastive Loss model;
  • 3 Inputs for the Triplet Loss model;
  • 4 Inputs for the Quadruplet Loss model.

This encoder has the following architecture (built in PyTorch):

Then, each model would have a single version of this encoder that they would use to generate vectors of features for their input. For example, for the Quadruplet Loss model, we have:

Training details & results

I trained my networks in parallel (using the same for-loop) using the following hyper-parameters:

  • 25 epochs
  • Learning Rate of 1e-3
  • Batch Size of 64
  • Embedding Size (Word2Vec modelling) of 40

My three algorithms performance were measured using AUC score at the end of each epoch, after the validation phase, using the training, validation and testing set to compute their mutual AUC score. They all followed the same progress overall, so I will only display the test results:

Plot of the AUC Score on the test set for each model as a function of epoch number

We see here that throughout the training, the model trained with Quadruplet Loss significantly outperforms the Contrastive Loss model. While they all seem to converge at the end, there is still a 0.01 AUC difference between the two models. This is a proof on how efficient Quadruplet Loss is to train a model to transform data into feature-dense vectors.

My github repository with the full code of this project is here:

DISCLAIMER: The Quora dataset was slightly modified for the purpose of this experiment. While it originally contains similar and dissimilar examples of questions, only similar questions were kept for my computation. I would then randomly sample any other question from the dataset to create the dissimilar examples. The whole difficulty of the original dataset being that some questions are very close in meaning while actually being different. The key of solutions built for this challenge were not only to build a Deep Similarity Network but also to create by hand “magic features” that would be used, merged to the vector, for classification.

Additional Ressources

For another concrete application of Siamese Neural Network, I advise you this extensive article on the subject by Raúl Gómez Bruballa posted on Neptune.ai blog section. It presents the concept of Image Retrieval through the matching of similarly embedded content.

References

[1] Quora. 2017. Quora Question Pairs, Kaggle

[2] R. Hadsell, S. Chopra, Y. LeCun. Dimensionality Reduction by learning an Invariant Mapping. 2006.

[3] F. Schroff, D. Kalenichenko, J. Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. 2015.

[4] W. Chen, X. Chen, J. Zhang, K. Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. 2017.

--

--

As a French PhD student, I am passionate to whatever comes close to Artificial Intelligence and Earth Observation.