What is a Siamese Neural Network?

A neural network that specializes in measuring similarity

Wanshun Wong
Towards Data Science

--

Photo by Sereja Ris on Unsplash

Many applications of machine learning revolve around checking whether two objects are similar. For example:

  • For face recognition, check if an input facial image is similar to any of the images in the database
  • On question-and-answer websites, check if a new question is similar to any of the archived questions

Nowadays, one of the most common and convenient approaches to measure the similarity between two objects is as follows:

  1. Obtain a vector representation (a.k.a. embedding) for each object. e.g. output of an intermediate layer of a pre-trained ResNet, average of pre-trained word vectors for all the words in a sentence
  2. Compute the cosine similarity between the two vectors from step (1)
  3. Use the value from step (2) to determine whether the two objects are similar

However, very often this approach does not perform well, because the vector representations in step (1) are not specialized in our application. If we want to fine-tune these vector representations, or even train a new model from scratch, we can use a Siamese neural network architecture.

Network Architecture

A Siamese neural network consists of two identical subnetworks, a.k.a. twin networks, joined at their outputs. Not only the twin networks have identical architecture, but they also share weights. They work in parallel and are responsible for creating vector representations for the inputs. For instance, we can use ResNet as the twin networks if our inputs are images. We can think of Siamese neural networks as wrappers for the twin networks. They help to produce better vector representations by measuring similarities between vectors.

In the above diagram, x₁ and x₂ are the two objects that we want to compare, and v₁ and v₂ are the vector representations of x₁ and x₂. The architecture of the comparison layers depends on the loss function and the labels of the training data. Since our goal is to have as much information in the vector representations as possible, the comparison layers usually have very simple architecture. The following are some of the common choices:

  • Computing the cosine similarity between v₁ and v
    The labels are real numbers between -1 and 1, and the loss function is mean squared error.
  • Concatenating v₁, v₂ and absolute element-wise difference |v₁ − v₂|, followed by fully-connected layers and a softmax layer
    This is for multi-class classification, and the loss function is cross entropy.
  • Computing the Euclidean distance ‖v₁ − v₂‖ between v₁ and v
    The loss function is contrastive loss or triplet loss. Since these two losses are less well known, we will briefly explain them in the next section.

Notice that the comparison layers should be symmetric w.r.t. v₁ and v₂. Combining with the fact that the twin networks have identical architecture and weights, this makes the whole Siamese neural network symmetric w.r.t. x₁ and x₂.

Loss Functions

Both contrastive loss and triplet loss are distance-based loss functions that are mainly used for learning vector representations, and are often used in conjunction with Siamese neural networks.

Contrastive Loss

Assume our dataset consists of different classes of objects. For example, the ImageNet dataset consists of images of cars, images of dogs, etc. Then for every pair of inputs (x₁, x₂),

  • If x₁ and x₂ belong to the same class, we want their vector representations to be similar. Therefore we want to minimize L = ‖v₁ − v₂‖².
  • On the other hand if x₁ and x₂ belong to different classes, we want ‖v₁ − v₂‖ to be large. The term we want to minimize is
    L = max(0, m − ‖v₁ − v₂‖)², where m is a hyperparameter called margin. The idea of margin is that, when v₁ and v₂ are sufficiently different, L will already be 0 and cannot be further minimized. Hence the model will not waste efforts in further separating v₁ and v₂, and will focus on other input pairs instead.

We can combine these two cases into a single formula:

L = y * v₁ − v₂‖² + (1 - y) * max(0, m − ‖v₁ − v₂‖)²

where y = 1 if x₁ and x₂ belong to the same class, y = 0 otherwise.

Triplet Loss

Triplet loss takes the above idea one step further by considering triplets of inputs (xa, xp, xn). Here xa is an anchor object, xp is a positive object (i.e. xa and xp belong to the same class), and xn is a negative object (i.e. xa and xn belong to different classes). Our goal is to make the vector representation va to be more similar to vp than to vn. The precise formula is given by

L = max(0, m + ‖vavp‖ - ‖vavn‖)

where m is the hyperparameter margin. Just like the case for contrastive loss, margin determines when the difference between ‖vavp‖ and ‖vavn‖ has become big enough, such that the model will no longer adjust its weights from this triplet.

For both contrastive loss and triplet loss, how we sample the input pairs (x₁, x₂) and the triplets (xa, xp, xn) from different classes of objects has a great impact on the model training process. Ideally we want input pairs and triplets that are not too easy but also not too hard for our model.

Implementation

Even though we have twin networks A and B in the previous diagram, in practise it is usually more convenient to just have a single copy of the twin network.

def forward(self, x1, x2):
v1 = self.twin_network(x1)
v2 = self.twin_network(x2)
return self.comparison_layers(v1, v2)

Both TensorFlow and PyTorch have some of the common loss functions built-in.

TensorFlow

PyTorch

Further Reading

  • [1] is a nice overview of Siamese neural networks. In particular it contains many references of applications of Siamese neural networks in different fields, such as image analysis, text mining, biology, medicine and health.
  • [3] uses Siamese neural network to fine-tune the vector representations produced by BERT. This is a good example of making use of pre-trained models.
  • Triplet loss is very popular with Siamese neural networks, and variants of it are introduced in the mini-batch setting. [4] selects “semi-hard” negatives within a batch for every pair of anchor or positive, and trains the network only on these triplets. The concept of “semi-hard” negatives is best illustrated by the following diagram:
[5]: The three types of negatives, given an anchor and a positive
  • [2] selects the hardest positive and also the hardest negative within a batch for every anchor when forming triplets. Hardest here means either largest ‖vavp‖ for positive, or smallest ‖vavn‖ for negative. We refer to [5] for a more detailed explanation of these two triplet loss variants, and TensorFlow Addons documentations for using them in TensorFlow.

References

  1. D. Chicco. Siamese Neural Networks: An Overview (2020), Artificial Neural Networks, Methods in Molecular Biology, vol 2190, Springer Protocols, p. 73–94
  2. A. Hermans, L. Beyer, and B. Leibe. In Defense of the Triplet Loss for Person Re-Identification (2017), arXiv
  3. N. Reimers, and I. Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019), EMNLP − IJCNLP 2019
  4. F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering (2015), CVPR 2015
  5. Triplet Loss and Online Triplet Mining in TensorFlow

--

--