Illustrated Guide to Siamese Network

Using triplet loss for One-shot learning

Pranjal Gupta
Towards Data Science

--

Building an accurate Machine learning model often requires for a lot of training data, which still might end up over-fitting on training data or work for a limited number of classes in data it was trained for. But what if a machine can “learn-to-learn”, for example, if we ever show our machine a single image of a parrot, it can accurately identify another image of parrot by understanding how it is similar to the image of parrot shown as a reference. This problem of designing a machine that can “learn-to-learn” is called meta-learning.

In this article, we will focus on a famous meta-learning model — Siamese Network. How it works and how to implement it?

Siamese, as the name suggests, comes from ‘Siamese Twins’, where we use two or more network(here, CNN in the case of images) which uses shared weights with intention to learn similarity and dissimilarity between images. The network outputs an n-dimensional embedding where each direction represents some visual pattern of the image.

Understanding embedding

Suppose, we use n-dimensional space to map our image, where each dimension corresponds to value of a particular trait/or pattern. Each dimension narrates a unique visual feature of the input image. Check out this amazing article by Jay Alammar for better intuition of embedding. For example in the case of different animal images, an output embedding might look like this(intensity of colour denotes its value between 0 to 1) —

Image by author

First two images of dog, outputs similar embedding, whereas third and fourth output embedding of cat and llama is very different from a dog because they have a very different bag of visual features.

In this article, we will be working on Omniglot data set. This data set has character images of various alphabets used all across the world in different languages.

Ominglot dataset

Working

CNN

CNN architecture inspired by this paper

Our CNN outputs a 1-D array of the desired size of embedding. We can see that last layer performs L2 normalization, this will normalize the output vector and map it to the surface of n-dimensional hyper-sphere of radius 1. This is done to make sure that the value of similarity between images can be compared by calculating distance between two embeddings, as all the embedding will reside on the surface that will give a better result. Our model has three of these CNN and all shares the same weight. This will help our model to learn one similarity and one dissimilarity in each sample. Each sample will contain triplet constituting anchor, positive and negative sample image. More on this in a while.

Final Siamese Network Architecture

Loss Function

Now comes the most important part of building this net. Here, we will write the definition of how it compares output embedding and understand similarity/dissimilarity between them, and thus fulfil the task of meta-learning.

Since we have our output embedding mapped onto the surface using earlier discussed L2 Normalization, we can either use L2 distance or Cosine Similarity. Using triplet loss will allow our model to map two similar images close to one another, and far from dissimilar sample image. This approach is implemented by feeding triplet constituting :

1. Anchor Image — This is a sample image.

2. Positive Image — This is just another variation of the anchor image. This helps model learns the similarities between the two images.

3. Negative Image — This is a different image from above two similar images. This helps our model learn dissimilarities with anchor image.

To increase the distance between similar and dissimilar output vector, and further map similar images close to one another, we introduce one more term called margin. This increases the separation between out similar and dissimilar vector, and also eliminate output of any trivial solution. Since it is difficult for humans to imagine an embedding mapped onto N-dimensional sphere and thus understanding how this loss is working, I made the following render to build intuition of how this thing works for N=3(our much familiar 3D).

This illustration describes our model output after training. Similar images marked as green points are mapped closed to one another and dissimilar images marked as red mapping are far apart with the least distance of the size of margin show as yellow. In an ideal case after training, no point should be mapped in the yellow region of the anchor image. Nevertheless, points often tend to have an overlapping region, since there is much-limited space on the surface. More on this is discussed later.

Our model uses a margin size of 0.2 as used in the paper.

This similarity/dissimilarity is defined by the distance between two vectors using L2 distance and cosine distance.

So our loss is defined as follows —

loss(a, p, n) = max(0, d(a, p) -d(a, n) +margin)

Cosine Triplet Loss

L2 Triplet Loss

Implementation

The tricky part of implementing this network was defining loss function. We make input data by sampling triplets of desired batch size, and define the network and our loss function, and TRAIN!!! Complete implementation can be found here.

Result and Observation

After training the network with different hyper-parameters and loss function, and testing network accuracy on n-way one-shot classification(i.e. “n” number of single images each of different class that network has never seen before). It can be observed —

  1. Cosine Loss tends to perform well for smaller embedding.
  2. Increasing the size of embedding increases the accuracy, as more dimensions can now represent more unique features of the image.
  3. Model accuracy slightly decreases as “n” in n-way one-shot increases, as it gets difficult for the network to map more number of embedding far apart on the hyper-sphere.
Accuracy comparison between the various size of the output vector

A follow-up article can be read that uses contrastive loss to make a Siamese network learn similarities only, using two different images of a character. No doubt, it does perform poorly and tends to over-fit on training, whereas triplet loss does not over-fit at all. Accuracy of contrastive loss based Siamese network falls drastically with increasing nway one-shot classification.

Accuracy comparison between contrastive and triplet loss for different size of the output vector

--

--