In a previous post I wrote about how you could use a Mask R-CNN model to detect and segment out articles of clothing to be used by some second stage model. For his post I built an example of that second stage model using a Pytorch siamese neural network. The idea would be that by combining these two models you could take a raw image, segment out just the articles of clothing, and then match those articles of clothing against a database of clothing items to find similar items.
So the first part of this post will focus on building the siamese network and near the end I will show an example using the output from my segmentation model and how the siamese network can generalize beyond shoes to other items it has never seen before.
What are Siamese Networks, Why Should We Care?
Rather than classifying what an object is, the goal with a siamese style network is to determine if items are similar or different. These can really be anything. Facebook uses them for facial recognition, the original application was signature verification, and I have used them to look at artists styles and game characters.
You train siamese networks by initializing two network "towers" where you hold their weights equal and conduct pairwise comparisons by feeding inputs through each tower. The networks are trained to determine whether or not the two objects it is presented with are the same or different by minimizing the distance between similar inputs and maximizing the difference between dissimilar ones.

Siamese networks are not answering the classic classification question, what is X? Rather they answer the question, are objects X and Y similar?
With that said, I think it is fair to ask why even bother going through the trouble of building a siamese architecture for pairwise comparison if you could just train a standard network to do these identifications?
In response, siamese networks have a number of properties that I think make them quite useful. The main one is that they are able to generalize beyond what they were trained on to be able to tell the differences between classes they have not necessarily been exposed to. Another property is that since they generalize to classes they have not been explicitly trained on, siamese networks do not need to be retrained every time they are required to look at something new.
See below for some examples of shoes that the network believes to be similar and then different.

In the above example the model generates embeddings for each shoe, and the euclidean distance between the embeddings is small. So the network believes that these shoes are fairly similar.

Technical Implementation
For this post I built my siamese network using Pytorch in a jupyter notebook. I used a custom dataset function, trained the network from scratch, and tested it in the notebook as well.
Feel free to check out the notebook on github here
I gathered 220 images of shoes from online and grouped them into styles of shoe. I used 6 categories "heel", "boot", "loafer", "sandal", "sneaker", "flat". The siamese network will essentially be trained by having pairs of shoes presented and it having to decide if the two images are from the same category.
The SiameseNetworkDataset class in the jupyter notebook essentially references a given image by index and pairs it with another image from the dataset. It returns the image pair and a target of 0 or 1 depending on whether or not they match. During training this dataset is called to generate samples and shows another useful characteristic of siamese networks where the effective size of the dataset is much higher than it actually is. In this case I had 200 or so training images, so when you select that first image there are 199 other images to pair it against. All of these are valid pairwise comparisons and can be used in training. In general this characteristic of siamese networks lets you train them with very little data, but more data is still better.
The contrastive loss function essentially measures the euclidean distance between the two outputs and adjusts them accordingly to make similar images have a shorter distance between them while increasing the distance between dissimilar inputs.

The training loop ran for 50 epochs and took maybe an hour to complete on my Nvidia 1080 GPU machine.
Cool… But Somewhat Impractical?
I agree with this. In actuality it is very impractical to have to do pairwise comparisons against everything in a database. For instance if you have millions of images you would find that match in hours or days which is terrible. So we need to figure out a way to use the specialization of siamese networks to determine similarity, in a way that isn’t terrible.
Stepping back for a second, in many areas of machine learning we find that if we can condense the representation of a sample in a meaningful way it can help our analysis significantly. Raw images are pretty hefty and even resizing color images to 224×224 pixels means you have 150K values to look through (224x224x3). You could also describe them as being very sparse in terms of what information is actually important to us here. So if you could create a useful embedding to represent this image in a less complicated space then you could match similar images using other methods/models pretty quickly.
Siamese networks allow you to fill this role well of creating embeddings of images because of the way they were optimized to determine similarity. You can feed in that 150K raw image values and get an output embedding of 128 values (this is the length of my network’s output vector). Then when you test that image against another you can see how similar the 128 values of the output embeddings are.
So for example, take the below image of the shoe, pass it through the siamese network. The shoe is inputted as a 224x224x3 array and comes out as a single length 128 vector. I find this to be quite elegant!

In a large scale environment you would use a siamese trained network to extract feature embeddings of all the images in the database, and then you can build other models or apply other methods to match against that database. This is leveraging the fact that similar images fed through the siamese network will also have similar embeddings and this extends to images/classes it has never been exposed to and still produce similar embeddings for similar images.
Discussing Siamese Network Results
I have shown some of the successes of this network, but it is important to show its limitations. Since this network was trained on a very small dataset so it still does not evaluate well in a lot of cases.



So these three are quick examples showing some weaknesses in this network trained on a small set of images. With more data and longer training time these would likely be addressed.
Another issue is going to be images could technically be similar, but viewed from different angles will return as dissimilar. This issue gets solved in cases like facial recognition where a model is provided multiple angles of a persons face to optimize against which allows the model to handle side views or partial obstruction. The below image is one where I labeled each as a "loafer" so it should technically return as similar, however the model evaluates them as being quite different

Another interesting thing I found is that when I used the bounding box instead of the segmented boot from the Mask R-CNN model, it actually performed very similarly to the segmented image of the boot. This might be because the background is fairly clean, but if this is consistent in other images it means you might not need a full Mask R-CNN model and could use a lighter weight object detection model and have it perform well. Which is a cool finding if this remains a consistent trend!

However I digress, the goal of this post is to show that you can segment an image in a first stage model, pass that to a second stage siamese network, and then match that against some known image. While this network is small and has issues which may be improved, it sill performed well when testing it in combination with my Mask R-CNN image segmentation model, so without further adieu…
Welcome to the Jungle!
If for instance, a company wanted to be able to take a image from the wild from a user and return similar items in their store to the clothing items to the image. Feeding a raw image like the one below would be fairly difficult because there is a very high amount of "noise" in this image. In this case, noise means everything that isn’t an article of clothing. Even if you were able to classify the different types of clothing in the image, it would be hard to return good similar items without localizing to those items first.

So to solve that signal to noise problem, you could use an image segmentation model like my custom trained multi-class Mask R-CNN model. It goes through the image and localizes to the items of clothes.

Then you can use those generated image masks to return just the items of clothes.

The final step would be to call the siamese network that I outlined in this post on those items so you can look for similar known items of clothing.
Here is an example using those pairwise comparisons for easier illustration. The shoe on the left can represent the image in the known shoe we are testing against and the right shoe is a boot that was extracted from the raw input image above.

This first image pair does not have a very good score, so we would say they are not similar.

This set does much better with a score of .43 In this case you could return the boot in the database which is a Stuart Weitzman boot as being similar to the boot seen on the model in the raw input image which is also a Stuart Weitzman boot.
Generalizing Beyond their Training
I have stated that siamese networks are useful because of their ability to generalize beyond what they were trained on. So to showcase this there is also a checkered handbag in the original image I ran my segmentation model over. Since I had this handbag extracted from the original image I decided to run a few tests and found that the shoe trained siamese network performed well in finding similar bags.
The first image shoes the extracted handbag image on the right, matched with a black handbag on the left. The network finds them to be fairly dissimilar with a score of .77

Now onto the second image. Once again we have the extracted handbag on the right, on the left I found an image of a checkered handbag online. The model evaluates these as having a dissimilarity score of .43 which means it thinks they are fairly similar.

This last bit was a cool little test that I was unsure my siamese network trained on a small set of images would be able to pass. I am guessing it helps that the checkered pattern makes the bag fairly distinctive. However this still illustrates the idea that siamese networks can generalize to types of images they have not been explicitly trained on.
Closing Thoughts
In this post I walked through what siamese networks are, why they are useful, how I built this one, and how to use one as a second stage model to allow you to match items from an image from the wild against a repository of clothes.
This post in combination with my previous image segmentation post show how to build a possible full end-to-end Deep Learning pipeline to address some very real business concerns in a variety of areas, not just clothing. It does not even have to be the specific models that I showcased here.
For a first stage model you may not need a full image segmentation model, maybe just a simple object detection model would work. This would make training much faster since it is easier to build object detection datasets rather than a full image segmentation one.
For the second stage I showcased how a siamese network could be useful to build feature embeddings. I really like how siamese networks can generalize well beyond what they were trained on, but you could do this task with a model which has been trained to classify images for that particular domain and then repurpose it for feature extraction. Personally I like using a siamese network here because it is trained explicitly to do similarity comparisons rather than repurposing another model to do that job.