The world’s leading publication for data science, AI, and ML professionals.

Paper explained: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

The secrets of why the SwAV model architecture works so well for self-supervised pre-training

Teaching a neural network to understand the world around it without human supervision has been one of the north stars of the computer vision research community for years. Recently, multiple publications have shown the potential of novel methods to make significant advancements in this area.

One of the most promising methods published is a paper called "Unsupervised Learning of Visual Features by Contrasting Cluster Assignments" by Caron et al. from this year. In this article, we will go over the ideas introduced by the authors, the SwAV model architecture as well as the resulting implications for self-supervised pre-training. I’ve tried to keep the article simple so that even readers with little prior knowledge can follow along. Without further ado, let’s dive in!

A visualization of the training procedure of SwAV. Source: [1]
A visualization of the training procedure of SwAV. Source: [1]

Pre-requisites: Self-supervised pre-training for computer vision

Before we go deeper into the SwAV paper, it’s worth quickly re-visiting what self-supervised pre-training is all about. If you are familiar with self-supervised pre-training, feel free to skip this part.

Traditionally, Computer Vision models have always been trained using supervised learning. That means humans looked at the images and created all sorts of labels for them, so that the model could learn the patterns of those labels. For example, a human annotator would assign a class label to an image or draw bounding boxes around objects in the image. But as anyone who has ever been in contact with labeling tasks knows, the effort to create a sufficient training dataset is high.

In contrast, self-supervised learning does not require any human-created labels. As the name suggest, the model learns to supervise itself. In computer vision, the most common way to model this self-supervision is to take different crops of an image or apply different augmentations to it and passing the modified inputs through the model. Even though the images contain the same visual information but do not look the same, we let the model learn that these images still contain the same visual information, i.e., the same object. This leads to the model learning a similar latent representation (an output vector) for the same objects.

We can later apply transfer learning on this pre-trained model. Usually, these models are then trained on 10% of the data with labels to perform downstream tasks such as object detection and semantic segmentation.

How SwAV uses unsupervised learning to understand the visual world around it

The training pipeline for SwAV is separated into multiple stages.

In the first part, different crops of an image are created. Whereas other techniques have always cropped an image using a fixed size, SwAV introduces a multi-crop strategy: Not only two larger crops are created, but also up to 4 smaller crops are taken. This is one of the keys for why SwAV can boost its performance in comparison to other approaches. Other image augmentations such as blurs and flips can be applied, too.

SwAV creates different sized crops from the same image for training. This is called a multi-crop strategy. Source: Own curation with image from Unsplash
SwAV creates different sized crops from the same image for training. This is called a multi-crop strategy. Source: Own curation with image from Unsplash

Next, the images are passed in pairs through a convolutional neural network. In the paper, the authors use the popular ResNet-50 architecture and they see increased performance as they increase the width of the model. For each of the image, the ResNet will output a feature vector of dimension 128.

This visualization shows the input of each of the different crops of the image into the ResNet, which outputs a vector representation for each of the images. Source: [1]
This visualization shows the input of each of the different crops of the image into the ResNet, which outputs a vector representation for each of the images. Source: [1]

Once these feature vectors, or codes, are calculated, they are matched with so-called prototype vectors. These are the vectors that define the space of possible distinction between the learnable visual feature space, e.g., the difference between a dog and a cat is represented by the different prototype vectors. The mapping of the codes to the prototypes is formulated to maximize the similarity between the image features and prototypes. The output of this process is a cluster assignment.

The matching of output feature vectors into the prototype vector space results in the assignment of the vectors to the prototype clusters. Source: [1]
The matching of output feature vectors into the prototype vector space results in the assignment of the vectors to the prototype clusters. Source: [1]

Now that we have created these cluster assignments, we make an assumption: If the input crops were taken from the same image, we want to learn the model in a way that it outputs similar vectors for crops of the same image.

A simplified visualization of the matched cluster assignment vectors. Source: [1]
A simplified visualization of the matched cluster assignment vectors. Source: [1]

Therefore, it should be possible to swap the two output vectors (crop 1 gets vector 2, crop 2 gets vector 1) and make the model learn to predict the vector output of the other input crop.

Swap of the cluster assignments. This becomes the new training target. Source: [1]
Swap of the cluster assignments. This becomes the new training target. Source: [1]

These swapped vectors are now the predicting target for the other input and now becomes our learning target that we use to backpropagate through the model’s weights.

This mechanism allows SwAV to learn consistent vector representations for the same image or object class. This ability will prove to be crucial for downstream transfer learning.

Now, let’s look at some results.

Results

When evaluating the SwAV model on linear classification on ImageNet using just the frozen weights from pre-training, it outperforms all other state-of-the-art self-supervised methods.

Performance of SwAV on the ImageNet dataset. Source: [2]
Performance of SwAV on the ImageNet dataset. Source: [2]

Most impressive, the performance of the self-supervised model almost matches the performance of a fully supervised model using only annotated data. All evaluations were conducted using a ResNet-50 (R50) backbone. On the right graph, the performance difference with more parameters shows SwAV outperforming other methods as well.

Evaluation of SwAV in comparison to a fully-supervised training procedure for difference datasets. Source: [2]
Evaluation of SwAV in comparison to a fully-supervised training procedure for difference datasets. Source: [2]

This performance advantage for linear classification holds true on other datasets as well. Incredibly, when used as a backbone for an object detection model such as the Mask R-CNN and fine-tuned with only 10% of the labeled data in ImageNet, all object detection architectures shown outperform a fully-supervised training of the same model architectures. This means self-supervised pre-training of an object detection backbone can lead to better performance then training the model fully-supervised.

Wrapping it up

In this article, you have learned about SwAV, a paper that leverages self-supervised pre-training to achieve new heights in performance for downstream tasks such as object detection. It even outperforms fully-supervised approaches on some tasks. While I hope this story gave you a good first insight into the paper, there is still so much more to discover. Therefore, I would encourage you to read the paper yourself, even if you are new to the field. You’ll have to start somewhere 😉

If you are interested in more details on the method presented in the paper, feel free to drop me a message on Twitter, my account is linked on my Medium profile.

I hope you’ve enjoyed this paper explanation. If you have any comments on the article or if you see any errors, feel free to leave a comment.

And last but not least, if you would like to dive deeper in the field of advanced computer vision, consider becoming a follower of mine. I try to post a story once a week and and keep you and anyone else interested up-to-date on what’s new in computer vision Research!


References:

[1] SwAV GitHub Implementation: https://github.com/facebookresearch/swav

[2] Caron, Mathilde, et al. "Unsupervised learning of visual features by contrasting cluster assignments." arXiv preprint arXiv:2006.09882 (2020). https://arxiv.org/pdf/2006.09882.pdf


Related Articles