The world’s leading publication for data science, AI, and ML professionals.

BYOL -The Alternative to Contrastive Self-Supervised Learning

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning by J. Grill et. al.

🚀 Sascha’s Paper Club

In today’s paper analysis we will have a close look into the paper behind BYOL (Bootstrap Your Own Latent). It provides an alternative to contrastive self-supervised learning techniques for representation learning removing the need for a large corpus of negative samples and gigantic batch sizes. Furthermore it is a landmark paper on the path of understanding today’s state-of-the-art foundation models such as the DINO family, including DINOv2.

While contrastive self-supervised learning frameworks still feel kind of intuitive, BYOL can be confusing and intimidating at first. Therefore, it’s a great paper to analyze together. So let’s dive into it and strip it down to uncover its core ideas!

Image created from publication by Sascha Kirch
Image created from publication by Sascha Kirch

Paper: Bootstrap your own latent: A new approach to self-supervised Learning by Jean-Bastien Grill et. al., 13 Jun. 2020

Resources: GitHub

Category: similarity learning, representation learning, Computer Vision, foundation models

Other Walkthroughs: [CLIP] – [GLIP] – [Segment Anything] – [Depth Anything] – [DINO] – [DDPM]

Outline

  1. Context & Background
  2. Claimed Contributions
  3. Method
  4. Experiments
  5. Conclusion
  6. Further Readings & Resources

Context & Background

BYOL falls into the category of self-supervised representation learning via similarity learning. Self-supervised means no explicit ground truth labels are provided, but a supervision signal might be constructed from unlabeled data. Representation learning means the model learns to encode its input into a lower dimensional and semantically rich representation space. And finally in similarity learning features that are similar are mapped close to each other in the latent representation space, while non-similar features are mapped further apart. These representations are crucial in many Deep Learning tasks that built upon these representations to for example generate new data, perform classification, segmentation or monocular depth estimation.

Many successful methods, such as CLIP, GLIP, MoCo or SimCLR use a contrastive learning approach. In contrastive learning, a score for matching data pairs is maximized, while a score for non-matching data is minimized. This process does heavily depend on the batch size and the number of negative samples provided during training. This dependency makes data collection and training more challenging.

BYOL aims to:

  1. Get rid for the need of negative samples and large batch size as required for contrastive learning.
  2. Decrease the dependency to domain specific augmentations to be applicable to other domains as language or images.

Among many references made in the paper, BYOL highlights its similarities to mean teacher, the momentum encoder and predictions of bootstrapped latents (PBL).

Claimed Contributions (According to Authors)

  1. Introduction of BYOL (Bootstrap your own latent), a self-supervised representation learning method that does not require negative pairs (as in contrastive learning)
  2. BYOL representations are shown to outperform the state-of-the-art (at the time of the paper’s release)
  3. BYOL is shown to be more resilient to batch size and used image augmentations compared to its contrastive counterparts

Paper Walkthroughs by Sascha Kirch

Method

Now that we have seen what BYOL claims to solve let’s try to understand how this is achieved. First let’s observe the architecture presented in Fig.1

Fig. 1: Framework architecture. Image Source + annotations by Sascha Kirch.
Fig. 1: Framework architecture. Image Source + annotations by Sascha Kirch.

BYOL consists of two networks: the online network and the target network. The online network consists of three submodules, namely the encoder, projector and predictor. The target network consists of two submodules, namely the encoder and projector. The encoder and predictor of both networks share the exact same architecture, they only differ in their model weights. While the online network is optimized during training, the target network updates its weights by an exponentially moving average of itself and the online network.

Encoder – The encoder consists of a ResNet convolutional neuronal network. It translates the input image into a latent representation.

Projector – Projects the latent space from a 4096-dimensional space into a 256-dimensional space via a multi-layer perceptron network (MLP). I guess the projector is not critical for the framework to work, but 256 is simply a convenient output dimension often used in the field of representation learning.

Predictor – Aims to predict the projected latent space of the target network from the projected latent space of the online network. Crucial to avoid representation collapse.

During training, two different and randomly selected augmentations are applied on an input image to construct two different views of that image. One view is fed into the online model and another view is fed into the target model. These augmentations include among others: resizing, flipping, cropping, color distortion, grayscale conversion, Gaussian blur and saturation. The training objective is to minimize the squared L2-distance between both networks output. After training, only the encoder of the online network is kept as the final model!

That’s all. Easy, right? 😜 Well, after reading the paper my face was more like this: 😵 While it is relatively straight forward to understand the processing of the framework if you break it down to its key components, gaining a intuition did cost me quite some time.

Before we try to gain some intuition of why BYOL actually works, let’s first strip down the presented equations and demystify them.

Math Demystification

Having a rough overview of BYOL’s architecture and how it is trained, let’s have a closer look at the equations. I have to say, the math part presented in the paper is way more complicated than it needs to be. While in some cases it is presented way to complex, in other cases it lags on clarity and leaves room for interpretation causing confusion.

I’ll focus on those equations from which I think are important to understand what is happening. Let’s start by analyzing them in the exact reversed order, because why not? 😜

First, let’s talk about the update of the models’ parameters during training. Recall that we have two models: the online model and the target model. The online model is updated by optimizing a loss function using a LARS optimizer.

Equation 1: Weight update of online network. Source + annotations by Sascha Kirch
Equation 1: Weight update of online network. Source + annotations by Sascha Kirch

The equation above simply says: "update the model’s parameters theta by calling an optimizer function upon the current parameters, the gradients of these parameters with respect to a loss function and a learning rate eta".

The target model on the other hand is not updated via optimization but by copying the weights from the online model and applying an exponential moving average on the copied updated weights and the current weights of the target network:

Equation 2: Weight update of target network. Source + annotations by Sascha Kirch
Equation 2: Weight update of target network. Source + annotations by Sascha Kirch

The equation above simply says: "update the model’s parameter xi by calculating an exponential moving average with the decay rate tau of the current weights xi and the updated weights of the online model". Tau follows a cosine schedule to decrease the contribution of the online model throughout the training.

Now let’s have a look on the loss function used to update the online model. It is defined as the sum of two other loss functions. These losses share the same equation as we will see later but are calculated on two different inputs of the network. Recall from Fig. 1. that two different views (i.e. v and v’) are generated from an image x by applying different augmentations. One view is input into the online model and the other one into the target model. During training, two forward passes are performed before calculating the loss, where the input for the networks are swapped. The image input to the online model is input into the target model and vice versa.

Equation 3: BYOL's loss function. Source + annotations by Sascha Kirch
Equation 3: BYOL’s loss function. Source + annotations by Sascha Kirch

The loss for the individual forward passes is a squared L2 distance of the L2-normalized outputs of the online model and the target model. Let’s break down the corresponding equation from the paper:

Equation 4: Individual loss function. Source + annotations by Sascha Kirch
Equation 4: Individual loss function. Source + annotations by Sascha Kirch

Note: The paper says that this is a mean squared error, which is actually not correct. The L2-distance does not divide by its number of elements. I guess they confused it with calculating the mean over all batches.

Intuition of BYOL

Now as we are equipped with an understanding of the framework and the key message of the equations, let us try to gain some intuition. I’ll present you what the authors think and then I’ll try to add some intuition of my own, well knowing it might not be accurate 🤡.

How does BYOL learn its representations? – The model is encouraged to generate the same latent representation of its two inputs, which represent two different views of the same object/scene. A cat is still a cat regardless of the image being blurred, in grayscale or flipped. In fact, I think the heavy augmentations are crucial here. It basically tells the model "Look, these are different variations of the same thing, so ignore these variations and consider them equal when extracting representations of the object/scene!".

Why are the representations not collapsing? – Recall that earlier we said, BYOL falls into the category of similarity learning. Wouldn’t it be the easiest way for the network, to just map everything into the same point in the latent space to achieve the highest similarity? In fact, this is one of the mayor difficulties in similarity learning and is called "collapsing solutions". Contrastive learning approaches solve this issue by providing many negative samples for a given match to map similar features close to each other in the latent space while mapping dissimilar features farther apart. BYOL solves this issue by introducing an asymmetry between the online and the target network with their predictor submodule and by employing an update rule for the target network parameters based on the exponentially moving average to ensure near optimality of the predictor throughout training.

Get an email whenever Sascha Kirch publishes 🚀 _Get an email whenever Sascha Kirch publishes 🚀 Looking to learn more about deep learning or simply stay up to dat_e…medium.com

Experiments

The authors of BYOL presented experiments and ablations to demonstrate the effectiveness of their method.

Ablation on Batch Size

From contrastive representation learning methods (e.g. [CLIP](https://towardsdatascience.com/the-clip-foundation-model-7770858b487d?source=friends_link&sk=a7b10ba1d0c3a20ecd4adb8200a48500) and GLIP) we know that there is a large dependence on the batch size during training. CLIP for example was trained on a batch size of 32,768, which is crazy considering it is a multi-modal language-image model.

The authors claim, since BYOL does not require negative samples, it is not as sensitive to lower batch sizes which they backup with the following experiment shown in Fig.2.

Fig. 2: Impact of batch size. Image Source + annotations by Sascha Kirch.
Fig. 2: Impact of batch size. Image Source + annotations by Sascha Kirch.

Sadly, this might still be too large for my private laptop 😅

Ablation on Robustness of Image Augmentations

The SimCLR paper has shown that contrastive vision methods are sensitive to their choice on image augmentations, especially those affecting the color histogram. While crops of the same image share a similar color histogram, crops of negative pairs don’t. The model can take a shortcut during training and focus on differences in color histograms rather than the semantic features.

The authors claim that BYOL is more robust towards their choice of image augmentations, because of the way the online and target networks are updated. While this hypothesis is backed up by an experiment, there is still a strong dependency and hence a drop in performance.

Fig. 3: Robustness towards image augmentations. Image Source + annotations by Sascha Kirch.
Fig. 3: Robustness towards image augmentations. Image Source + annotations by Sascha Kirch.

Linear Evaluation on ImageNet

In the field of representation learning, an important characteristic is the model’s ability project semantically rich features into a latent space, to cluster similar features and to separate dissimilar features. A common test is to freeze the model (in case of BYOL only the encoder of the online model) and to train a linear classifier on top of the representations.

Linear evaluation of BYOL has been performed on ImageNet and has been compared to many other models and outperforms the previous state-of-the-art of that time.

You’ll find in many papers the differentiation between ResNet-50 encoder and other variations of ResNet. It’s just that the ResNet-50 has been emerged to be the standard network to evaluate performance on.

Table 1: Linear evaluation on ImageNet. Source
Table 1: Linear evaluation on ImageNet. Source

Semi-Supervised Fine-Tuning for classification

Another very typical experiment setup in representation learning is the model’s performance when fine-tuned to a specific downstream task and dataset.

Table 2 depicts the metrics when finetuning BYOL on a classification task using either 1% or 10% of the entire ImageNet training set.

Table 2: Semi-supervised training on ImageNet. Source
Table 2: Semi-supervised training on ImageNet. Source

Transfer to Other Vision Tasks

The authors also present experiments where they transfer-learn BYOL on a semantic segmentation task and a monocular depth estimation task, two other important fields of computer vision.

The differences to previous approaches are marginal, but I guess the key message here is, "We have a different approach that works just as good"

Table 3: Transfer to other vision tasks. Source
Table 3: Transfer to other vision tasks. Source

Conclusion

BYOL presented an alternative approach for self-supervised representation learning. By implementing two networks that perform similarity learning, BYOL can be trained without the need of negative training samples like those needed for contrastive learning approaches. To avoid collapsing solutions the target network is updated via EMA from the online network and an extra prediction sub-module is built on top of the online network.

Further Readings & Resources

If you have made it so far: congratulations🎉 and thank you😉 ! Since it seems that you are quite interested in the topic, here are some further resources:

Following a list of papers that built upon BYOL:

  1. DINO: Emerging Properties in Self-Supervised Vision Transformers
  2. DINOv2: Learning Robust Visual Features without Supervision

Here are two of my articles about the contrastive learning methods CLIP and GLIP for self-supervised representation learning:

The CLIP Foundation Model

GLIP: Introducing Language-Image Pre-Training to Object Detection


Related Articles