The world’s leading publication for data science, AI, and ML professionals.

DINO – A Foundation Model for Computer Vision

Emerging Properties in Self-Supervised Vision Transformers by M. Caron et. al.

🚀 Sascha’s Paper Club

It is an exciting decade for Computer Vision. Great successes from the natural language domain are transferred to the vision domain including the introduction of the ViT (vision transformer) and lately large-scale self-supervised pre-training techniques have made headlines under the name of foundation models.

Today we are looking into a framework called DINO (self DIstillation, NO labels), a visual foundation model built on interesting properties of ViTs. It is also the predecessor of one of today’s best performing foundation models: DINOv2.

Image created from publication by Sascha Kirch
Image created from publication by Sascha Kirch

Paper: Emerging Properties in Self-Supervised Vision Transformers, by Mathilde Caron et.al., 29. Apr. 2021

Resources: GitHub Blog Post

Category: foundation model, computer vision, vision transformer, Knowledge Distillation, similarity learning, self-supervised learning

Other Walkthroughs: [BYOL] – [CLIP] – [GLIP] – [Segment Anything] – [DINO] – [Depth Anything] – [DDPM]

Outline

  1. Context & Background
  2. Method
  3. Experiments
  4. Ablations
  5. Conclusion
  6. Further Readings & Resources

Context & Background

The year is 2021, April to be precise. It has been four years since the release of the transformer model with attention is all you need. Self-supervised pre-training is long being practiced in NLP by models such as BERT and the term foundation model is not yet known for the next few months until the release of "on the opportunities and Risks of Foundation Models". Six months earlier the Vision transformer (ViT) was first published on arxiv and it is still one month until ICLR 2021 where it will be presented.

Let that sink in for a moment: ViT had its debut on arxiv.org in October 2020 and was presented on ICLR2021 in May 2021. DINO was released on arxiv in April 2021. So, one month before it was actually presented on a conference. This would mean they only had 5 months if they had started right away to come up with the project’s idea, compile a team, lay out the theoretical foundation, train the model, perform experiments and ablations, and write the paper. No wonder PhD students these days feel constantly anxious. At least that’s what’s happening to me sometimes 😅

While ViTs were very competitive with convolutional networks, they are demanding in terms of computational resources and amount of training data.

The authors of DINO made a simple observation: the success of transformers in NLP was coupled with self-supervised pre-training and current self-supervised methods in the vision domain are built from convnets, like e.g. BYOL.

BYOL -The Alternative to Contrastive Self-Supervised Learning

Inspired by BYOL and the mean teacher, the authors came up with a framework to train a ViT in a self-supervised fashion and found:

  1. Self-supervised ViT features explicitly contain the scene layout and, in particular, object boundaries.
  2. Self-supervised ViT features perform particularly well with a basic nearest neighbors classifier (k-NN) without any fine-tuning, linear classifier nor data augmentation.

In contrast to BYOL and mean teacher, DINO implements a knowledge-distillation framework consisting of a student and teacher model that acts upon different views of the same image and adds extra measures to deal with inherent instabilities of similarity-learning approaches, where solutions often collapse.

An interesting finding of the underlying vision transformer architecture (ViT) is that when trained with unsupervised learning techniques its features contain explicit information about the semantic segmentation of an image. One can simply visualize the self-attention map of selected heads of the multi-head attention layer as shown in the video bellow:

Fig. 1: Self-attention maps for selected heads. Source
Fig. 1: Self-attention maps for selected heads. Source

Let us unpack another layer of abstraction and let’s try to understand how DINO implements its framework, tackles instabilities and how it performs compared to previous methods!

Paper Walkthroughs by Sascha Kirch


Method

The DINO framework shares the same overall structure with other similarity-learning frameworks like BYOL or the mean teacher but also with knowledge distillation. Let’s first have a look on how DINO does it and the differentiate between the other frameworks.

Fig. 2: DINO architecture. Source + annotations by Sascha Kirch
Fig. 2: DINO architecture. Source + annotations by Sascha Kirch

Networks and Update Rule

Let’s start from the middle. DINO implements two networks with the exact same architecture but a different set of weights. Those are the student and the teacher. The student is trained with back propagation and the teacher updates its weights with an exponential moving average of its own weights and those of the student.

Equation 1: Update rule of the teacher's weights. Source + annotations by Sascha Kirch
Equation 1: Update rule of the teacher’s weights. Source + annotations by Sascha Kirch

Backbones are either a ResNet50 or DeiT (which is a ViT adapted for knowledge distillation). An MLP-based projection head is connected to the backbone to reduce the dimensionality of the features, but is removed for inference.

Nice, but which model is used for inference: student or teacher? – Well that’s a good question and funny enough not a single word is mentioned in the paper. Intuitively you might think the student, at least I did at first. But as we will see later, the teacher outperforms the student throughout the training. The only hint beside the better performance is that in the code implementation the teacher checkpoint is the default one for the evaluation of for example video segmentation, linear probing and k-NN. Since this parameter can be changed though, I cannot tell you with certainty.

Inputs and Outputs

From an input image x different views x1 and x2 are created by cropping and applying image augmentations like in BYOL (e.g. color jitter, Gaussian blur and solarization). The technique used for cropping is called multi-crop where multiple crops of different sizes are generated to save memory while providing more data. Small crops are called local views and consist of 96×96 pixels that are exclusively feed into the student. Larger crops are called global views and consists of 224×224 pixels that are exclusively fed into the teacher. As we will see later in the ablation section, 2 global views and 10 local views have been used during training.

NOTE: The paper is a bit confusing regarding the multi-crop technique because neither the provided pseudo-code nor the architecture shown in Fig. 3 above reflect it. The pseudo code even suggests that x1 and x2 are feed into both, the student and the teacher like in BYOL, which is not the case when using multi-crop.

In contrast to similarity learning where the objective is to maximize the similarity of embeddings, DINO minimizes the cross-entropy between the teacher’s and the student’s output distribution. As indicated by the equation bellow, the cross-entropy is calculated for each pair of global and local views and is then summed up.

Equation 2: Optimization objective. Source + annotations by Sascha Kirch
Equation 2: Optimization objective. Source + annotations by Sascha Kirch

And what do the models output? – Like in similarity learning, the student and the teacher output an embedding for a given image, rather than a prediction score. Like in knowledge distillation, the output is transformed via a SoftMax transformation into a probability distribution. The SoftMax has a temperature parameter that controls the smoothing or sharpening of the resulting distribution. This temperature plays a crucial role in knowledge distillation because it allows to control the balance between transferring general knowledge and fine-grained details from a teacher network to a student network, making the distillation process more effective for different tasks.

Fig. 3: Effect of temperature value on the SoftMax output. Illustration by Sascha Kirch created with this python notebook
Fig. 3: Effect of temperature value on the SoftMax output. Illustration by Sascha Kirch created with this python notebook

I created a notebook for you so you can investigate the impact of the temperature on the resulting distribution:

ML_Notebooks/Softmax_Temperature.ipynb at main · sascha-kirch/ML_Notebooks

Avoiding Collapse

As mentioned earlier, student and teacher have the exact same architecture. This kind of setup is unstable (if no counter measures are implemented) and might result in collapsing solutions, where all features are mapped to a certain region in the latent space, e.g. a single point in the worst case. BYOL addressed this issue with an extra prediction head for only one of the models introducing an asymmetry. Since DINO has symmetric models another trick is required: centering and sharpening. Both are applied to the teacher network only. Centering is a technique that prevents a single dimension in the latent space to dominate, by adding a bias term c to the teachers output g(x) = g(x)+c, where

Equation 3: Update rule of the centering term. Source + annotations by Sascha Kirch
Equation 3: Update rule of the centering term. Source + annotations by Sascha Kirch

While centering has a positive effect, it also encourages the output to collapse into a uniform distribution. Sharpening has the opposite effect hence applying both balances their effect and stabilizes training. Sharpening is achieved by using a smaller temperature in the SoftMax (see Fig. 3) for the teacher as for the student.

To avoid collapsing the hyperparameter m from equation 3 and the temperature of the teacher are crucial. In their ablation study in the appendix section the authors show that m=0.9…0.999 works best and the temperature value is linearly increased from 0.04 to 0.07 during warm-up.

What does DINO do? Knowledge Distillation or Similarity Learning?

The answer is a little bit of both!

While knowledge distillation usually distils knowledge from an already trained, larger and more accurate teacher model into a smaller student model, it could also be seen as some sort of similarity learning because it encourages the student network to produce predictions that are similar to those of the teacher. In similarity learning, the two models are usually trained jointly and often align their latent space predictions rather than probability distributions.

Since the authors of DINO phrase their objective as knowledge distillation, let’s have a look on some differences compared with "standard" knowledge distillation:

  1. DINO’s teacher is not available a priori but "trained" alongside the student. It can even be considered as a co-distillation since knowledge is also distilled from the student into the teacher.
  2. DINO’s teacher and student are not acting on the same input but on different views of the image cropped to different sizes.
  3. DINO uses different temperatures in the SoftMax of both models to perform sharpening.
  4. DINO calculates the cross-entropy over the temperature-scaled SoftMax of the embeddings rather than prediction scores.

And how is it similar to knowledge distillation?:

  1. DINO consists of a student and a teacher network, where the teacher performs better than the student as we will see in the experiments.
  2. Rather than maximizing a similarity metric, DINO minimizes the cross-entropy loss of a temperature scaled SoftMax output.

Get an email whenever Sascha Kirch publishes 🚀 _Get an email whenever Sascha Kirch publishes 🚀 Looking to learn more about deep learning or simply stay up to dat_e…medium.com


Experiments

The paper presents a vast number of experiments. They pre-train the model on ImageNet, a commonly used dataset in representation learning.

For the evaluation, common techniques usually either train a linear classifier on top of frozen features or fine-tune the model to new downstream tasks, where the parameters of the model are adapted.

The authors of DINO claim that those techniques are very sensitive to hyperparameters which makes comparisons unfair and hard to reproduce. Hence, they propose to use a simple nearest neighbor clustering algorithm on the features of the pre-trained model.

Linear and k-NN Classification on ImageNet

In a this experiment the models are tested on their image classification accuracy on ImageNet. A variety of self-supervised pre-trained models are tested with either a ResNet or a ViT backbone. The classification is done either with linear probing or k-NN clustering.

Table 1: Linear and k-NN classification on ImageNet. Source + annotations by Sascha Kirch
Table 1: Linear and k-NN classification on ImageNet. Source + annotations by Sascha Kirch

I guess the key take-aways are:

  1. K-NN performs better on ViT features than on ResNet features.
  2. Decreasing patch size in the ViT has larger improvement as larger backbone, but at the cost of slower inference.

Video Instance Segmentation

An important experiment has been the video segmentation task, since the paper is about the ViT’s capability to capture semantic segmentation in its features when trained with unsupervised methods. Or let’s say that’s what is claimed 😁

Table 2: Video Instance segmentation. Source + annotations by Sascha Kirch
Table 2: Video Instance segmentation. Source + annotations by Sascha Kirch

Observing those results I am missing two further experiments:

  1. It would be nice to see a comparison of a supervised ResNet50 and a self-supervised ResNet50 in the DINO framework to support their claim that the ViT is superior to the ResNet architecture.
  2. It would also be great to see the same set of ViT backbones for supervised as for self-supervised to see the impact on patch-size and model size.

But as I always say: asking questions is easy 😁 In real-world projects the authors often face resource constraints and project deadlines so not every single little detail can be covered!

Probing the Self-Attention Map

In this experiment the authors investigated the self-attention maps of different heads in the multi-head self-attention layers of the ViT. They visualize the attention maps from selected heads from the last layer of ViT-S/8, those of the learned [CLS] token to be precise.

Fig. 4: Attention maps from selected heads. Source + annotations by Sascha Kirch
Fig. 4: Attention maps from selected heads. Source + annotations by Sascha Kirch

Other Experiments

In other experiments, DINO improved compared against the supervised baseline. Those tasks include image retrieval and copy detection.


Ablations

For their ablation study the authors experiment with the ViT-S model.

Importance of Patch Size

Recall that a vision transformer inputs a patchified version of an input image, transforms each patch into a token and then applies a transformer with its self-attention mechanism. This was a trick by the authors of ViT to reduce the compute requirements for trading-off performance, making transformers applicable to image data.

DINO claims that smaller size of the patches increases the performance while decreasing the throughput (number of images that can be processed per second), which is exactly what ViT claims.

Fig. 5: Impact of patch size on accuracy and throughput. Source + annotations by Sascha Kirch
Fig. 5: Impact of patch size on accuracy and throughput. Source + annotations by Sascha Kirch

Intuitively I’d say it is no surprise since you increase the input resolution and you end up with more tokens to attend to, so you end up with a fine-grained attention map.

Different Teacher Update Rules

The teacher in DINO is updated by calculating the exponential moving average from the updated student and the current teacher. This is the "momentum encoder" approach they refer to.

Using a momentum encoder and plotting the accuracy of the teacher and the student during training, the teacher performs better throughout the entire process. From this we can hypothesize:

  1. the teacher can provide a strong learning signal to the student.
  2. an improving student improves the teacher due to the EMA update rule (co-distillation).
  3. One can use the Teacher as final model which has better performance but the same architecture as the student, hence no change in compute requirements.
Fig. 6: Teacher performance. Source + annotations by Sascha Kirch
Fig. 6: Teacher performance. Source + annotations by Sascha Kirch

They also experiment with 3 other update rules: copying the weights from the student to the teacher, use the student weights from the previous iteration of the optimizer and use the student weights from the previous epoch.

Multi-Crop vs. Time and GPU Memory

As mentioned earlier, DINO inputs multiple cropped views of the same image and feeds the global views into the teacher and the local views into the student. In this ablation, the authors experiment with different amounts of local views and report the impact on performance, training time and peak memory per GPU.

Table 3: Multi-Crop vs. Time and GPU Memory. Source + annotations by Sascha Kirch
Table 3: Multi-Crop vs. Time and GPU Memory. Source + annotations by Sascha Kirch

Avoiding Collapse

In this ablation the authors evaluated the role of their stabilizing measures to avoid collapsing solutions: centering and sharpening.

To do so, they decomposed the cross-entropy into an entropy term and a Kullback-Leibler (KL) divergence term. KL divergence is a measure of difference of two probability distributions. If KL is 0, two distributions are considered equal.

The intuition behind this is the following: if the KL divergence of the output distribution of the teacher and the student is constant throughout the training, there is no learning signal for updating the weights of the student.

Fig. 7: Analysis of collapsing solutions. Source + annotations by Sascha Kirch
Fig. 7: Analysis of collapsing solutions. Source + annotations by Sascha Kirch

Effect of Batch Size

An interesting property is that DINO can be trained with small batch sizes without a large drop in performance. This was actually one of BYOL’s motivation, a paper DINO builds upon, to be less dependent on batch size compared to contrastive self-supervised learning.

Table 4: Batch size vs. accuracy. Source + annotations by Sascha Kirch
Table 4: Batch size vs. accuracy. Source + annotations by Sascha Kirch

Contrastive methods like CLIP and GLIP provide a lot of negative samples for a given positive sample to avoid collapsing solutions. The more negative samples per optimizer update step (hence per batch) the better it works.


Conclusion

In conclusion, DINO is a knowledge-distillation framework. It is a visual foundation model that exploits interesting properties of ViTs and is the predecessor of one of today’s best-performing foundation models, DINOv2. DINO’s framework consists of a student and teacher model that acts upon different views of the same image and adds extra measures to deal with inherent instabilities of similarity-learning approaches. The experiments show that DINO outperforms other self-supervised pre-trained models on various tasks.


Further Readings & Resources

Papers

In the meantime an improved version of DINO has been released:

  1. DINOv2: Learning Robust Visual Features without Supervision
  2. Blog post DINOv2 by Meta

Paper Walkthroughs

You might also like my other paper walkthroughs covering concepts we discussed in this article:

The CLIP Foundation Model

GLIP: Introducing Language-Image Pre-Training to Object Detection

BYOL -The Alternative to Contrastive Self-Supervised Learning

Segment Anything – Promptable Segmentation of Arbitrary Objects


Related Articles