🚀 Sascha’s Paper Club
It is an exciting decade for Computer Vision. Great successes from the natural language domain are transferred to the vision domain including the introduction of the ViT (vision transformer) and lately large-scale self-supervised pre-training techniques have made headlines under the name of foundation models.
Today we are looking into a framework called DINO (self DIstillation, NO labels), a visual foundation model built on interesting properties of ViTs. It is also the predecessor of one of today’s best performing foundation models: DINOv2.

Paper: Emerging Properties in Self-Supervised Vision Transformers, by Mathilde Caron et.al., 29. Apr. 2021
Category: foundation model, computer vision, vision transformer, Knowledge Distillation, similarity learning, self-supervised learning
Other Walkthroughs: [BYOL] – [CLIP] – [GLIP] – [Segment Anything] – [DINO] – [Depth Anything] – [DDPM]
Outline
- Context & Background
- Method
- Experiments
- Ablations
- Conclusion
- Further Readings & Resources
Context & Background
The year is 2021, April to be precise. It has been four years since the release of the transformer model with attention is all you need. Self-supervised pre-training is long being practiced in NLP by models such as BERT and the term foundation model is not yet known for the next few months until the release of "on the opportunities and Risks of Foundation Models". Six months earlier the Vision transformer (ViT) was first published on arxiv and it is still one month until ICLR 2021 where it will be presented.
Let that sink in for a moment: ViT had its debut on arxiv.org in October 2020 and was presented on ICLR2021 in May 2021. DINO was released on arxiv in April 2021. So, one month before it was actually presented on a conference. This would mean they only had 5 months if they had started right away to come up with the project’s idea, compile a team, lay out the theoretical foundation, train the model, perform experiments and ablations, and write the paper. No wonder PhD students these days feel constantly anxious. At least that’s what’s happening to me sometimes 😅
While ViTs were very competitive with convolutional networks, they are demanding in terms of computational resources and amount of training data.
The authors of DINO made a simple observation: the success of transformers in NLP was coupled with self-supervised pre-training and current self-supervised methods in the vision domain are built from convnets, like e.g. BYOL.
BYOL -The Alternative to Contrastive Self-Supervised Learning
Inspired by BYOL and the mean teacher, the authors came up with a framework to train a ViT in a self-supervised fashion and found:
- Self-supervised ViT features explicitly contain the scene layout and, in particular, object boundaries.
- Self-supervised ViT features perform particularly well with a basic nearest neighbors classifier (k-NN) without any fine-tuning, linear classifier nor data augmentation.
In contrast to BYOL and mean teacher, DINO implements a knowledge-distillation framework consisting of a student and teacher model that acts upon different views of the same image and adds extra measures to deal with inherent instabilities of similarity-learning approaches, where solutions often collapse.
An interesting finding of the underlying vision transformer architecture (ViT) is that when trained with unsupervised learning techniques its features contain explicit information about the semantic segmentation of an image. One can simply visualize the self-attention map of selected heads of the multi-head attention layer as shown in the video bellow:

Let us unpack another layer of abstraction and let’s try to understand how DINO implements its framework, tackles instabilities and how it performs compared to previous methods!
Method
The DINO framework shares the same overall structure with other similarity-learning frameworks like BYOL or the mean teacher but also with knowledge distillation. Let’s first have a look on how DINO does it and the differentiate between the other frameworks.

Networks and Update Rule
Let’s start from the middle. DINO implements two networks with the exact same architecture but a different set of weights. Those are the student and the teacher. The student is trained with back propagation and the teacher updates its weights with an exponential moving average of its own weights and those of the student.

Backbones are either a ResNet50 or DeiT (which is a ViT adapted for knowledge distillation). An MLP-based projection head is connected to the backbone to reduce the dimensionality of the features, but is removed for inference.
Nice, but which model is used for inference: student or teacher? – Well that’s a good question and funny enough not a single word is mentioned in the paper. Intuitively you might think the student, at least I did at first. But as we will see later, the teacher outperforms the student throughout the training. The only hint beside the better performance is that in the code implementation the teacher checkpoint is the default one for the evaluation of for example video segmentation, linear probing and k-NN. Since this parameter can be changed though, I cannot tell you with certainty.
Inputs and Outputs
From an input image x different views x1 and x2 are created by cropping and applying image augmentations like in BYOL (e.g. color jitter, Gaussian blur and solarization). The technique used for cropping is called multi-crop where multiple crops of different sizes are generated to save memory while providing more data. Small crops are called local views and consist of 96×96 pixels that are exclusively feed into the student. Larger crops are called global views and consists of 224×224 pixels that are exclusively fed into the teacher. As we will see later in the ablation section, 2 global views and 10 local views have been used during training.
NOTE: The paper is a bit confusing regarding the multi-crop technique because neither the provided pseudo-code nor the architecture shown in Fig. 3 above reflect it. The pseudo code even suggests that x1 and x2 are feed into both, the student and the teacher like in BYOL, which is not the case when using multi-crop.
In contrast to similarity learning where the objective is to maximize the similarity of embeddings, DINO minimizes the cross-entropy between the teacher’s and the student’s output distribution. As indicated by the equation bellow, the cross-entropy is calculated for each pair of global and local views and is then summed up.

And what do the models output? – Like in similarity learning, the student and the teacher output an embedding for a given image, rather than a prediction score. Like in knowledge distillation, the output is transformed via a SoftMax transformation into a probability distribution. The SoftMax has a temperature parameter that controls the smoothing or sharpening of the resulting distribution. This temperature plays a crucial role in knowledge distillation because it allows to control the balance between transferring general knowledge and fine-grained details from a teacher network to a student network, making the distillation process more effective for different tasks.

I created a notebook for you so you can investigate the impact of the temperature on the resulting distribution:
ML_Notebooks/Softmax_Temperature.ipynb at main · sascha-kirch/ML_Notebooks
Avoiding Collapse
As mentioned earlier, student and teacher have the exact same architecture. This kind of setup is unstable (if no counter measures are implemented) and might result in collapsing solutions, where all features are mapped to a certain region in the latent space, e.g. a single point in the worst case. BYOL addressed this issue with an extra prediction head for only one of the models introducing an asymmetry. Since DINO has symmetric models another trick is required: centering and sharpening. Both are applied to the teacher network only. Centering is a technique that prevents a single dimension in the latent space to dominate, by adding a bias term c to the teachers output g(x) = g(x)+c, where

While centering has a positive effect, it also encourages the output to collapse into a uniform distribution. Sharpening has the opposite effect hence applying both balances their effect and stabilizes training. Sharpening is achieved by using a smaller temperature in the SoftMax (see Fig. 3) for the teacher as for the student.
To avoid collapsing the hyperparameter m from equation 3 and the temperature of the teacher are crucial. In their ablation study in the appendix section the authors show that m=0.9…0.999 works best and the temperature value is linearly increased from 0.04 to 0.07 during warm-up.
What does DINO do? Knowledge Distillation or Similarity Learning?
The answer is a little bit of both!
While knowledge distillation usually distils knowledge from an already trained, larger and more accurate teacher model into a smaller student model, it could also be seen as some sort of similarity learning because it encourages the student network to produce predictions that are similar to those of the teacher. In similarity learning, the two models are usually trained jointly and often align their latent space predictions rather than probability distributions.
Since the authors of DINO phrase their objective as knowledge distillation, let’s have a look on some differences compared with "standard" knowledge distillation:
- DINO’s teacher is not available a priori but "trained" alongside the student. It can even be considered as a co-distillation since knowledge is also distilled from the student into the teacher.
- DINO’s teacher and student are not acting on the same input but on different views of the image cropped to different sizes.
- DINO uses different temperatures in the SoftMax of both models to perform sharpening.
- DINO calculates the cross-entropy over the temperature-scaled SoftMax of the embeddings rather than prediction scores.
And how is it similar to knowledge distillation?:
- DINO consists of a student and a teacher network, where the teacher performs better than the student as we will see in the experiments.
- Rather than maximizing a similarity metric, DINO minimizes the cross-entropy loss of a temperature scaled SoftMax output.
Experiments
The paper presents a vast number of experiments. They pre-train the model on ImageNet, a commonly used dataset in representation learning.
For the evaluation, common techniques usually either train a linear classifier on top of frozen features or fine-tune the model to new downstream tasks, where the parameters of the model are adapted.
The authors of DINO claim that those techniques are very sensitive to hyperparameters which makes comparisons unfair and hard to reproduce. Hence, they propose to use a simple nearest neighbor clustering algorithm on the features of the pre-trained model.
Linear and k-NN Classification on ImageNet
In a this experiment the models are tested on their image classification accuracy on ImageNet. A variety of self-supervised pre-trained models are tested with either a ResNet or a ViT backbone. The classification is done either with linear probing or k-NN clustering.

I guess the key take-aways are:
- K-NN performs better on ViT features than on ResNet features.
- Decreasing patch size in the ViT has larger improvement as larger backbone, but at the cost of slower inference.
Video Instance Segmentation
An important experiment has been the video segmentation task, since the paper is about the ViT’s capability to capture semantic segmentation in its features when trained with unsupervised methods. Or let’s say that’s what is claimed 😁

Observing those results I am missing two further experiments:
- It would be nice to see a comparison of a supervised ResNet50 and a self-supervised ResNet50 in the DINO framework to support their claim that the ViT is superior to the ResNet architecture.
- It would also be great to see the same set of ViT backbones for supervised as for self-supervised to see the impact on patch-size and model size.
But as I always say: asking questions is easy 😁 In real-world projects the authors often face resource constraints and project deadlines so not every single little detail can be covered!
Probing the Self-Attention Map
In this experiment the authors investigated the self-attention maps of different heads in the multi-head self-attention layers of the ViT. They visualize the attention maps from selected heads from the last layer of ViT-S/8, those of the learned [CLS] token to be precise.

Other Experiments
In other experiments, DINO improved compared against the supervised baseline. Those tasks include image retrieval and copy detection.
Ablations
For their ablation study the authors experiment with the ViT-S model.
Importance of Patch Size
Recall that a vision transformer inputs a patchified version of an input image, transforms each patch into a token and then applies a transformer with its self-attention mechanism. This was a trick by the authors of ViT to reduce the compute requirements for trading-off performance, making transformers applicable to image data.
DINO claims that smaller size of the patches increases the performance while decreasing the throughput (number of images that can be processed per second), which is exactly what ViT claims.

Intuitively I’d say it is no surprise since you increase the input resolution and you end up with more tokens to attend to, so you end up with a fine-grained attention map.
Different Teacher Update Rules
The teacher in DINO is updated by calculating the exponential moving average from the updated student and the current teacher. This is the "momentum encoder" approach they refer to.
Using a momentum encoder and plotting the accuracy of the teacher and the student during training, the teacher performs better throughout the entire process. From this we can hypothesize:
- the teacher can provide a strong learning signal to the student.
- an improving student improves the teacher due to the EMA update rule (co-distillation).
- One can use the Teacher as final model which has better performance but the same architecture as the student, hence no change in compute requirements.

They also experiment with 3 other update rules: copying the weights from the student to the teacher, use the student weights from the previous iteration of the optimizer and use the student weights from the previous epoch.
Multi-Crop vs. Time and GPU Memory
As mentioned earlier, DINO inputs multiple cropped views of the same image and feeds the global views into the teacher and the local views into the student. In this ablation, the authors experiment with different amounts of local views and report the impact on performance, training time and peak memory per GPU.

Avoiding Collapse
In this ablation the authors evaluated the role of their stabilizing measures to avoid collapsing solutions: centering and sharpening.
To do so, they decomposed the cross-entropy into an entropy term and a Kullback-Leibler (KL) divergence term. KL divergence is a measure of difference of two probability distributions. If KL is 0, two distributions are considered equal.
The intuition behind this is the following: if the KL divergence of the output distribution of the teacher and the student is constant throughout the training, there is no learning signal for updating the weights of the student.

Effect of Batch Size
An interesting property is that DINO can be trained with small batch sizes without a large drop in performance. This was actually one of BYOL’s motivation, a paper DINO builds upon, to be less dependent on batch size compared to contrastive self-supervised learning.

Contrastive methods like CLIP and GLIP provide a lot of negative samples for a given positive sample to avoid collapsing solutions. The more negative samples per optimizer update step (hence per batch) the better it works.
Conclusion
In conclusion, DINO is a knowledge-distillation framework. It is a visual foundation model that exploits interesting properties of ViTs and is the predecessor of one of today’s best-performing foundation models, DINOv2. DINO’s framework consists of a student and teacher model that acts upon different views of the same image and adds extra measures to deal with inherent instabilities of similarity-learning approaches. The experiments show that DINO outperforms other self-supervised pre-trained models on various tasks.
Further Readings & Resources
Papers
In the meantime an improved version of DINO has been released:
Paper Walkthroughs
You might also like my other paper walkthroughs covering concepts we discussed in this article:
GLIP: Introducing Language-Image Pre-Training to Object Detection
BYOL -The Alternative to Contrastive Self-Supervised Learning
Segment Anything – Promptable Segmentation of Arbitrary Objects