Member-only story
A simple way to learn generally from a large training set: DINO
This post describes a self-supervised learning method: self-distillation with no Labels (DINO)
While the method (DINO [1]) itself is simple and straightforward, there are some prerequisites to understanding the method, i.e., 1) supervised learning, 2) self-supervised learning, 3) knowledge distillation, and 4) vision transformer. If you know all of it, you can skip to here.
Supervised Learning
Supervised learning is straightforward. We have a bunch of images, for each image, we have a label. Then, we train a model by telling it which image belongs to which label. In this case, we call it image classification, the learning objective is the cross-entropy loss between the one-hot label and the predicted probability distribution. By taking the index of the maximum value of the probability distribution, we obtain the predicted label of an image.
But the problem of supervised learning is that the model will often reduce the rich visual information contained in an image into a single category selected from a predefined set of a few thousand categories. In other words, the learning signal is only predicting a label.