Member-only story

A simple way to learn generally from a large training set: DINO

This post describes a self-supervised learning method: self-distillation with no Labels (DINO)

KamWoh Ng
Towards Data Science
9 min readApr 6, 2022

While the method (DINO [1]) itself is simple and straightforward, there are some prerequisites to understanding the method, i.e., 1) supervised learning, 2) self-supervised learning, 3) knowledge distillation, and 4) vision transformer. If you know all of it, you can skip to here.

Supervised Learning

Supervised learning is straightforward. We have a bunch of images, for each image, we have a label. Then, we train a model by telling it which image belongs to which label. In this case, we call it image classification, the learning objective is the cross-entropy loss between the one-hot label and the predicted probability distribution. By taking the index of the maximum value of the probability distribution, we obtain the predicted label of an image.

Supervised Learning for Image Classification. Image by Author.

But the problem of supervised learning is that the model will often reduce the rich visual information contained in an image into a single category selected from a predefined set of a few thousand categories. In other words, the learning signal is only predicting a label.

Self-Supervised Learning

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Written by KamWoh Ng

I am a PhD student in Computer Vision at University of Surrey. See kamwoh.github.io

No responses yet

What are your thoughts?

Very nice article! Thanks. It's short and sweet. Very clear.
But I think it could be slightly improved. When you discuss the probability distribution, you use K = number of heads in 10 throws as your data value. When you discuss the likelihood…

May I ask why L(theta, K=k_hat) = P(K=k_hay)…..assuming k_hat is 7 in this example

don’t understand this crucial part


Thanks for...

Great explanation of the difference between probability and likelihood! As someone who's delving into data science, your clear breakdown really helped me grasp the concepts better. Looking forward to more enlightening posts like this.