Semi-supervised learning is a machine learning technique for deriving useful information from labelled and unlabelled data.
In this tutorial:
- You will learn what supervised, unsupervised, semi-supervised, and self-supervised learning is.
- Go step-by-step through a PyTorch code for Bootstrap your own latent (BYOL) __ – a semi-supervised learning method you can implement and run yourself in Google Colab – no cloud or GPU is needed!
- You will learn the basic theory behind this method.
Before doing this tutorial, you should have basic familiarity with supervised learning on images with PyTorch.
What is semi-supervised learning, and why do we need it?
Generally speaking, Machine Learning methods can be divided into three categories:
- supervised learning
- unsupervised learning
- reinforcement learning
We will omit reinforcement learning here and concentrate on the first two types.
In supervised learning, our data consists of labelled objects. A machine learning model is tasked with learning how to assign labels (or values) to objects.
Examples: 1) The Hospital has ECG readings labelled with ICD-10 codes. Based on the ECG reading, we want to automatically pre-diagnose a patient. 2) The Bank has data about creditors – their financial status, how much they own, whether they are paying on time, etc. The Bank wants to assess how much more money it can lend someone.
On the contrary, unsupervised learning deals only with unlabelled objects.
Example: We can task a computer with clustering images into 10 categories without specifying what these categories mean (k-means clustering).
Semi-supervised learning falls between these two: some objects are labelled, but the majority of them are not. The imbalance between labelled and unlabelled data comes from the fact that labelling data is usually resource-intensive.
Example: We have a dataset with tweets. Some of them are annotated with positive, negative or neutral sentiment. Unfortunately, annotating is time- and cost-intensive – we need to pay annotators for doing so and cross-check their answers for correctness. Therefore most of the tweets are not labelled as it is relatively cheap and easy to download them, but not so cheap to annotate them.
There is also another type of learning: self-supervised. We can talk about self-supervised learning when we come up with some supervised task that we do not necessarily want to solve but can be a pretext for a model to learn. Self-supervised learning usually falls into the category of unsupervised learning and is used to enhance supervised learning.
Example: Let’s assume we have a big dataset with unlabelled images. We want to learn a model to extract useful features from these images that could help us with other tasks (like cat/dog recognition). We randomly apply 9 different distortions on the pictures (or do not distort them, so there are 10 possibilities). We then task a model to recognise which distortion (if any) was applied. We hope the model will learn to extract useful features that can be reused for other tasks (like object recognition).
STL-10 – a benchmark dataset for semi-supervised learning
Before we explore methods, let’s look at the dataset we will use. The STL-10 dataset was created by researchers at Stanford University and inspired by CIFAR-10, which you might have heard of. STL-10 consists of 100,000 unlabelled images, 5,000 labelled images for training, and 8,000 images for testing. Images are spread equally across ten classes.
Open up Google Colab and create a new notebook with a GPU environment. First, mount a Google Drive for convenience. STL-10 is quite heavy, and redownloading it every time you run your environment might be inconvenient. Run the code below in the cell and follow the instructions.
Then, create a folder for the STL-10 dataset.
Download the STL-10 dataset. As you can see, we are also defining transformation here. By default, all images are PIL Images objects which are not very handy for neural networks. Therefore, we are transforming them into tensors.
We should also define DataLoaders for these datasets. Try to do this yourself! Fill out missing parts of code.
Great! As you can see, the batch size is set to 128— this value should not crash the Google Colab environment, but you can feel free to experiment.
Get a supervised baseline
First, we must obtain a baseline with only supervised learning to compare it with semi-supervised learning. Use the code below. If you need an explanation of this code, please let me know in the comments.
Bootstrap your own latent
Bootstrap your own latent (BYOL) is a self-supervised method for representation learning that was first published in January 2020 and then presented at the top-tier scientific conference – NeroNIPS 2020. We will implement this method.
A rough overview
BYOL has two networks – online and target. They learn from each other. We take an image and perform two different augmentations (t and t’). One augmented picture (v) is put into an online network, and the second augmented picture (v’) is fed to the target network.

The online network will return some pseudo-predictions (it’s pseudo because we have no actual labels to predict here) and the target network – projection. Both outputs need to be of the exact dimensions. The output of the target network will perform as our ground truth. We calculate the mean square error (MSE) between the outputs of these networks.
Then, we perform backpropagation through the online network but leave the target network for now. By doing so, the online network learns to predict the output of the target network.
After backpropagation, the target network is updated with a moving exponential average of the online network’s parameters. We’ll elaborate on this later.
The online network learns "quickly" from the target network, and the target network learns "slowly" from the online network. The online network tries to be as close as possible to the output of the target network.
The intuition behind this mechanism is that the outputs of these two networks should be similar – both of them get the same images but with different augmentations. If we have a picture of a cat, regardless of how we preprocess it (to some reasonable degree), it’s still a photo of a cat. The online network learns the projection of the target network for the same objects but with different "expositions" in an image.
At the very end, the part of the online network (encoder, fθ) will be taken out and used for supervised learning.
If you did not fully understand this explanation, don’t worry – we will go through it step-by-step so you will have a chance to learn.
Contrastive learning
It is also worth noting that this architecture is an example of contrastive learning. Contrastive learning is a technique in which we try to get representations (embeddings) of similar objects as similar as possible but representations of distinct images as different as possible. You can learn more in this Medium post. The vital difference in BYOL is that this method does not have negative pairs. With that improvement, it’s lighter for computations and, therefore, can be demonstrated in a free Google Colab environment.
Augmentations
As mentioned earlier, two different augmentations were performed on the image. To be more specific, we sample two transformations from two different distributions t~Τ and t’~Τ’. The paper published in NIPS proceedings does not elaborate on these augmentations, but a pre-print published on arXiv does.
If you want to stay up-to-date with state-of-the-art methods, you should be able to read scientific papers. Therefore, I encourage you to review Section B (p. 16–17) of the pre-print and implement missing code from the functions. Some values are not taken from the article due to the missing information. Documentation for torchvision.transforms should be helpful.
Skeleton of a BYOL architecture
Let’s have a look at Figure 1 again.

So, we have images (STL-10 dataset) and augmentations. What about the rest? Encoder (f) could be any network that transforms given images to features (representation), e.g., a resnet18. Projection (g) is responsible for creating smaller representations from the output of the representation network (encoder). The prediction layer makes pseudo-predictions from projections.
Note that this architecture is asymmetric. Authors hypothesise that it prevents collapsed solutions (e.g., outputting the same vector for every image, would give MSE =0). Therefore, the prediction layer needs to have the same input and output dimensions in order to calculate the MSE between the output of the target’s projection network and the online’s prediction network.
Implementing projector
As you can see above, BYOL.mlp
should return the projector and the predictor. Let’s do this then. The pre-print in Section 3.3 Implementation details states:
[…] the representation y is projected to a smaller space by a multi-layer perceptron (MLP) gθ, and similarly for the target projection gξ. This MLP consists in a linear layer with output size 4096 followed by batch normalization, rectified linear units (ReLU), and a final linear layer with output dimension 256.
This translates to:
Fitting model
We will split the fitting model to unlabelled data into four steps:
- Train (fit) model on unlabelled data
- Validate on unlabelled train data
- Validate on validation data (labels will be omitted)
- Print results
All above steps will be repeated epochs
times.
Forward and backward propagation
Now, we will take care of the most crucial part of this code – self-supervised learning in train_one_epoch
. Keep in mind that you can look at Figure 1 and compare it with the code. First, we have to set both networks to training mode.
Then, we need to iterate through batches returned by DataLoader
and put them into GPU.
The forward pass will be implemented in a separate function, as we will reuse it in the validation process (DRY rule).
We run backpropagation on a loss tensor…
… and update parameters of a target network.
Let’s go to the forward
method. First, we need to augment images with two different augmentation functions. We use torch.no_grad()
as we don’t want to perform backpropagation through these transformations. Two differently augmented images are saved to v
and v_prime
variables.
Image v
is fed into the online network which returns pseudo-prediction. Please note that we do not use torch.no_grad()
here as we will be doing backward pass on this network. Image v_prime
is fed into the target network with torch.no_grad()
. Both outputs are [normalize](https://Pytorch.org/docs/stable/generated/torch.nn.functional.normalize.html)
d and …
… and the mean square error is calculated on these outputs (or rather sum square error as we set sum
reduction in __init__
).
The paper also states that:
We symmetrize the loss […] by separately feeding v′ to the online network and v to the target network to compute [loss].
The code below introduces this symmetric loss.
Now, we are left with updating the target network. Remember the self.tau
defined in __init__
? This is a decay parameter. The parameters ξ of the target network are updated in _i_th step with parameters θ of the online network:

This equation might be already familiar to you. It defines the exponential moving average over a series of parameters θ updated in each step (batch). It’s used for exponential smoothing – a process in which we smooth out time series. In this context, the parameters of the target network "smooth out" the "rapid" changes of the parameters in the online network.
This is the final code for the BYOL
class, which also includes the validation process and the code for running training. We will run self-supervised learning for only one epoch, as it usually takes one hour to do so on Google Colab. You can change train_loss = self.validate(train_dl)
to train_loss = 0
to cut some time.
The last layer of resnet18 is substituted with Identity
layer – with that, we will get features extracted by this network instead of predictions for 1000 classes.
Semi-supervised learning
Now, let’s combine self-supervised learning with supervised learning. First of all, we take out the online encoder (fθ) from the BYOL class and create a copy. As we want to predict ten classes, we will substitute the last Identity
layer with Linear
. If you’re going to freeze the encoding part of the network, you can do this by uncommenting code.
Results
Now, we plot figures comparing the performance of supervised and semi-supervised learning using a colour-blind-friendly pallet.
As you can see below, semi-supervised learning got slightly better results than supervised learning.


Conclusions
We went from using only labelled data for supervised training to leveraging unlabelled data with self-supervised and semi-supervised learning. As you can see, we did not get a significant difference in results, but we still showed that using semi-supervised learning can improve results in some cases.
I encourage you to experiment with this code – maybe change the optimiser, τ (tau), encoder architecture? If you have some exciting findings or if this post helped you in your use case, please leave a comment. I’d like to hear about it.
Thank you for reading this tutorial. If you liked it, please follow me on Medium – it will help me grow my blog and continue my work. Comments, feedback, and new ideas are much appreciated!