Semi-supervised learning made simple

Learn how to build your own semi-supervised model from scratch in PyTorch

Maciej Dzieżyc
Towards Data Science

--

Semi-supervised learning is a machine learning technique of deriving useful information from both labelled and unlabelled data.

In this tutorial:

  • You will learn what is supervised, unsupervised, semi-supervised, and self-supervised learning.
  • Go step-by-step through a PyTorch code for BYOL — a semi-supervised learning method that you can implement and run yourself in Google Colab — no cloud or GPU is needed!
  • You will learn a basic theory behind BYOL — a semi-supervised learning method.

Before doing this tutorial, you should have basic familiarity with supervised learning on images with PyTorch.

What is semi-supervised learning and why do we need it?

Generally speaking, machine learning methods can be divided into three categories:

  • supervised learning
  • unsupervised learning
  • reinforcement learning

We will omit reinforcement learning here and concentrate on the first two types.

In supervised learning, our data consists of labelled objects. A machine learning model is tasked with learning how to assign labels (or values) to objects.

Examples:
1) Hospital has ECG readings which are labelled with ICD-10 codes. Based on the ECG reading we want to automatically pre-diagnose a patient.
2) Bank has data about creditors — their financial status, how much they own, are they paying on time etc. Bank wants to assess how much more money they can lend someone.

On the contrary unsupervised learning deals with only unlabelled objects.

Example: We can task a computer with clustering images into 10 categories without specifying what these categories mean (k-means clustering).

Semi-supervised learning falls in between this two: some objects are labelled, but the majority of them are not. The dominance of unlabelled data comes from the fact that labelling data is usually resource-intensive.

Example: We have a dataset with tweets. Some of them are annotated with positive, negative or neutral sentiment. Unfortunately annotating is time and cost intensive — we need to pay annotaters for doing so and also cross-check their answers for correctness. Therefore most of the tweets are not labelled as it is relatively cheap and easy to download them, but not so cheap to annotate them.

There is also another type of learning: self-supervised. We can talk about self-supervised learning when we come up with some supervised task which we do not necessarily want to solve but can be a pretext for a model to learn. Self-supervised learning usually falls into the category of unsupervised learning and is used to enhance supervised learning.

Example: Let’s assume we have a big dataset with unlaballed images. We want to learn a model to extract some usefull features from these images which could help us in other tasks (like cat/dog recognition). We apply randomly 9 diffrent distorntions on the images (or do not distore them, so there are 10 possibilities). We then task a model with recognizing which distortion (if any) was aplied. With that we hope that model will learn to extract featers which then can be reused somewhere else (like cat/dog recognition).

STL-10 — a benchmark dataset for semi-supervised learning

Before we get into methods, let us have a look at the dataset we will be using. STL-10 dataset was created by researchers at Stanford University and inspired by CIFAR-10, which you might have heard of. STL-10 consists of 100.000 unlabelled images and 5.000 labelled images for training, and 8.000 images for testing. Images are spread equally across ten classes.

Open up Google Colab and create a new notebook with a GPU environment. First, mount a Google Drive for convenience. STL-10 is quite heavy, and redownloading it every time you run your environment might be inconvenient. Run the code below in the cell and follow the instructions.

Mount Google Drive in Google Colab

Then, create a folder for the STL-10 dataset.

Create a folder/directory for your mini-project and STL-10 dataset

Download the STL-10 dataset. As you can see, we are also defining transformation here. We’re doing so because, by default, all images are PIL Images objects which are not very handy for neural networks. Therefore we are transforming them into tensors.

Download the dataset and load it to variables

We should also define DataLoaders for these datasets. Try to do this yourself! Fill out missing parts of code.

Create DataLoader for the STL-10 dataset

Great! As you can see, the batch size is set for 128— this value was obtained by me experimentally as a value that does not crash the Google Colab environment, but you can feel free to experiment.

Get a supervised baseline

First, we need to obtain a baseline with only supervised learning to compare it with semi-supervised learning. Use the code below. If you need an explanation of this code, please let me know in the comments.

Supervised learning baseline

Bootstrap your own latent

Bootstrap your own latent (BYOL) is a self-supervised method for representation learning which was first published in January 2020 and then presented at the top-tier scientific conference — NeroNIPS 2020. We will implement this method.

A rough overview

BYOL has two networks — online and target. They learn from each other. We take an image and perform two different augmentations on it (t and t’). One augmented picture (v) is put to an online network, and the second augmented picture (v’) is fed to the target network.

Figure 1 — An overview of architecture (based on Fig. 2 from BYOL paper; image created by the author in Lucidchart)

The online network will return some pseudo-predictions (it’s pseudo because we have no actual labels to predict here) and the target network — projection. Both outputs need to be of the exact dimensions. The output of the target network will perform as our ground truth. We calculate the mean square error (MSE) between outputs of these networks.

Then, we perform backpropagation through the online network but leave the target network for now. By doing so, the online network learns to predict the output of the target network.

After backpropagation, the target network is updated with moving exponential average of the online network’s parameters. We’ll elaborate on what it means later.

The online network learns “quickly” from the target network, and the target network learns “slowly” from the online network. The online network tries to be as close as possible to the output of the target network.

The intuition behind this mechanism is that the outputs of these two networks should be similar — both of them get the same images but with different augmentations. If we have an image of a cat, regardless of how we preprocess it (to some reasonable degree), it’s still a photo of a cat. The online network learns the projection of the target network for the same objects but with different “exposition” in an image.

At the very end, the part of the online network (encoder, ) will be taken out and used for supervised learning.

If you did not fully understand this explanation, don’t worry — we will be going step-by-step through this, so you will have a chance to learn this.

Contrastive learning

It is also worth noting that this architecture is an example of contrastive learning. Contrastive learning is a technique in which we try to get representations (embeddings) of similar objects as similar as possible but representations of distinct images as different as possible. You can learn more in this Medium post. The vital difference in BYOL is that this method does not have negative pairs. With that improvement, it’s lighter for computations and, therefore, can be demonstrated in a free Google Colab environment.

Augmentations

As mentioned earlier, there are two different augmentations performed on the image. To be more specific, we sample two transformations from two different distributions t~Τ and t’~Τ’. The paper published in NIPS proceedings does not elaborate on these augmentations, but a pre-print published on arXiv does.

If you want to stay up-to-date with state-of-the-art methods, you should be able to read scientific papers. Therefore, I encourage you to look at Section B (p. 16–17) of the pre-print and implement missing code from the functions. Some of the values are not taken from the article due to the missing information. Documentation for torchvision.transforms should be helpful.

Code for image augmentation

Skeleton of a BYOL architecture

Let’s have a look at Figure 1 again.

Figure 1 — An overview of architecture (based on Fig. 2 from BYOL paper; image created by the author in Lucidchart)

So we have images (STL-10 dataset) and augmentations. What about the rest? Encoder (f) could be any network that transforms given images to features (representation), e.g., a resnet18. Projection (g) is responsible for creating smaller representations from the output of the representation network (encoder). The prediction layer makes pseudo-predictions from projections.

Note that this architecture is asymmetric. Authors hypothesise that it prevents collapsed solutions (e.g. outputting the same vector for every image, would give MSE =0). Therefore, the prediction layer needs to have the same input and output dimensions in order to calculate MSE between the output of the target’s projection network and the online’s prediction network.

Init method for BYOL class

Implementing projector

As you can see above, BYOL.mlp should return the projector and the predictor. Let’s do this then. The pre-print in Section 3.3 Implementation details states:

[…] the representation y is projected to a smaller space by a multi-layer perceptron (MLP) gθ, and similarly for the target projection gξ. This MLP consists in a linear layer with output size 4096 followed by batch normalization, rectified linear units (ReLU), and a final linear layer with output dimension 256.

This translates to:

MLP for projector and predictor

Fitting model

We will split the fitting model to unlabelled data into four steps:

  • Train (fit) model on unlabelled data
  • Validate on unlabelled train data
  • Validate on validation data (labels will be omitted)
  • Print results

All above steps will be repeated epochs times.

Training and validation loop

Forward and backward propagation

Now we will take care of the most crucial part of this code — self-supervised learning in train_one_epoch. Keep in mind that you can look at Figure 1 and compare it with the code. First, we have to set both networks to training mode.

Set networks to training mode

Then we need to iterate through batches returned by DataLoader and put them into GPU.

Put tensors to GPU

Forward pass will be implemented in a separate function as we will reuse it in the validation process (DRY rule).

Perform forward pass

We run backpropagation on a loss tensor…

Backpropagation added

… and update parameters of a target network.

Update target network in forward pass

Let’s go to the forward method. First, we need to augment images with two different augmentation functions. We use torch.no_grad() as we don’t want to perform backpropagation through these transformations. Two differently augmented images are saved to v and v_prime variables.

Image augmentation on batch data

Image v is fed into the online network which returns pseudo-prediction. Please note that we do not use torch.no_grad() here as we will be doing backward pass on this network. Image v_prime is fed into the target network with torch.no_grad(). Both outputs are normalize d and …

Forward pass on target and online networks with outputs normalisation

… and the mean square error is calculated on these outputs (or rather sum square error as we set sum reduction in __init__).

Calculation of loss

The paper also states that:

We symmetrize the loss […] by separately feeding v′ to the online network and v to the target network to compute [loss].

The code below introduces this symmetric loss.

Symmetryzing loss function

Now we are left with updating the target network. Rember the self.tau defined in __init__? This is a decay parameter. The parameters ξ of the target network are updated in ith step with parameters θ of the online network:

This equation might be already familiar to you. It defines the exponential moving average over series of parameters θ updated in each step (batch). It’s used for exponential smoothing — a process in which we smooth out time series. In this context, the parameters of the target network “smooth out” the “rapid” changes of the parameters in the online network.

Updating target network with an exponential moving average of online network

And this is the final code for the BYOL class which also includes the validation process and the code for running training. We will run self-supervised learning for only one epoch as it usually takes one hour to do so on Google Colab. You can change train_loss = self.validate(train_dl) to train_loss = 0to cut some time.

The last layer of resnet18 is substituted with Identity layer — with that, we will get features extracted by this network instead of predictions for 1000 classes.

Complete BYOL class code and its usage

Semi-supervised learning

Now, let’s combine self-supervised learning with supervised learning. First of all, we take out the online encoder () from the BYOL class and create a copy. As we want to predict ten classes, we will substitute the last Identity layer with Linear. If you’re going to freeze the encoding part of the network, you can do this by uncommenting code.

Semi-supervised learning using resnet18 trained on unlabelled data

Results

Now we plot figures comparing the performance of supervised and semi-supervised learning using a colour-blind-friendly pallet.

Code generating learning curves

As you can see below, semi-supervised learning managed to get slightly better results compared to supervised learning.

Figure 2 — Comparison of performance for supervised and semi-supervised learning. Your results may vary due to the randomness in parameters initialisation and batch shuffling in DataLoader. Image created by the author.

Conclusions

We went from using only labelled data for supervised training into leveraging unlabelled data with self-supervised and semi-supervised learning. As you can see, we did not get a significant difference in results, but we still showed that using semi-supervised learning can improve results in some cases.

I encourage you to experiment with this code — maybe change optimiser, τ (tau), encoder architecture? If you have some exciting findings or this post helped you in your use case, please leave a comment. I’d like to hear about it.

Thank you for going through this tutorial. If you liked it, please follow me on Medium — it will help me grow my blog and continue my work. Comments, feedback and new ideas are much appreciated!

--

--