The world’s leading publication for data science, AI, and ML professionals.

Efficient Nets with Noisy Student Training

As you know CNNs can be scaled if we require high computational power, but the larger the network more resources are required to train it…

An efficient way to build powerful CNN models and further improving their accuracy with the Noisy Student Algorithm.

As you know, CNNs can be scaled if we require high computational power, but the larger the network more resources are required to train it. So, CNNs are built according to the resource availability in such a way that they can be scaled for better accuracy when more resources are available. In Efficient Net, we study how we can scale our model along the three dimensions of width, depth, and resolution of the input.

Table of contents:

  1. Introduction and Previous Methods
    1. Efficient Nets – Compound Scaling
    2. Architecture
    3. Noisy Student Training
    4. Stochastic Depth
    5. Transfer Learning

INTRODUCTION

The process of scaling Neural Networks is not well understood: there are many ways to do it but the parameters are mostly chosen arbitrarily. Efficient Nets provides an efficient way to scale the CNNs along the three dimensions.

Depth Scaling – The most common way of scaling where the number of layers is increased or decreased depending on the requirement. As the number of layers increases, we get more and more complex features. Theoretically, increasing the number of layers should always give us higher accuracy, but thats not the case as deeper the network higher is the chances of overfitting, vanishing gradient and much more. ResNet-1000 has similar accuracy as ResNet-101. Increase in layers does not ideally decrease the performance either. ResNet-34 performs better than ResNet-18.

Width Scaling – Increasing the width of the network helps us capture more fine tuned features(smaller details in the image). This is done when we want to limit are model size, but accuracy saturation occurs easily in this type of scaling.

Resolution Scaling – Increasing the dimensions of the input image helps the model learn features better as the information that the pixels carry would be much more accurate. State of the art models like Yolo uses 418*418 as the input image size. But accuracy gain diminishes very quickly in case of larger models.

Scaling alongside a single dimension increases accuracy but the gain is insignificant in cases of larger models.

EFFICIENT NETS – COMPOUND SCALING

Now can we combine these scaling methods somehow to increase accuracy in even larger models? Why don’t we just increase the parameters along all three dimensions?

Well increasing them arbitrarily will just make things worse as sometimes an increase in dimensions results in loss or no change inaccuracy of the model. And it will be a tedious task as we have to hit and try in three dimensions simultaneously.

So a new method was introduced, balancing dimensions of width, depth and resolution by scaling with a constant ratio.

The constant ratio that we take is called a compound coefficient ɸ. The proposed equation is:

ɸ is specified by the user depending on the number of resources available. Alpha, beta, and gamma represent how the resources will be assigned to the dimensions of depth, width, and resolution.

Now how it works:

1 Under the constraints of β² γ²) ≈ 2, we fix the value of ɸ (=1) and find the best possible values for α, β, and γ, by training the base model.

Efficient-Net-B0 model was trained and the values found were α = 1.2, β = 1.1, and γ = 1.15. α, β, and γ are constant coefficients determined by grid search on the base network.

2Now with the above-found values of α, β, and γ we scale up the baseline model with different ɸ values and obtain Efficient Nets B1-B7.

Better performance can be achieved by finding α, β, and γ in a large model, but it will have its own downside of increased search cost.

ARCHITECTURE

Any baseline architecture can be used which is small and easy to scale. Researchers proved that the compound scaling technique tends to consistently enhance model accuracy for ResNets and MobileNets.

Researchers developed a new baseline network by performing a neural architecture search using the AutoML MNAS framework. The architecture for the baseline model B0 is simple and clean so that it is easier to scale.

NOISY STUDENT TRAINING

The noisy student training is a semi-supervised learning technique that implements the idea of self-training and distillation by using a larger or equal sized student model and adding noise to the student at the time of training.

What!! Calm down. Everything will come to light if we dive a little deeper.

The algorithm is simple:

1) We train the teacher model on labeled images.

2) Now we generate soft or hard pseudo labels for unlabelled images using the teacher model.

3) We take an equal or larger sized student model and train it with combined data with noise added to the images as well as the model.

4) Iterate the process from step 2 a few times, by making the student model as the teacher.

Noisy Student training significantly improves the accuracy for all model sizes.

We generate a large amount of labels (from nearly about 300M unlabelled images) so that the student model learns better than the teacher. We try to work with the soft labels as they produce better results (in some cases). Soft labels means instead of labelling the class of the object we label it with the probability of the object belonging to that class.

Now as the name suggests, we add noise to the data and the model at the time of training of the student. What kind of noise are we talking about?

As for the image data we do augmentation to add input noise, and for the model noise, we add dropout and stochastic depth during training.

We all know what is data augmentation and dropout as they are common terms in Deep Learning. But what is meant by Stochastic Depth noise?

STOCHASTIC DEPTH

Stochastic depth is an effortless idea that focuses to shrink the depth of the network during training by bypassing the transformations through skip connections. With this, we get a network that has an expected depth really small while its maximum depth is higher. This creates an ensemble-like model by skipping through a subset of layers radically reducing the training time.

The model depth is kept the same during the time of testing which reduces the test loss. Stochastic Depth during testing requires certain changes to the network as the outputs from any layer is calibrated with the numbers of time that layer participates in training.

But those details are for another article, as for this you should be able to understand how stochastic depth noise is added during the time of student training.

Just like dropout disables the passage of information through the nodes of the layer, stochastic depth disables the entire layer or subset of layers.

TRANSFER LEARNING

Transfer learning for efficient nets saves a lot of training time and computational power. You can use the pre-trained weights of efficient nets and fine-tune them to work for your own classification problems. In just a few lines of code, you can achieve higher accuracy than many known models.

Let’s take a look and compare the accuracy achieved with all the past models and Efficient Nets.

bharatdhyani13/EfficientNet-Noisy-Student-Training

CONCLUSION

Efficient Nets are powerful models that generate high-end results. In general, compound scaling increases accuracy up to 2.5% as compared to scaling in a single dimension. Noisy Student self-training is an effective way to leverage unlabelled datasets and improving accuracy by adding noise to the student model while training so it learns beyond the teacher’s knowledge. Stochastic Depth is a simple yet ingenious idea to add noise to the model by bypassing the transformations through skip connections. This decreases the training time, while bypassed layers can be activated during testing for producing more accurate results.

These are some simple ideas that are combined to improve the accuracy of image classification. This opens up the possibilities to improve scaling and training with simple and clever mathematical transformations.


REFERENCE

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Self-training with Noisy Student improves ImageNet classification

Stochastic Depth Networks Accelerate Deep Network Training


Related Articles