Fueling up your neural networks with the power of cyclical learning rates

Never limit the capability of your neural network. Let it explore its ability to learn

Bipin Krishnan P

Published in

Towards Data Science

9 min readSep 27, 2020

Source: https://unsplash.com/photos/WE_Kv_ZB1l0

Introduction

Choosing the best learning rate for training neural networks is a tedious task and most often it is done by a process of trial and error.

But, what if you can provide your neural network with a range of learning rate values. The best part is that there is a method to get the best learning rate range without even starting the actual training of your neural network.

How cool is that?

These cool techniques were introduced by Leslie N. Smith in his paper Cyclical Learning Rates for Training Neural Networks.

By using the techniques discussed in the paper we get a better result in fewer iterations.

The abstract of the paper clearly says this:

Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations

The statement in the abstract is supported by several experiments conducted against other techniques like adaptive learning rates.

While using cyclical learning rate for training on the CIFAR-10 data set, the following results were obtained:

Source: https://arxiv.org/abs/1506.01186

The same accuracy obtained by other methods at 70,000 iterations was achieved by the cyclical learning rate method at approximately 25,000 iterations.

Now, let’s jump straight into the details of the paper.

What is this so-called cyclical learning rate?

Most of the time while passing in a learning rate, what we do is simply give the neural network a fixed value.

But while using cyclical learning rate what we do is that we pass in a minimum learning rate and maximum learning rate.

For example, consider the minimum learning rate to be 0.001 and the maximum learning rate to be 0.01. During the training process, the learning rate will change from 0.001(minimum learning rate) up to a value of 0.01(maximum learning rate), then again it will change from 0.01(maximum learning rate) to 0.001(minimum learning rate) and the process continues until the training is complete.

Just like a cyclical process, it starts from the minimum and goes to the maximum, and then it returns to the minimum. It’s as simple as that.

Now obviously, you might have a question.

How many iterations or epochs does it take for the learning rate to go from minimum to the maximum value and vice versa?

The answer to that question is the step size.

If the step size is 100, then it takes 100 iterations for the learning rate to go from minimum to the maximum value and another 100 iterations to return to the minimum.

As shown in the above figure, the learning rate starts from a minimum value of 0.0001 and reaches a maximum value of 0.001 after 100 iterations and again returns to the minimum in the next 100 iterations.

One complete cycle is the time taken to return to the minimum value. In the above figure, it is equal to 200 iterations.

One complete cycle = 2*(step size)

Torch7 code for implementing cyclical learning rate given in the paper is:

Torch7 code for cyclical learning rate

epochCounter — Number of iterations

opt.LR — Minimum value for the learning rate

maxLR — Maximum value for the learning rate

But let’s convert the above code to numpy:

Numpy implementation of cyclical learning rate

Now, let’s test whether our numpy implementation is working as expected. For that let’s run a for loop and check whether the learning rate is moving from minimum to maximum as discussed before.

Code for testing our function

By running the above code we get a plot as shown below:

As expected, our learning rate starts from the minimum value and moves linearly up and down within the specified step size.

The above technique is called triangular policy. There are two more techniques described in the paper:

triangular2 — The only difference between traingular2 and triangular policy is that here the difference between the base learning rate and the maximum learning rate is reduced in half at the end of each complete cycle.
exp_range — The learning rate here varies from minimum to maximum but the only difference is that each boundary values(minimum and maximum) decline by an exponential factor of gamma^iteration(gamma raise to iteration).

Training a model using cyclical learning rate

As you have an idea of what exactly is cyclical learning rate, let’s train a model using cyclic learning rate and see whether it’s performing better than a model with a single learning rate.

To make our experiments faster, we will be using a small subset from the MNIST(Modified National Institute of Standards and Technology) data set. Let’s begin with our experimentation:

Import the necessary modules.

Import the necessary modules

2. Download the MNIST data set.

Download the mnist data set

As I’ve said, we will be downloading only a small subset of the complete MNIST data set.

3. Now, we will create a custom data set class.

Custom data set class

4. Build the required transforms and load the data set using PyTorch data loader.

Load the data

Since the downloaded MNIST data set is in the form of tensors and PyTorch data class only takes in PIL(Python Imaging Library) images, we need to transform the data set into PIL images and convert them to tensors then feed into the data loader.

The data set is split in such a way that 8000 data points are used for training and the rest is used for validation(approximately 2000 data points).

5. Now let’s create a validate function to calculate the loss and accuracy of our model on validation data.

Validation function

6. Now, let’s create our model.

Building the model

We will be using resnet18(without the pre-trained weights) as our model with cross entropy loss and adam optimizer.

7. Now we are all set to train our model.

Training using cyclic learning rate

On each iteration through the data loader, we will update the value of learning rate in the optimizer using the cyclical learning rate function that we’ve implemented earlier using numpy.

In the above code, we are using a step size equal to two times the training data loader and a learning rate boundary between 1e-3 and 1e-2.

We will also store the value of accuracy after each iteration to compare the results with another model trained using a single learning rate.

9. Now, we will quickly create and train another model but with a single learning rate value.

We will train the model using the same data set used before.

8. Now let’s compare the results of our model trained for 4 epochs and the other model trained for 8 epochs but with a single learning rate, i.e, 0.001(default value for the adam optimizer).

Code for comparing the results

In the above code, the term ‘acc1’ is the accuracy values for the model trained with single learning rate and the term ‘acc’ is the accuracy values for the model trained using cyclic learning rate.

The above code gives a plot as follows:

It is clear from the above plot that the model trained using cyclical learning rate(red line) achieved higher accuracy than the model trained with a fixed learning rate(blue line) even at a fewer number of iterations.

Finding the best learning rate range

As I’ve said earlier, there is a technique to find the best learning rate range using a technique. This technique is called the “LR range test” as mentioned in the paper.

There is a simple way to estimate reasonable minimum and maximum boundary values with one training run of the network for a few epochs. It is a “LR range test”.

This is done by setting the minimum learning rate to a small value like 1e-07 and the maximum learning rate to a high value. Then the model is trained for some iteration and then the loss obtained for each learning rate is plotted.

This is neatly implemented in the fastai library, but there is also a PyTorch implementation for the same.

But for that, we need to install a library called torch-lr-finder using pip as shown below:

pip install torch-lr-finder

Now we can test for the best range of learning rate to pass into our model.

We will just pass our model, loss function, optimizer and device(cuda or cpu) to initialize the learning rate finder.

We pass our training data loader, validation data loader, minimum learning rate(very low value) and maximum(very high value) learning rate to start the learning rate range test.

After running the above code, we get a plot with a learning rate suggestion like the one below:

From the figure, we can see that the loss value continues to decrease from a value of approximately 3e-4 to a value of 1e-3, thus these values can be used as our minimum and maximum values of the learning rate. The optimum learning rate suggested by the learning rate finder is 5.21e-04 which is also between this range and can be used if you wish to train the model with a single learning rate.

Using PyTorch’s learning rate scheduler

PyTorch provides a learning rate scheduler to change the learning rate as discussed above.

So, let’s use PyTorch’s learning rate scheduler to train a model with the same architecture, hyper-parameters, optimizer and loss function that we’ve used before.

Let’s import the learning rate scheduler from PyTorch and quickly build the model.

Initializing the CyclicLR scheduler

Now we will train our model and use the learning rate scheduler to update the learning rates.

Training the model and updating the learning rate using the scheduler

As seen in the above code, after each iteration through the data loader, the learning rate is updated using the scheduler.

After 4 epochs, the model gives the same accuracy(98.2638) as the one trained using the cyclical learning rate function built by us.

📎 Some of the important points to remember are:

Increasing the learning rate from minimum(low value) to maximum(high value) might have a short term negative effect but in the long term, it gives a better result.
Cyclical learning rate helps in getting out of saddle points while training.
It is good to set the step size equal to 2–10 times greater than the number of iterations through the data loader in a single epoch(or length of the training data loader).
It is better to stop the training at the end of the cycle, i.e, when the learning rate is at its minimum.
A rule of thumb is to keep the minimum learning rate to 1/3rd or 1/4th of the maximum learning rate.
Whenever you start with a new data set or architecture, the learning rate range test is a good way to get an optimum learning rate value or an optimum learning rate range.

Conclusion

This is such an awesome technique to use in your day to day training of neural networks.

If you are still in doubt about the capability of cyclical learning rates, then you should definitely check out the experiments section of the paper and you should also try out your own experiments with cyclical learning rates to understand it’s capacity.

You can combine the power of this technique along with other methods like adaptive learning rate techniques to get powerful models.

There are a lot of techniques that are not popular among deep learning enthusiasts which may improve the generalization capability of your model or reduce the time to train the model which may save you a lot of your precious time.

If you wish to get the complete code discussed in the article, you can find it in this repository.