Adaptive - and Cyclical Learning Rates using PyTorch

Thomas Dehaene
Towards Data Science
7 min readMar 20, 2019

--

The Learning Rate (LR) is one of the key parameters to tune in your neural net. SGD optimizers with adaptive learning rates have been popular for quite some time now: Adam, Adamax and its older brothers are often the de-facto standard. They take away the pain of having to search and schedule your learning rate by hand (eg. the decay rate).

Source: Jeremy Jordan’s blogpost

This post will give a short example-wise overview and comparison of the most popular adaptive learning rate optimizers. We’ll be using PyTorch, the hipster neural-network library of choice!

Next to this, fast.ai preached the concept of Cyclical Learning Rates (CLR) as well, referring to the great paper by Leslie Smith (link). We’ll take a look how SGD with this schedule holds up to the other optimizers.

To demonstrate, we will take on two classification tasks:

  • Image classification with a vanilla CNN architecture
  • Image classification with a pretrained (on ImageNet) ResNet34 network

All code and training logs for this post can be found on GitHub.

The Data

For this post, we’ll be using the ‘Flowers Recognition’ dataset from Kaggle (link). It’s a great basic real-world’ish dataset for testing out Image Classification networks.

example image of the class ‘dandelion’

The data, which gets split up using 20% for validation, is fairly evenly divided over the 5 classes to predict:

Adaptive Optimizers

Without going deeper into the mathematics of each one, here is a brief (and a tad oversimplified) overview of the optimizers we will be battling against one-another:

  • Adagrad: this will scale the learning rate, for each parameter, based on the past history of the gradients. In essence: large gradients => smaller alpha and vice versa. The downside though is that learning rates can diminish very quickly.
  • Adadelta: continues on Adagrad, but with new tricks: only the past w gradients are stored instead of the entire history, and the (now limited) history is stored as a decaying average.
  • RMSProp: kinda the same (plz don’t shoot me Mr. Hinton sir), but RMSProp divides the LR by an exponentially decaying average of squared gradients.
  • Adam: next to storing the historic sum of squared gradients, it also calculates an exponentially decaying average of the past gradients (similar to momentum).
  • Adamax: here, another trick is applied to the moving average of the squared gradients v(t), the authors apply infinity-norm ℓ∞ to obtain a new, norm-constrained vector v(t), plug this into Adam and thus obtain a surprisingly stable algorithm.

👉 Tip: if you’re looking for a more in-depth mathematical comparison of the optimizers, check out this fantastic blog post by Sebastian Ruder, which was of great help for me in writing this post.

Cyclical Learning Rate

The CLR paper suggests two very interesting points:

  1. It gives us a way to schedule the Learning Rate in an efficient way during training, by varying it between an upper and a lower bound in a triangular fashion.
  2. It gives us a very decent estimate which range of Learning Rates works well for your particular network.

There are a number of parameters to play around with here:

  • step size: during how many epochs will the LR go up from the lower bound, up to the upper bound.
  • max_lr: the highest LR in the schedule.
  • base_lr: the lowest LR in the schedule, in practice: the author of the paper suggests to take this a factor R smaller than the max_lr. Our used factor was 6.

The exact reason why this would work well is difficult to analyse of course. The evolution of the LR might cause the network to go to a higher loss in the short-term, but this short-term disadvantage proves advantageous in the long term. It gives the network the ability to jump to another local minimum, if the current one isn’t very stable.

Source: Snapsshot Ensembles (https://arxiv.org/abs/1704.00109)

One other advantage CLR has over the Adaptive methods described above is that it is less computationally intensive.

In the paper, it is also mentioned that you can play around with a linearly or exponentially decreasing upper bound over time, but this is not implemented in this blog post.

So how does this work in code?…

Step 1: find the upper LR

Using a vanilla CNN as an example : step 1 is to calculate the upper bound of the learning rate for your model. The way to do this is to:

  • define an initial learning rate, the lower boundary of the range you want to test (let’s say 1e-7)
  • define an upper boundary of the range (let’s say 0.1)
  • define an exponential scheme to run through this step by step:
Used formula for the LR finder scheduling (N = number of images, BS = Batch Size, lr = learning rate)

Luckily, PyTorch has a LambdaLR object which lets us define the above in a lambda function:

  • Next, do a run (I used two epochs) through your network. At each step (each batch size): capture the LR, capture the loss and optimize the gradients:

👉Note: we don’t take the ‘raw’ loss at each step, but the smoothed loss, being: loss = α . loss + (1- α). previous_loss

After this, we can clearly see the LR followed a nice exponential patern:

The loss-lr plot for the basic network (see later) looks as follows:

We can clearly see that a too high LR causes divergence in the network loss, and too low of an LR doesn’t cause the network to learn very much at all…

In his fast.ai course, Jeremy Howard mentions that a good upper bound is not on the lowest point, but about a factor of 10 to the left.

Taking this into account, we can state that a good upper bound for the learning rate would be: 3e-3.

A good lower bound, according to the paper and other sources, is the upper bound, divided by a factor 6.

Step 2: CLR scheduler

Step 2 is to create a Cyclical learning schedule, which varies the learning rate between the lower and the upper bound.

This can be done in a number of fashions:

Various possibilites for the CLR shape (source: jeremy jordan’s blog)

We’re going for the plain ol’ triangular CLR schedule.

Programmatically, we just need to create a custom function:

Step 3: wrap it

In step 3, this can then be wrapped inside a LambdaLR object in PyTorch:

Step 4: train

During an epoch, we need to update the LR using the ‘.step()’ method of the scheduler object:

Comparison 1: Vanilla CNN

First up is the classification taks using a vanilla (non-pretrained) CNN. I’ve used the following network architecture:

To prevent the model from overfitting on the (relatively small) dataset, we use the following techniques:

  • Dropout in the Linear layers
  • Batchnorm layer in the CNN blocks
  • Data augmentation:

👉hint: you need to calc the mean and std for the channel normalization in advance, look in the full notebook to see how to tackle this.

We trained the network for 150 epochs for each of the 6 optimizers. To cancel out some variability, we did 3 runs for each optimizer.

The training and validation accuracy look like such:

Training accuracy
Validation accuracy

Alrighty boys and girls, what can we see here:

👉 Adagrad: mediocre performance, as was to be expected

👉 Adadelta: not a real champ in training acc, but very decent performance in validation

👉 RMSProp: unless I’m doing something wrong here, I was a bit surprised by the poor performance

👉 Adam: consistently good

👉 Adamax: promising training accuracy evolution, but not perfectly reflected in the validation accuracy

👉 SGD with CLR: much faster convergence in training accuracy, fast convergence in validation accuracy terms, not too shabby…

In the end, SGD+CLR, Adam and Adadelta all seem to end at about the same final validation accuracy of about 83%.

Comparison 2: Resnet34 Transfer Learning

If you’re saying: “image classification on a small dataset”, you need to consider Transfer Learning.

So we did just that, using Resnet34, pretrained on ImageNet. I believe the dataset was fairly close to Imagenet pictures, so I only unfroze the last block of the 5 convolutional blocks, and replaced the last linear layer with a new one:

The network was trained, for each of the 6 optimizers, for 100 epochs (due to much faster convergence):

Training accuracy
Validation accuracy

Key notes here:

👉 In general: much less difference between the optimizers, especially when observing the validation accuracy

👉 RMSProp: still a bit of an underperformer

👉 SGD+CLR again good performance in training accuracy, but this does not get reflected immediately in the validation accuracy.

It seems that for Transfer Learning, the absolute reward in tuning your learning rate and carefully selecting your optimizer is less great.

This is probably due to two main effects:

  • the network weights are already largely optimized
  • the optimizer typically only gets to optimize a smaller portion of the entire network weights, since most weights remain frozen

Conclusion

The main point from the blogpost would be:

Don’t just take any old of-the-shelf optimizer. The learning rate is one of the most important hyperparameters, so it pays of to take a closer look at it. And if you’re comparing, have a look at SGD with a CLR schedule.

Again: all code can be found here, feel free to check it out!

--

--

Information-addicted Machine Learning Engineer at ML6. Turning caffeine into Python code.