Supercharge Training of Your Deep Learning Models

Super convergence with one-cycle learning rates

Published in

Towards Data Science

7 min readNov 22, 2023

Have you come across scenarios when it is easy to get an initial burst in accuracy but once you reach 90%, you have to push really really hard to squeeze out any improvement in performance? Does your model take too long to train?

In this article, we will look at an interesting technique to supercharge your training setup and get that extra bit of performance you have been looking for and train faster. Essentially, we will work towards dynamically changing the learning rate over epochs using a policy called the One-Cycle Learning Rate.

Originally mentioned in a paper by Leslie Smith, the one-cycle learning rate schedule[1], [2] focuses on a unique strategy to dynamically update the learning rate during the training process. Sounds like a mouthful of terms, don’t worry, let’s first start with a typical training setup and then we will gradually understand how we can improve results using one-cycle learning rate.

Training a Image Classifier

As we are working towards learning a neat trick (cycle-rate) to improve model performance, why not do it while enjoying the classic rock-paper-scissors game.

Problem Statement

The game of rock-paper-scissors is a classic child’s game involving two players using hand gestures (for rock, paper or scissors) competing to over-power their opponent. For instance, the rock gesture wins over scissors but the paper gesture wins over the rock. Interesting, isn’t it?

Our objective here, is to train an image classification model which can detect one of the three gestures. We can then leverage such a trained model to develop an end-to-end game. For the purpose of this article, we will limit the scope towards training a classifier itself, the end-to-end game complete with a deployable model is for another article probably.

The Dataset

We are lucky that we already have a labelled dataset which we can leverage to train a classification model to great effect. The dataset is hosted on TensorFlow dataset catalog made available by Laurence Moroney (CC BY 2.0). It has the following attributes:

Number of data points: 2800
Number of classes : 3
Available train-test split: Yes
Dataset size: 220 MiB

TensorFlow provides a nice and clean API to access such datasets, the following snippet allows us to download the train and validation splits

import tensorflow_datasets as tfds

DATASET_NAME = 'rock_paper_scissors'
(dataset_train_raw, dataset_test_raw), dataset_info = tfds.load(
    name=DATASET_NAME,
    data_dir='tmp',
    with_info=True,
    as_supervised=True,
    split=[tfds.Split.TRAIN, tfds.Split.TEST],
)

# plot samples from the dataset
fig = tfds.show_examples(dataset_train_raw, dataset_info)

The following are a few sample images from this dataset itself:

Figure: Sample data points in the Rock Paper Scissors Dataset — Figure:Sample data points in the Rock Paper Scissors Dataset

Learning Rate

Learning Rate is one of the key hyper-parameters which can make or break a setup yet it is one which is typically overlooked. The reason it is overlooked is because most libraries/packages come with good enough defaults to begin with. But these defaults can take you only so far.

Getting the correct learning rate for a bespoke use case such as ours is very important. It is a tricky trade-off to find the optimal value. Go too slow (or small) with the learning rate and your model will hardly learn anything. Go too fast (or large) and it will overshoot the ever so mysterious minima all neural networks aim to find. The same is depicted in the below illustration for a better understanding.

Figure:Impact of Learning Rate towards Model’s ability to learn the objective (minima). Souce: Author

Gradient Descent & Optimizers

Gradient descent is the standard way to train/optimise neural networks. It works by minimizing the objective function by updating the parameters of the network in the opposite direction of the gradient. Without going into much details, it helps in travelling downhill along the slope of the objective function. A detailed introduction to gradient descent is available here for reference.

The deep learning community has come a long way since the initial models were trained with vanilla gradient descent. Over the years a number of improvements have helped train faster and avoid obvious pitfalls. Briefly, some of the notable and most popular ones are:

AdaGrad
Adaptive Gradient algorithm is an optimization algorithm which adapts the learning rates of individual parameters based on their historical gradients, allowing for larger updates for infrequent parameters and smaller updates for frequent ones. It is designed to handle sparse data efficiently. It is well-suited when dealing with sparse data.

RMSProp
Root Mean Square Propagation optimises the learning by adjusting the learning rates for each parameter individually. It addresses the diminishing learning rates problem in AdaGrad by using a moving average of squared gradients. This helps adaptively scale the learning rates based on recent gradient magnitudes.

ADAM
Adaptive Moment Estimation is an optimization algorithm that combines ideas from both RMSProp and momentum methods. It maintains exponentially decaying averages of past gradients and squared gradients, using them to adaptively update parameters. ADAM is known for its efficiency and effectiveness in training deep neural networks.

One-Cycle Learning Rate and Super Convergence

One-Cycle Learning Rate is a simple two-step process to improvise upon the learning rate and the momentum as the training progresses. It works as follows:

Step 1: We start by ramping up the learning rate initially from a lower to a higher value in a linear incremental fashion for a few epochs
Step 2: We maintain the highest value of learning rate for a few epochs
Step 3: We then go back to a lower learning rate decaying over time

During these three steps, the momentum is updated in the exact opposite direction, i.e. when the learning rate goes up, the momentum goes down and vice-versa.

One-Cycle Learning Rate in Action

Let us first work our way through a simple implementation for one-cycle learning rate and then use it for training our model. We will leverage a ready-to-use implementation for the one-cycle LR schedule from Martin Gorner’s 2019 talk at TensorFlow World as depicted in listing 2.

def lr_function(epoch):
    # set start, min and max value for learning rate
    start_lr = 1e-3; min_lr = 1e-3; max_lr = 2e-3

    # define the number of epochs to increase 
    # LR lineary and then the decay factor
    rampup_epochs = 6; sustain_epochs = 0; exp_decay = .5

    # method to update the LR value based on the current epoch
    def lr(epoch, start_lr, min_lr, max_lr, rampup_epochs,
           sustain_epochs, exp_decay):
        if epoch < rampup_epochs:
            lr = ((max_lr - start_lr) / rampup_epochs
                        * epoch + start_lr)
        elif epoch < rampup_epochs + sustain_epochs:
            lr = max_lr
        else:
            lr = ((max_lr - min_lr) *
                      exp_decay**(epoch - rampup_epochs -
                                    sustain_epochs) + min_lr)
        return lr

    return lr(epoch, start_lr, min_lr, max_lr,
              rampup_epochs, sustain_epochs, exp_decay)

We execute this function(see listing 2) for a fixed number of epochs to showcase how the learning rate changes as per the two steps we discussed earlier. Here we start with an initial learning rate of 1e-3 and ramp it up to 2e-3 in the first few epochs. It is then reduced again back to 1e-3 over the course of the remaining epochs. This dynamic learning rate curve is depicted with a sample run of 24 epochs in following figure.

One-cycle learning rate policy over 50 epochs. Learning rate is ramped up initially, followed by a slow decay over epochs. Image Source: by author — One-cycle learning rate policy over 24 epochs. Learning rate is ramped up linearly , followed by a slow decay over remaining epochs. Image Source: Author

We will now put our one-cycle learning rate scheduler to the test by applying it when using a MobileNetV2 model as a feature extractor while training a classification head for our current case of rock-paper-scissors. We will then be comparing it against a simple CNN as well as MobileNetV2+classification head with standard Adam optimiser. The complete notebook is available for reference on github. For a quick overview, the following snippet outlines how we use TensorFlow callbacks to plug-in our 1-cycle-rate utility.

# Set Image Shape 
INPUT_IMG_SHAPE= (128, 128, 3)

# Get Pretrained MobileNetV2
base_model = tf.keras.applications.MobileNetV2(
  input_shape=INPUT_IMG_SHAPE,
  include_top=False,
  weights='imagenet',
  pooling='avg'
)

# Attach a classification head
model_lr = tf.keras.models.Sequential()
model_lr.add(base_model)
model_lr.add(tf.keras.layers.Dropout(0.5))
model_lr.add(tf.keras.layers.Dense(
    units=NUM_CLASSES,
    activation=tf.keras.activations.softmax,
    kernel_regularizer=tf.keras.regularizers.l2(l=0.01)
))

# compile the model
model_lr.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.sparse_categorical_crossentropy,
    metrics=['accuracy']
)

# set number of epochs
initial_epochs = 24

# Set the model for training
# The LearningRateScheduler callback is where we
# plug our custom 1-cycle rate function
training_history_lr = model_lr.fit(
    x=dataset_train_augmented_shuffled.repeat(),
    validation_data=dataset_test_shuffled.repeat(),
    epochs=initial_epochs,
    steps_per_epoch=steps_per_epoch,
    validation_steps=validation_steps,
    callbacks=[
        tf.keras.callbacks.LearningRateScheduler(lambda epoch: \
                                             lr_function(epoch),
                                             verbose=True)
    ],
    verbose=1
)

We train all 3 models for 24 epochs with a batch size of 64. The following figure showcases the impact of 1 cycle learning rate. It is able to assist our model to achieve convergence in just 5 epochs as compared to the other two models. The super-convergence phenomenon is visible for validation dataset as well.

MobileNetV2 with 1 cycle learning rate (mobileNetV2_lr) outperforms MobileNetV2 and simple CNN architectures by achieving converge is just 5 epochs

We reach consistent values of validation accuracies ranging between 90–92% within 10 epochs which is so far the best we have seen across all our models. On evaluation the model performance on the test dataset also depicts the same story, i.e. MobileNetV2_lr outperforms the other two very easily.

# Simple CNN
Test loss:  0.7511898279190063
Test accuracy:  0.7768816947937012

# MobileNetV2
Test loss:  0.24527719616889954
Test accuracy:  0.9220430254936218

# MobileNetV2_LR
Test loss:  0.27864792943000793
Test accuracy:  0.9166666865348816

Conclusion

Overcoming the plateau in model performance beyond 90% accuracy and optimizing training time can be achieved through the implementation of the One-Cycle Learning Rate. This technique, introduced by Leslie Smith and team, dynamically adjusts the learning rate during training, offering a strategic approach to supercharging the model performance. By adopting this method, you can efficiently navigate the complexities of training setups and unlock the potential for faster and more effective deep learning models. Embrace the power of One-Cycle Learning Rate to elevate your training experience and achieve superior results!