Speeding up Neural Net Training with LR-Finder

Finding good initial learning rate for your network

Faizan Ahemad
Towards Data Science

--

Intro: Optimisers and LR

While training a Deep Neural Network selecting a good learning rate is essential for both fast convergence and a lower error. We also have to select a optimiser which decides how weight updates are done in a DNN.

There are various optimisers available like Adam, SGD+momentum, Adagrad, RMSprop, AdaDelta, AdaBound. Of these Adam and SGD+momentum are most popular. While training a Fully Connected DNN or a Convolutional network most State of the Art networks use SGD+momentum. This is due to the fact that it generalises better to unseen data and gives better validation/test scores.

Why do we need to Find a Good LR?

There are two tiny problems with SGD though, SGD is slow to converge compared to Adam and SGD requires learning rate tuning. Surprisingly the solution to both problems is using a good starting learning rate. In case your LR is too high, your errors will never reduce and training will not converge. Too low LR and you have to wait too long for training to converge. So we start with a good LR given by LR-Finder, then decay it a little as we reach the end.

So How does LR finders work?

Basic objective of a LR Finder is to find the highest LR which still minimises the loss and does not make the loss explode/diverge. We do this by training a model while increasing the LR after each batch, we record the loss and finally we use the LR just before loss exploded. We do this for 1 epoch.

start_lr = 0.0001
end_lr = 1.0
num_batches = len(Data)/batch_size
cur_lr = start_lr
lr_multiplier = (end_lr / start_lr) ** (1.0 / num_batches)
losses = []
lrs = []
for i in 1 to num_batches:
loss = train_model_get_loss(batch=i)
losses.append(loss)
lrs.append(cur_lr)
cur_lr = cur_lr*lr_multiplier # increase LR
plot(lrs,losses)

You will get a plot which looks like below

Loss vs LR

We plot the points with a arrow to indicate location in our implementation

LR Finder with Annotation

From this plot we find the point after which loss starts increasing too much.

Usage

I have written a small library which contains the LR Finder. This is for Keras. For pytorch fast.ai implementation works.
To install:

pip install --upgrade --upgrade-strategy only-if-needed https://github.com/faizanahemad/data-science-utils/tarball/master > /dev/null

Next in your notebook (For Cifar 10)

First Define Imports and Data generators for our Data set.

from data_science_utils.vision.keras import *X_train, Y_train, X_test, Y_test = get_cifar10_data()cutout_fn = get_cutout_eraser(p=0.75, s_l=0.1, s_h=0.3, r_1=0.3, r_2=1 / 0.3, max_erasures_per_image=2, pixel_level=True)
datagen = ImageDataGenerator(featurewise_center=True,featurewise_std_normalization=True,
preprocessing_function=cutout_fn)
datagen_validation = ImageDataGenerator(featurewise_center=True,featurewise_std_normalization=True,)
datagen_validation.fit(X_train)
model = build_model()

Next We build our model and use LR Finder on it.

model = build_model() # returns an instance of Keras Model
lrf = LRFinder(model)
generator = datagen.flow(X_train, Y_train, batch_size=256,shuffle=True)
test_generator = datagen_validation.flow(X_test, Y_test, batch_size=256, shuffle=True)
lrf.find_generator(generator, 0.0001, 10.0,test_generator, epochs=1, steps_per_epoch=None,)
lrf.plot_loss()

The above 2 plots you saw were generated using this. You can see the example in this Google Colab Notebook.

Precautions while using LR Finder

  • We use the minimas as our candidate LRs. You can notice that some of these are local minimas, where the overall loss is quite high, for the candidate LRs the overall loss should be near to the minimum loss. So we filter these local minimas
  • We need to use validation set to find loss, using training set to find loss will not yield correct results since the weights will overfit to training set.
  • When we generate candidate LRs, we need to ensure that those are distinct enough. For example generating candidates like 0.552 and 0.563 doesn’t make sense since these LRs are too close. As such we apply a rule that each LR should be atleast 20% different from its previous lower LR.
  • Note that LR Finder gives you an approx value, So it doesn’t matter if you take the exact value or not. Like if LRF gives you 0.012, then you can use 0.01 as well. If it gives 0.056 then both 0.05 and 0.06 are fine.

Fine-Tuning LR in later epochs

We can use LR scheduling or LR decay to reduce LR in later epochs since we started with high LR initially.

Results

Before LR Finder

We used SGD directly with a learning_rate of 0.01 and nesterov’s momentum. We trained the network on CIFAR-10 for 100 epochs. Our network has 450K params.

Without LR Finder

As you can notice it took about 60 epochs for validation error to converge and it is 86.7% accuracy.

After LR Finder

We use the LR provided by LR finder. Everything else is same.

After LR Finder

You can see that here we get 86.4% accuracy but training converges in 40 epochs instead of 60. Using LR given by LR finder along with EarlyStopping can reduce compute time massively.

After LR Finder with LR scheduling

We use LR scheduling from keras to decrease LR in each epoch. Basically after each epoch we decrease LR by multiplying with 0.97. You can see the section LR Scheduling in the example notebook

With LR scheduling

Notice that we get 88.4% with same network and LR. Also notice that towards the end the loss and accuracy graphs are no longer fluctuating as before.

So we have gotten over 1% accuracy increase just by using LR finder with LR scheduling using the same 100 epochs we started with.

Conclusion and References

Using LR Finder is shown to be beneficial for both faster training and increasing accuracy. We also show how to use LR finder using an example notebook. The code of LR Finder is here.

In case you found this useful, do visit my Notebook/Github for full code.

--

--