Learning Rate Scheduler

Published in

Towards Data Science

3 min readAug 2, 2017

Adaptive Learning Rate

In training deep networks, it is helpful to reduce the learning rate as the number of training epochs increases. This is based on the intuition that with a high learning rate, the deep learning model would possess high kinetic energy. As a result, it’s parameter vector bounces around chaotically. Thus, it’s unable to settle down into deeper and narrower parts of the loss function (local minima). If the learning rate, on the other hand, was very small, the system then would have low kinetic energy. Thus, it would settle down into shallow and narrower parts of the loss function (false minima).

The above figure depicts that a high learning rate will lead to random to and fro moment of the vector around local minima while a slow learning rate results in getting stuck into false minima. Thus, knowing when to decay the learning rate can be hard to find out.

We base our experiment on the principle of step decay. Here, we reduce the learning rate by a constant factor every few epochs. Typical values might be reducing the learning rate by half every 5 epochs, or by 0.1 every 20 epochs. These numbers depend heavily on the type of problem and the model. One heuristic you may see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving.

In practice, step decay is preferred as it’s easier to interpret hyperparameters like fraction of decay and the step timings in units of epochs. Also, it’s found to provide stabilization to the value of learning rate which in turn helps the stochastic gradient descent to exhibit fast convergence and a high rate of success.

Experimental Steps