The world’s leading publication for data science, AI, and ML professionals.

Effect of Gradient Descent Optimizers on Neural Net Training

co-authored with Apurva Pathak

co-authored with Apurva Pathak

Experimenting with Gradient Descent Optimizers

Welcome to another instalment in our Deep Learning Experiments series, where we run experiments to evaluate commonly-held assumptions about training neural networks. Our goal is to better understand the different design choices that affect model training and evaluation. To do so, we come up with questions about each design choice and then run experiments to answer them.

In this article, we seek to better understand the impact of using different optimizers:

  • How do different optimizers perform in practice?
  • How sensitive is each optimizer to parameter choices such as learning rate or momentum?
  • How quickly does each optimizer converge?
  • How much of a performance difference does choosing a good optimizer make?

To answer these questions, we evaluate the following optimizers:

  • Stochastic gradient descent (SGD)
  • SGD with momentum
  • SGD with Nesterov momentum
  • RMSprop
  • Adam
  • Adagrad
  • Cyclic Learning Rate

How are the experiments set up?

We train a neural net using different optimizers and compare their performance. The code for these experiments can be found on Github.

  • Dataset: we use the Cats and Dogs dataset, which consists of 23,262 images of cats and dogs, split about 50/50 between the two classes. Since the images are differently-sized, we resize them all to the same size. We use 20% of the dataset as validation data (dev set) and the rest as training data.
  • Evaluation metric: we use the binary cross-entropy loss on the validation data as our primary metric to measure model performance.
Figure 1: Sample images from Cats and Dogs dataset
Figure 1: Sample images from Cats and Dogs dataset
  • Base model: we also define a base model that is inspired by VGG16, where we apply (convolution ->max-pool -> ReLU -> batch-norm -> dropout) operations repeatedly. Then, we flatten the output volume and feed it into two fully-connected layers (dense -> ReLU -> batch-norm) with 256 units each, and dropout after the first FC layer. Finally, we feed the result into a one-neuron layer with a sigmoid activation, resulting in an output between 0 and 1 that tells us whether the model predicts a cat (0) or dog (1).
Figure 2: Base model architecture (created using NN SVG)
Figure 2: Base model architecture (created using NN SVG)
  • Training: we use a batch size of 32 and the default weight initialization (Glorot uniform). The default optimizer is SGD with a learning rate of 0.01. We train until the validation loss fails to improve over 50 iterations.

Stochastic Gradient Descent

We first start off with vanilla stochastic gradient descent. This is defined by the following update equation:

Figure 3: SGD update equation
Figure 3: SGD update equation

where w is the weight vector and dw is the gradient of the loss function with respect to the weights. This update rule takes a step in the direction of greatest decrease in the loss function, helping us find a set of weights that minimizes the loss. Note that in pure SGD, the update is applied per example, but more commonly it is computed on a batch of examples (called a mini-batch).

How does learning rate affect SGD?

First, we explore how learning rate affects SGD. It is well known that choosing a learning rate that is too low will cause the model to converge slowly, whereas a learning rate that is too high may cause it to not converge at all.

Figure 4: Illustration of optimizer convergence, taken from Jeremy Jordan's website
Figure 4: Illustration of optimizer convergence, taken from Jeremy Jordan’s website

To verify this experimentally, we vary the learning rate along a log scale between 0.001 and 0.1. Let’s first plot the training losses.

Figure 5: Training loss curves for SGD with different learning rates
Figure 5: Training loss curves for SGD with different learning rates

We indeed observe that performance is optimal when the learning rate is neither too small nor too large (the red line). Initially, increasing the learning rate speeds up convergence, but after learning rate 0.0316, convergence actually becomes slower. This may be because taking a larger step may actually overshoot the minimum loss, as illustrated in figure 4, resulting in a higher loss.

Let’s now plot the validation losses.

Figure 6: Validation loss curves for SGD with different learning rates
Figure 6: Validation loss curves for SGD with different learning rates

We observe that validation performance suffers when we pick a learning rate that is either too small or too big. Too small (e.g. 0.001) and the validation loss does not decrease at all, or does so very slowly. Too large (e.g. 0.1) and the validation loss does not attain as low a minimum as it could with a smaller learning rate.

Let’s now plot the best training and validation loss attained by each learning rate*:

Figure 7: Minimum training and validation losses for SGD at different learning rates
Figure 7: Minimum training and validation losses for SGD at different learning rates

The data above confirm the ‘Goldilocks’ theory of picking a learning rate that is neither too small nor too large, since the best learning rate (3.2e-2) is in the middle of the range of values we tried.

*Typically, we would expect the validation loss to be higher than the training loss, since the model has not seen the validation data before. However, we see above that the validation loss is surprisingly sometimes lower than the training loss. This could be due to dropout, since neurons are dropped only at training time and not during evaluation, resulting in better performance during evaluation than during training. The effect may be particularly pronounced when the dropout rate is high, as it is in our model (0.6 dropout on FC layers).

Best SGD validation loss

  • Best validation loss: 0.1899
  • Associated training loss: 0.1945
  • Epochs to converge to minimum: 535
  • Params: learning rate 0.032

SGD takeaways

  • Choosing a good learning rate (not too big, not too small) is critical for ensuring optimal performance on SGD.

Stochastic Gradient Descent with Momentum

Overview

SGD with momentum is a variant of SGD that typically converges more quickly than vanilla SGD. It is typically defined as follows:

Figure 8: Update equations for SGD with momentum
Figure 8: Update equations for SGD with momentum

Deep Learning by Goodfellow et al. explains the physical intuition behind the algorithm [0]:

Formally, the momentum algorithm introduces a variable v that plays the role of velocity – it is the direction and speed at which the parameters move through parameter space. The velocity is set to an exponentially decaying average of the negative gradient.

In other words, the parameters move through the parameter space at a velocity that changes over time. The change in velocity is dictated by two terms:

  • 𝛼, the learning rate, which determines to what degree the gradient acts upon the velocity
  • 𝛽, the rate at which the velocity decays over time

Thus, the velocity is an exponential average of the gradients, which incorporates new gradients and naturally decays old gradients over time.

One can imagine a ball rolling down a hill, gathering velocity as it descends. Gravity exerts force on the ball, causing it to accelerate or decelerate, as represented by the gradient term 𝛼 * dw. The ball also encounters viscous drag, causing its velocity to decay, as represented by 𝛽.

One effect of momentum is to accelerate updates along dimensions where the gradient direction is consistent. For example, consider the effect of momentum when the gradient is a constant c:

Figure 9: change in velocity over time when gradient is a constant c.
Figure 9: change in velocity over time when gradient is a constant c.

Whereas vanilla SGD would make an update of _-_𝛼c _e_ach time, SGD with momentum would accelerate over time, eventually reaching a terminal velocity that is 1/1-𝛽 times greater than the vanilla update (derived using the formula for an infinite series). For example, if we set the momentum to 𝛽=0.9, then the update eventually becomes 10 times as large as the vanilla update.

Another effect of momentum is that it dampens oscillations. For example, consider a case when the gradient zigzags and changes direction often along a certain dimension:

Figure 10: Illustration of momentum, from More on Optimization Techniques by Ekaba Bisong
Figure 10: Illustration of momentum, from More on Optimization Techniques by Ekaba Bisong

The momentum term dampens the oscillations because the oscillating terms cancel out when we add them into the velocity. This allows the update to be dominated by dimensions where the gradient points consistently in the same direction.

Experiments

Let’s look at the effect of momentum at learning rate 0.01. We try out momentum values [0, 0.5, 0.9, 0.95, 0.99].

Figure 11: Effect of momentum on training loss (left) and validation (right) at learning rate 0.01.
Figure 11: Effect of momentum on training loss (left) and validation (right) at learning rate 0.01.

Above, we can see that increasing momentum up to 0.9 helps model training converge more quickly, since training and validation loss decrease at a faster rate. However, once we go past 0.9, we observe that training loss and validation loss actually suffer, with model training entirely failing to converge at momentum 0.99. Why does this happen? This could be because excessively large momentum prevents the model from adapting to new directions in the gradient updates. Another potential reason is that the weight updates become so large that it overshoots the minima. However, this remains an area for future investigation.

Do we observe the decrease in oscillation that is touted as a benefit of momentum? To measure this, we can compute an oscillation proportion for each update step – i.e. what proportion of parameter updates in the current update have the opposite sign compared to the previous update. Indeed, increasing the momentum decreases the proportion of parameters that oscillate:

Figure 12: Effect of momentum on oscillation
Figure 12: Effect of momentum on oscillation

What about the size of the updates – does the acceleration property of momentum increase the average size of the updates? Interestingly, the higher the momentum, the larger the initial updates but the smaller the later updates:

Figure 13: Effect of momentum on average update size
Figure 13: Effect of momentum on average update size

Thus, increasing the momentum results in taking larger initial steps but smaller later steps. Why would this be the case? This is likely because momentum initially benefits from acceleration, causing the initial steps to be larger. Later, the momentum causes oscillations to cancel out, which could make the later steps smaller.

One data point that supports this interpretation is the distance traversed per epoch (defined as the Euclidean distance between the weights at the beginning of the epoch and the weights at the end of the epoch). We see that even though larger momentum values take smaller later steps, they actually traverse more distance:

Figure 14: Distance traversed per epoch for each momentum value.
Figure 14: Distance traversed per epoch for each momentum value.

This indicates that even though increasing the momentum values causes the later update steps to become smaller, the distance traversed is actually greater because the steps are more efficient – they do not cancel each other out as often.

Now, let’s look at the effect of momentum on a small learning rate (0.001).

Figure 15: Effect of momentum on training loss (left) and validation loss (right) at learning rate 0.001.
Figure 15: Effect of momentum on training loss (left) and validation loss (right) at learning rate 0.001.

Surprisingly, increasing momentum on small learning rates helps it converge, when it didn’t before! Now, let’s look at a large learning rate.

Figure 16: Effect of momentum on training loss (left) and validation loss (right) at learning rate 0.1.
Figure 16: Effect of momentum on training loss (left) and validation loss (right) at learning rate 0.1.

When the learning rate is large, increasing the momentum degrades performance, and can even result in the model failing to converge (see flat lines above corresponding to momentum 0.9 and 0.95).

Now, to generalize our observations, let’s look at the minimum training loss and validation loss across all learning rates and momentums:

Figure 17: Minimum training loss (left) and validation loss (right) at different learning rates and momentums. Minimum value in each row is highlighted in green.
Figure 17: Minimum training loss (left) and validation loss (right) at different learning rates and momentums. Minimum value in each row is highlighted in green.

We see that the learning rate and the momentum are closely linked -the higher the learning rate, the lower the range of ‘acceptable’ momentum values (i.e. values that don’t cause the model training to diverge). Conversely, the higher the momentum, the lower the range of acceptable learning rates.

Altogether, the behavior across all the learning rates suggests that increasing momentum has an effect akin to increasing the learning rate. It helps smaller learning rates converge (Figure 14) but may cause larger ones to diverge (Figure 15). This makes sense if we consider the terminal velocity interpretation from Figure 9 – adding momentum can cause the updates to reach a terminal velocity much greater than than the vanilla updates themselves.

Note, however, that this does not mean that increasing momentum is the same as increasing the learning rate – there are simply some similarities in terms of convergence/divergence behavior between increasing momentum and increasing the learning rate. More concretely, as we can see in Figures 12 and 13, momentum also decreases oscillations, and front-loads the large updates at the beginning of training – we would not observe the same behaviors if we simply increased the learning rate.

Alternative formulation of momentum

There is another way to define momentum, expressed as follows:

Figure 18: Alternative definition of momentum
Figure 18: Alternative definition of momentum

Andrew Ng uses this definition of momentum in his Deep Learning Specialization on Coursera. In this formulation, the velocity term is an exponentially moving average of the gradients, controlled by the parameter beta. The update is applied to the weights, with the size of the update controlled by the learning rate alpha. Note that this formulation is mathematically the same as the first formulation when expanded, except that all the terms are multiplied by 1-beta.

How does this formulation of momentum work in practice?

Figure 19: Effect of momentum (alternative formulation) on training loss (left) and validation loss (right)
Figure 19: Effect of momentum (alternative formulation) on training loss (left) and validation loss (right)

Surprisingly, using this alternative formulation, it looks like increasing the momentum actually slows down convergence!

Why would this be the case? This formulation of momentum, while dampening oscillations, does not enjoy the same benefit of acceleration that the other formulation does. If we consider a toy example where the gradient is always a constant c, we see that the velocity never accelerates:

Figure 20: Change in velocity over time with repeated gradients of constant c
Figure 20: Change in velocity over time with repeated gradients of constant c

Indeed, Andrew Ng suggests that the main benefit of this formulation of momentum is not acceleration, but the fact that it dampens oscillations, allowing you to use a larger learning rate and therefore converge more quickly. Based on our experiments, increasing momentum by itself (in this formulation) without increasing the learning rate is not enough to guarantee faster convergence.

Best validation loss on SGD with momentum

  • Best validation loss: 0.2046
  • Associated training loss: 0.2252
  • Epochs to converge to minimum: 402
  • Params: learning rate 0.01, momentum 0.5

SGD with momentum takeaways

  • Momentum causes model training to converge more quickly, but is not guaranteed to improve the final training or validation loss, based on the parameters we tested.
  • The higher the learning rate, the lower the range of acceptable momentum values (ones where model training converges).

Stochastic Gradient Descent with Nesterov Momentum

One issue with momentum is that while the gradient always points in the direction of greatest loss decrease, the momentum may not. To correct for this, Nesterov momentum computes the gradient at a lookahead point (w + velocity) instead of w. This gives the gradient a chance to correct for the momentum term.

Figure 21: Nesterov update. Left: illustration. Right: equations.
Figure 21: Nesterov update. Left: illustration. Right: equations.

To illustrate how Nesterov can help training converge more quickly, let’s look at a dummy example where the optimizer tries to descend a bowl-shaped loss surface, with the minimum at the center of the bowl.

Figure 22. Left: regular momentum. Right: Nesterov momentum.
Figure 22. Left: regular momentum. Right: Nesterov momentum.

As the illustrations show, Nesterov converges more quickly because it computes the gradient at a lookahead point, thus ensuring that the update approaches the minimizer more quickly.

Let’s try out Nesterov on a subset of the learning rates and momentums we used for regular momentum, and see if it speeds up convergence. Let’s take a look at learning rate 0.001 and momentum 0.95:

Figure 23: Effect of Nesterov momentum on lr 0.001 and momentum 0.95.
Figure 23: Effect of Nesterov momentum on lr 0.001 and momentum 0.95.

Here, Nesterov does indeed seem to speed up convergence rapidly! How about if we increase the momentum to 0.99?

Figure 24: Effect of Nesterov momentum on lor 0.001 and momentum 0.99.
Figure 24: Effect of Nesterov momentum on lor 0.001 and momentum 0.99.

Now, Nesterov actually converges more slowly on the training loss, and though it initially converges more quickly on validation loss, it slows down and is overtaken by momentum after around 50 epochs.

How should we measure speed of convergence over all the training runs? Let’s take the loss that regular momentum achieves after 50 epochs, then determine how many epochs Nesterov takes to reach that same loss. We define the convergence ratio as this number of epochs divided by 50. If it less than one, then Nesterov converges more quickly than regular momentum; conversely, if it is greater, then Nesterov converges more slowly.

Figure 25. Ratio of epochs for Nesterov's loss to converge to the regular momentum's loss after 50 epochs. Training runs where Nesterov was faster are highlighted in green; slower in red; and runs where neither Nesterov nor regular momentum converged in yellow.
Figure 25. Ratio of epochs for Nesterov’s loss to converge to the regular momentum’s loss after 50 epochs. Training runs where Nesterov was faster are highlighted in green; slower in red; and runs where neither Nesterov nor regular momentum converged in yellow.

We see that in most cases (10/14) adding Nesterov causes the training loss to decrease more quickly, as seen in Table 5. The same applies to a lesser extent (8/12) for the validation loss, in Table 6.

There does not seem to be a clear relationship between the speedup from adding Nesterov and the other parameters (learning rate and momentum), though this can be an area for future investigation.

Best validation loss on SGD with Nesterov momentum

  • Best validation loss: 0.2020
  • Associated training loss: 0.1945
  • Epochs to converge to minimum: 414
  • Params: learning rate 0.003, momentum 0.95
Figure 26. Minimum training and validation losses achieved by each training run. Minimum in each row is highlighted in green.
Figure 26. Minimum training and validation losses achieved by each training run. Minimum in each row is highlighted in green.

SGD with Nesterov momentum takeaways

  • Nesterov momentum computes the gradient at a lookahead point in order to account for the effect of momentum.
  • Nesterov generally converges more quickly compared to regular momentum.

RMSprop

In RMSpropr, we do the following for each weight parameter:

  • Keep a moving average of the squared gradient.
  • Divide the gradient by the square root of the moving average, then apply the update.

The update equations are as follows:

Figure 27: RMSprop update equations - adapted from the Deep Learning Specialization by Andrew Ng
Figure 27: RMSprop update equations – adapted from the Deep Learning Specialization by Andrew Ng

Here, rho is a hyperparameter that defines how quickly the moving average adapts to new terms – the higher rho is, the more slowly the moving average changes. Epsilon is a small number meant to prevent division by zero. Alpha is the learning rate, w_i is weight i, a_i is the moving average, and dw_i is the gradient for weight i.

What is RMSprop trying to do on a conceptual level? RMSprop is trying to normalize each element of the update so that no one element is excessively large or small. As an example, consider a weight parameter where the gradients are [5, 5, 5] (and assume that 𝛼=1). The denominator in the second equation is then 5, so the updates applied would be -[1, 1, 1]. Now, consider a weight parameter where the gradients are [0.5, 0.5, 0.5]; the denominator would be 0.5, giving the same updates -[1, 1, 1] as the previous case! In other words, RMSprop cares more about the sign (+ or -) of each weight than the magnitude, and tries to normalize the size of the update step for each of these weights.

This is different from vanilla SGD, which applies larger updates for weight parameters with larger gradients. Considering the above example where the gradient is [5, 5, 5], we can see that the resulting updates would be -[5, 5, 5], whereas for the [0.5, 0.5, 0.5] case the updates would be -[0.5, 0.5, 0.5].

How do learning rate and rho affect RMSprop?

Let’s try out RMSprop while varying the learning rate 𝛼 (default 0.001) and the coefficient 𝜌 (default 0.9). Let’s first try setting 𝜌 = 0 and vary the learning rate:

Figure 28: RMSProp training loss at different learning rates, with rho = 0.
Figure 28: RMSProp training loss at different learning rates, with rho = 0.

First lesson learned – it looks like RMSProp with 𝜌=0 does not perform well. This results in the update being as follows:

Figure 29: RMSprop when rho = 0
Figure 29: RMSprop when rho = 0

Why this fails to perform well is an area for future investigation.

Let’s try again over nonzero rho values. We first plot the train and validation losses for a small learning rate (1e-3).

Figure 30: RMSProp at different rho values, with learning rate 1e-3.
Figure 30: RMSProp at different rho values, with learning rate 1e-3.

Increasing rho seems to reduce both the training loss and validation loss, but with diminishing returns – the validation loss ceases to improve when increasing rho from 0.95 to 0.99.

Let’s now take a look at what happens when we use a larger learning rate.

Figure 31: RMSProp at different rho values, with learning rate 3e-2.
Figure 31: RMSProp at different rho values, with learning rate 3e-2.

Here, the training and validation losses entirely fail to converge!

Let’s take a look at the minimum training and validation losses across all parameters:

Figure 32: Minimum training loss (left) and minimum validation loss (right) on RMSprop across different learning rates and rho values. Minimum value in each row is highlighted in green.
Figure 32: Minimum training loss (left) and minimum validation loss (right) on RMSprop across different learning rates and rho values. Minimum value in each row is highlighted in green.

From the plots above, we find that once the learning rate reaches 0.01 or higher, RMSprop fails to converge.Thus, the optimal learning rate found here is around ten times as small as the optimal learning rate on SGD! One hypothesis is that the denominator term is much smaller than one, so it effectively scales up the update. Thus, we need to adjust the learning rate downward to compensate.

Regarding 𝜌, we can see from the graphs above the RMS performs the best on our data with high 𝜌 values (0.9 to 1). Even though the Keras docs recommend using the default value of 𝜌=0.9, it’s worth exploring other values as well – when we increased rho from 0.9 to 0.95, it substantially improved the best attained validation loss from 0.2226 to 0.2061.

Best validation loss on RMSprop

  • Best validation loss: 0.2061
  • Associated training loss: 0.2408
  • Epochs to converge to minimum: 338
  • Params: learning rate 0.001, rho 0.95

RMSprop takeaways

  • RMSprop seems to work at much smaller learning rates than vanilla SGD (about ten times smaller). This is likely because we divide the original update (dw) by the averaged gradient.
  • Additionally, it seems to pay off to explore different values of 𝜌, contrary to the Keras docs’ recommendation to use the default value.

Adam

Adam is sometimes regarded as the optimizer of choice, as it has been shown to converge more quickly than SGD and other Optimization methods [1]. essentially a combination of SGD with momentum and RMSProp. It uses the following update equations:

Figure 33: Adam update equations
Figure 33: Adam update equations

Essentially, we keep a velocity term similar to the one in momentum – it is an exponential average of the gradient updates. We also keep a squared term, which is an exponential average of the squares of the gradients, similar to RMSprop. We also correct these terms by (1 – beta); otherwise, the exponential average will start off with lower values at the beginning, since there are no previous terms to average over. Then we divide the corrected velocity by the square root of the corrected square term, and use that as our update.

How does learning rate affect Adam?

It has been suggested that the learning rate is more important than the β1 and β2 parameters, so let’s try varying the learning rate first, on a log scale from 1e-4 to 1:

Figure 34: Training loss (left) and validation loss (right) on Adam across learning rates.
Figure 34: Training loss (left) and validation loss (right) on Adam across learning rates.

We did not plot learning rates above 0.03, since they failed to converge. We see that as we increase the learning rate, the training and validation loss decrease more quickly – but only up to a certain point. Once we increase the learning rate beyond 0.001, the training and validation loss both start to become worse. This could be due to the ‘overshooting’ behavior illustrated in Figure 4.

So, which of the learning rates is the best? Let’s find out by plotting the best validation loss of each one.

Figure 35: Minimum training and validation loss on Adam across different learning rates.
Figure 35: Minimum training and validation loss on Adam across different learning rates.

We see that the validation loss on learning rate 0.001 (which happens to be the default learning rate) seems to be the best, at 0.2059. The corresponding training loss is 0.2077. However, this is still worse than the best SGD run, which achieved a validation loss of 0.1899 and training loss of 0.1945. Can we somehow beat that? Let’s try varying β1 and β2 and see.

How do β1 and β2 affect Adam?

We try the following values for β1 and β2:

beta_1_values = [0.5, 0.9, 0.95]
beta_2_values = [0.9, 0.99, 0.999]
Figure 36: Training loss (left) and validation loss (right) across different values for beta_1 and beta_2.
Figure 36: Training loss (left) and validation loss (right) across different values for beta_1 and beta_2.
Figure 37: Minimum training losses (left) and minimum validation losses (right). Minimum value in each row highlighted in green.
Figure 37: Minimum training losses (left) and minimum validation losses (right). Minimum value in each row highlighted in green.

The best run is β1=0.5 and β2=0.999, which achieves a training loss of 0.2071 and validation loss of 0.2021. We can compare this against the default Keras params for Adam (β1=0.9 and β2=0.999), which achieves 0.2077 and 0.2059, respectively. Thus, it pays off slightly to experiment with different values of beta_1 and beta_2, contrary to the recommendation in the Keras docs – but the improvement is not large.

Surprisingly, we were not able to beat the best SGD performance! It turns out that others have noticed that Adam sometimes works worse than SGD with momentum or other optimization algorithms [2]. While the reasons for this are beyond the scope of this article, it suggests that it pays off to experiment with different optimizers to find the one that works the best for your data.

Best Adam validation loss

  • Best validation loss: 0.2021
  • Associated training loss: 0.2071
  • Epochs to converge to minimum: 255
  • Params: learning rate 0.001, β1=0.5, and β2=0.999

Adam takeaways

  • Adam is not guaranteed to achieve the best training and validation performance compared to other optimizers, as we found that SGD outperforms Adam.
  • Trying out non-default values for β1 and β2 can slightly improve the model’s performance.

Adagrad

Adagrad accumulates the squares of gradients, and divides the update by the square root of this accumulator term.

Figure 38: Adagrad update equation [3]
Figure 38: Adagrad update equation [3]

This is similar to RMSprop, but the difference is that it simply accumulates the squares of the gradients, without using an exponential average. This should result in the size of the updates decaying over time.

Let’s try Adagrad at different learning rates, from 0.001 to 1.

Figure 39: Adagrad at different learning rates. Left: training loss. Right: validation loss.
Figure 39: Adagrad at different learning rates. Left: training loss. Right: validation loss.

The best training and validation loss are 0.2057 and 0.2310, using a learning rate of 3e-1. Interestingly, if we compare with SGD using the same learning rates, we notice that Adagrad keeps pace with SGD initially but starts to fall behind in later epochs.

Figure 40: Adagrad vs SGD at same learning rate. Left: training loss. Right: validation loss.
Figure 40: Adagrad vs SGD at same learning rate. Left: training loss. Right: validation loss.

This is likely because Adagrad initially is dividing by a small number, since the gradient accumulator term has not accumulated many gradients yet. This makes the update comparable to that of SGD in the initial epochs. However, as the accumulator term accumulates more gradient, the size of the Adagrad updates decreases, and so the loss begins to flatten or even rise as it becomes more difficult to reach the minimizer.

Surprisingly, we observe the opposite effect when we use a large learning rate (3e-1):

Figure 41: Adagrad vs SGD at large learning rate (0.316). Left: training loss. Right: validation loss.
Figure 41: Adagrad vs SGD at large learning rate (0.316). Left: training loss. Right: validation loss.

At large learning rates, Adagrad actually converges more quickly than SGD! One possible explanation is that while large learning rates cause SGD to take excessively large update steps, Adagrad divides the updates by the accumulator terms, essentially making the updates smaller and more ‘optimal.’

Let’s look at the minimum training and validation losses across all params:

Figure 42: Minimum training and validation losses for Adagrad.
Figure 42: Minimum training and validation losses for Adagrad.

We can see that the best learning rate for Adagrad, 0.316, is significantly larger than that for SGD, which was 0.03. As mentioned above, this is most likely because Adagrad divides by the accumulator terms, causing the effective size of the updates to be smaller.

Best validation loss on Adagrad

  • Best validation loss: 0.2310
  • Associated training loss: 0.2057
  • Epochs to converge to minimum: 406
  • Params: learning rate 0.312

Adagrad takeaways

  • Adagrad accumulates the squares of gradients, then divides the update by the square root of the accumulator term.
  • The size of Adagrad updates decreases over time.
  • The optimal learning rate for Adagrad is larger than for SGD (at least 10x in our case).

Cyclic Learning Rate

Cyclic Learning Rate is a method that lets the learning rate vary cyclically between a min and max value [4]. It claims to eliminate the need to tune the learning rate, and can help the model training converge more quickly.

Figure 43: Cyclic learning rate using a triangular cycle
Figure 43: Cyclic learning rate using a triangular cycle

We try the cyclic learning rate with reasonable learning rate bounds (base_lr=0.1, max_lr=0.4), and a step size equal to 4 epochs, which is within the 4–8 range suggested by the author.

Figure 44: Cyclic learning rate. Left: Train loss. Right: validation loss.
Figure 44: Cyclic learning rate. Left: Train loss. Right: validation loss.

We observe cyclic oscillations in the training loss, due to the cyclic changes in the learning rate. We also see these oscillations to a lesser extend in the validation loss.

Best CLR training and validation loss

  • Best validation loss: 0.2318
  • Associated training loss: 0.2267
  • Epochs to converge to minimum: 280
  • Params: Used the settings mentioned above. However, we may be able to obtain better performance by tuning the cycle policy (e.g. by allowing the max and min bounds to decay) or by tuning the max and min bounds themselves. Note that this tuning may offset the time savings that CLR purports to offer.

CLR takeaways

  • CLR varies the learning rate cyclically between a min and max bound.
  • CLR may potentially eliminate the need to tune the learning rate while attaining similar performance. However, we did not attain similar performance.

Comparison

So, after all the experiments above, which optimizer ended up working the best? Let’s take the best run from each optimizer, i.e. the one with the lowest validation loss:

Figure 45: Best validation loss achieved by each optimizer.
Figure 45: Best validation loss achieved by each optimizer.

Surprisingly, SGD achieves the best validation loss, and by a significant margin. Then, we have SGD with Nesterov momentum, Adam, SGD with momentum, and RMSprop, which all perform similarly to one another. Finally, Adagrad and CLR come in last, with losses significantly higher than the others.

What about training loss? Let’s plot the training loss for the runs selected above:

Figure 46: Training loss achieved by each optimizer for best runs selected above.
Figure 46: Training loss achieved by each optimizer for best runs selected above.

Here, we see some correlation with the validation loss, but Adagrad and CLR perform better than their validation losses would imply.

What about convergence? Let’s first take a look at how many epochs it takes each optimizer to converge to its minimum validation loss:

Figure 47: Num epochs to converge to minimizer.
Figure 47: Num epochs to converge to minimizer.

Adam is clearly the fastest, while SGD is the slowest.

However, this may not be a fair comparison, since the minimum validation loss for each optimizer is different. How about measuring how many epochs it takes each optimizer to reach a fixed validation loss? Let’s take the worst minimum validation loss of 0.2318 (the one achieved by CLR), and compute how many epochs it takes each optimizer to reach that loss.

Figure 48: Number of epochs to converge to worst minimum validation loss (0.2318, achieved by CLR).
Figure 48: Number of epochs to converge to worst minimum validation loss (0.2318, achieved by CLR).

Again, we can see that Adam does converge more quickly to the given loss than any other optimizer, which is one of its purported advantages. Surprisingly, SGD with momentum seems to converge more slowly than vanilla SGD! This is because the learning rate used by the best SGD with momentum run is lower than that used by the best vanilla SGD run. If we hold the learning rate constant, we see that momentum does in fact speed up convergence:

Figure 49: Comparing SGD and SGD with momentum.
Figure 49: Comparing SGD and SGD with momentum.

As seen above, the best vanilla SGD run (blue) converges more quickly than the best SGD with momentum run (orange), since the learning rate is higher at 0.03 compared to the latter’s 0.01. However, when hold the learning rate constant by comparing with vanilla SGD at learning rate 0.01 (green), we see that adding momentum does indeed speed up convergence.

Why does Adam fail to beat vanilla SGD?

As mentioned in the Adam section, others have also noticed that Adam sometimes works worse than SGD with momentum or other optimization algorithms [2]. To quote Vitaly Bushaev‘s article on Adam, "after a while people started noticing that despite superior training time, Adam in some areas does not converge to an optimal solution, so for some tasks (such as image classification on popular CIFAR datasets) state-of-the-art results are still only achieved by applying SGD with momentum." [2] Though the exact reasons are beyond the scope of this article, others have shown that Adam may converge to sub-optimal solutions, even on convex functions.

Conclusions

Overall, we can conclude that:

  • You should tune your learning rate – it makes a large difference in your model’s performance, even more so than the choice of optimizer.
  • On our data, vanilla SGD performed the best, but Adam achieved performance that was almost as good, while converging more quickly.
  • It is worth trying out different values for rho in RMSprop and the beta values in Adam, even though Keras recommends using the default params.

References

[0] https://www.deeplearningbook.org/contents/optimization.html

[1] Diederik P. Kingma and Jimmy Lei Ba. Adam : A method for stochastic optimization. 2014. arXiv:1412.6980v9

[2] https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c

[3] https://ruder.io/optimizing-gradient-descent/index.html#adagrad

[4] Leslie N. Smith. https://arxiv.org/pdf/1506.01186.pdf


Related Articles