Flux.jl on MNIST — What about ADAM?
So far we’ve seen a performance analysis using the standard gradient descent optimizer. But which results do we get, if we use a more sophisticated one like ADAM?
In Flux.jl on MNIST — Variations of a theme, I presented three neural networks for recognizing handwritten digits as well as three variations of the gradient descent algorithm (GD) for training these networks.
The follow-up article Flux.jl on MNIST — A performance analysis showed how well each of these models performs using different parametrizations und how much effort it takes to train them, depending on the GD variant applied. The GD algorithm used the standard GD learning formula utilizing a constant learning rate in that analysis.
Now we will perform this analysis using an advanced learning formula (in Flux called an “optimizer”) and compare the results to the performance achieved with the standard formula. The optimizer we will use is called ADAM (= ADAptive Moment estimation) published in 2015 by Diederik P. Kingma and Jimmy Ba. It adapts the learning rate on each iteration and is currently one of the most sophisticated approaches in this field.
In analogy to the last analyses we will apply all three GD variants for training: The batch GD, the stochastic GD and the mini-batch GD.
Batch Gradient Descent
4LS-Model
Applied to the 4LS-model using batch GD, ADAM leads to a dramatic improvement: With only 500 iterations we get an accuracy of 74.16%, increasing up to over 91%. In contrast the standard Descent
optimizer got stuck at 11.35%.
The learning curves illustrate this behaviour: On the left we have the standard optimizier up to iteration 500 and on the right ADAM with 1,000 iterations. ADAM reduces the loss very fast to a value below 0.1, stagnating then a bit until about iteration 300 where it starts to decrease again. Descent
is also quite fast until below 0.1 (please note the different scaling on the y-axis) and then stagnates. Apart from that, the decrease in loss with Descent
doesn’t translate into an increase in accuracy (on the test dataset).
3LS-Model
We get a similar result for the 3LS-model. ADAM improves the results here also significantly: At 500 iterations we get an accuracy of 95.66%, which can be slightly improved to over 96% at 1,000 iterations. The standard optimizer on the other hand only produced accuracies below 20%.
The learning curves show this too: Both optimizers decrease the loss quite fast to a value below 0.1 (note again different scaling on the y-axis), ADAM being even faster than Descent
and then decreasing significantly more.
Compared to the 4LS-model, ADAM stagnates here much sooner, but achieving at that point already a better accuracy than on 4LS.
2LR-Model
On the 2LR-model we got already good results with the standard optimizer and we reach the same level of accuracy (above 96%) with ADAM. But we get there with ADAM already at iteration 500 (then decreasing slightly), whereas the standard optimizer needs 4,000 iterations to reach that level.
The learning curves are, apart from the faster decrease with ADAM at the beginning, in this case quite similar:
Stochastic Gradient Descent
The standard optimizer delivered on the 4LS- and 3LS-models quite good results with stochastic GD. So let’s see, if ADAM can do any better.
4LS-Model
The learning curves for the 4LS-model have the same basic shape for both optimizers, but ADAM results in a stronger oscillation.
ADAM gives us higher accuracies with less iterations, but in the end we don’t quite get the quality we received from Descent
(at 2 million iterations 94.16% vs. 92.12%).
3LS-Model
With the 3LS-model we get almost the same situation: Basically similar curves where ADAM converges at the beginning faster to lower loss values (and oscillates much more on that way):
This is reflected in the accuracies we can achieve. And again, ADAM doesn’t quite get to the level we reached with the standard optimizer (93.73% vs. 97.64% at 2 million iterations):
2LR-Model
Training the 2LR-model with the standard optimizer was a failure. We never got an accuracy above 20%. ADAM is in this case a real improvement, achieving a bit more than 87% of accuracy at about 0.5 million iterations (and then getting worse again). But we don’t get in the region above 95% as we did with other configurations.
The learning curve using ADAM shows heavy oscillations and it doesn’t really fit to the numbers in the table above, since the loss seems to increase the more iterations we do.
A 100-moving-average on the the first 128,000 iterations shows a better picture of what happens in that part. At least on the first 40,000 iterations the loss decreases significantly. But this still doesn’t explain why the accuracy improves in the range up to about 0.5 million iterations.
Mini-batch Gradient Descent
The mini-batch GD achieved very good results on all three models. So there isn’t much space left for ADAM. Let’s have a look at the differences.
4LS-Model
As seen above on other configurations, ADAM converges faster at the beginning (92.46% accuracy at 500 iterations) but doesn’t quite reach the level we had with the standard optimizer (94.49% vs. 93.28% at 2 million iterations):
The learning curves reflect these findings (and again, ADAM leads to more oscillation):
3LS-Model
The same holds for the 3LS-Model. ADAM achieves an accuracy of 94.47% already at 500 iterations, but is slightly worse in the end at 2 million iterations (97.68% vs. 96.17%).
In an analogous way this is reflected within the learning curves:
2LR-Model
With the 2LR-model the situation is a bit different. Here ADAM isn’t even faster converging at the beginning. And again it doesn’t reach the same level of accuracy as the standard optimizer (97.0% vs. 91.33%).
The learning curve shows here (as with stochastic GD) a significant oscillation:
Conclusions
Descent vs. ADAM
The following diagram gives an overview of the best results achieved with the standard optimizer (left) and ADAM (right). It shows accuracy in relation to training time. Here we can see, that ADAM does generally well: Almost all results are in the upper left corner. I.e. a high accuracy has been achieved with relatively little effort. The standard optimizer delivered a broader range of results. But in the end it produced some of the best results.
Top Performers
The next diagram shows the top performers (accuracy > 90%). The best results could be achieved by 3LS/Descent (bright blue) followed by 2LR/Descent (bright yellow). The 4LS-variants (greenish colors) didn’t so well in general.
Effort
And finally, as in the recent analysis, I calculated a performance indicator called “effort”, which is the ratio between training-time and accuracy (describing how much training-time is necessary to achieve a certain level of accuracy). So you can see the top performers on that aspect too:
Summary
Interestingly we could get significantly better results with ADAM in all three cases where the standard optimizer failed. But in all other cases ADAM didn’t quite reach the levels we obtained with Descent
. It converged in these cases mostly faster at lesser iterations, but this didn’t help in the end. So we can conclude that both optimizers complement each other well, but none of them can be recommended for all purposes.