Implementation of Optimization for Deep Learning Highlights in 2017 (feat. Sebastian Ruder)

Published in

Towards Data Science

10 min readMay 24, 2018

Sebastian Ruder is a PhD student in Natural Language Processing and a research scientist at AYLIEN. And he has one of the most interesting and informative blog on NLP and Machine Learning. (I read it all the time and I highly recommend anyone to read it too!)

And I covered all of the optimization algorithms on this post that I learned from his blog post. Now this post is the second version covering more advanced optimization techniques.

Network Architecture / Bench Mark to Compare

Network Architecture of ELU Network, from this blog post

I recently covered ‘Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)’, click here to read the blog post, and unfortunately the model that we implemented there suffered from over-fitting to the training images. So lets take a look at how each of these methods is able to increase the performance of the model. (Other words how well it can generalize) The reason why I say this is because all of these methods are tackling the problem of generalization.

Please note that, for fair comparison (And I wanted to know how these methods improve the network as it is) so I did not add any additional layers such as batch normalization or any data preprocessing. Also all of the network were trained using some kind of variation of Adam Optimizer.

Above image is how the same network did when trained with auto differentiation (adam optimizer) with L2 regularization. So in summary the benchmark we aim to beat with new optimization methods is 64 percent accuracy on testing images.

Fixing Weight Decay Regularization in Adam

Above paper is one of the main references Sebastian used in his blog post. So I thought it would be a good idea to link it here as well.

Case 1) Decoupling weight decay / Results

Left Image → Regular Adam Optimizer
Right Image → Adam with Decoupling Weight Decay
Red Box → Weight Decay Term added

The first method is very simple when updating the new weights, we are going to add some kind of weight decay term (that is less then 1) multiply it with the weights as well as the learning rate. When implemented in python it looks like something below.

Red Line → Added Line for weight decay regularization

I have set the weight decay rate to be 0.512 and on each 10,50,100, and 150 iteration I halved the weight decay value.

Left Image → Train Accuracy / Cost Over Time
Right Image → Test Accuracy / Cost Over Time

Due to cost rising up to 80 the graph on the right is skewed, however don’t worry I have attached the image below to show the final accuracy of the model.

As seen above using this method we were able to achieve 56 percent accuracy on the test images. That’s not that great considering the fact that just adam with L2 regularization can achieve 64 percent accuracy.

Case 2) Fixing the exponential moving average / Results

Red Box → Max between the old and new v value

Above image shows the new update rule for Adam (Other name for this algorithm is AMSGrad). Also please take note for this rule I have changed the Beta 2 value to 0.9 rather than using the default value of 0.999. (as seen below)

Finally when implemented in Tensorflow, it can look something like below.

Red Box → Max between the old and new v value

I personally like this method the most, since it is very simple to implement. But I understand when comparing tensors reduce sum might not be the best method, one other method can be comparing the Euclidean norm. But now let’s see how the method have increased model’s performance.

Left Image → Train Accuracy / Cost Over Time
Right Image → Test Accuracy / Cost Over Time

As seen above, this method did not fully prevent the model from over-fitting however it did get a better performance than our bench mark.

66 percent accuracy on test images, while 99 percent accuracy on training images, with no other forms of regularization methods except for the change we made in back propagation.

Case 3) Tuning the learning rate / Results

Red Box → Changed Hyper Parameters for Adam
Blue Box → Calculated Number of Parameters of the model
Purple Box → Equation to choose our learning rate for current step

For this we would calculate the number of parameters in out network. And if anyone wants to know how to do that, please click this link. Below is a simple example of calculated parameter values for VGG 16.

Below is an example of calculating the parameters for our network.

Red Box → Parameter Calculation
Blue Box → Network architecture for reference

Now since we have the number of parameters ready, lets take a look at the equation to calculate the learning rate.

With simple tf.cond() we are able to choose the minimum value between step_num^(-0.5) and step_num*warmup_steps^(-1.5). Finally lets take a look at the result.

Left Image → Train Accuracy / Cost Over Time
Right Image → Test Accuracy / Cost Over Time

As seen above, using this method the model was not even able to perform well on the training images. I suspect that the learning rate could be higher for this method.

Please note I interpreted step number as number of iteration (or current iteration) if this is wrong please comment below.

The finally accuracy for this model was around 23 percent for both training image as well as testing images.

Case 4) Adam with Restarts / Results

Above is the equation for stochastic gradient descent (SGD) with restarts. In summary the learning rate gets reset for every iteration Ti. So when we plot the learning rate over time it would look something like below.

As seen above, the learning rate was set to 1 at iteration 1, again at 300, and 700 etc…. Now lets take a look at how to apply this method to Adam.

Red Box → How to apply this method to Adam optimizer

As seen above, we first need to fix the weight decay, and we already saw this in action. (this is the method that we used in case 1) ). Now lets take a look at the implementation.

Red Box → Variable Place Holder for Ti and Tcur
Blue Box → Equation for calculating the learning rate

There are different ways to implement restart, but I choose the simplest method LOL. Also please note that the authors of the original paper recommend to have the max value of learning rate set to 1, I have set to 0.001. When we plot how the learning rate changes over time (for 200 iteration) it would look like something below.

The restart iterations are set to 10, 30, 70, and 150. (When we set the first restart iteration to 10). Now lets take a look at the performance of the model.

Left Image → Train Accuracy / Cost Over Time
Right Image → Test Accuracy / Cost Over Time

Well… no doubt the model did horribly. I actually spent many hours trying to get this model work but the model always seemed to learn something only at the beginning of the iteration and somewhere during training it seemed to overshot and stopped all learning process.

As I REALLY want to know what I did wrong, please comment below if you have any recommendation for this model.

Final accuracy for both training image as well as testing images were 9 percent….

Interactive Code / Transparency

For Google Colab, you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding! Also for transparency I uploaded all of the log during training.

To access the code for Case 1 please click here, to access the logs click here.
To access the code for Case 2 please click here, to access the logs click here.
To access the code for Case 3 please click here, to access the logs click here.
To access the code for Case 4 please click here, to access the logs click here.

Final Words

These are amazing work done by multiples of smart researchers, however, FOR THIS EXPERIMENT none of it seemed to be the magic bullet where it can generalize extremely well.

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also implemented Wide Residual Networks, please click here to view the blog post.

Reference

Optimization for Deep Learning Highlights in 2017. (2017). Sebastian Ruder. Retrieved 7 May 2018, from http://ruder.io/deep-learning-optimization-2017/
Only Numpy: Implementing and Comparing Gradient Descent Optimization Algorithms + Google Brain’s…. (2018). Towards Data Science. Retrieved 7 May 2018, from https://towardsdatascience.com/only-numpy-implementing-and-comparing-gradient-descent-optimization-algorithms-google-brains-8870b133102b
Clevert, D., Unterthiner, T., & Hochreiter, S. (2015). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). Arxiv.org. Retrieved 7 May 2018, from https://arxiv.org/abs/1511.07289
[ICLR 2016] Fast and Accurate Deep Networks Learning By Exponential Linear Units (ELUs) with…. (2018). Towards Data Science. Retrieved 7 May 2018, from https://towardsdatascience.com/iclr-2016-fast-and-accurate-deep-networks-learning-by-exponential-linear-units-elus-with-c0cdbb71bb02
Regularization with TensorFlow. (2018). ritchieng.github.io. Retrieved 7 May 2018, from http://www.ritchieng.com/machine-learning/deep-learning/tensorflow/regularization/
tf.cond | TensorFlow. (2018). TensorFlow. Retrieved 7 May 2018, from https://www.tensorflow.org/api_docs/python/tf/cond
'Switch'), V. (2018). ValueError: Shape must be rank 0 but is rank 1 for ‘cond_11/Switch’ (op: ‘Switch’). Stack Overflow. Retrieved 8 May 2018, from https://stackoverflow.com/questions/47739707/valueerror-shape-must-be-rank-0-but-is-rank-1-for-cond-11-switch-op-switch
tensorflow?, H. (2018). How to get PI in tensorflow?. Stack Overflow. Retrieved 8 May 2018, from https://stackoverflow.com/questions/45995471/how-to-get-pi-in-tensorflow
networks?, H. (2018). How to calculate the number of parameters of convolutional neural networks?. Stack Overflow. Retrieved 8 May 2018, from https://stackoverflow.com/questions/28232235/how-to-calculate-the-number-of-parameters-of-convolutional-neural-networks
Loshchilov, I., & Hutter, F. (2017). Fixing Weight Decay Regularization in Adam. arXiv preprint arXiv:1711.05101.
Examples: Basics — imgaug 0.2.5 documentation. (2018). Imgaug.readthedocs.io. Retrieved 8 May 2018, from http://imgaug.readthedocs.io/en/latest/source/examples_basics.html
[ICLR 2016] Fast and Accurate Deep Networks Learning By Exponential Linear Units (ELUs) with…. (2018). Towards Data Science. Retrieved 8 May 2018, from https://towardsdatascience.com/iclr-2016-fast-and-accurate-deep-networks-learning-by-exponential-linear-units-elus-with-c0cdbb71bb02
(2018). Arxiv.org. Retrieved 8 May 2018, from https://arxiv.org/pdf/1711.05101.pdf

Implementation of Optimization for Deep Learning Highlights in 2017 (feat. Sebastian Ruder)

Written by Jae Duk Seo