Understanding RMSprop — faster neural network learning

Vitaly Bushaev
Towards Data Science
7 min readSep 2, 2018

--

Disclaimer: I presume basic knowledge about neural network optimization algorithms. Particularly, knowledge about SGD and SGD with momentum will be very helpful to understand this post.

I. Introduction

RMSprop— is unpublished optimization algorithm designed for neural networks, first proposed by Geoff Hinton in lecture 6 of the online course “Neural Networks for Machine Learning” [1]. RMSprop lies in the realm of adaptive learning rate methods, which have been growing in popularity in recent years, but also getting some criticism[6]. It’s famous for not being published, yet being very well-known; most deep learning framework include the implementation of it out of the box.

There are two ways to introduce RMSprop. First, is to look at it as the adaptation of rprop algorithm for mini-batch learning. It was the initial motivation for developing this algorithm. Another way is to look at its similarities with Adagrad[2] and view RMSprop as a way to deal with its radically diminishing learning rates. I’ll try to hit both points so that it’s clearer why the algorithm works.

II. RPROP

Let’s start with understanding rprop — algorithm that’s used for full-batch optimization. Rprop [3] tries to resolve the problem that gradients may vary widely in magnitudes. Some gradients may be tiny and others may be huge, which result in very difficult problem — trying to find a single global learning rate for the algorithm. If we use full-batch learning we can cope with this problem by only using the sign of the gradient. With that, we can guarantee that all weight updates are of the same size. This adjustment helps a great deal with saddle points and plateaus as we take big enough steps even with tiny gradients. Note, that we can’t do that just by increasing the learning rates, because steps we take with large gradients are going to be even bigger, which will result in divergence. Rprop combines the idea of only using the sign of the gradient with the idea of adapting the step size individually for each weight. So, instead of looking at the magnitude of the gradient, we’ll look at the step size that’s defined for that particular weight. And that step size adapts individually over time, so that we accelerate learning in the direction that we need. To adjust the step size for some weight, the following algorithm is used:

  1. First, we look at the signs of the last two gradients for the weight.
  2. If they have the same sign, that means, we’re going in the right direction, and should accelerate it by a small fraction, meaning we should increase the step size multiplicatively(e.g by a factor of 1.2). If they’re different, that means we did too large of a step and jumped over a local minima, thus we should decrease the step size multiplicatively(e.g. by a factor of 0.5).
  3. Then, we limit the step size between some two values. These values really depend on your application and dataset, good values that can be for default are 50 and a millionth, which is a good start.
  4. Now we can apply the weight update.

Note, there are different version of rprop algorithm defined by the authors. I’ve presented the simplest one to get you familiar with it. While I like define optimization algorithms formally with equations, this one is better expressed through code; so the simplest version rprop update rule can look as follows:

III. Rprop to RMSprop

Rprop doesn’t really work when we have very large datasets and need to perform mini-batch weights updates. Why it doesn’t work with mini-batches ? Well, people have tried it, but found it hard to make it work. The reason it doesn’t work is that it violates the central idea behind stochastic gradient descent, which is when we have small enough learning rate, it averages the gradients over successive mini-batches. Consider the weight, that gets the gradient 0.1 on nine mini-batches, and the gradient of -0.9 on tenths mini-batch. What we’d like is to those gradients to roughly cancel each other out, so that the stay approximately the same. But it’s not what happens with rprop. With rprop, we increment the weight 9 times and decrement only once, so the weight grows much larger.

To combine the robustness of rprop (by just using sign of the gradient), efficiency we get from mini-batches, and averaging over mini-batches which allows to combine gradients in the right way, we must look at rprop from different perspective. Rprop is equivalent of using the gradient but also dividing by the size of the gradient, so we get the same magnitude no matter how big a small that particular gradient is. The problem with mini-batches is that we divide by different gradient every time, so why not force the number we divide by to be similar for adjacent mini-batches ? The central idea of RMSprop is keep the moving average of the squared gradients for each weight. And then we divide the gradient by square root the mean square. Which is why it’s called RMSprop(root mean square). With math equations the update rule looks like this:

E[g] — moving average of squared gradients. dC/dw — gradient of the cost function with respect to the weight. n — learning rate. Beta — moving average parameter (good default value — 0.9)

As you can see from the above equation we adapt learning rate by dividing by the root of squared gradient, but since we only have the estimate of the gradient on the current mini-batch, wee need instead to use the moving average of it. Default value for the moving average parameter that you can use in your projects is 0.9. It works very well for most applications. In code the algorithm might look like this:

IV. Similarity with Adagrad.

Adagrad[2] is adaptive learning rate algorithms that looks a lot like RMSprop. Adagrad adds element-wise scaling of the gradient based on the historical sum of squares in each dimension. This means that we keep a running sum of squared gradients. And then we adapt the learning rate by dividing it by that sum. In code we can express it like this:

What’s this scaling does when we have high condition number ? If we have two coordinates — one that has always big gradients and one that has small gradients we’ll be diving by the corresponding big or small number so we accelerate movement among small direction, and in the direction where gradients are large we’re going to slow down as we divide by some large number.

What happens over the course of training ? Steps get smaller and smaller and smaller, because we keep updating the squared grads growing over training. So we divide by the larger number every time. In the convex optimization, this makes a lot of sense, because when we approach minina we want to slow down. In non-convex case it’s bad as we can get stuck on saddle point. We can look at RMSprop as algorithms that addresses that concern a little bit.

With RMSprop we still keep that estimate of squared gradients, but instead of letting that estimate continually accumulate over training, we keep a moving average of it.

V. Results

I found these awesome visualizations for different optimization algorithms that show how they behave in different situations.

Source: https://imgur.com/a/Hqolp#NKsFHJb

As you can see, with the case of saddle point, RMSprop(black line) goes straight down, it doesn’t really matter how small the gradients are, RMSprop scales the learning rate so the algorithms goes through saddle point faster than most.

Source: https://imgur.com/a/Hqolp#NKsFHJb

In this case, algorithms start at a point with very large initial gradients. RMSprop(black line) goes through almost the most optimal path, while momentum methods overshoot a lot. Adagrad goes unstable for a second there.

VI. Conclusion (Learn more)

RMSprop is good, fast and very popular optimizer. Andrej Karpathy’s “A Peek at Trends in Machine Learning” [4] shows that it’s one of the most popular optimization algorithms used in deep learning, its popularity is only surpassed by Adam[5]. If you want to learn more about optimization in deep learning you should check out some of the following sources:

  1. Sebastian Ruder’s blog has several articles concerning trends in deep learning optimization.
  2. fast.ai is a great course about deep learning, they go through the most popular optimization algorithms, explaining them on excel sheets.
  3. Andrew Ng’s second course of his Deep Learning Specialization on coursera also touches popular algorithms, explaining them in a very clear way.

VII. References

[1] Geoffrey Hinton Neural Networks for machine learning nline course. https://www.coursera.org/learn/neural-networks/home/welcome

[2] Adagrad: Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159. Retrieved from http://jmlr.org/papers/v12/duchi11a.html

[3] Christian Igel and Michael H ̈usken (2000). Improving the Rprop Learning Algorithm. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.1332

[4] Andrej Karpathy (2017). A Peek at Trends in Machine Learning. https://medium.com/@karpathy/a-peek-at-trends-in-machine-learning-ab8a1085a106

[5] Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13

[6] Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, Benjamin Recht (2017) The Marginal Value of Adaptive Gradient Methods in Machine Learning. arXiv:1705.08292v2

--

--