The world’s leading publication for data science, AI, and ML professionals.

Introduction and Implementation of Adagradient & RMSprop

Adaptive Learning Rate on Different Dimensions

In last post, we’ve been introducing stochastic gradient descent and momentum term, where SGD adds some randomness into traditional gradient descent and momentum helps to accelerate the process. However, both these methods set a fix learning rate:

The above shows gradient descent with momentum term, where the lr is actually fixed for all parameters on different dimension. It works fine for parameters that appear frequently which can also be updated frequently through iterations, but for parameters that associate with infrequent features, it would result in insufficient update to reach optimality.

Parameters associated with infrequent features only receive meaningful updates whenever these features occur. Given a decreasing learning rate we might end up in a situation where the parameters for common features converge rather quickly to their optimal values, whereas for infrequent features we are still short of observing them sufficiently frequently before their optimal values can be determined. In other words, the learning rate either decreases too slowly for frequent features or too slowly for infrequent ones.

Adagradient

Adagradient is developed to resolve the problem described above, it manages to update different parameters with different pace by the gradient and frequencies of each parameter. Let’s have a look of the formula:

As you might have noticed, the key difference is the s_t term added here. For infrequent features or features that come with small gradients, their s_t would be small, but the lr/sqrt(s_t + ϵ) would be large, which results in larger steps in update. This gives different update speed for different parameters and to resolve the disadvantages brought by unified learning rate.

Now let’s implement and apply it to an actual problem.

The problem that we are going to work on is the same as we stated before,

where we try to find the optimal value of a, b to minimise the loss of difference between y and f(x) , and the gradient of a, b is computed above. The implementation of adagradient would be:

where the input values of X, Y is made by:

The implementation should be straight forward, we initialised s_a, s_b as 0, and in order to plot the learning process, I kept the intermediate values in lists.

The parameters update process would be like:

Parameters Update
Parameters Update

And the learning rate is like:

Learning Rate Update
Learning Rate Update

Here the lr/sqrt(s + ϵ) is considered the modified learning rate.

RMSprop

One problem with adagrad is that the term s could grow unbounded as the gradient accumulates overtime, which could result in drastic decrease in learning steps, and cause parameters barely move in the end.

RMSprop was proposed to solve this issue by simply adding a normalisation to the term:

Notice here the past gradient s_{t-1} is bounded by an extra parameter γ , where γ normally takes the value 0.9.

The implementation is similar to adagrad except the update of s. And the update process would be like:

Parameter Update
Parameter Update

And learning rate is like:

Learning Rate Update
Learning Rate Update

Here we see the learning rate for a, b is oscillating instead of decreasing monotonically(you can check the implementation here).

So far, we have learnt several techniques applied in optimisation processes. Next up, let’s ensemble them and derive another powerful tool – Adam.

Reference:


Related Articles