The algorithm to be discussed will allow you to routinely reach lower cost function values by a factor of about 10¹⁰ compared to the Adam optimization algorithm.
Keep on reading to see why it’s like "driving an old car"!
Introduction
There are many computations that are done in science and technology in which one has to minimize what is known as a cost function (CF). People use CFs because they’re easy to construct, and because their minimum corresponds to what they seek to know. They have been used throughout machine learning (ML), physics, and many other fields. As it turns out, the Gradient Descent (GD) algorithm is especially convenient for finding minima of the CF. Recall that GD operates on a function f(θ) by updating its argument from _θold to _θnew via

where α is the learning rate, and the gradient is ∇f = ∂ f / ∂ θ. This updating is repeated until a stopping condition is met. As shown, one only needs to compute the gradient, and to set α to a suitably small number.
However, finding the "best" value for α turns out to be an issue. Often, people will do several trial optimization runs using different α values to see which seems to lead to the best behavior, and then just select that value. However, even the mere idea of using a single value of α for an entire optimization run is flawed. I argue that it’s comparable to trying to drive between two cities at a single speed. For example, would the best speed to drive between Chicago and St Louis be 10mph, 40mph or 70mph? Certainly, for different parts of the trip, different speeds would be preferred. It’s a similar situation with finding the best α for the whole optimization run.
Here’s the plan for this article: First, the diagnostic ρ is introduced, which will give insight into whether α is too large or small. Next, it’ll be demonstrated on an example, just to build intuition. After that, some pseudocode is given to show how you can compute ρ with practically no additional cost. Finally, it will be shown how to use ρ on everyone’s favorite test problem: classifying digits! The algorithm that uses it (NeogradM) outperforms Adam, reaching a lower CF value by many orders of magnitude.
The ρ Diagnostic
Let’s use the figure below to explain ρ

The blue curve is the CF, and the two parameter values (_θold and _θnew) are marked with red dots. The corresponding values of the CF at these values are _fold and _fnew. In addition, define _fest as the estimated CF value based on the GD algorithm as

where _dθ = (θ_new -θold). Note that this is really just a linear equation, like the familiar "y = mx + b", except here _y = f_est, m = ∇f, x = dθ, and b = fold. The range of values leading up to _fest are shown as a green line in the figure as dθ varies from 0 to _(θ_new -θold).
However, what is really of interest for us is the orange line in that figure. It represents the gap or "deviation" between the estimate _fest, and the correct value _fnew. In the figure, this deviation indicates whether α is too large or small. However, to make it really useful, it should be dimensionless. Here, that’s achieved by dividing it by the vertical drop from _fest. This leads to the ρ diagnostic measure:

where absolute values are used because we only care about the magnitude. Note this measure is invariant with respect to translating and scaling f (while keeping dθ constant), which is exactly what is needed. Thus, ρ serves as a sort of "universal measure" of the appropriateness of the size of α.
How to think about ρ
You’re probably wondering, what value should I use for ρ? Certainly, values that are very small will lead to inefficient updates, while those that are very large will lead to updates that are relatively erratic and uncontrolled.
However, before giving a direct answer on a "best value" for ρ, first consider this analogy. Imagine someone with an old car who’s trying to drive as fast as possible. He won’t even be looking at the speedometer; he only pays attention to when the car begins physically shaking at high speed. This driver reasons, "if the car is shaking only a little, I can go faster, but if it’s shaking a lot, I should slow down". It’s the same situation with GD, except now ρ is the shaking and α is the gas pedal. In both cases, the "shaking" is acting as a proxy for the variable we wish to control, whether it be the gas pedal or the learning rate α.
This driver reasons, "if the car is shaking only a little, I can go faster, but if it’s shaking a lot, I should slow down". It’s the same situation with GD, except now ρ is the shaking and α is the gas pedal.
To summarize, the control algorithm becomes: if ρ is too large, decrease α, and if ρ is too small, increase α. Of course, the ideal is to keep ρ at a fixed value. In the example in the next section, it will be shown how to do just that. This is called the "constant ρ ansatz", and an α that achieves that is called the ideal learning rate.
An Example
A good example for demonstrating the usefulness of ρ is the 1-dimensional quartic CF. The variables we’ll need to compute ρ are:

A straightforward calculation reveals

where q = αθ². The most important feature to take note of is that ρ now depends on θ. This means that a single α will have a different effect depending on the nearness to the minimum at θ=0. Even more interesting is that this equation can be easily inverted in favor of α. Assuming small q,

Although quite simple, this is actually quite remarkable. We now have a formula by which to set α given a desired value of ρ. With each iteration of the GD algorithm, as θ changes, this formula guarantees that a target value of ρ is maintained. This is called an "Ideal GD" algorithm, and is illustrated in the figure below; it shows log f and logρ vs. iterations for Adam and the Ideal GD.

In the figure, the target value was _ρ = 0.1. S_o, at each iteration for Ideal GD, α was set to 0.1/(6θ²), according to the above formula. (The initial θ was -3 for both algorithms.) This graph shows that the value reached with Ideal GD is more than a factor of 10¹¹ times smaller than that from Adam after only 150 iterations. Obviously, the longer the run, the larger this factor becomes. In addition, it shows that with Adam, the stage at which it begins to plateau coincides with where its ρ drops. In the figure, this occurs near the _50_th iteration. This feature was seen repeatedly with Adam. Since ρ is not controlled with Adam, it is free to evolve; in this case it becomes very small, leading to inefficient updates. Essentially, Adam has built-in regularization, since its ρ naturally becomes very small. It’s akin to early stopping, and is the reason it tends to "plateau". In addition, running Adam was a challenge because a number of trial runs had to be done to choose a decent value for α. With respect to the Ideal GD algorithm, the main issue is that it makes the CF so small that one has to worry about machine precision errors, since the numbers get that small.
Essentially, Adam has built-in regularization, since its ρ naturally becomes very small. It’s akin to early stopping, and is the reason it tends to "plateau".
Efficiently computing ρ
In the previous section, we gave an example of a simple cost function and determined an exact relation between ρ and α. Using that, it was possible to set α as a function of a target value for ρ. In a real application, the CF becomes extremely complicated, and the computation of such a relationship becomes out of the question. Instead, a measurement of ρ will be used in setting α. As it stands now, the formula for ρ makes it appear two computations of the CF will be needed, per iteration. However, it’s possible to easily measure ρ with only one computation per iteration. First, let’s begin with some bare-bones pseudocode for GD:
INIT: θ, num
FOR: i = 1 to num+1
f = f(θ)
g = ∇f(θ)
dθ = -αg
θ = θ + dθ
RETURN: θ
It should be self-explanatory, given the discussion so far. Now, the most important thing about this version is the order of the following operations within the FOR-loop: (1) evaluate f, (2) evaluate g, (3) update θ. This "f–g-update" sequence is now modified so there’s an initial "evaluate f" before the loop, and inside the loop the order is "g-update-f". This rewriting allows access to values of f both before and after an update, and no extra computation is needed. With these changes, the pseudocode appears as
INIT: θ_old, num
f_old = f(θ_old)
FOR: i = 1 to num
g = ∇f(θ_old)
dθ = -αg
θ_new = θ_old + dθ
f_new = f(θ_new)
f_est = f_old + g.dθ
ρ = get_rho( f_old, f_new, f_est )
f_old = f_new
θ_old = θ_new
RETURN: θ_new
where _getrho is an implementation of our equation for ρ. Finally, if this rewriting doesn’t appeal to you, there are other ways to go about it. For example, you could evaluate ρ inside a conditional such as "IF i > 1", to ensure that both _fold and _fnew are available.
Approximating α
With that out of the way, the next step is understanding how to use the original ρ formula in order to set α. It turns out it can be used in an approximate manner, again with no significant overhead. With the definition of ρ in mind, define the quantities A and B via

This is done to pull out the leading order dependence on α and make it explicit. As a result, A and B are constants, to leading-order in α. However, note that in general A retains α-dependence, while B has no α-dependence. Using these equations and the definition of ρ, an expression for α easily follows

Note that after an update, this equation holds exactly. However, the value of ρ will probably be too large or small. So, as a first approximation (cf. Neograd_v0 in Sec. 6.3 here) we’ll obtain a new α by instead substituting in the target value for ρ. This works well provided those two ρ values (the target and the current value) don’t differ too much. Also, the reason it’s an approximation is because A has some α-dependence, in general. Finally, note the trick we’re doing is to treat α as a dependent variable and ρ as the independent variable, when in reality it’s the other way around.
Let’s consider an example to drive this home. Suppose after the 7th iteration that α=0.1, B=5, A=10, and ρ=0.2. Observe that these values exactly satisfy the above equation for α, as they must (i.e., 0.1=(5/10)0.2). Now, we’d like that ρ value to be closer to the target value of 0.1, so we plug 0.1 into the equation, along with the same A and B, and compute a new value of α as 0.05. This new value will be used in the _8_th iteration. Of course, this value of 0.05 probably won’t be ideal, since we ignored the α-dependence in A, but so long as the CF doesn’t change too rapidly it works well. As a side note, you might notice that since this technique involves an approximate value of α in the subsequent iteration… the expression "a day late and a dollar short" comes to mind. Nevertheless, it still works well.
since this technique involves an approximate value of α in the subsequent iteration… the expression "a day late and a dollar short" comes to mind. Nevertheless, it still works well.
The next level of approximation is, rather than trying to get a new α by substituting in the ρ target value, use a value of ρ that isn’t too different from the original value. This is called Neograd_v1 in my paper. Finally, we can get really fancy and combine this algorithm with existing ones. The one I found to work best combines these ideas with momentum; it’s called NeogradM.
Here is an example of the application of NeogradM to the problem of digit recognition (with data from Scikit-Learn). The CF was a cross-entropy penalty constructed between the output of a neural net and the training label. As can be seen in the figure

the values of the CF reached using NeogradM are about 10⁸ times smaller than with Adam. In addition, as shown in my paper [1] in Fig. 13, the value of ρ mainly stays near the target; for Adam the values of ρ are lower by a factor of about 100. Once again, the situation with Adam is that ρ becomes very small, and the CF profile has a plateau.
Final Comments
The purpose of this article has been to introduce you to the main concepts from the full paper on the Neograd family of algorithms. Perhaps the most important takeaway is that the ρ diagnostic metric is easy to implement and is an effective proxy for the learning rate. If you go on to implement it in your own gradient descent programs, that is half the battle. Then, you might see that your current program may run faster. At that point, perhaps you’ll be willing to take the next step and implement NeogradM; sample code is on my Github.
In addition to what has been reviewed here, the full paper includes a number of other useful results, such as other metrics, another derivation of GD, etc.
Hope you enjoyed this article!
References
[1] M.F. Zimmer, Neograd: Gradient Descent with a Near-Ideal Learning Rate (2020), arXiv preprint, arXiv:2010:07873. (Submitted for publication)