Machine Learning Research

In this article, we will take a look at a recent paper published by Atılım Güneş Baydin, Barak A. Pearlmutter, Don Syme, Frank Wood, and Philip Torr, published on February 17, 2022 – Gradients without Backpropagation. You can find it here.
This paper is still under review, so it might change a bit later on. Still, the core idea, which is very neat, should remain intact. Let’s read the abstract from the paper:
Using backpropagation to compute gradients of objective functions for optimization has remained a mainstay of machine learning. Backpropagation, or reverse-mode differentiation, is a special case within the general family of automatic differentiation algorithms that also includes the forward mode. We present a method to compute gradients based solely on the directional derivative that one can compute exactly and efficiently via the forward mode. We call this formulation the forward gradient, an unbiased estimate of the gradient that can be evaluated in a single forward run of the function, entirely eliminating the need for backpropagation in gradient descent. We demonstrate forward gradient descent in a range of problems, showing substantial savings in computation and enabling training up to twice as fast in some cases.
Sounds good so far, right? Now, I will give you some intuition on how and why this approach actually works. I will also explain some of the important mathematical concepts on the way, for the folks who feel a bit insecure with it. Let’s dig right in.
Prerequisites
Before we start our journey, let us recap a few things that are important to know to put the whole topic into perspective. Let us assume that we want to optimize some objective function f, __ just imagine minimizing the loss function in the case of a neural network.
f has a lot, potentially billion of parameters θ, therefore this task is quite hard to do. Still, smart people managed to invent many algorithms to solve this problem, among them first-order methods like Gradient Descent, but also ** _higher-orde_r optimization algorithms such as Newton’s metho**d.
However, since the authors focused on the former, we will do the same.
The Need for Gradients
Chances are that you know the usual way of doing gradient descent. With this, I mean algorithms such as Stochastic Gradient Descent (SGD), RMSProp, Adam, Adamax, etc. that are readily available in software libraries like Tensorflow, PyTorch, or scikit-learn.
All of these different optimizing algorithms rely on one common ingredient: the gradient of the objective function f denoted as **** _∇f(_θ). The difference between the aforementioned algorithms is how they proceed with the gradient. For example, the simplest algorithm SGD starts with a random θ and then updates it with the simple update rule θ ← θ – α·_∇f(_θ) until convergence, i.e. θ or f_(θ) do not change too much anymore._ Here, __ α is the learning ra_t_e.
RMSProp, __ Adam, and co. take the gradient as well but do more elaborate computations which result in faster convergence _very ofte_n. Consider the following animation, which is an oldie, but goldie:

Backpropagation
Let us stick with neural networks now. The usual way of computing the gradient ∇f(θ) there is by using a technique called backpropagation, which is essentially the chain rule from calculus. Don’t worry, we don’t talk about that too long, and will spare you lengthy calculations. Just look at the following simple example function: f(_θ₁, θ₂, θ₃) = (θ₁ + θ₂)_ · _θ_₃.
Computing the gradient using backpropagation works like this:
- Forward pass: Plug θ into f, receive f(θ).
- Backward pass: Starting from f(θ), compute the gradient going backward.
Let us plug in the numbers θ = (3, -1, 2) for the forward pass and then see what happens in our toy network:


The forward pass computes an output of 4 from left to right, very easy. The backward pass then computes all the gradients from right to left. We are only interested in the last gradients though, in our case ∇f(θ) = (2, 2, 2). The result is correct since direct calculation gives us ∇f(θ) = (_θ_₃, _θ_₃, _θ₁ + θ_₂).
Note: Computing the gradient directly is a luxury that we do not have when dealing with billions of parameters in a convoluted formula.
So far so good, but this means that we have to go through the network twice. Can we compute the function output as well as the gradient in a single forward pass?
Well, the authors of the paper propose a simple way of doing just that! But we still need one ingredient to understand their method.
Directional Derivatives
This might sound scary if you have never dealt with them before but let me explain it to you in a simple way. From school, you know how to take derivatives of functions in one variable, right? The definition for when θ is a scalar __ is as follows:

This value f’(θ) is then just the slope of the tangent of the graph in point θ, i.e. a single real number.
Now, when dealing with neural networks, our function f does not have a single input, but potentially billions, remember? Thus, we have to find a more general definition for the derivative in this case. Of course, mathematicians got us covered already. The directional derivative of f with respect to a vector v at a point θ is defined as

Why do we need a v? Well, in higher dimensions, we have a lot of options to define tangents of graphs. Take the paraboloid as an example: in the black dot on the surface of this shape, we can define as many tangents as we like.

If we choose some v = (_v₁, v₂),_ the tangent will point in the same direction, just with an additional z-coordinate, i.e. (_v₁, v_₂, s). This slope s happens to be ∇ᵥf(θ), by the way.
Anyway, if you’re not into geometry, just remember that ∇ᵥf(θ) is a scalar as well, just as f’(θ). For each v, it is a different number, indicating the slope of f in the direction of v. One can prove the following now:

Here, ∇f(θ) is the usual gradient of f. Easy to remember, right? From left to right, the small subscript v just jumps to the side. A quick sanity check: The left side is a scalar, and the right side as well because it is the dot product of two vectors.
Great, now we have everything to understand the paper!
The Forward Gradient
So, given a function f, the authors define another function gᵥ that they call a forward gradient.
Note: The authors omit the subscript v in their paper, but I find it useful to put it there. It reminds us that the forward gradient depends on some v.

We can see that gᵥ takes a vector θ and outputs another vector again, some scaled version of v (remember: _ ∇ᵥf(_θ) is a scalar). We also need θ and v to have the same dimension.
But this looks more complicated than before, it even introduces another vector v! How to choose v to make gᵥ even useful? The answer is simple: independently sample each component of v as a standard Gaussian N(0, 1). Now, you might wonder why this is useful because this just creates a lot of random vectors. The authors showed the following using some nice-to-follow math:

This is fancy language for the following:
If you sample a lot of v, compute the corresponding forward gradients, and then average these forward gradients, you end up with a pretty good approximation of the true gradient.
Let us verify this statement with some simple code. We define the function f(_θ₁, θ_₂) = _θ_₁² + _θ_₂² for which we want to compute the gradient in the point θ = (2, 1).
import numpy as np
f = lambda theta: theta[:, 0]**2 + theta[:, 1]**2
theta = np.array([[2, 1]])
Note: I made the function take several inputs at once, i.e. batches of inputs.
So far, so good. Before we continue, let us compute the gradient ourselves. We have ∇f(_θ₁, θ_₂) = (2_θ_₁, 2_θ_₂), so ∇f(2, 1) = (4, 2).
We can also verify this via computing directional derivatives using the canonical basis, i.e. all vectors with all zeroes except for a single 1.
h = 0.00001
v = np.array([[1, 0], [0, 1]])
(f(theta + h*v) - f(theta)) / h # approximate derivative
# Output:
# array([4.00001, 2.00001])
Now, let us compute a lot of random vectors v and compute the forward gradients gᵥ:
np.random.seed(0) # set a seed to keep it reproducible
v = np.random.randn(1_000_000, 2) # a million vectors of size 2
grad_v_f = ((f(theta + h*v) - f(theta)) / h) # ∇ᵥf(θ)
g = grad_v_f.reshape(-1, 1) * v # scale v with grad_f to compute g
Important note: The way I calculate∇ ᵥf_(θ) is not that good. It is just an approximation to the true directional derivative and even numerically unstable. As the authors say, there is a way to **compute ∇ ᵥf(θ) together with f(θ)_ in a single forward pass using forward mode automatic differentiation**, but unfortunately, they don’t provide a source for this claim.
Hussein Abdulrahman pointed out that PyTorch uses dual numbers, for example. Thanks!
I will not go too much into details, but dual numbers are similar to complex numbers a + bi, but instead of having an imaginary number i with the property _i_² = -1, we deal with numbers a + bɛ with ɛ having the property that _ɛ_² = 0 but ɛ ≠ 0. For these numbers, we have f(a + bɛ) = f(a) + f’(a)bɛ for well-behaved functions f with a single input. It naturally extends to functions with multiple inputs, and in this case, we receive (after renaming the variables) f(θ + ɛv) = f(θ) + ɛ∇f(θ)v, __ exactly what we need.
In g
, a lot of random vectors are stored now, but according to the theorem, their average should be close to ∇f(2, 1) = (4, 2). Let’s see.
g.mean(axis=0)
# Output:
# array([3.99433963, 2.00797197])
Looks good! But we didn’t expect anything less from mathematics. 😎 We can even visualize the forward gradients gᵥ:

We can see that there is quite a spread around the true value (4, 2), but still, the algorithm seems to work nicely, as demonstrated via experiments the authors conducted. But before we come to them, let me copy the complete algorithm from the paper, just because it is nice and short.

This basically a vanilla gradient descent, just using another approach to compute the gradient. And, as the authors mention as well, this is just the beginning: It is straightforward to create forward gradient versions of Adam, RMSProp, and any other algorithm that uses gradients. So, stay tuned for the Fadam algorithm at some point.
The only interesting thing that the scientific community still has to figure out is if the noise introduced by the forward gradients is breaking the more advanced algorithms such as Adam. For SGD, it works quite well, so let’s check out the experiments.
Experiments
Disclaimer: I copy all images from the paper now.




This definitely looks promising. Especially for the more complicated neural networks, forward gradients yield similar performance using less time than their backpropagation counterparts. Just be careful because there is always the possibility that the authors do not show us a case where their algorithm does not work well.
Conclusion
In this article, I reviewed the recent paper Gradients without Backpropagation with you. The authors show how to do gradient descent only using a single forward pass instead of backpropagating through the network. This saves time and does not increase memory consumption.
The authors implemented their approach in PyTorch, however, I am still waiting for them to publish it. All in all, I think it is a nicely written paper with a beautiful and neat idea that is easy to understand and even implement.
I hope that you learned something new, interesting, and useful today. Thanks for reading!
As the last point, if you
- want to support me in writing more about Machine Learning and
- plan to get a Medium subscription anyway,
why not do it via this link? This would help me a lot! 😊
To be transparent, the price for you does not change, but about half of the subscription fees go directly to me.
Thanks a lot, if you consider supporting me!
If you have any questions, write me on LinkedIn!