Convergence of Gradient Descent under Approximate Gradients

Exploring Gradient Descent with bounded gradient errors

Christian Howard
Towards Data Science

--

Optimization is a fascinating area with a lot of uses, especially these days with Machine Learning (ML). As those involved with ML know, gradient descent variants have been some of the most common optimization techniques employed for training models of all kinds. For very large data sets, stochastic gradient descent has been especially useful but at a cost of more iterations to obtain convergence. Stochastic gradient descent pays this convergence penalty due to forming an inaccurate gradient estimate that leads to the model parameters taking on a trajectory that is non-optimal as it attempts to find its way to a local minima.

From: https://www.reddit.com/r/ProgrammerHumor/comments/eez8al/gradient_descent/

Of course, stochastic gradient descent is not the only place where a gradient may be computed with some non-trivial errors. In particular, someone could approximate the gradient using a finite difference approximation, particularly if they have some black box quantity where finding a closed gradient expression is hard or impossible, and still arrive at a non-trivial amount of error in their gradient result. So a question we might ask ourselves is:

How does convergence error get impacted under gradient descent with approximate gradients with bounded errors?

This is a cool problem to think about with a few potential areas of usefulness! For anyone excited about this stuff, I will go over my analysis and proof for the convergence error of gradient descent under this situation with a particularly nice objective function. Time to dive into the math!

Meme from: https://giphy.com/gifs/reaction-BmmfETghGOPrW

So first, let us take a look at the recursive gradient descent approach in question. Consider the steepest descent method with bounded error:

where s is a positive constant stepsize, ε_k is an error term satisfying ε_k ≤ δ for all k, and f is the positive definite quadratic form defined as

Let us then define a parameter q to be

and assume that q < 1. Using the above gradient descent approach, I will show that for all k, we can bound the distance between the kth iterate and a local optima by the following:

where x_0 is the starting point for the iterate sequence and x* is the local optima. I will prove this result by first proving two useful lemmas and then using them to prove the main result.

Lemma 1

Given a value 0 ≤ c < 1 and that

we can find a bound for x_k - x* to be

Proof

Lemma 2

For some symmetric, positive definite matrix A and positive scalar s, the following inequality holds:

Proof

Recall that some positive definite square matrix A = U^T Λ U where U is a unitary matrix of eigenvectors and Λ is a diagonal matrix with positive eigenvalues. Recall as well that for a matrix 2-norm for some matrix B, we have

With that, we can proceed with the proof using the following steps:

Proof of main result

To arrive at our final result, we will make use of some typical analysis inequalities that will give us an opening to use Lemma 1 and Lemma 2. With that said, we can proceed with this final proof as follows:

Since we assume we choose s to be small enough such that q < 1, we can use Lemma 1 to further show that

This main result is interesting for a few reasons. First, it is clear that very quickly our convergence rate becomes limited by the gradient approximation error instead of the initial guess error. However, we can somewhat overcome this by choosing a sufficiently small value for the step size s, something that practitioners of stochastic gradient descent are likely used to doing.

Anyway, now with the theoretical stuff done, we can move on to see how well this inequality holds relative to actual numerical experiments.. time to crunch some numbers!

Meme from: https://makeameme.org/meme/crunch-the-numbers

Numerical Results

I thought it could be cool to do a numerical experiment to see how the bound compares to the convergence in practice. To do this experiment, a noise vector ε was added onto the exact gradient of the quadratic form for a random, positive definite Q such that ε ≤ δ for some δ > 0 that is specified. Multiple sequences were run with different starting random seeds and the plot below is a visualization of the convergence results against the bound.

Based on the figure above, it looks like the bound works out nicely! It is certainly a bit conservative relative to the actual experiments but that is okay. Some observations to note are the following. As q → 1, the bound term independent of k in the inequality becomes quite large and the convergence of the q^k term to 0 slows down. This implies that the spectrum of sQ are within a unit ball centered around a value of 1 and that the extreme values of s λ_{min} and s λ_{max} are near the boundary of this ball. q → 1 also means we are approaching a situation where we will diverge as the number of iterations approach infinity, so it makes sense that things would be bounded more and converge more slowly when q is close to 1.

Conclusion

In any event, we have seen that if we approximate our gradient in a way that is bounded (say using Finite Differences or mini-batch estimates in Machine Learning), it is possible to bound our convergence error and get a clearer idea of what to expect in terms of run-time, long term accuracy, and potentially more! Thus, this is a pretty neat result and something to keep in mind!

Based on article originally published at https://christianjhoward.me on March 15, 2018.

--

--

PhD student @ UIUC who enjoys the mystical arts of mathematics. Works in Theoretical Computer Science. www.christianjhoward.me