Mathematical Intuition behind Gradient Descent

Mathematical derivation of the update rule in Gradient Descent — the most popular optimization algorithm in Machine Learning and Deep Learning

Published in

Towards Data Science

7 min readMay 8, 2019

Gradient descent is an iterative optimization algorithm for finding the minimum of a function, most commonly used in Machine Learning and Deep Learning.

Introduction

If you’ve ever seen or heard of the term ‘Gradient Descent’ in your life, you must have most certainly come across the following equation:

Gradient Descent - parameter update step

And the following graph:

In the above equation L is a loss function (or a cost function) and θ is any parameter on which the cost function depends. These are weights(W) and biases(b) in case of Neural Networks (or Deep Learning).

The goal is to find the global minima of the loss function.
These parameters are updated during each iteration of training algorithm until we reach a minimum value of the loss function.

In the context of Deep Learning (or Neural Networks), we write the above equations in terms of weights and biases, as:

weight and bias update in gradient descent

This is a fundamental step in Gradient Descent optimization algorithm, that is executed during each iteration of training.
Let’s derive this equation mathematically (don’t panic ! all you need is high school basic calculus to do this). After doing this, whenever you encounter the parameter update step you would know of its origins, and feel more empowered !

General Learning Algorithm

Let’s first quickly review the general Learning Algorithm:

Here ‘till satisfied’ is a subjective condition and could be one of the many stopping criteria such as reaching threshold loss value or iterating some fixed number of times, etc.

Parameter Update Step

Note that the update step involves adding adding some change Δw, Δb to w,b. We will soon find out that those are indeed negative partial derivatives of loss w.r.t w,b respectively i.e - 𝛿L/𝛿w and - 𝛿L/𝛿b where L = f(w,b).

Let us formulate these mathematically:

Note that we have not introduced learning rate α yet. Let’s first understand the need of learning rates.

Need for Learning Rate

We know that θ is a vector, and so is Δθ.
Let us consider the sum of these two vectors:

It is evident that the resultant sum of the two vectors is quite large compared to individual vectors. This is because we are taking a large step Δθ. We need to take small steps in that direction so that vector sum is small. This is also important because if we make such large updates (Δθ) to parameters θ, we might miss out on global minimum of loss function L. Hence we introduce learning rate that limits the size of update we make to parameters θ.

Note how, with the help of learning rate α < 1, we limit the amount of update we make to θ.

Let us now find out the correct value of Δθ that will reduce the value of loss.

Before we move ahead, let me introduce the (in)famous Taylor Series, which we will use to find Δw, Δb and hence Δθ

Taylor Series and its application in Gradient Descent

Taylor series

Taylor’s series is used to find the value of a function at a distance Δx from a point x, given the derivatives of the function at that point.

Let’s find the value of Δw using Taylor’s series.
In this case, function f will be loss function L, and we will expand the series for L(w + α*Δw).
We have to find a value of Δw such that L(w + α*Δw) < L(w).

Taylor series for loss function in terms of w

At this step, we can infer that we’d need second term onwards to be negative for the new loss to be smaller than the old loss.

But Loss L(θ) is a multivariate function. It is a function of not just weight w, but also bias b. We have represented them as a vector θ = [w, b].
So we need to write down the vector form of Taylor series to find Δθ.

vector form of Taylor series for parameter vector θ

Here ∇ L(θ) represents first-order gradient of loss w.r.t θ.
Gradient is nothing but a vector of partial derivatives of the function w.r.t each of its parameters.
Similarly ∇² will be a vector of second-order partial derivatives and so on.

In practice learning rate α is very small (0.001, 0.00001, etc.), so α², α³, … will be very small and their contribution to the loss L will be negligible.
So they can be ignored from the equation.
The final equation will then become

updated equation for loss

Finding best value for Δθ

Since we want the updated loss L(θ + αΔθ) to be less than the previous loss L(θ), and since loss is a positive quantity, the second term in above equation has to be negative.
So we need such value of Δθ so that the dot product in second term becomes negative, i.e we need

condition for new loss to be negative

We know that

cosine of angle between 2 vectors is their dot product divided by product of their magnitudes

We also know that cos β lies between -1 and 1 i.e -1 ⩽ cos β ⩽ +1 .
Therefore,

Now we want the dot product to be as negative as possible (so that loss can be as low as possible)
But as we can see from the above inequality, the most negative value it can acquire is -k.

Now for the dot product to be -k, cos β has to be equal to -1.
cos β = -1 corresponds to β = 180°

Since the two vectors are in opposite direction, from vector properties we know

Final Parameter Update Step

Now if we substitute these values in the parameter update step in general learning algorithm, it becomes

updated parameter update step in learning algorithm

Now that equation, my friend, is exactly similar to what we set out to derive.
Each time you update your parameters (w and b) using this rule, the loss on your training set will decrease, until it cannot reduce further i.e when slopes(or partial derivatives) become 0.

Result in decrease of loss due to iterative update steps

Gradient Descent Rule for multiple weights and biases (vectorized notation)

We have derived the update rule for a single weight and bias.
In reality a deep neural network has a lot of weights and biases, which are represented as matrices (or tensors), and so our update rule should also be modified to update all weights and biases of the network simultaneously.