Mathematical Intuition behind Gradient Descent
Mathematical derivation of the update rule in Gradient Descent — the most popular optimization algorithm in Machine Learning and Deep Learning
Gradient descent is an iterative optimization algorithm for finding the minimum of a function, most commonly used in Machine Learning and Deep Learning.
Introduction
If you’ve ever seen or heard of the term ‘Gradient Descent’ in your life, you must have most certainly come across the following equation:
And the following graph:
In the above equation L is a loss function (or a cost function) and θ is any parameter on which the cost function depends. These are weights(W) and biases(b) in case of Neural Networks (or Deep Learning).
The goal is to find the global minima of the loss function.
These parameters are updated during each iteration of training algorithm until we reach a minimum value of the loss function.
In the context of Deep Learning (or Neural Networks), we write the above equations in terms of weights and biases, as:
This is a fundamental step in Gradient Descent optimization algorithm, that is executed during each iteration of training.
Let’s derive this equation mathematically (don’t panic ! all you need is high school basic calculus to do this). After doing this, whenever you encounter the parameter update step you would know of its origins, and feel more empowered !
General Learning Algorithm
Let’s first quickly review the general Learning Algorithm:
Here ‘till satisfied’ is a subjective condition and could be one of the many stopping criteria such as reaching threshold loss value or iterating some fixed number of times, etc.
Parameter Update Step
Note that the update step involves adding adding some change Δw, Δb to w,b. We will soon find out that those are indeed negative partial derivatives of loss w.r.t w,b respectively i.e - 𝛿L/𝛿w and - 𝛿L/𝛿b where L = f(w,b).
Let us formulate these mathematically:
Note that we have not introduced learning rate α yet. Let’s first understand the need of learning rates.
Need for Learning Rate
We know that θ is a vector, and so is Δθ.
Let us consider the sum of these two vectors:
It is evident that the resultant sum of the two vectors is quite large compared to individual vectors. This is because we are taking a large step Δθ. We need to take small steps in that direction so that vector sum is small. This is also important because if we make such large updates (Δθ) to parameters θ, we might miss out on global minimum of loss function L. Hence we introduce learning rate that limits the size of update we make to parameters θ.
Note how, with the help of learning rate α < 1, we limit the amount of update we make to θ.
Let us now find out the correct value of Δθ that will reduce the value of loss.
Before we move ahead, let me introduce the (in)famous Taylor Series, which we will use to find Δw, Δb and hence Δθ
Taylor Series and its application in Gradient Descent
Taylor’s series is used to find the value of a function at a distance Δx from a point x, given the derivatives of the function at that point.
Let’s find the value of Δw using Taylor’s series.
In this case, function f will be loss function L, and we will expand the series for L(w + α*Δw).
We have to find a value of Δw such that L(w + α*Δw) < L(w).
At this step, we can infer that we’d need second term onwards to be negative for the new loss to be smaller than the old loss.
But Loss L(θ) is a multivariate function. It is a function of not just weight w, but also bias b. We have represented them as a vector θ = [w, b].
So we need to write down the vector form of Taylor series to find Δθ.
Here ∇ L(θ) represents first-order gradient of loss w.r.t θ.
Gradient is nothing but a vector of partial derivatives of the function w.r.t each of its parameters.
Similarly ∇² will be a vector of second-order partial derivatives and so on.
In practice learning rate α is very small (0.001, 0.00001, etc.), so α², α³, … will be very small and their contribution to the loss L will be negligible.
So they can be ignored from the equation.
The final equation will then become
Finding best value for Δθ
Since we want the updated loss L(θ + αΔθ) to be less than the previous loss L(θ), and since loss is a positive quantity, the second term in above equation has to be negative.
So we need such value of Δθ so that the dot product in second term becomes negative, i.e we need
We know that
We also know that cos β lies between -1 and 1 i.e -1 ⩽ cos β ⩽ +1 .
Therefore,
Now we want the dot product to be as negative as possible (so that loss can be as low as possible)
But as we can see from the above inequality, the most negative value it can acquire is -k.
Now for the dot product to be -k, cos β has to be equal to -1.
cos β = -1 corresponds to β = 180°
Since the two vectors are in opposite direction, from vector properties we know
Final Parameter Update Step
Now if we substitute these values in the parameter update step in general learning algorithm, it becomes
Now that equation, my friend, is exactly similar to what we set out to derive.
Each time you update your parameters (w and b) using this rule, the loss on your training set will decrease, until it cannot reduce further i.e when slopes(or partial derivatives) become 0.
Gradient Descent Rule for multiple weights and biases (vectorized notation)
We have derived the update rule for a single weight and bias.
In reality a deep neural network has a lot of weights and biases, which are represented as matrices (or tensors), and so our update rule should also be modified to update all weights and biases of the network simultaneously.
Note that most deep learning texts use the notation Δ instead of ∇ to represent gradient in equations.
Outro
That is all for this one. I hope after reading this, you have developed a more intuitive understanding of gradient descent algorithm.