Wondering why do you subtract Gradient in a Gradient Descent Algorithm?

Easy to understand vector calculus insight featuring partial derivatives, gradient and directional derivatives

Mayank Mishra
Towards Data Science

--

Photo by Jeswin Thomas on Unsplash

While working with Machine Learning, one of the most basic approaches towards a supervised ML problem is defining a cost function, and then minimizing the cost function for the best output. Gradient descent is one of the most widely used algorithms to do so. A basic mathematical representation of a gradient descent algorithm looks something like this.

Basic representation of Gradient Descent Algorithm

Here, θ represents the vector containing the parameters and the J is the cost function. Assignment operator suggests that after every iteration, the values in vector θ will update. The learning rate can be set manually which defines the size of the steps in the gradient descent algorithm. But why did we subtract the gradient descent to attain minimal loss? To understand that, let’s understand a few basic concepts of vector calculus.

Gradient

A gradient of a function f, written as ∇f, is a vector that contains all the partial derivatives of f. Let’s look at it with an example.

Consider a function f(x,y) = x³ sin(y). We first need to find out the partial derivatives of the function f. To find the partial derivative of a function with respect to a variable, treat all the other variable as constants.

Partial Derivatives of the function f(x,y) = x³ sin(y)

Now a gradient of f, denoted by ∇f, would simply be a vector with the collection of all the partial derivative, i.e. -

Gradient of function f(x,y) = x³ sin(y)

Therefore, to generalise, we can say that the gradient of a function J can be written as-

General Representation of a gradient

The partial derivative of a multivariable function with respect to a particular variable simply tells us that, while keeping the other variables constant, if we change a particular variable very slightly, how will the output of the function change.

For example, let’s take a very simple function:

f (x,y) = x²y. At any point, say (2,3), the partial derivative of the function with respect to x :

∂f(x,y)/∂x = 2xy = 2*2*3 = 12.

The value (output) of the function f(x,y) at the point (2,3) is 2²*3 = 12.

Now, let us increase the value of x very slightly, say 0.001, while keeping the y value constant. Now the point is (2.001,3). The value (output) of the function at this point is 2.001²*3 = 12.012003.

Let’s see the rate of change in the output when x was changed, ∂f(x,y)/∂x .

(f(2.001,3) — f(2,3))/2.001–2 = 0.012003/0.001 ≃ 12

This value is equal to the value of partial derivative of the function with respect to x, thus implying the change in the output of a function when an input variable is slightly changed keeping the other variables constant. This same can be done for y too, keeping the value x constant.

Graph of f(x,y) = x²+y²

Let us now visualise the function f(x,y) = x²+y² for better understanding. The graph of the function is given on the left. If the function was our loss function, we can see that the loss would be minimum near the origin and would increase as we go further away. Let us plot a vector field of gradients on the x-y plane. The gradient to the given function is [2x,2y].

(a) Vector field of gradient on the x-y plane || (b)Bottom to top view

We see that the vectors in the vector field point away from the origin for this function. That is because when we take any point on the graph (a,b), project them on the x-y plane and then draw the vector (a,b) from the origin to that point, the gradient to that is vector [2a,2b] which points even further away from the origin. Also, a point to notice here is that the points that lie close to the origin have a gradient vector with smaller length than the one that are further away. Which is not surprising because the gradient vector is [2x,2y], so any point that is further away from the origin will have a bigger length than the points with small values of x,y.

To visualise this better, given below are gradients at particular points on the graph.

Gradient vector at various points on the graph
Bottom up view of gradient vector at different points on the graph (It points away from the origin and the more farther you go away from the origin, the bigger the length of the vector is)

Imagine you are at any point on the graph and you want to climb up. You want to find a direction in which you should move from a random point to increase the altitude the fastest. That direction is given by the gradient vector at that particular point. Thus, the gradient gives us a direction of steepest ascent. In our gradient descent algorithm, we aim to minimize the value of loss function, therefore at any given point, we need to move in direction where the value of the function decreases the most, i.e. in the opposite direction of gradient which is direction of steepest descent. Therefore we subtract the gradient in the algorithm.

But it is still not clear why gradient is the direction of steepest ascent. To understand that, we need to get an overview of directional derivatives.

Directional Derivative

Consider a function with two variables x,y. We saw earlier that partial derivative of a function f(x,y) would tell us the change in the output if we would move slightly in the x-direction (keeping y constant) or in the y-direction (keeping x constant). But what if we move in a direction that is not parallel to either of the axes. What if we move in the direction of a vector say [2,3]. i.e. 2 units in the x-direction and 3 units in the y- direction. How will the output of the function change now? This information is given by the help of directional derivative.

The directional derivative is mathematically represented by the dot product of gradient of the function with 𝑤⃗ , which is the vector in the direction we are moving.

Directional derivative (considering 𝑤⃗ as [a,b])

Take the earlier function f(x,y) = x²+y². You are standing at a point say (2,3). Now you move slightly in the direction of vector(3,1). What would be the change in the output?

Now as we were standing at (2,3), the directional derivative at that point will be:

This means that if at (2,3) we move slightly in the direction of vector(3,1), the output would change by a factor of 18. The reason why we dot product is also obvious. When we were moving in only x-direction the change in the output is ∂f/∂x and when we were moving only in y-direction the change in the output was ∂f/∂y. And now when we are moving in the direction of vector 𝑤 ⃗ say [a,b], we are going a units in x-direction and b units in y-direction. Thus, the change in the output can be denoted by a(∂f/∂x) + b(∂f/∂y).

Now at any given point, we can move in a lot of directions. Assume we wish to move in a direction 𝑤 ⃗ where the directional derivative is the maximum, i.e. the output of the function increases the most. This direction is also the direction of steepest ascent.

Thus we can say we are looking for:

A point to note here is that the vectors in the equation are of length 1, i.e. unit vectors. This is to eradicate the possibility that we might choose any other vector with a large magnitude pointing in some completely different direction, just because it gave the highest value of the dot product.

We know that given any two vectors, the dot product can be written as:

Dot product of two vectors

Here θ is the angle between the two vectors. For the dot product to be maximum, the angle between the two vectors should be 0.

In our case, the magnitude of the vectors is 1. Therefore, only the angle between the vectors matters. We can see that we will have the highest value of the directional derivative when our vector 𝑤 ⃗(the direction in which we are moving) is in the direction of the gradient of the function.

Thus we can say that the gradient at any point of a function points in the direction of steepest ascent, i.e. if I were to climb the function as quickly as possible, I would choose the direction of the gradient.

While in gradient descent algorithm, we wish to choose to move in the direction where the function (i.e. the loss function) reduces the maximum, thus we move in the opposite direction of the gradient, i.e. the direction of steepest descent. Therefore, we subtract the gradient in the gradient descent algorithm.

Thank you for reading!

If you wish to read more on core topics of deep learning, feel free to follow and stay updated.

All the graphical representations and images in the story are by the author.

--

--