The world’s leading publication for data science, AI, and ML professionals.

Gradient Descent, the Easy Way

A gentle and intuitive way to understand Gradient Descent.

Gradient Descent is at the core of all Deep Learning. This article explains in simple details the technicalities of this method.

Update: The best way of learning and practicing Reinforcement Learning is by going to http://rl-lab.com

Consider a system in which it takes an input, then applies some computations based on given parameters and then output the results.

The output is then compared to some expected results and what we would like to see is to have the output as close to those expected results as possible.

Take for example the analogy of old TV set or a radio, where you have a signal coming, and using a button you fine tune the device to have the best audio/video with little or no noise.

So how can this be done mathematically ?

The difference between the output and the expected result is described by a function that we call the Loss function (𝓛).

To keep things simple, we will assume that the loss function of the system is a parabola in form of :

𝓛 = aπœƒ Β² + b where πœƒ is the fine tuning parameter, a and b are some values.

This loss function looks like the following graph:

The graph above gives the amount of error in the output based on the parameters πœƒ. The aim is to minimize the error, judging from the graph it is easy to spot that the minimal error is at the bottom of the curve (point: πœƒ = 0, 𝓛 = 1).

How to find the minimum error?

As usual the analytical solution is a possibility but it is not always easy when the loss function gets complicated, for this reason we fallback to iterative solutions.

The way we proceed is rather intuitive, we vary πœƒ and we look if 𝓛 decreases. We keep doing that until we find the value of πœƒ for which is 𝓛 minimum.

As we can see when πœƒ moves from -2 to -1.4, 𝓛 drops which means πœƒ is moving in the right direction. On the other hand when πœƒ moves from +1.5 to +2, 𝓛 increases which means πœƒ is moving in the wrong direction, it must go in the opposite direction as shown by the arrow.

The question is now how to find the πœƒ that minimizes 𝓛?

Obviously we need to vary πœƒ, but by how much and in which direction ? As we have seen above that as we vary πœƒ from the lowest values to the highest values (ex: -∞ to ﹒∞), we track the variation of 𝓛. If 𝓛 keep decreasing we keep increasing πœƒ, otherwise if 𝓛 starts increasing we start decreasing πœƒ until we find the right value. This solves the direction, but what about the amount by which the πœƒ should vary ? A simple solution is to take a fraction of the variation of 𝓛, we call it 𝛂.

Mixing all of these features together we get the following formula:

πœƒπ‘›β‚Šβ‚ = πœƒπ‘› -𝛂 . βˆ†π“›

This formula clearly says that the next value of πœƒ, is based on the previous value minus a fraction of the variation of 𝓛. If 𝓛 is decreasing then βˆ†π“› will be < 0 and -𝛂 . βˆ†π“› > 0 so πœƒ will increase. If 𝓛 is increasing then βˆ†π“› will be > 0 and -𝛂 . βˆ†π“› < 0 so πœƒ will decrease.

How to compute βˆ†π“›?

Of course the most direct way to do this is βˆ†π“› = 𝓛(end) – 𝓛(start) but this will require some management. A better way to do it, is to compute the derivative which will give the variation of 𝓛 at each point. For example the derivative of 𝓛 = aπœƒ Β² + b becomes d𝓛/dπœƒ = 2aπœƒ, and so by plugging a value of πœƒ we get d𝓛. The minimum of 𝓛 is when d𝓛 becomes zero for a certain πœƒ. The target now becomes finding πœƒ that gives d𝓛 = 0

How to choose 𝛂 ?

The 𝛂 is usually decided upon experience, several values should be tried to find the one that makes the conversion faster, but there is still a catch here which is the constance of the parameter (more on that later).

Example

Consider that 𝛂 = .9 , 𝓛 = πœƒ Β² , and a value Ι› = .00001 which tells how small βˆ†π“› should become in order to consider it as zero. After iterating over the gradient descent formula (πœƒπ‘›β‚Šβ‚ = πœƒπ‘› -𝛂 . βˆ†π“›) we get the following table.

As we can see it took 77 iterations in order for βˆ†π“› to be smaller than .00001 and be able to find an acceptable value of πœƒ. It is somehow long, and this is because we have a constant 𝛂 ! The reason is that we are making the same step even when we are close to the minimum, while in fact what we should do is use smaller steps in order to readjust the descent.

Schematically the gradient descent is doing something like the following graph (PS. this is not accurate representation, but it gives an intuition)

However, what is really needed, is that as the we approach the minimum, we have to adapt our steps in order to pin point that minimum and avoid zig-zagging around it. So we need something like the following:

One way of achieving this is to decrease 𝛂 as we go on, for example with each iteration i compute a factor f = 1 + (i/10) then divide 𝛂 by f in the gradient descent formula:

πœƒπ‘›β‚Šβ‚ = πœƒπ‘› -𝛂/f . βˆ†π“›

This modification will give a much improved result, in which we reach a solution in just 8 iterations, as shown in the following table:

Finally here is a a python code that will compute a gradient descent for any loss function, provided that the right derivative is given in the function dF(x)

Conclusion

Gradient descent is a handy tool to find the minimum of a loss function, it is very useful in Deep Learning field.


Related Articles