The world’s leading publication for data science, AI, and ML professionals.

Gradient Descent Demystified in 5 Minutes

Learn how Gradient Descent works with just tiny bit of Math – and a lot of common sense.

"I was on a date with this girl. Everything was great until I started talking about gradient descent. It all went downhill from there."

Well, tough luck, but it’s her loss.

This didn’t make any sense to you? Great, keep reading.

Photo by Stephen Leonardi on Unsplash
Photo by Stephen Leonardi on Unsplash

If you haven’t been living under a rock, you’ve most likely heard about an optimization algorithm called gradient descent. But just because you’ve heard of it doesn’t mean that you understand it. Try to explain it in a couple of sentences without the usage of technical jargon.

Not so easy, right? Then it’s a good thing you’re here. Let’s dive in.

Gradient descent is an optimization algorithm used to find optimal values of parameters of a function that minimizes a cost function. It might sound fancy, but try to think of it visually. You have a large bowl, just like the one from you eat cereals in the morning. That bowl is a cost function. Now take any random position on the surface of the bowl – those are current values of your coefficients. The idea is to somehow get to the bottom of the bowl – ergo to get to the minimum of a function – ergo to find optimal parameters.

The idea is to continue trying out different values for your coefficients and to evaluate the cost for each, and then to select new coefficients that are slightly better – having a lower cost. Repeating this process enough times will lead you to the bottom of the bowl and you will discover the best values of the coefficients – best values meaning that they result in the minimum cost.


But what is a Cost Function?

In Data Science and machine learning, the cost function is used to estimate how good or bad your model is. In the most simple words possible, a cost function is a measure of how wrong your model is in terms of its ability to estimate the relationship between features (X) and target variable (y).

It doesn’t matter all that much what you use as your cost function, but some of the most popular are:

  • Sum of Squared Residuals (SSR/SSE)
  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)

It’s also sometimes called loss function or error function. Your objective is to find optimal parameters for the model, ergo the ones that minimize the cost function.

Now when you understand this, let’s continue with gradient descent.


Gradient Descent Step by Step

The algorithm starts off with setting initial values for coefficients – you are free to set the values to whatever you like (just not a string or boolean), but the common practice is to set them to 0. If I have two coefficients, let’s say beta 0 and beta 1, I would set them to zero initially:

Now just to keep things simple let’s say I’m dealing with a linear regression task, and those betas are my coefficients (beta 0 being the bias intercept). I would then define my cost function – SSR for example:

This can be further simplified:

It’s quite simple to read. You make a prediction, then subtract that prediction from the actual value, and you take the square of that. Then, you sum all of those squares for every instance in the dataset.

Now comes the part where you should know a bit of Calculus ** to fully understand what’s going on. You need to calculate partial derivative**s for each of the coefficients, so the coefficients can be updated later. The topics you should be familiar with are:

  • Power rule
  • Chain rule
  • Multivariate differentiation

Some time ago I’ve written an article on taking derivatives in Python, and it covers to a degree those topics:

Taking Derivatives in Python

Back to the point. As my model has only two coefficients, I need to calculate two partial derivatives, one with respect to beta 0, and the other with respect to beta 1. Here’s how:

Now comes the part in which you take those two functions and do something known as epoch – just a fancy word for a single iteration through the dataset. You’re going through the dataset, minding your own business, and keeping track of the cost for the coefficients – let’s denote them as:

Before you know it you’ve reached the end of the dataset, and it’s time to recalculate the coefficients. Let’s introduce one more term – Learning rate (LR). It determines to what extent newly acquired information overrides old information. You set it to some constant when initializing coefficients, and it’s usually somewhere between 0.0001 and 0.1:

Setting the learning rate to too small value will result in a lot of epochs needed to find optimal parameters, and setting it to value too high can potentially cause the algorithm to ‘skip’ optimal values. A learning rate is needed to calculate the step size – a number which tells the algorithm far to go for the next iteration.

If the function is steep at the given point, the step size will be larger, and as you approach the optimal parameters step size will decrease, allowing the algorithm to hit the sweet spot.

Once you got those calculated, you can update the coefficients as follows:

And that’s one epoch in a nutshell. Repeat this process for like 10000 times and you’ll get the optimal values for coefficients.


Before You Leave

This should have been some manageable text and manageable math to understand one of the most fundamental topics of Machine Learning and deep learning. If you still feel like you don’t understand it, grab a pen and paper and follow through the equations and calculate everything by hand.

If that still sounds somewhat unmanageable to you, fear not, in a couple of days I’ll publish an article on using gradient descent to find optimal coefficient values for linear regression – and I’ll use equations listed above precisely.

Is the concept clearer to you now? Please let me know.


Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.

Join Medium with my referral link – Dario Radečić


Related Articles