Machine Learning 101: An Intuitive Introduction to Gradient Descent

Thalles Silva
Towards Data Science
8 min readMar 6, 2018

--

Gradient descent is, with no doubt, the heart and soul of most Machine Learning (ML) algorithms. I definitely believe that you should take the time to understanding it. Because once you do, for starters, you will better comprehend how most ML algorithms work. Besides, understanding basic concepts is key for developing intuition about more complicated subjects.

To understand Gradient Descent at its heart, let’s have a running example. The task is an old one in the field — predict house prices using some historical data as prior knowledge.

But our goal here is to talk about Gradient Descent. To do that, let’s make the example simple enough so we can concentrate on the good parts.

But, Before we go ahead, you can get the code here.

Basic Concepts

Suppose you want to climb a very tall hill. Your goal is to get to the top of the hill the fastest. You look around and you realize you have more than one path to start off. Since you are on the bottom, all of these options seem to take you somewhat closer to the summit.

But you want to get to the top in the fastest way possible. So, how can you do that? How can you take a step that takes you as close as possible to the summit?

Up to this point, it is not clear how to take this step. That is where the Gradient can help you.

As stated in this Khan Academy video, the gradient captures all the partial derivatives of a multi-variable function.

Let’s go step by step and see how it works.

In simpler terms, the derivative is the rate of change or the slope of a function at a given point.

Take the f(x) = x² function as an example. The derivative of f(x), is another function f’(x) that computes the slope of f(x) at a given point x. In this situation, for x = 2, the slope of f(x) = x² is 2x or 2*2 = 4.

The slope of f(x)=x² at different points.

Simply putting, the derivative points to the direction of steepest ascent. And the good thing is, the gradient is exactly the same thing. With one exception, the Gradient is a vector-valued function that stores partial derivatives. In other words, the gradient is a vector, and each of its components is a partial derivative with respect to one specific variable.

Take the function, f(x, y) = 2x² + y² as another example.

Here, f(x, y) is a multi-variable function. Its gradient is a vector, containing the partial derivatives of f(x, y). The first with respect to x, and the second with respect to y.

If we calculate the partials of f(x,y) we get.

So the gradient is the following vector:

Note that each component indicates what is the direction of steepest ascent for each of the function’ variables. Put it differently, the gradient points to the direction where the function increases the most.

Back to the hill climbing example, the gradient points you to the direction that takes you to the peak of the mountain the fastest. In other words, the gradient points to the higher altitudes of a surface.

In the same way, if we get a function with 4 variables, we would get a gradient vector with 4 partial derivatives. Generally, an n-variable function results in an n-dimensional gradient vector.

For Gradient descent, however, we do not want to maximize f as fast as we can, we want to minimize it.

But let’s define our task first and things will look much cleaner.

Predicting House Prices

We are going to solve the problem of predicting house prices based on historical data. To build a Machine Learning model, we often need at least 3 things. A problem T, a performance measure P, and an experience E, from where our model will learn patterns from.

To solve task T, we are going to use a simple Linear Regression Model. This model will learn from experience E, and after training, it will be able to generalize its knowledge to unseen data.

The Linear Model is an excellent model to learn. It’s the foundation of many other ML algorithms like Neural Networks and Support Vector Machines.

For this example, the experience E, is the HOUSES Dataset. The HOUSES dataset contains a collection of recent real estate listings in San Luis Obispo County and around it.

The collection contains 781 data records and it is available for download in CSV format here. Among the 8 available features, for simplicity, we are going to focus on only two of them: the Size, and Price. For each of the 781 records, the Size, in square feet, will be our input features, and the Price our target values.

Besides, to check if our model is properly learning from experience E, we need a mechanism to measure its performance. To do that, we take the mean of squared errors (MSE) as our performance measure.

MSE has been a standard for linear regression for many years. But in theory, any other error measure like the Absolute Error would work. Some of the benefits of MSE is that it penalizes larger errors more than the Absolute error.

Now that we have formalized our learning algorithm, let’s dive into the code.

First we load the data in python using Pandas, and separate the Size and Prices features. After, we normalize the data to prevent some of the features to out value some of the others. Also, Gradient Descent is known for converging much faster with normalized data than otherwise.

Bellow, you can see the distribution of house prices by its size in square meters.

Distribution of house prices by Size. The data is normalized to the [0,1] interval.

A Linear Regression model works by drawing a line on the data. As such, our model is represented by a simple line equation.

Line equation. m and b are the slope and the y-intercept respectively. The x variable is the placeholder for the input values.

For a Linear Model, the two free parameters are the slope m and the y-intercept y. These two variables are the knobs that we are going to change in order to find the best line equation.

Iteratively, we are going to perform slight changes to them, so it can follow the direction of steepest descent on the error surface. After each iteration, these weight changes will refine our model so that it can represent the trends of the dataset.

Before going further, remember that for Gradient Descent, we want to take the direction opposite to the gradient.

You can think of Gradient Descent as a ball rolling down on a valley. we want it to sit in the deepest place of the mountains, however, it is easy to see that things can go wrong.

In analogy, we can think of Gradient Descent as being a ball rolling down on a valley. The deepest valley is the optimal global minimum and that is the place we aim for.

Depending on where the ball starts rolling, it may rest in the bottom of a valley. But not in the lowest one. This is called a local minimum and in the context of our model, the valley is the error surface.

Note that, in the analogy, not all local minima are bad. Some of them are actually almost as low (good) as the lowest (global) one. In fact, for high dimensional error surfaces, it is most common to settle in one of these (not so bad) local minima.

Similarly, the way we initialize our model weights may lead it to rest in a local minimum. To avoid that, we initialize the two weight vectors with values from a random normal distribution with zero mean and low variance.

At each iteration, we are going to take a random subset of our dataset and linearly combine it with our weights. This subset is called a mini-batch. After the linear combination, we feed the resulting vector in the MSE function to calculate the current error.

With this error signal, we can calculate the partial derivatives of the error and get the Gradient.

First, we get the partial derivative with respect to W0.

Partial with respect to W0

Second, we do the same, but taking W1 as the actor.

Partial with respect to W1

With the two partials, we have the gradient vector:

The Gradient

Where Err is the MSE error function.

Having that, our next step is to update the weight vectors W0 and W1, using the gradients, to minimize the error.

We want to update the weights so they can push the error down in the next iteration. We need to make them follow the opposite direction of each respective gradient signal. To do that, we are going to take small steps of size η in that direction.

The step size η is the learning rate and it controls the rate of learning. Empirically, a good starting point is 0.1. In the end, the update step rule is set as:

In code, the complete model looks like this. Look at the minus sign in front of both gradients DW0 and DW1. This guarantees that we will take steps in the opposite direction to the gradient.

After updating the weights, we repeat the process with another random mini-batch. And that is it.

Step by step, each weight update causes a small shift in the line towards its best representation. In the end, when the error variance is small enough we can stop learning.

Linear model conversion over time. The first weight updates cause the line to rapidly reach an ideal representation.

This version of Gradient Descent is called Mini-Batch Stochastic Gradient Descent. In this version, we use a small subset of the training data to calculate the gradient. Each mini-batch gradient offers an approximation to the optimal direction. Even though the gradients do not point to the exact direction, in practice it converges to very good solutions.

Error signal by epoch. Note that after decreasing the error signal very fast the model slows down and converges.

If you look close at the error/episode graph, you notice that in the beginning, learning occurs at a faster pace.

After some epochs, however, it slows down and plateaus. This happens because, in the beginning, the gradient vectors that point to the steepest descent are long in magnitude. As result, the two weight variables W0 and W1 suffer more drastic changes.

However, as they get closer to the summit of the error surface, the gradient slowly gets smaller and smaller, which causes very small changes to the weights.

In the end, the learning curve stabilizes, and the process is done.

Enjoy!

--

--