Gradient Descent in Python

Sagar Mainkar
Towards Data Science
7 min readAug 25, 2018

--

When you venture into machine learning one of the fundamental aspects of your learning would be to understand “Gradient Descent”. Gradient descent is the backbone of an machine learning algorithm. In this article I am going to attempt to explain the fundamentals of gradient descent using python code. Once you get hold of gradient descent things start to be more clear and it is easy to understand different algorithms.Much has been already written on this topic so it is not going to be a ground breaking one. To follow along and build your own gradient descent you will need some basic python packages viz. numpy and matplotlib to visualize.

Let us start with some data, even better let us create some data. We will create a linear data with some random Gaussian noise.

X = 2 * np.random.rand(100,1)
y = 4 +3 * X+np.random.randn(100,1)

Next let’s visualize the data

It is evident that Y has a nice linear relationship with X. This data is very simple and has only one independent variable X.

You may studied at college level that a line can be expressed as

Slope of line

And then you can solve the equation for b and m as follows:

Analytical method

This is called the analytical method of solving the equation. Nothing wrong in this however remember machine learning is about programming in matrices. For sake of machine learning I can express the equation for a line in terms of machine learning in a different way. I would call y as my hypothesis and represent it as J(theta) and call b as theta0 and m as theta1. I can write same equation as :

Machine learning way

To solve for the Theta0 and Theta1 analytical way I would have to write the following program:

theta_best = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

Remember that I have added a bias unit to X that is 1 for every vector in X. This if for ease of matrix multiplication to solve for Theta.

You can see that if the number of features in X starts increasing then the load on CPU/GPU to do the matrix multiplication would start increasing and if the number of features as really huge like a million features then it would almost become infeasible for your computer to solve this. This is where gradient descent comes to the rescue.

To explain in brief about gradient descent, imagine that you are on a mountain and are blindfolded and your task is to come down from the mountain to the flat land without assistance. The only assistance you have is a gadget which tells you the height from sea-level. What would be your approach be. You would start to descend in some random direction and then ask the gadget what is the height now. If the gadget tells you that height and it is more than the initial height then you know you started in wrong direction. You change the direction and repeat the process. This way in many iterations finally you successfully descend down.

Well here is the analogy with machine learning terms now:

Size of Steps took in any direction = Learning rate

Gadget tells you height = Cost function

The direction of your steps = Gradients

Looks simple but mathematically how can we represent this. Here is the maths:

Where m = Number of observations

I am taking an example of linear regression.You start with a random Theta vector and predict the h(Theta), then derive cost using the above equation which stands for Mean Squared Error(MSE). Remember you try to minimize cost you would need to know your next step (or Theta). The partial derivative is something that can help to find the Theta for next iteration.

But wait currently we need to calculate Theta0 and Theta1 how to do that and what if we had multiple features then we would have multiple Theta. Don’t worry here is a generalized form to calculate Theta:

where alpha = Learning Rate

All right we are all set to write our own gradient descent, although it might look overwhelming to begin with, with matrix programming it is just a piece of cake, trust me.

What are the things we need, a cost function which calculates cost, a gradient descent function which calculates new Theta vector that’s it, really that it.

Let start with cost function and here is the code:

Cost function

I would share my GitHub gist at the end of this article so you can download and run the code but for now let us understand the cost function. It takes theta,X and y where theta is a vector , X is row vector and y is vector. This is how generally your data is X is a matrix of row vectors while y is a vector.

Remember you do not need to call this function explicitly our gradient descent method will call it internally so let head to our gradient descent function.

Gradient Descend function

It takes three mandatory inputs X,y and theta. You can adjust the learning rate and iterations. As I said previously we are calling the cal_cost from the gradient_descent function.

Let us try to solve the problem we defined earlier using gradient descent.

We need to find theta0 and theta1 and but we need to pass some theta vector in gradient descent. We can start with random values of theta from Gaussian distribution and may be 1000 iterations and learning rate of 0.01. The code snippet is self explanatory.

We get theta0 = 4.11 and theta1 =2.899 which very close to our actual values of 4 and 3 for theta0 and theta1 respectively. But do we need to iterate 1000 times and use a learning rate of 0.01? To answer this we need to look at the how the cost varies with iterations so let’s plot cost_history against iterations.

Looking at this graph it is apparent that the cost after around 180 iterations does not reduce which means we can use only 200 iterations. If we zoom in the graph we can notice this.

Also we can notice that cost reduces faster initially and then slows down.

You can try and play with different learning rate and iteration combinations.

It would be great to see how the gradient descent actually converges to the solution with different learning rates and iterations. And this would make more sense if we can see everything in one go.

So why wait let’s do it, let us plot the graphs for convergence and cost vs iterations for the four combinations of iterations and learning rates

it_lr =[(2000,0.001),(500,0.01),(200,0.05),(100,0.1)]

For brevity I am not pasting the code here but only the graphs, please feel free to check out the full code on my GitHub link.

Check how with small learning rates it takes long to converge to the solution whereas with with larger learning rates it is quicker.

A note of caution the actual learning rate for your problem would depend on the data and there is no general formula to set it right. However there are sophisticated optimization algorithms which start with a larger learning rates and then slowly reduce the learning rate as we approach the solution e.g. Adam optimizer.

Stochastic Gradient Descent (SGD)

You may have heard of this term and may be wondering what is this. It is very simple to understand this, in our gradient descent algorithm we did the gradients on each observation one by one,in stochastic gradient descent we can chose the random observations randomly. It is called stochastic because samples are selected randomly (or shuffled) instead of as a single group (as in standard gradient descent) or in the order they appear in the training set.

Mini Batch Gradient Descent

In actual practice we use an approach called Mini batch gradient descent. This approach uses random samples but in batches. What this means is that we do not calculate the gradients for each observation but for a group of observations which results in a faster optimization.A simple way to implement is to shuffle the observations and then create batches and then proceed with gradient descent using batches.

We implemented the gradient descent for linear regression but you can do it for logistic regression or any other algorithm. What would change is the cost function and the way you calculate gradients. So we need to define our cost function and gradient calculation.

This was an simplified explanation of gradient descent but in practice you do not need to write your own gradient descent. There are numerous sophisticated algorithms available.

Here is link to the GITHUB gist

If you want to see a running example please check it out on google colab here

--

--

Experienced and interested in microservices, data handling, Kubernetes. A Machine learning and Data Science enthusiast.