An overview of the Gradient Descent algorithm

The subtle yet powerful algorithm that optimizes parameters

Published in

Towards Data Science

8 min readApr 25, 2019

Optimizing parameters is the ultimate goal of every machine learning algorithm. You want to get the optimum value of the slope and the intercept to get the line of best fit in linear regression problems. You also want to get the optimum value for the parameters of a sigmoidal curve in logistic regression problems. So what if I told you that Gradient Descent does it all?

Before learning how it works, let’s first make clear the meaning of the words Gradient and Descent along with some other key terms.

Terminology

Loss Function: A function that returns the cost associated with the model and measures how well our model is doing on the training data. If the cost is too high, it means that the predictions by our model are deviating too much from the observed data. In any machine learning algorithm, our ultimate mission is to minimize the loss function. Various loss functions which we use are:

Regression Losses:

L1 Loss / Mean Absolute Error
L2 Loss / Mean Squared Error
Root Mean Squared Error

Classification Losses:

Log Loss (Cross-Entropy Loss)
SVM Loss (Hinge Loss)

Learning Rate: This is the hyperparameter that determines the steps the gradient descent algorithm takes. Gradient Descent is too sensitive to the learning rate. If it is too big, the algorithm may bypass the local minimum and overshoot. If it too small, it might increase the total computation time to a very large extent. We will see the effect of the learning rate in depth later in the article.

Gradient: Basically, it is a measure of the steepness of a slope. And technically, when we sum up all the first-order derivatives of all the variables in a function, it gives us gradient. For example, if we consider linear regression, we have two parameters, slope, and the intercept, to minimize. So, we calculate derivatives w.r.t. both slope & the intercept and then sum them up to get the gradient for it.

Descent: To optimize parameters, we need to minimize errors. The aim of the gradient descent algorithm is to reach the local minimum (though we always aim to reach the global minimum of the function. But if a gradient descent algorithm once attains the local minimum, it is nearly impossible to reach the global minimum.). The algorithm accomplishes this by an iterative process of calculating step size at every iteration. And, this iterative calculation of step size to reach a local minimum (or in other words, descending to the point of minimum) is known as the descent (Enough of that going down the hill example).

Implementation

Now, let’s get into the technicalities of the algorithm. We will be using first-order derivatives to calculate the gradient. For the demonstration, we will be using NumPy to apply gradient descent on a linear regression problem.

Let’s generate a randomized dataset first using the NumPy’s random function and plot it to visualize our dataset distribution with a scatter plot.

# Importing Libraries
import numpy as np
import matplotlib.pyplot as plt# Generating Randomized dataset
X = 3*np.random.rand(100,1)
y = 9 + 2*X+np.random.rand(100,1)# Scatter plot
plt.scatter(X,y)
plt.xlabel('X')
plt.ylabel('y')
plt.title('X vs y')
plt.figure(figsize=(15,25))

In the above code, we have taken two NumPy arrays for independent and dependent variables, that is X and y respectively. It is clear by looking at the graph that the relationship is linear. We know that a linear relation can be put into a function as follows:

Hypothesis function for our linear problem

Analytical Method

We can calculate the values for theta_0 & theta_1 using the analytical method as shown below:

# mean of x and y vector
mean_x = X.mean()
mean_y = y.mean()# calculating cross-deviation and deviation about x
sum_yx = (X*y).sum()
x_sq = (X**2).sum()
ssxy = sum_yx - (len(X)*mean_x*mean_y)
ssxx = ((X - mean_x)**2).sum()# calculating regression coefficients
theta_1 = ssxy/ssxx
theta_0 = mean_y - (theta_1*mean_x)# printing both the values
print('Theta 0: {:0.3f}'.format(theta_0))
print('Theta 1: {:0.3f}'.format(theta_1))Output:
Theta 0: 9.482
Theta 1: 1.998

Then why do we need Gradient Descent if we can calculate optimum parameters analytically? It is because the analytical method turns out to be hugely computationally expensive when we have a dataset with millions of data points. Gradient descent, on the other hand, gives us similar results while minimizing the computation time massively. Let’s solve for 0 and 1 using gradient descent and see for ourselves.

Mathematically, the cost function and the gradient can be represented as follows:

Gradient & Cost Function for our problem

Intuition Behind the Cost Function

Our goal here is to minimize the cost function in a way that it comes as close to zero as possible. Having a high negative value is also as bad as a high positive value for the cost function. So, in order to keep the value of cost function >=0, we are squaring it up. We could have done the same using absolute but there are two main reasons why we are not doing so.

Putting absolutes instead of squaring it would penalize the model equally for high and low residuals, whereas we need a weighted penalizing rule where the data points with higher residuals are penalized more and lower ones less. That can be simply achieved by squaring the residuals/errors.
Also, according to the Gaussian Noise concept, there are two kinds of errors, systematic and random. Simply stated, a systematic error is something that follows a certain direction. It is consistent in nature, and predictable. On the other hand, a random error is noise in our distribution. Once we’ve taken care of the systematic noise component, the best predictor is obtained when the random noise is minimized. Putting it another way, the best predictor is the one with the tightest distribution (smallest variance) around the predicted value. Minimizing the least squared loss is the same thing as minimizing the variance! That explains why the least squared loss works for a wide range of problems. The underlying noise is very often Gaussian, because of the CLT (Central Limit Theorem), and minimizing the squared error turns out to be the right thing to do!

We are putting in ½ so that the ‘2’ coming from the derivation of the cost function while minimizing it for calculating gradient, cancels out, and makes gradient more concise. Anyway, it is a constant does not matter in calculating theta’s value (as we are differentiating it).

Also, we are averaging the function over m, which is the total number of data points in the training dataset. This is so that the value of the calculated gradient does not change the scale and always averages out to a central value in case a new data point comes in.

Implementation With Gradient Descent

Before we implement gradient descent, knowing the intuition behind it is a must. Since we are now done with that part, let’s dive into the actual implementation of it.

We want to calculate the value for 0 and 1 but we can have multiple features (>=2). In that case, the general formula to calculate consecutive step sizes will be

The general formula for getting consecutive theta value

where α is the learning rate. We can now infer from the formula that the α influences the size of the step a lot as we are multiplying the gradient with the alpha for every iteration.

Enough of mathematics, let us start with the actual code. To start, we will initialize a random value for the parameters to be optimized, 0 and 1. We will write two functions to calculate cost and gradient descent by iterating and store them in two distinct NumPy arrays. The above-mentioned formulae have been used in calculating the cost and the consecutive values for theta using the gradient.

def cost(theta,X,y):
    '''
    Calculates cost of the function.
    X & y have their usual meaning.
    theta - vector of coefficients.
    '''
    m = len(y)
    # Calculating Cost
    c = (1/2*m) * np.sum(np.square((X.dot(theta))-y))  
    return cdef gradient_descent(X,y,theta,alpha,iterations):
    '''
    returns array of thetas, cost of every iteration
    X - X matrix with added bias.
    y - target variable matrix
    theta - matrix of regression coefficients
    alpha - learning rate
    iteration - number of iteration to be run
    '''
    #Getting number of observations.
    m = len(y)
    
    # Initializing cost and theta's arrays with zeroes.
    thetas = np.zeros((iterations,2))
    costs = np.zeros(iterations)
    
    # Calculating theta for every iteration.
    for i in range(iterations):
            theta = theta - (1/m)*alpha*(X.T.dot((X.dot(theta))-y))
            thetas[i,:] = theta.T
            costs[i] = cost(theta,X,y)
            
    return theta,thetas,costs# Learning Rate
alpha = 0.01# Number of iterations
iterations = 3000# Initializing a random value to give algorithm a base value.
theta = np.random.randn(2,1)# Adding a biasing constant of value 1 to the features array.
X_bias = np.c_[np.ones((len(X),1)),X]# Running Gradient Descent
theta,thetas,costs = gradient_descent(X_bias,y,theta,alpha,iterations)# printing final values.
print('Final Theta 0 value: {:0.3f}\nFinal Theta 1 value: {:0.3f}'.format(theta[0][0],theta[1][0]))
print('Final Cost/MSE(L2 Loss) Value: {:0.3f}'.format(costs[-1]))
Output:
Final Theta 0 value: 9.448
Final Theta 1 value: 2.015
Final Cost/MSE(L2 Loss) Value: 411.001

In the code, we can see that we have run 3000 iterations. But, that was something passed to the function explicitly. In real-life scenarios, we do not do that. The actual stop point for gradient descent to stop running should be when step size approaches zero.

Analysis

Let us check how the L2 Loss reduces along with increasing iterations by plotting a graph.

# Plotting Line Plot for Number of Iterations vs MSE
plt.plot(range(iterations),costs)
plt.xlabel('Number of Iterations')
plt.ylabel('Mean Squared Error')
plt.title('Cost vs Iterations Analysis')

Number of Iterations vs MSE (Mean Squared Error): The Elbow Method

In the above graph, we see that initially, the error reduces significantly. But as iterations increase, there is not much reduction seen in the error. It nearly stabilizes after a certain value, or we can say that it decreases negligibly.

That is the beauty of the gradient descent. When it starts approaching the local minimum, it starts to take very very small steps towards it in order to not overstep it. On the other hand, when it is far away, to reduce computation time it takes bigger steps to reach the local minimum faster.

So, to select the best possible value of the number of iterations using this graph, we use the elbow method. Simply, the point at which the value of MSE starts to decrease negligibly is our value for the number of iterations (the elbow of the graph).

There are various types of Gradient Descent as well. What we did above is known as Batch Gradient Descent. The other types are:

Conclusion

Gradient Descent can be used to optimize parameters for every algorithm whose loss function can be formulated and has at least one minimum. Also, we cleared up our understanding of the infamous gradient descent loss function equation.

We learned how to use a visual method to estimate the number of iterations required. Finally, we got to know the actual workings behind the scenes of the famous gradient descent algorithm.