A Tour to Machine Learning and Deep Learning

An Introduction to Gradient Descent

From theory to practice, implement Batch Gradient Descent, Mini-Batch Gradient Descent and Stochastic Gradient Descent on a dataset

Yang
Towards Data Science
8 min readMay 13, 2019

--

This blog will cover following questions and topics:

1. What is a gradient?

2. What is gradient descent?

3. Three common gradient descents

4. Implementation in Python

1. Gradient

Gradient is a vector that is tangent of a function and points in the direction of greatest increase of this function. Gradient is zero at a local maximum or minimum because there is no single direction of increase. In mathematics, gradient is defined as partial derivative for every input variable of function. For example, we have a function:

The figure of function is shown below and we can see that the minimum of function is (0, 0).

In this case, x-component of gradient is the partial derivative respect to x and y-component of gradient is the partial derivative respect to y. The gradient of function above is:

If we want to find the direction that increases the function at most at point (1, 2), we can plug (1, 2) into the formula above and get:

2. Gradient Descent

As gradient is a vector pointing at the greatest increase of a function, negative gradient is a vector pointing at the greatest decrease of a function. Therefore, we can minimize a function by iteratively moving a little bit in the direction of negative gradient. That is the logic of gradient descent.

Given a starting point:

We can construct an iterative procedure:

The parameter alpha in the formula above is called learning rate, which is a small constant in most cases and ranges from 0 to 1. The iterative procedure will not stop until it converges. Taking previous example, we have already known that the gradient is:

Therefore, iterative procedure of gradient descent can be written as:

Then we can get:

Finally, assuming alpha is smaller than 1, then we can get:

The conclusion is same as what we observe in the figure above.

Another intuitive way to explain gradient descent is that ‘Consider the 3-dimensional graph below in the context of a function. Our goal is to move from the mountain in the top right corner to the dark blue sea in the bottom left. The arrows represent the direction of steepest descent (negative gradient) from any given point–the direction that decreases the function as quickly as possible.’ [3]

Image Source: https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

In machine learning, we focus more on the relationship between cost function and parameters, instead of dependent and independent variables. The general idea of gradient descent in machine learning is to tweak parameters iteratively in order to minimize a cost function.

Image Source: https://saugatbhattarai.com.np/what-is-gradient-descent-in-machine-learning/

In some cases, we can apply a closed-form equation to calculate the parameters directly that best fit the model to training dataset. For example, to minimize MSE for linear regression, parameters can be written as:

However, in other cases we do not have closed-form equation, such as logistic regression. Therefore, an iterative optimization approach like gradient descent is applied.

An important parameter in gradient descent is learning rate, which determine the size of each step. When learning rate is too big, gradient descent may jump across the valley and end up on the other side. This will lead to the cost function diverge. On the other hand, when learning rate is too small, it will take the algorithm long time to converge. Therefore, proper learning rate is needed before gradient descent starts.

Image Source: https://towardsdatascience.com/gradient-descent-in-a-nutshell-eaf8c18212f0

Normalization plays important role to Gradient Descent. If features are not normalized, features with large scale will dominate the update, so the algorithm will generate a zigzag learning path. It takes many unnecessary steps and longer time to reach to the minimum. After all features are normalized, the cost function is a more spherical shape. Gradient Descent algorithm goes straight towards minimum. One way to perform normalization is to minus mean and divide by standard deviation. You can also apply StandardScaler function in Scikit-Learn directly.

Image Source: https://www.jeremyjordan.me/batch-normalization/

3. Three Typical Gradient Descents

We will look into several variants of gradient descent that are widely used in machine learning: Batch Gradient Descent, Mini-batch Gradient Descent, and Stochastic Gradient Descent.

Batch Gradient Descent

Batch Gradient Descent uses the whole batch of training data at every step. It calculates the error for each record and takes an average to determine the gradient. The advantage of Batch Gradient Descent is that the algorithm is more computational efficient and it produces a stable learning path, so it is easier to convergence. However, Batch Gradient Descent takes longer time when the training set is large.

Stochastic Gradient Descent

In the other extreme case, Stochastic Gradient Descent just picks one instance from training set at every step and update gradient only based on that single record. The advantage of Stochastic Gradient Descent is that the algorithm is much faster at every iteration, which remediate the limitation in Batch Gradient Descent. However, the algorithm produces less regular and stable learning path compared to Batch Gradient Descent. Instead of decreasing smoothly, the cost function will bounce up and down. After rounds of iterations, the algorithm may find a good parameter, but the final result is not necessary global optimal.

Mini-Batch Gradient Descent

Mini-Batch Gradient Descent combines the concept of Batch and Stochastic Gradient Descent. At each step, the algorithm compute gradient based on a subset of training set instead of full data set or only one record. The advantage of Mini-Batch Gradient Descent is that the algorithm can take advantage of matrix operation during calculation and cost function can decrease more smoothly and stable than Stochastic Gradient Descent.

Image Source: https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3

4. Implementation in Python

In this part, I will use famous data set iris to show how gradient decent works in logistic regression.

First, import the package.

from sklearn import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.lines as mlines

Next, load data. Note that for simplicity, I only choose 2 kinds of iris.

# Load data
iris = datasets.load_iris()
X=iris.data[0:99,:2]
y=iris.target[0:99]
# Plot the training points
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
plt.figure(2, figsize=(8, 6))
plt.clf()
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

Batch Gradient Descent

# Function for batch gradient decent    
def Batch_GD (Learning_Rate,num_iterations,X,y):
#Step 1: Initial Parameter
N=len(X)
w=np.zeros((X.shape[1],1))
b=0
costs=[]
# Starting Loop
for i in range(num_iterations):
#Step 2: Apply Sigmoid Function and get y prediction
Z=np.dot(w.T,X.T)+b
y_pred=1/(1+1/np.exp(Z))

#Step 3: Calculate Loss Function
cost=-(1/N)*np.sum(y*np.log(y_pred)+(1-y)*np.log(1-y_pred))

#Step 4: Calculate Gradient
dw=1/N*np.dot(X.T,(y_pred-y).T)
db=1/N*np.sum(y_pred-y)

#Step 5: Update w & b
w = w - Learning_Rate * dw
b = b - Learning_Rate * db

# Records cost
if i % 1000 == 0:
costs.append(cost)
#print(cost)
return(w,b,costs)
# Run a function
Result_BatchGD=Batch_GD(Learning_Rate=0.01,num_iterations=100000,X=X,y=y)

Stochastic Gradient Descent

# Function for Stochastic Gradient Descent       
def Stochastic_GD (Learning_Rate,num_iterations,X,y):
# Step 1: Initial Parameter
N=len(X)
w=np.zeros((X.shape[1],1))
b=0
costs=[]
# Starting two layer of loops
for i in range(num_iterations):
for j in range(N):
# Choose 1 record
XX=X[j,:]
yy=y[j]
# Step 2: Apply Sigmoid Function and get y prediction
Z=np.dot(w.T,XX.T)+b
y_pred=1/(1+1/np.exp(Z))
#Step 3: Calculate Loss Function
cost=-(yy*np.log(y_pred)+(1-yy)*np.log(1-y_pred))
#Step 4: Calculate Gradient
dw=np.multiply(XX,(y_pred-yy)).reshape((2,1))
db=y_pred-yy
#Step 5: Update w & b
w = w - Learning_Rate * dw
b = b - Learning_Rate * db

#Step 6: Calculate Loss Function
Z_full=np.dot(w.T,X.T)+b
y_pred_full=1/(1+1/np.exp(Z_full))
cost=-(1/N)*np.sum(y*np.log(y_pred_full)+(1-y)*np.log(1-y_pred_full))
#Records cost
if i % 100 == 0:
costs.append(cost)
#print(cost)

return(w,b,costs)
# Run a function
Result_Stoc_GD=Stochastic_GD(Learning_Rate=0.01,num_iterations=2000,X=X,y=y)

Mini-batch Gradient Descent

# Function for mini batch Gradient Descent
def Minibatch_GD (Learning_Rate,num_iterations,X,y,Minibatch):
# Part 1: Mini Batch
np.random.seed(1000)
N=len(X)
mini_batches=[]

#Step 1: Shuffle (X,y)
permutation=list(np.random.permutation(N))
shuffled_X=X[permutation,:]
shuffled_y=y[permutation]

#Step 2: Partition
num_complete_minibatches=int(np.floor(N/Minibatch))

for i in range(num_complete_minibatches):
mini_batch_X=shuffled_X[i*Minibatch:(i+1)*Minibatch,:]
mini_batch_y=shuffled_y[i*Minibatch:(i+1)*Minibatch]

mini_batch = (mini_batch_X, mini_batch_y)
mini_batches.append(mini_batch)

if N % Minibatch !=0:
mini_batch_X=shuffled_X[N-Minibatch:N,:]
mini_batch_y=shuffled_y[N-Minibatch:N]

mini_batch = (mini_batch_X, mini_batch_y)
mini_batches.append(mini_batch)

# Part 2: Gradient Descent
w=np.zeros((X.shape[1],1))
b=0
costs=[]

for i in range(num_iterations):
for j in range(num_complete_minibatches+1):
#Select Minibatch
XX=mini_batches[j][0]
yy=mini_batches[j][1]
#Step 2: Apply Sigmoid Function and get y prediction
Z=np.dot(w.T,XX.T)+b
y_pred=1/(1+1/np.exp(Z))

#Step 3: Calculate Gradient
dw=1/Minibatch*np.dot(XX.T,(y_pred-yy).T)
db=1/Minibatch*np.sum(y_pred-yy)
#Step 4: Update w & b
w = w - Learning_Rate * dw
b = b - Learning_Rate * db

#Step 5: Calculate Loss Function
Z_full=np.dot(w.T,X.T)+b
y_pred_full=1/(1+1/np.exp(Z_full))
cost=-(1/N)*np.sum(y*np.log(y_pred_full)+(1-y)*np.log(1-y_pred_full))

if i % 1000 ==0:
costs.append(cost)
#print(cost)

return(w,b,costs)
# Run a function
Result_MiniGD=Minibatch_GD(Learning_Rate=0.01,num_iterations=100000,X=X,y=y,Minibatch=50)

Visualization Results

# Plot linear classification
fig, ax = plt.subplots()
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,edgecolor='k')
line_B_GD=mlines.Line2D([0,7],[-0.5527,4.1577],color='red')
line_Mini_GD=mlines.Line2D([0,7],[-0.56185,4.1674],color='blue')
line_Sto_GD=mlines.Line2D([0,7],[-0.5488,4.1828],color='green')
ax.add_line(line_B_GD)
ax.add_line(line_Mini_GD)
ax.add_line(line_Sto_GD)
ax.set_xlabel('Sepal length')
plt.show()

From figure above, we can see that three kinds of gradient descents generate similar linear decision boundry.

Summary

In this post, you learned a lot about gradient descent. You now know the basic mathematics behind gradient and you have an understanding how the algorithm works behind the scenes. Second, you learned why learning rate and normalization are so important to the success of algorithm. Finally, you learned about the most common types of gradient descent and how to implement those algorithms in python. This knowledge enables you to better understand machine learning and deep learning. You can read more blogs by clicking on the following link:

Reference

[1] Aurélien Géron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, 2018

[2] Ian Goodfellow, Yoshua Bengio, Aaron Courville, (2017) Deep Learning

[3] https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

--

--