This article will look at both programming assignment 3 and 4 on neural networks from Andrew Ng‘s Machine Learning Course. This is also the first complex non-linear algorithms we have encounter so far in the course. I do not know about you but there is definitely a steep learning curve for this assignment for me. Neural Network forms the basis of deep learning which has a widespread application such as computer vision or natural language processing. As such, it is important to get the fundamental right and coding these assignments in python is one way to ensure that.
Before getting into Neural Networks, let’s complete the last section for logistic regression – Multi-class Logistic Regression.
This series of exercise make use of a handwritten digits dataset that consists of 5000 training examples, where each example is a 20 pixel by 20 pixel grayscale image of the digit.
Loading the dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
# Use loadmat to load matlab files
mat=loadmat("ex3data1.mat")
X=mat["X"]
y=mat["y"]
Since the dataset was given in .mat format instead of the usual .txt format, I had to make use of scipy loadmat function for the job. The official documentation can be found here. Since loadmat load .mat files as a dictionary with variable names as keys, assigning X and y is as simple as accessing the dict with the variables’ keys.
X.shape, y.shape
To better understand the dataset, having the shape of the data tells us the dimension of the data. X has a shape of 5000,400
which corresponds to 5000 training examples, each with 400 features from its 20 X 20 pixel. y has a shape of 5000,1
, in which each training example has a label ranging from 1 to 10 (‘0′ digit is labeled as ’10’ in this dataset).
Visualizing the data
import matplotlib.image as mpimg
fig, axis = plt.subplots(10,10,figsize=(8,8))
for i in range(10):
for j in range(10):
axis[i,j].imshow(X[np.random.randint(0,5001),:].reshape(20,20,order="F"), cmap="hot") #reshape back to 20 pixel by 20 pixel
axis[i,j].axis("off")
The code block above construct 100 subplots and randomly visualize 100 out of the 5000 training examples using plt.imshow. Take note we have to reshape the training example back to 20 X 20 pixel before we can visualize it and adding order="F"
as a parameter into the reshape function ensure that the orientation of the image is upright.

Computing the cost function and gradient
def sigmoid(z):
"""
return the sigmoid of z
"""
return 1/ (1 + np.exp(-z))
def lrCostFunction(theta, X, y, Lambda):
"""
Takes in numpy array of theta, X, y, and float lambda to compute the regularized logistic cost function
"""
m=len(y)
predictions = sigmoid(X @ theta)
error = (-y * np.log(predictions)) - ((1-y)*np.log(1-predictions))
cost = 1/m * sum(error)
regCost= cost + Lambda/(2*m) * sum(theta[1:]**2)
# compute gradient
j_0= 1/m * (X.transpose() @ (predictions - y))[0]
j_1 = 1/m * (X.transpose() @ (predictions - y))[1:] + (Lambda/m)* theta[1:]
grad= np.vstack((j_0[:,np.newaxis],j_1))
return regCost[0], grad
This is similar to the cost function we used back in the Logistic Regression assignment.
theta_t = np.array([-2,-1,1,2]).reshape(4,1)
X_t =np.array([np.linspace(0.1,1.5,15)]).reshape(3,5).T
X_t = np.hstack((np.ones((5,1)), X_t))
y_t = np.array([1,0,1,0,1]).reshape(5,1)
J, grad = lrCostFunction(theta_t, X_t, y_t, 3)
print("Cost:",J,"Expected cost: 2.534819")
print("Gradients:n",grad,"nExpected gradients:n 0.146561n -0.548558n 0.724722n 1.398003")

Now for the classification task. Since we have more than one class, we will have to train multiple logistic regression classifiers using one-vs-all classification method (one classifier for each class).
def gradientDescent(X,y,theta,alpha,num_iters,Lambda):
"""
Take in numpy array X, y and theta and update theta by taking num_iters gradient steps
with learning rate of alpha
return theta and the list of the cost of theta during each iteration
"""
m=len(y)
J_history =[]
for i in range(num_iters):
cost, grad = lrCostFunction(theta,X,y,Lambda)
theta = theta - (alpha * grad)
J_history.append(cost)
return theta , J_history
def oneVsAll(X, y, num_labels, Lambda):
"""
Takes in numpy array of X,y, int num_labels and float lambda to train multiple logistic regression classifiers
depending on the number of num_labels using gradient descent.
Returns a matrix of theta, where the i-th row corresponds to the classifier for label i
"""
m, n = X.shape[0], X.shape[1]
initial_theta = np.zeros((n+1,1))
all_theta = []
all_J=[]
# add intercept terms
X = np.hstack((np.ones((m,1)),X))
for i in range(1,num_labels+1):
theta , J_history = gradientDescent(X,np.where(y==i,1,0),initial_theta,1,300,Lambda)
all_theta.extend(theta)
all_J.extend(J_history)
return np.array(all_theta).reshape(num_labels,n+1), all_J
The gradientDescent
function is the usual optimizing function that we had implemented previously. As for oneVsAll
, it iterates through all the classes and trained a set of theta for each class using gradient descent ( fmincg
function was used in the assignment). all_theta
then captures all the optimized theta in a list and return as a numpy array, reshaped into a matrix of theta where the i-th row corresponds to the classifier for label i. np.where
comes in handy here to get a vector of y with 1/0 for each class to conduct our binary classification task within each iteration.
Plotting of the cost function just to ensure gradient descent works as intended
plt.plot(all_J[0:300])
plt.xlabel("Iteration")
plt.ylabel("$J(Theta)$")
plt.title("Cost function using Gradient Descent")

In order to make a prediction, the probability of x(i) for each class was computed and the prediction is the class with the highest probability
def predictOneVsAll(all_theta, X):
"""
Using all_theta, compute the probability of X(i) for each class and predict the label
return a vector of prediction
"""
m= X.shape[0]
X = np.hstack((np.ones((m,1)),X))
predictions = X @ all_theta.T
return np.argmax(predictions,axis=1)+1
pred = predictOneVsAll(all_theta, X)
print("Training Set Accuracy:",sum(pred[:,np.newaxis]==y)[0]/5000*100,"%")
The print statement print: Training Set Accuracy: 91.46 %
Finally, time for neural networks. With the same dataset, we aimed to achieve higher accuracy using a more complex algorithm such as the neural network. For the first part of the exercise, the optimized theta values were given to us and we are supposed to implement feedforward propagation to obtain the prediction and model accuracy.
Loading of the optimized theta
mat2=loadmat("ex3weights.mat")
Theta1=mat2["Theta1"] # Theta1 has size 25 x 401
Theta2=mat2["Theta2"] # Theta2 has size 10 x 26
Using feedforward propagation for prediction
def predict(Theta1, Theta2, X):
"""
Predict the label of an input given a trained neural network
"""
m= X.shape[0]
X = np.hstack((np.ones((m,1)),X))
a1 = sigmoid(X @ Theta1.T)
a1 = np.hstack((np.ones((m,1)), a1)) # hidden layer
a2 = sigmoid(a1 @ Theta2.T) # output layer
return np.argmax(a2,axis=1)+1
pred2 = predict(Theta1, Theta2, X)
print("Training Set Accuracy:",sum(pred2[:,np.newaxis]==y)[0]/5000*100,"%")
The print statement print: Training Set Accuracy: 97.52 %
. A much higher accuracy as compared to multi-class logistic regression!
In assignment 4, we worked towards implementing a neural network from scratch. We start off by computing the cost function and gradient of theta.
def sigmoidGradient(z):
"""
computes the gradient of the sigmoid function
"""
sigmoid = 1/(1 + np.exp(-z))
return sigmoid *(1-sigmoid)
def nnCostFunction(nn_params,input_layer_size, hidden_layer_size, num_labels,X, y,Lambda):
"""
nn_params contains the parameters unrolled into a vector
compute the cost and gradient of the neural network
"""
# Reshape nn_params back into the parameters Theta1 and Theta2
Theta1 = nn_params[:((input_layer_size+1) * hidden_layer_size)].reshape(hidden_layer_size,input_layer_size+1)
Theta2 = nn_params[((input_layer_size +1)* hidden_layer_size ):].reshape(num_labels,hidden_layer_size+1)
m = X.shape[0]
J=0
X = np.hstack((np.ones((m,1)),X))
y10 = np.zeros((m,num_labels))
a1 = sigmoid(X @ Theta1.T)
a1 = np.hstack((np.ones((m,1)), a1)) # hidden layer
a2 = sigmoid(a1 @ Theta2.T) # output layer
for i in range(1,num_labels+1):
y10[:,i-1][:,np.newaxis] = np.where(y==i,1,0)
for j in range(num_labels):
J = J + sum(-y10[:,j] * np.log(a2[:,j]) - (1-y10[:,j])*np.log(1-a2[:,j]))
cost = 1/m* J
reg_J = cost + Lambda/(2*m) * (np.sum(Theta1[:,1:]**2) + np.sum(Theta2[:,1:]**2))
# Implement the backpropagation algorithm to compute the gradients
grad1 = np.zeros((Theta1.shape))
grad2 = np.zeros((Theta2.shape))
for i in range(m):
xi= X[i,:] # 1 X 401
a1i = a1[i,:] # 1 X 26
a2i =a2[i,:] # 1 X 10
d2 = a2i - y10[i,:]
d1 = Theta2.T @ d2.T * sigmoidGradient(np.hstack((1,xi @ Theta1.T)))
grad1= grad1 + d1[1:][:,np.newaxis] @ xi[:,np.newaxis].T
grad2 = grad2 + d2.T[:,np.newaxis] @ a1i[:,np.newaxis].T
grad1 = 1/m * grad1
grad2 = 1/m*grad2
grad1_reg = grad1 + (Lambda/m) * np.hstack((np.zeros((Theta1.shape[0],1)),Theta1[:,1:]))
grad2_reg = grad2 + (Lambda/m) * np.hstack((np.zeros((Theta2.shape[0],1)),Theta2[:,1:]))
return cost, grad1, grad2,reg_J, grad1_reg,grad2_reg
The assignment walks through the whole process step by step, computing the cost first, followed by regularized cost, gradient, and finally regularized gradient. If you want to follow along, I modified the code such that it returns values for the intermediary steps as long as you use the right indexing.
input_layer_size = 400
hidden_layer_size = 25
num_labels = 10
nn_params = np.append(Theta1.flatten(),Theta2.flatten())
J,reg_J = nnCostFunction(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, 1)[0:4:3]
print("Cost at parameters (non-regularized):",J,"nCost at parameters (Regularized):",reg_J)

flatten()
function here collapsed the array into one dimension and np.append
"unroll" the parameters into a vector.
The issue of symmetry for initial theta was discussed in the lecture. To break off such symmetry, random initialization is needed.
def randInitializeWeights(L_in, L_out):
"""
randomly initializes the weights of a layer with L_in incoming connections and L_out outgoing connections.
"""
epi = (6**1/2) / (L_in + L_out)**1/2
W = np.random.rand(L_out,L_in +1) *(2*epi) -epi
return W
initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size)
initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels)
initial_nn_params = np.append(initial_Theta1.flatten(),initial_Theta2.flatten())
Finally, it’s our turn to optimize the theta values using feedforward propagation and backpropagation. The optimizing algorithm I am using is once again the same old gradient descent.
def gradientDescentnn(X,y,initial_nn_params,alpha,num_iters,Lambda,input_layer_size, hidden_layer_size, num_labels):
"""
Take in numpy array X, y and theta and update theta by taking num_iters gradient steps
with learning rate of alpha
return theta and the list of the cost of theta during each iteration
"""
Theta1 = initial_nn_params[:((input_layer_size+1) * hidden_layer_size)].reshape(hidden_layer_size,input_layer_size+1)
Theta2 = initial_nn_params[((input_layer_size +1)* hidden_layer_size ):].reshape(num_labels,hidden_layer_size+1)
m=len(y)
J_history =[]
for i in range(num_iters):
nn_params = np.append(Theta1.flatten(),Theta2.flatten())
cost, grad1, grad2 = nnCostFunction(nn_params,input_layer_size, hidden_layer_size, num_labels,X, y,Lambda)[3:]
Theta1 = Theta1 - (alpha * grad1)
Theta2 = Theta2 - (alpha * grad2)
J_history.append(cost)
nn_paramsFinal = np.append(Theta1.flatten(),Theta2.flatten())
return nn_paramsFinal , J_history
nnTheta, nnJ_history = gradientDescentnn(X,y,initial_nn_params,0.8,800,1,input_layer_size, hidden_layer_size, num_labels)
Theta1 = nnTheta[:((input_layer_size+1) * hidden_layer_size)].reshape(hidden_layer_size,input_layer_size+1)
Theta2 = nnTheta[((input_layer_size +1)* hidden_layer_size ):].reshape(num_labels,hidden_layer_size+1)
A warning to those executing the code. It will take a fair bit of time to compute depending on your computing power, and even longer if you are optimizing your alpha and num_iters values. I use 0.8 for alpha and 800 for num_iters but I believe it can get better accuracy with more tunning.
pred3 = predict(Theta1, Theta2, X)
print("Training Set Accuracy:",sum(pred3[:,np.newaxis]==y)[0]/5000*100,"%")

With this, I am halfway through the series. The Jupyter notebook will be uploaded to my GitHub at (https://github.com/Benlau93/Machine-Learning-by-Andrew-Ng-in-Python).
For other Python implementation in the series,
- Linear Regression
- Logistic Regression
- Regularized Logistic Regression
- Support Vector Machines
- Unsupervised Learning
- Anomaly Detection
Thank you for reading.