A Tour to Machine Learning and Deep Learning

Build up a Neural Network with Python

Use NumPy to realize forward propagation, backward propagation

Yang
Towards Data Science
7 min readJul 22, 2019

--

Figure 1: Neural Network

The purpose of this blog is to use package NumPy in python to build up a neural network. Although well-established packages like Keras and Tensorflow make it easy to build up a model, yet it is worthy to code forward propagation, backward propagation and gradient descent by yourself, which helps you better understand this algorithm.

Overview

Figure 2: Overview of forward propagation and backward propagation

Figure above shows how information flows, when a neural network model is trained. After input Xn is entered, a linear combination of weights W1 and bias B1 is applied to Xn. Next, an activation function is applied to have a non-linear transformation to get A1. Then A1 is entered as input for next hidden layer. Same logic is applied to generate A2 and A3. The procedure to generate A1, A2 and A3 is called forward propagation. A3, also regarded as output of the neural network, is compared with independent variable y to calculate cost. Then derivative of cost function is calculated to get dA3. Take a partial derivative of dA3 for W3 and B3 to get dW3 and dB3. Same logic is applied to get dA2, dW2, dB2, dA1, dW1 and dB1. The procedure to generate a list of derivatives is called backward propagation. Finally gradient descent is applied and parameters are updated. Then a new round iteration starts with updated parameters. The algorithm will not stop until it converges.

Create Testing Data

Create a small set of testing data to verify functions created.

Initialize Parameters

In the stage of parameter initialization, weights are initialized as random values near zero. “If weights are near zero, then the operative part of sigmoid is roughly linear, and hence the neural network collapses into an approximately linear model.” [1] The gradient of sigmoid function around zero is steep, so parameters can be updated rapidly by using gradient descent. Do not use zero and large weights, which leads to poor solutions.

I manually calculated one iteration training of neural network in Excel, which help you to verify accuracy of functions created at each step. Here is the output of parameter initialization.

Table 1: Parameters Initialization Testing Result

Forward Propagation

In the neural network, inputs Xn is entered and information flows forward through the whole network. The inputs Xn provide the initial information that propagates up to hidden units at each layer and finally produces prediction. This procedure is called forward propagation. Forward propagation consists of two steps. First step is the linear combination of weight and output from last layer (or Inputs Xn) to generate Z. Second step is to apply activation function to have a nonlinear transformation.

Table 2: Matrix Calculation in forward propagation

In the first step, you need to pay attention to the dimension of input and output. Suppose you have an input matrix X with dimension of [2, 3] and one column in the matrix represents a record. There are 5 hidden units in the hidden layer, so the dimension of weight matrix W is [5, 2]. The dimension of bias B is [5, 1]. By applying matrix multiplication, we can get output matrix Z with dimension of [5, 3]. Details of calculation can be seen in the table above.

Table 3: How activation is applied in forward propagation

Table above shows that how activation function is applied to each component of Z. The reason to use activation function is to have a nonlinear transformation. Without activation function, no matter how many hidden layers model has, it is still a linear model. There are several popular and commonly used activation functions, including ReLU, Leaky ReLU, sigmoid, and tanh function. Formulas and figures for those activation functions are shown below.

Figure 3: Activation Function

First, you need to define sigmoid and ReLU function. Then create function for single layer forward propagation. Finally, functions created in the previous step is nested into the function called full forward propagation. For simplicity purpose, ReLU function is used in the first N-1 hidden layers and sigmoid function is used in the last hidden layer (or output layer). Note that in the case of binary classification problem, sigmoid function is used; in the case of multiclass classification problem, softmax function is used. Save Z and A calculated in each hidden layer into caches, which will be used in backward propagation.

Here is the function output on testing data.

Table 4: Forward Propagation Testing Result

Cost Function

The output of forward propagation is the probability of binary events. Then the probability is compared with response variable to calculate cost. Cross entropy is used as cost function in the classification problem. Mean square error is used as cost function in the regression problem. Formula for cross entropy is shown below.

Here is the function output on testing data.

Table 5: Cost Function Testing Result

Backward Propagation

During training, forward propagation can continue onward until it produces a cost. The backward propagation is to calculate the derivatives of cost function and flow all information back to each layer, using chain rule in the calculus.

Suppose

and

Then

Chain rule then states

The derivatives for activation functions are shown below.

Similar to forward propagation. First, you need to create a function for derivative of sigmoid and ReLU. Then define a function for single layer backward propagation, which calculates dW, dB, and dA_prev. dA_prev will be used as input for backward propagation for previous hidden layer. Finally, function created in the previous step is nested into the function called full backward propagation. To align with forward propagation, first N-1 hidden layers use ReLU function and last hidden layer or output layer uses sigmoid function. You can modify the code and add more activation function as you wish. Save dW and dB into another caches, which will be used to update parameters.

Here is the function output on testing data.

Table 6: Backward Propagation Testing Result

Update Parameters

Once gradients are calculated from backward propagation, update the current parameters by learning rate * gradients. Then updated parameters are used in a new round of forward propagation.

Here is the function output on testing data.

Table 7: Parameter Update Testing Result

Explanation for gradient descent can be seen in my blog.

Stack functions together

To train a neural network model, functions created in previous steps are stacked together. Summary of functions is provided in the table below.

Table 8: Functions Summary

Run Model

First use make_moons function create two interleaving half circles data. Visualization of data is provided below.

Figure 4: Training Data

Then run the function to train a neural network model. Training process is visualized in the figures below. Cost converges after 8000 epochs and model accuracy rate converge to 0.9.

Figure 5: Cost over Time
Figure 6: Accuracy over Time

Next Step

Figure 5 and 6 indicates that there is a potential overfitting problem. You can use methods including early stop, dropout and regularization to remediate this issue. You can play with model by adding other activation functions besides ReLU and sigmoid function. Batch gradient descent is used in this blog, but there are many improved gradient descent algorithms such as Momentum, RMSprop, Adam and so on.

Summary

Though taking online courses and read relevant chapters in the book before, not until I hands on the coding and writing blog by myself, I fully understood this fancy method. As an old saying goes, teaching is the best way to learn. Hope you can benefit by reading this blog. Please read my other blogs if you have interest.

Reference

[1] Trevor Hastie, Robert Tibshirani, Jerome Friedman, (2008), The Elements of Statistical Learning

[2] Ian Goodfellow, Yoshua Bengio, Aaron Courville, (2017) Deep Learning

[3] https://www.coursera.org/specializations/deep-learning?

[4] https://en.wikipedia.org/wiki/Activation_function

[5] https://explained.ai/matrix-calculus/index.html

--

--