How to Build Neural Network from Scratch

Arseny Turin

Published in

Towards Data Science

8 min readMay 25, 2020

Step by step tutorial on how to build a simple neural network from scratch

Introduction

In this post, we will build our own neural network from scratch with one hidden layer and a sigmoid activation function. We will take a closer look at the derivatives and a chain rule to have a clear picture of the backpropagation implementation. Our network would be able to solve a linear regression task with the same accuracy as a Keras analog.

The code for this project you can find in this GitHub repository.

What is a Neural Network?

We used to see neural networks as interconnected layers of neurons, where we have an input layer on the left, hidden layers in the middle and output layer on the right side. It’s easier to digest visually, but ultimately, the neural network is just one big function that takes other functions as an input; and depends on the depth of the network those inner functions could also take other functions as input and so on. Those inner functions are in fact “layers”. Let's take a look at the diagram of the network we will build:

It has an input layer with two features, a hidden layer with three neurons and an output layer. Each neuron in the hidden layer is a sigmoid activation function that takes input values (x1, x2), weights (w1,…,w6) and biases (b1, b2, b3) as an input and produces value ranging from 0 to 1 as an output.

In the beginning we assign random values to the weights and biases ranging from 0 to 1.

The output layer contains just one neuron that does a similar job as neurons from the hidden layer. You might’ve guessed that ŷ is in fact that big function I mentioned earlier.

The type of network we described above is called Dence Network, because neurons are fully connected with elements from the previous layer, in our case with input.

How Do We Train a Neural Network?

The most crucial part to understand is that the neural network is trained only by adjusting weights and biases to minimize output error. The training process consists of feedforward and backpropagation. Feedforward predicts output and the backpropagation is adjusting weights and biases to minimize the error of the output, i.e. difference between predicted and true values.

Feedforward

When we want to predict output, we use a feedforward function. This function takes input x1 and x2, these input values go into neurons in the hidden layer along with weights and biases and each neuron returns a value [0–1]; then the output layer takes those values and produces the output.

Let's take a closer look at the first neuron in the hidden layer and understand what it’s actually doing.

As I mentioned earlier, each neuron is just a sigmoid function:

Similar to linear regression, where we have parameters such as slopes and intercept to make predictions, in the neural networks we have weights and bias.

Each neuron will produce a value [0-1] which we will use as an input to the output layer. In some neural networks, it’s not necessary to have an output layer as a sigmoid or any other activation function, it could be just a sum of values from the previous layer.

The code for this function looks like this:

def sigmoid(x): return 1 / (1 + np.e**-x)def feedforward(x1, x2):    n1 = sigmoid(x1 * w1 + x2 * w2 + b1)
    n2 = sigmoid(x1 * w3 + x2 * w4 + b2)
    n3 = sigmoid(x1 * w5 + x2 * w6 + b3)
    y_hat = sigmoid(n1 * w7 + n2 * w8 + n3 * w9 + b4)    return y_hat

After we have predicted the value we can compare it to the true value by using Mean Squared Error (MSE).

Why Use Bias?

The role of the bias is to provide an additional parameter to the neuron that is not affected by previous layers because it’s not connected to them in any way.

Backpropagation

At first, our network would do a terrible job at predicting, because weights and biases are just random numbers. Now backpropagation comes into play to help us train these parameters.

Each time network makes a prediction, we use MSE to compare it with the true value, and then we go back to adjust each weight and bias to reduce error ever so slightly.

There are several elements we should know about to understand backpropagation:

Derivatives — what direction to change each weight and bias
Chain Rule — how to access each weight
Gradient Descent — an iterative process of adjusting weights and biases

Derivatives

Thanks to derivatives we always know in which direction we change each parameter (make it slightly bigger or smaller). Let's say we want to adjust one weight (w = 0.75) in order to make MSE a bit smaller. To do that we should take a partial derivative of a function with respect to this weight. Then we plug numbers into the derived function and get a number (0.05), positive or negative. Then we subtract that number from our weight (w -= 0.05). That's how the adjustment is done. And that should happen to each weight in the network.

Because networks are ultimately very complex functions, it’s quite difficult to find a derivative of a parameter that buried inside a myriad of other functions. Luckily, derivatives have a chain rule that simplifies that process for us.

Chain Rule

If we need to find a derivative of a function that contains another function, we use the chain rule.

This rule says that we take a derivative of an outside function, keep the inside function untouched, then we multiply everything by the derivative of the inside function.

Gradient Descent

Once we know all the derivatives, we gradually adjust each weight and bias each time we do backpropagation. What’s important to know here is that gradient descent has a learning rate parameter, which is usually a small number that we add to the result of the derivative either to slow down or speed up training.

Example of Training Weight 1

Let's see on practice how to take a partial derivative of the w1. Once we understand how it works for one weight, it would be easy to implement for others, because for the most part, it is almost exactly the same process.

Now, to take a partial derivative with respect to w1, we should start with the MSE function. It’s the root function that contains all other functions in the network.

As we know, MSE is a squared difference between a true and predicted value (y-ŷ)². If we would unfold the entire network to see where w1 sits inside, it would look like this:

Matryoshka of the neural network

We should remember by now that network is a function that contains other functions, and to adjust each parameter in this function we need to use the chain rule:

Chain rule representation

All that's left is to take derivatives of each function. For w1 it’s four derivatives, but for w8 it would be just three because it sits in ŷ and we don’t need to go that deep to get there.

In the end, we get this scary-looking equation that explains how w1 affects MSE. Lets elaborate on what’s happening in each step.

First, we took a derivative of MSE (y-ŷ)², which is 2(y-ŷ), because the derivative of x² is 2x, the same rule applies here. We didn’t touch what’s inside a squared function as the chain rule requires.

Then we took a partial derivative of (y-ŷ) with respect to ŷ, which resulted in (0-sigmoid’(…)). Remember ŷ is a sigmoid function. With respect to ŷ because it’s the leading path to the weight w1. Next, we took a partial derivative of another sigmoid function with respect to n1 and finally the last derivative of x1*w1, which is x1, because the derivative of w1 is 1, and the coefficient x1 stays the same. Then we multiply each derivative and we’re good to go.

To put everything together, we get:

Backpropagation of the weight 1

Where a learning rate is a small number, usually ranging from 0.01 to 0.05, but it could be larger. Same logic we apply for finding all the other weights and biases.

You can imagine that if we have hundreds of neurons, we would have thousands of weights and biases, so for illustration purposes, we have kept the number of neurons very small. Please see GitHub for the rest of the code.

Testing Network on Real Data

Our network did a very good job of predicting home prices based on two parameters: median income and average rooms. Data was taken from the “california_housing” dataset from sklearn library. The network converged pretty fast, just over 6 epochs, and resulted in MSE = 0.028, exactly the same result I got from Keras analog.

Conclusion

Our network is great for educational purposes, but it has limitations: we cannot change the number of neurons in the hidden layer or add another layer to the network. We have only one activation function and our network can solve only simple tasks, such as linear regression. If we want to use it for classification problems, we would need to find derivatives for the Cross-Entropy or Softmax loss function. All these changes could be done in the current setup.

Let me know in the comments if you have any questions or what was hard to understand.

Thank you for reading.