How to Build a Neural Network from Zero
No frameworks, just Python
“What I cannot create, I do not understand” (Richard Feynman)
This sentence was found on Richard Feynman’s blackboard at the time of his death. I use to think there is something corny about quoting Feynman: but, when a quote perfectly illustrates how you feel, resorting to cliché might be justified.
Let me get something out of the way: I know that there are dozens of excellent articles, tutorials, and videos on the topic of this article. Do we need another?I think my perspective is unusual. I am 51 years old, and my background in either programming, calculus, linear algebra, or engineering is zero. I started teaching myself programming and Artificial Neural Networks roughly twelve months ago. After a non-negligible number of tutorials and courses on Deep Learning, most of which leverage existing frameworks such as PyTorch, sci-kit learn, and TensorFlow, I still felt something was amiss in my understanding of certain concepts. When I don’t understand something thoroughly, I always do the same thing: build it from the ground up.
Hence, I decided to implement a simple Artificial Neural Network (ANN or NN) without resorting to any framework. This article illustrates the results of my attempt. Some will find it an exercise in futility. Others, hopefully, will find it helpful — perhaps even learn something from it.
(Please note that this is not an introduction to neural networks, nor do I explain the mathematics underpinning the algorithms. I assume my readers to be familiar with the basic concepts behind Deep Neural Networks, such as backpropagation. If you are not, you can do a lot worse than watching 3blue1brown’s series on YouTube).
I welcome criticism and suggestions. If you spot a mistake, feel free to get in touch! (You can find the code in this GitHub repo if you are interested.)
Getting started
First of all, let’s import some libraries. Since we will not use any deep learning or machine learning framework, the number of imports is limited.
I decided to train and test the network using the MNIST_784 dataset.¹ As you probably know if you are reading this, MNIST is considered the “Hello, world” of Deep Learning. It’s a dataset consisting of 70.000 grayscale images of handwritten digits; in the words of its authors:
“It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.”
We can download MNIST as a python
dictionary (using e.g. scikit-learn
) as follows (we store the dataset itself in data
and the labels in labels
:
Let’s explore the dataset. We pick a random data point to get acquainted with an image and its shape:
So we know that our image is a vector of shape 784: a flattened 28 x 28 matrix. If we split our dataset into a training set with, e.g., 60.000 data points and a test set with 10.000 data points, we will end up with the following: input layer size = m*n, where m is the number of samples and n the number of features.
Hence, the network’s input layer has size (60.000 x 784). Assuming we represent the training dataset with features as rows and examples as columns, we’d end with something like the following:
Let’s start by thinking about a simple, fully connected NN with just one hidden layer. Let’s say the hidden layer has 512 nodes. The first question is: how do we control the mapping from the input layer to the hidden layer? In other words, what should be the size of the weight matrix? Each node in the input layer will connect to each input in the hidden layer. So we will need 784 x 512 = 401,408 weights or trainable parameters. And since each connection to the hidden layer will add a bias to each weighted sum, we will need 512 biases. As a general rule:
(Getting dimension rights always drives me berserk: this, courtesy of Andrew Ng, is what I use when I get lost.) Eventually, this is what we’d get for a NN with a single deep layer consisting of 4 neurons:
(But remember: we need to add biases to each node in the hidden layer.)
Setting the stage
Let’s start by defining some commonly used activation functions:
In this article, I will only use the Rectified Linear Unit (ReLU) function, but the NN should be flexible enough to accommodate other choices. We’ll remember that when we implement the NN.
For the output layer, we will use softmax, given that this is a multiclass classification problem.
Before we can train our network, we are going to need some preprocessing functions:
- a normalize function to scale the inputs to the [0, 1] range and
- a one-hot-encode function, which will turn the array of labels from an n-sized vector (where n is the number of samples) to an n x m array (where m is the number of possible outputs).
There are various ways of scaling features: in our case, we will use min-max scaling (we could also just divide each feature by the number of the possible values a feature can take — in our case, 255):
To perform one-hot-encoding on the labels, we can create an identity matrix (i.e., a matrix consisting of 1s in the diagonal and 0s elsewhere) having the same shape as our labels vector and subsequently index it with the labels vector itself, like this:
label encoded
------- -------------------------------
5 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
0 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
4 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
1 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
9 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
2 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
1 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
3 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
1 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
4 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
Because raw data can be organized differently, I decided, as a matter of design choice, to leave the appropriate reshaping and the one-hot-encoding to the user: i.e., it will be up to them to preprocess the training dataset and the test dataset so that features as passed as rows and examples as columns and to pass the labels to the one_hot_encode
function. The scaling will instead be performed by the network itself.
Finally, we will need derivatives of the activation functions to perform gradient descent:
Implementing the neural network
We now have everything we need for our deep neural network, which we will implement as a class
. So we need to think about:
- an
init
method; - a forward propagation method;
- how to implement backpropagation;
- how to train the network;
- how to compute the cost function;
- methods to predict, and, finally:
- a way to implement metrics such as accuracy.
My thought process was, roughly, the following. We can pass the following arguments to the constructor:
- the training set with the corresponding labels
- the test set with its labels
- an activation function choice (as a string)
- the number of classes we need to predict
- the architecture — which we represent as a list, where each element of the list corresponds to a deep layer and indicates the number of neurons in that layer.
Then we use a dictionary to store the layers, and name them with strings to order to make keeping track easier (a trick I also picked up from Andrew Ng’s course on Machine Learning).
Notice a bunch of assert
statements. I cannot emphasize this enough: if you don’t want to spend hours in debugging hell, think about how to test your network and sprinkle your methods liberally with conditions to be met.
When the constructor gets called to instantiate a NN, it performs a bunch of actions:
- it normalizes the features in the training set X;
- it adds a layer with size m where m is the number of features (which it infers from the shape of the first training example) at the beginning of the
architecture
list and another layer with size n (the number of classes) at the end; - it initializes an empty
dict
called parameters to store biases and weights.
Once we have our init
, we can think about a method to initialize these parameters. For that, we proceed as follows:
- we iterate over the number of deep layers (hence, starting at 1 and ending at L-1, where L is the length of the
architecture
list). - for each layer, we add to the
parameters
dictionary a key-value pair for weights and another for biases - to keep things simple, we initialize the weights by randomly picking them from a normally distributed sample of shape: (number of nodes in layer i, number of nodes in layer i+1) and scaling them by a factor of 100. A better choice might be to make the weight initialization dependent on the activation function of choice (see, e.g., this article)
- the biases are a vector of zeros: one for each node in the layer. We store each weight matrix in the dictionary using the key w{i} and each bias vector with key b{i}.
Feedforward, backpropagation, and fitting
Feedforward
Now we get to the methods for forward- and backpropagation.
I want both methods to take no parameters as inputs (except for the class itself, obviously) and to modify the model in place. Why? Because we will pass the data to the fit method, and the calls to the forward and the backward passes will happen there. (We do, however, need each forward pass to return a cost value which we will use when fitting the model during gradient descent.)
So, here’s how we proceed to implement the forward
method. You might remember we initialized two dictionaries: one for the layers and one for the parameters (which we populate using the initialize_parameters
method above). For each node in each of the layers except the input and outputs layer, we will need an input (the weighted sum of inputs) and an activation output. We can store those in our layers
dictionary as z{i} and a{i} respectively for each of the i deep layers in the model’s architecture. Hence:
- we iterate over the range (1, length of the architecture-1);
- to obtain
layers[zi]
for each i in that range we calculate the dot product ofparameters[wi]
andlayers[a-i]
to which we addparameters[bi];
- Finally, to obtain the activation output
layers[ai]
, we apply the activation function of choice tolayers[zi]
. We are storing the values of the layers because we will need them when we perform backpropagation.
As an implementation note: I was driven quite mad by overflows in the cost function computation until I realized that it was due to possible divisions by zero. Hence the 0.00000001 in the cost calculation.
Backpropagation
The implementation of the backpropagation algorithm is relatively straightforward in structure. It closely parallels that of the forward pass (I won’t be going into the math behind the backpropagation algorithm here, as it is way beyond the scope of the article). We initialize an empty derivatives
dictionary to store the gradients and iterate over the layers from right to left. For each layer, we compute dZ
and use it to calculate that layer’s dW
and dB
(the updates in the weights and biases, respectively). Here is where the layers
dictionary we previously defined comes in handy, as we need to access each layer’s input z
and activation a
. First, let’s compute the gradients for the last (the output) layer.
For the outmost layer, dz
is simply the result of output — ground truths
. (Note: what shape is dz
? Well, there will be as many rows as there are labels and as many columns as there are training examples — hence, we’ll need to transpose it when we dot it with the values stored in layers
. Keep this in mind.)
We can then compute dW
and db
as follows:
- the gradients for the weights are the dot product of
dz
and the transpose of the activation (see the above note) from the previous layer (averaged of the number of examples) - the gradients for the biases are just the averaged
dz
(note that we need tokeepdims=True
as otherwise, the array will be squished to shape (1, ) as a rank-1 array which might lead to ambiguous results. - We store the gradients for that layer’s activation in a variable
dAPrev
.
We then loop over the remaining layers, except this time we compute dz
as the product of the layer’s activation function with the n+1th
layer’s activation’s gradients dAPrev
.
Finally, we return the derivatives
dictionary.
Fitting, accuracy, and predictions
We now put it all together to train the network. We pass a learning rate and a number of epochs to the fit
method. This will train the model by calling initialize_parameters
and for each epoch running a forward pass, followed by a backpropagation which returns the gradients. These will, in turn, be used to update the trainable parameters. During the training, costs will be stored in a list for future reference. Performance on the training and test sets (here defined as accuracy
) will also be stored as well as displayed.
The last two pieces are the predict
and the accuracies
methods.
I have also defined a bunch of plotting functions that I am omitting here, but you can find them on the Github repo if you are interested.
Training the network and results
Finally, we can train and test our network! Let’s split the data into a training and a test set and perform training with ReLu (but we could use any other of the defined activation functions).
We can then instantiate a NN with the appropriate initialization parameters: here, I have chosen a 2-layers architecture of 128 and 32 neurons, respectively, which I will train across 200 epochs with a learning rate of 0.03.
Let’s see the results!
This seems not bad at all!
I hope you learned something on the way.
References
[1] License: Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset, which is a derivative work from original NIST datasets. MNIST dataset is made available under the terms of the Creative Commons Attribution-Share Alike 3.0 license.