How to Build a Neural Network from Zero

No frameworks, just Python

Bernardino Sassoli

Published in

Towards Data Science

11 min readFeb 2, 2022

Inspirational image of scrabble tiles spelling the word foundation — Photo by Brett Jordan on Unsplash

“What I cannot create, I do not understand” (Richard Feynman)

This sentence was found on Richard Feynman’s blackboard at the time of his death. I use to think there is something corny about quoting Feynman: but, when a quote perfectly illustrates how you feel, resorting to cliché might be justified.

Let me get something out of the way: I know that there are dozens of excellent articles, tutorials, and videos on the topic of this article. Do we need another?I think my perspective is unusual. I am 51 years old, and my background in either programming, calculus, linear algebra, or engineering is zero. I started teaching myself programming and Artificial Neural Networks roughly twelve months ago. After a non-negligible number of tutorials and courses on Deep Learning, most of which leverage existing frameworks such as PyTorch, sci-kit learn, and TensorFlow, I still felt something was amiss in my understanding of certain concepts. When I don’t understand something thoroughly, I always do the same thing: build it from the ground up.

Hence, I decided to implement a simple Artificial Neural Network (ANN or NN) without resorting to any framework. This article illustrates the results of my attempt. Some will find it an exercise in futility. Others, hopefully, will find it helpful — perhaps even learn something from it.

(Please note that this is not an introduction to neural networks, nor do I explain the mathematics underpinning the algorithms. I assume my readers to be familiar with the basic concepts behind Deep Neural Networks, such as backpropagation. If you are not, you can do a lot worse than watching 3blue1brown’s series on YouTube).

I welcome criticism and suggestions. If you spot a mistake, feel free to get in touch! (You can find the code in this GitHub repo if you are interested.)

Getting started

First of all, let’s import some libraries. Since we will not use any deep learning or machine learning framework, the number of imports is limited.

Importing libraries

I decided to train and test the network using the MNIST_784 dataset.¹ As you probably know if you are reading this, MNIST is considered the “Hello, world” of Deep Learning. It’s a dataset consisting of 70.000 grayscale images of handwritten digits; in the words of its authors:

“It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.”

We can download MNIST as a python dictionary (using e.g. scikit-learn) as follows (we store the dataset itself in data and the labels in labels:

Loading MNIST

Let’s explore the dataset. We pick a random data point to get acquainted with an image and its shape:

Exploring an MNIST data point

So we know that our image is a vector of shape 784: a flattened 28 x 28 matrix. If we split our dataset into a training set with, e.g., 60.000 data points and a test set with 10.000 data points, we will end up with the following: input layer size = m*n, where m is the number of samples and n the number of features.

Hence, the network’s input layer has size (60.000 x 784). Assuming we represent the training dataset with features as rows and examples as columns, we’d end with something like the following:

Let’s start by thinking about a simple, fully connected NN with just one hidden layer. Let’s say the hidden layer has 512 nodes. The first question is: how do we control the mapping from the input layer to the hidden layer? In other words, what should be the size of the weight matrix? Each node in the input layer will connect to each input in the hidden layer. So we will need 784 x 512 = 401,408 weights or trainable parameters. And since each connection to the hidden layer will add a bias to each weighted sum, we will need 512 biases. As a general rule:

Rule for the shape of the weights matrix

(Getting dimension rights always drives me berserk: this, courtesy of Andrew Ng, is what I use when I get lost.) Eventually, this is what we’d get for a NN with a single deep layer consisting of 4 neurons:

Layers’ shapes

(But remember: we need to add biases to each node in the hidden layer.)

Setting the stage

Let’s start by defining some commonly used activation functions:

Activations

In this article, I will only use the Rectified Linear Unit (ReLU) function, but the NN should be flexible enough to accommodate other choices. We’ll remember that when we implement the NN.

For the output layer, we will use softmax, given that this is a multiclass classification problem.

Softmax

Before we can train our network, we are going to need some preprocessing functions:

a normalize function to scale the inputs to the [0, 1] range and
a one-hot-encode function, which will turn the array of labels from an n-sized vector (where n is the number of samples) to an n x m array (where m is the number of possible outputs).

There are various ways of scaling features: in our case, we will use min-max scaling (we could also just divide each feature by the number of the possible values a feature can take — in our case, 255):

Scaling

To perform one-hot-encoding on the labels, we can create an identity matrix (i.e., a matrix consisting of 1s in the diagonal and 0s elsewhere) having the same shape as our labels vector and subsequently index it with the labels vector itself, like this:

label                            encoded
-------  -------------------------------
      5  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
      0  [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      4  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
      1  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
      9  [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
      2  [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
      1  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
      3  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
      1  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
      4  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]

One -hot-encoding for labels

Because raw data can be organized differently, I decided, as a matter of design choice, to leave the appropriate reshaping and the one-hot-encoding to the user: i.e., it will be up to them to preprocess the training dataset and the test dataset so that features as passed as rows and examples as columns and to pass the labels to the one_hot_encode function. The scaling will instead be performed by the network itself.

Finally, we will need derivatives of the activation functions to perform gradient descent:

Derivatives

Implementing the neural network

We now have everything we need for our deep neural network, which we will implement as a class. So we need to think about:

an init method;
a forward propagation method;
how to implement backpropagation;
how to train the network;
how to compute the cost function;
methods to predict, and, finally:
a way to implement metrics such as accuracy.

My thought process was, roughly, the following. We can pass the following arguments to the constructor:

the training set with the corresponding labels
the test set with its labels
an activation function choice (as a string)
the number of classes we need to predict
the architecture — which we represent as a list, where each element of the list corresponds to a deep layer and indicates the number of neurons in that layer.

Then we use a dictionary to store the layers, and name them with strings to order to make keeping track easier (a trick I also picked up from Andrew Ng’s course on Machine Learning).

Neural network __init__ method.

Notice a bunch of assert statements. I cannot emphasize this enough: if you don’t want to spend hours in debugging hell, think about how to test your network and sprinkle your methods liberally with conditions to be met.

When the constructor gets called to instantiate a NN, it performs a bunch of actions:

it normalizes the features in the training set X;
it adds a layer with size m where m is the number of features (which it infers from the shape of the first training example) at the beginning of the architecture list and another layer with size n (the number of classes) at the end;
it initializes an empty dict called parameters to store biases and weights.

Once we have our init, we can think about a method to initialize these parameters. For that, we proceed as follows:

we iterate over the number of deep layers (hence, starting at 1 and ending at L-1, where L is the length of the architecture list).
for each layer, we add to the parameters dictionary a key-value pair for weights and another for biases
to keep things simple, we initialize the weights by randomly picking them from a normally distributed sample of shape: (number of nodes in layer i, number of nodes in layer i+1) and scaling them by a factor of 100. A better choice might be to make the weight initialization dependent on the activation function of choice (see, e.g., this article)
the biases are a vector of zeros: one for each node in the layer. We store each weight matrix in the dictionary using the key w{i} and each bias vector with key b{i}.

Parameter initialization

Feedforward, backpropagation, and fitting

Feedforward

Now we get to the methods for forward- and backpropagation.

I want both methods to take no parameters as inputs (except for the class itself, obviously) and to modify the model in place. Why? Because we will pass the data to the fit method, and the calls to the forward and the backward passes will happen there. (We do, however, need each forward pass to return a cost value which we will use when fitting the model during gradient descent.)

So, here’s how we proceed to implement the forward method. You might remember we initialized two dictionaries: one for the layers and one for the parameters (which we populate using the initialize_parameters method above). For each node in each of the layers except the input and outputs layer, we will need an input (the weighted sum of inputs) and an activation output. We can store those in our layers dictionary as z{i} and a{i} respectively for each of the i deep layers in the model’s architecture. Hence:

we iterate over the range (1, length of the architecture-1);
to obtain layers[zi] for each i in that range we calculate the dot product of parameters[wi] and layers[a-i] to which we add parameters[bi];
Finally, to obtain the activation output layers[ai], we apply the activation function of choice to layers[zi]. We are storing the values of the layers because we will need them when we perform backpropagation.

forward propagation

As an implementation note: I was driven quite mad by overflows in the cost function computation until I realized that it was due to possible divisions by zero. Hence the 0.00000001 in the cost calculation.

Backpropagation

The implementation of the backpropagation algorithm is relatively straightforward in structure. It closely parallels that of the forward pass (I won’t be going into the math behind the backpropagation algorithm here, as it is way beyond the scope of the article). We initialize an empty derivatives dictionary to store the gradients and iterate over the layers from right to left. For each layer, we compute dZ and use it to calculate that layer’s dW and dB (the updates in the weights and biases, respectively). Here is where the layers dictionary we previously defined comes in handy, as we need to access each layer’s input z and activation a. First, let’s compute the gradients for the last (the output) layer.

For the outmost layer, dz is simply the result of output — ground truths. (Note: what shape is dz? Well, there will be as many rows as there are labels and as many columns as there are training examples — hence, we’ll need to transpose it when we dot it with the values stored in layers. Keep this in mind.)

We can then compute dW and db as follows:

the gradients for the weights are the dot product of dz and the transpose of the activation (see the above note) from the previous layer (averaged of the number of examples)
the gradients for the biases are just the averaged dz (note that we need to keepdims=True as otherwise, the array will be squished to shape (1, ) as a rank-1 array which might lead to ambiguous results.
We store the gradients for that layer’s activation in a variable dAPrev.

We then loop over the remaining layers, except this time we compute dz as the product of the layer’s activation function with the n+1th layer’s activation’s gradients dAPrev.

Finally, we return the derivatives dictionary.

Backprop

Fitting, accuracy, and predictions

We now put it all together to train the network. We pass a learning rate and a number of epochs to the fit method. This will train the model by calling initialize_parameters and for each epoch running a forward pass, followed by a backpropagation which returns the gradients. These will, in turn, be used to update the trainable parameters. During the training, costs will be stored in a list for future reference. Performance on the training and test sets (here defined as accuracy) will also be stored as well as displayed.

Fit model.

The last two pieces are the predict and the accuracies methods.

Predict and accuracy.

I have also defined a bunch of plotting functions that I am omitting here, but you can find them on the Github repo if you are interested.

Training the network and results

Finally, we can train and test our network! Let’s split the data into a training and a test set and perform training with ReLu (but we could use any other of the defined activation functions).

We can then instantiate a NN with the appropriate initialization parameters: here, I have chosen a 2-layers architecture of 128 and 32 neurons, respectively, which I will train across 200 epochs with a learning rate of 0.03.

Let’s see the results!

This seems not bad at all!

I hope you learned something on the way.

References

[1] License: Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset, which is a derivative work from original NIST datasets. MNIST dataset is made available under the terms of the Creative Commons Attribution-Share Alike 3.0 license.