The world’s leading publication for data science, AI, and ML professionals.

A Neural Network implementation (Part I)

Feedforward implementation using only Numpy!

Figure 1. My notes on backward propagation. Photo by me.
Figure 1. My notes on backward propagation. Photo by me.

How to build a Neural Network from Scratch with Numpy! (Part I)

Neural Networks are the heart of Deep Learning. They’ve constituted the foundational building block for transcendent models such as Bert or GPT-3. To get a very superficial understanding of what transformer do, I think it’s essential to learn how to build a neural network from scratch, using the most basic mathematical tool available out there.

Based on this, I decided to invest some time and code my own neural network implementation. You can have a look at the full project in my Github page (neuralnet). Yeah… I know, it’s not a very original name 👻 !

I will cover the topic in several different blog posts.

Theory first

The first thing I wanted to get right, was the exact math behind forward and backward propagation. I didn’t want to have a look nor copy other post’s implementations so I took Introduction to Deep Learning by Sandro Skansi (from Springer publisher) which seemed to be just about right for the task.

Figure 2. Cover of "Introduction to Deep Learning" by Sandro Skansi. Image from Amazon
Figure 2. Cover of "Introduction to Deep Learning" by Sandro Skansi. Image from Amazon

There’s an excellent explanation of shallow feedforward neural networks in chapter 4. I studied it and developed the equations by hand first so that I didn’t have to squeeze my head while I was developing. (Actually, the top image of the post are my real notes, I spent a nice afternoon getting the whole algorithm right). It’s better to start coding when you have a solid idea of what you’re trying to create. Moreover when you’re doing scientific programming.

You’ll note that there is a clear "Keras API-based" bias on my project. I borrowed some concepts from it to encapsulate elements such as the Layers abstraction. In the end, is the most reasonable way to expose the building of neural networks.

Basic Concepts

At this point, I’m going to sum up the topics that will be covered in this series of posts for implementing the network. There will be explanations with mathematical expressions and the associated code.

The topics that will be covered in this part are:

Feedforward Propagation

  • Layer Weights
  • Activation Functions
  • Vectorized Bias Absorption
  • Full forward propagation

The following points will be explained in the upcoming Part II! 😄

  • Optimization: Gradient Descent
  • Loss Functions
  • Backward Propagation and Chain Rule
  • Weights initialization

Forward Propagation

From this step onwards, we’re going to focus on a binary classification problem using a basic feedforward architecture as the basis to build the code. We’ll use this setting because is one of the most common problems in Machine Learning and the easiest to explain this concepts.

As a description of the following Figure 3, our network consists of 2 neurons in the input layer, which means that the network will accept vectors with 2 components, then 3 in the hidden layer and 1 in the output layer. We’ll see later that by means of using activation functions the output neuron will return the probability of the input observation of belonging to one class or another.

Figure 3. Shallow Feedforward Neural Network
Figure 3. Shallow Feedforward Neural Network

Layer Weights

Each of the lines and arrows represents the weights matrices that relates the neurons among each other with respect to each layer. These weights will ponder "how much" of the previous input or layer output will be transferred to the next one. We could write those matrices mathematically as shown below.

The matrix that represent the relationship of weights between the input and hidden layer will be denoted as ΘIH and the weights between the hidden and output layer as ΘHO.

Figure 4. Weights matrices for each layer.
Figure 4. Weights matrices for each layer.

Then, to explicitly define the transformations that take place from each layer to the following, we need to define what is an activation function.

Activation Functions

It’s quite common to read the definition of activation functions based on the biologically inspired model. But let’s ignore partially as I think is somehow confusing.

Some notes on the biological and philosophical foundations can be read in the paper "On the Origins of Deep Learning" which explains these primer conceptualizations with a very incisive and systematical approach.

I prefer to look at activation functions as non-linear (In practice, always non-linear except on perceptrons) transformations that squash their outputs into a certain range (approximating a probability distribution) and that eventually allows Neural Networks to prove they are universal function approximators_._ This theory explains why neural networks exhibit so high representation power.

A very cool explanation on the Universal Approximation Theory can be read here.

On my implementation, I’ve chosen the sigmoid activation function, which has been very much the standard activation function for a long time (Although, recently it has been superseded by ReLu and all of it’s derivations).

Figure 5. Sigmoid activation function.
Figure 5. Sigmoid activation function.

The code implementation is shown below.

Bias Absorption

Now, we have almost all the tools to see how forward propagation is computed. Let’s have a first look at how the expressions looks like on a 2 layer feed forward architecture:

Figure 6. Forward propagation on a shallow network.
Figure 6. Forward propagation on a shallow network.

As you can clearly see, the form of the forward propagations seems to be quite simple. It’s only a type of functions composition.

*We’ve inherited the tradition of presenting what neural networks are with the neurons and their links in this post, but in the end if you look at the expression above, it’s quite straight-forward to understand what’s the network basis. The output of the expression, if applied to an input X, would be the probability of that input to belong to the first class.

Why ΘIH and ΘHO have the b subscript? → Because we’re implementing the forward and backward propagation with bias absorption. This is not a very spread concept, but I think it’s more efficient in computational terms, as it allows a fully vectorized implementation.

Bias absorption is basically seeing the bias term as another input to each of the layers and forcing them to be 1. Then a weight will be stacked into the layer weights so that it is also estimated to fit data.

*The reason why we need to use the bias term is because it improves the flexibility of the model to adapt to different data points.

Let’s explain this with a simple example:

Figure 7. Single Layer Perceptron
Figure 7. Single Layer Perceptron

Imagine having the architecture defined above in Fig. 7:

If we apply bias absorption to input vector and weights,

Then, carrying this idea to our original problem and network architecture,

Note that we’ve added a new row to both matrices to handle the dot product that we’ll explain on the next step.

Definition of Forward Propagation

Now, we have everything to show how to compute forward propagation. Let’s have a look at the code.

The function _addbias basically, stacks a column of 1s to the given vector to allow bias absorption within the layer. Then in the FullyConnectedLayer class, we have the activation which is an encapsulation of the sigmoid function defined before, the weights attribute, which is supposed to be initialized by a higher level class NeuralNet (which will be covered later) by using the method _initializeweights. At this point, just think that the weights are initialized randomly from a normal distribution.

*Although I’ve removed some boilerplate code in the gist above, you’ll notice that there are some traces of it. This is because it is supposed to handle different initialization techniques or activation functions but maintaining the same interface for all of them. Go here for the full code.

Then, we have the forward method. This method is supposed to perform the propagation on the layer where it is hosted. Translating that line of code into a mathematical expression will lead to something like:

The last expression denoted by yFCL, which is just an acronym for the FullyConnectedLayer, is making exactly what the forward methods shows. It applies the dot product of the inputs x and the weights matrix ΘIHb.

_*_You may have noticed that the expression is not quite the exact same thing as in Figure 6 with regards to weight’s and inputs’ order. This is totally valid as long as your vectors and matrices accommodate the matrix product properly.

Taking care of the shapes here is quite an important point for a correct implementation. You should know the required shape of your inputs and how they are going to be multiplied and transformed across the layers of the network.

This is only the explanation for a single layer forward, but, ¿How to handle the end to end propagation of the input signal until the end of the network? If we recall the figure exposing the mathematical expression for full forward propagation on a 2 layer network:

Figure 6. Forward propagation on a shallow network.
Figure 6. Forward propagation on a shallow network.

It’s an easy operation to code and effortless to generalize to N layers. This is how it looks like in Python.

To give a brief summary of what we’re doing here, the __buildlayers method is responsible of setting the layers defined when instantiating the class NeuralNet. You would do something similar to:

nn = NeuralNet(
    layer_shapes=(
        (2, 3),
        (3, 1)
    )
)

By doing so, you’re just defining the initial shapes of the weights in between each layer. Then, in the init, those shapes are converted into the absorbed bias version, that is, adding a new row to the initial shape (just as we’ve explained before!).

def set_bias_as_weight(shape):
    return shape[0] + 1, shape[1]

After having the proper weights shapes, they are initialized by using a technique called He Normal initialization. Each layer is integrated in what I’ve called LayersChain (similar to the Sequential class that Keras offers). It is basically an enhanced list object with some functionality to ease the operations throughout layers.

You can see in the NeuralNet class how we use this class to loop across all of the layers to perform the full propagation. Also, to clarify, this implementation is totally vectorized. The underlying reason for doing this is that training a neural network-based model is done in a batched iterative way through Gradient Descent. This give us room to leverage parallel processing and matrix calculus to optimize code efficiency. So, although we’ve been working with a single observation example throughout the post, but if instead of one observation you stack N observations into the same vector x, all of the calculations would effectively work too.

The output of the forward method will be used for backpropagation to calculate the error we’re producing with respect to the targets but also for doing inference once we’ve trained our neural network.

And that’s basically all for Part I! Hope that you liked it and that everything is clear! See you on Part II.


Related Articles