The world’s leading publication for data science, AI, and ML professionals.

Coding A Neural Network From Scratch in NumPy

Deepen your understanding of AI by building an ANN from scratch in NumPy

Photo by Possessed Photography on Unsplash
Photo by Possessed Photography on Unsplash

Introduction

In this article, I will walk through the development of an artificial neural network from scratch using NumPy. The architecture of this model is the most basic of all ANNs – a simple feed-forward network. I will also show the Keras equivalent of this model, as I tried to make my implementation ‘Keras-esque’. Although the feed-forward architecture is basic compared to other Neural Networks such as transformers, these core concepts can be extrapolated to build more complex ANNs. These topics are inherently technical. For a more conceptual article on AI please check out my other post, Demystifying Artificial Intelligence.

Table of Contents

  • Architecture Overview

    • Forward Pass
    • Backward Pass
  • NumPy Implementation – Data

    • Construct Layers
    • Construct Network
    • Network: The Forward Pass
    • Layers: T_he Forward Pass
    • Perform Forward Pass / Do Sanity Check
    • _Network**: T_he Backward Pass
    • _Layers**: T_he Backward Pass
    • Perform Backward Pass / Do Sanity Check -_ Train Model
  • Conclusion

Term Glossary

  • X = inputs
  • y = labels
  • W = weights
  • b = bias
  • Z = dot product of X and W plus b
  • A = activation(Z)
  • k = number of classes
  • Lower-case letter denotes vectors, upper-case letters denotes matrix

Architecture

Forward Pass

The dot product

First, we compute the dot product of our inputs & weights and add a bias term.

Dot Product + Bias. Image by author.
Dot Product + Bias. Image by author.

Second, we put the weighted sum obtained in step one through an activation function.

Both of these operations are element wise, and straightforward. Therefore, I will not go into depth. More on dot products and activation functions here— Dot Product, Activation Functions. These computations take place in every neuron in each hidden layer.

Neuron Computation Overview. Image by author.
Neuron Computation Overview. Image by author.
Function Notation - Model Prediction. Image by author.
Function Notation – Model Prediction. Image by author.

The Activation Function

In my implementation we use ReLU activation in the hidden layers because it is easy to differentiate, and Softmax activation in the output layer (more on this below). In future versions, I will build it out to be more robust and enable any of these activation functions.

Commonly used activation functions:

Sigmoid Activation Function. Image by author.
Sigmoid Activation Function. Image by author.
Tanh Activation Function. Image by author.
Tanh Activation Function. Image by author.
ReLU Activation Function. Image by author.
ReLU Activation Function. Image by author.

Backward Pass

The Loss Function

We start by calculating the loss, also referred to as the error. This is a measure of how incorrect the model is.

The loss is a differential objective function that we will train the model to minimize. Depending on the task you’re trying to perform, you may choose a different loss function. In my implementation we use categorical cross-entropy loss because this is a multi-classification task, shown below. For a binary classification task, you could use binary cross-entropy loss, for a regression task mean squared error.

Categorical Cross-Entropy Loss. Image by author.
Categorical Cross-Entropy Loss. Image by author.

This caused some confusion for me, so I would like to expand on what is happening here. The formula above implies the labels are one-hot encoded. Keras expects the labels to be one-hot, however my implementation does not. Here is an example of computing cross-entropy loss, and an example of why it is not necessary to one-hot encode the labels.

Given the following data from a single sample, the one-hot encoded labels (y) and our models prediction (yhat), we compute the cross-entropy loss.

y = [1, 0, 0]

ŷ = [3.01929735e-07, 7.83961013e-09, 9.99999690e-01]

Computing Cross Entropy Loss w/ one-hot encoded labels. Image by author.
Computing Cross Entropy Loss w/ one-hot encoded labels. Image by author.

As you can see, the correct class at this sample was zero, indicated by a one in the zero index of the y array. We multiply the negative log of our output probability, by the corresponding label for that class, and sum together across all classes.

You may have already noticed, but besides the zero index, we get zero, because anything multiplied by zero is zero. What this boils down to is simply the negative log of our probability at the corresponding index for the correct class. Here the correct class was zero, so we take the negative log of our probabilities at the zero index.

Negative log of probabilities at the zero index. Image by author.
Negative log of probabilities at the zero index. Image by author.

The total loss is the average over all samples, denoted by m in the equation. To get this figure we repeat the computation above for each sample, compute the sum and divide by the total number of samples.

Stochastic Gradient Descent

Now that we have calculated the loss, it is time to minimize it. We start by computing the gradient of our output probabilities with respect to the input parameters, and backpropagate the gradients to the parameters at each layer.

At each layer we perform similar computations to the forward pass, except instead of doing the computations for only Z and A, we have to execute a computation for each parameter (dZ, dW, db, dA), as shown below.

Hidden Layer

Hidden Layer dZ. Image by author.
Hidden Layer dZ. Image by author.
Weight Gradient @ Current Layer. Image by author.
Weight Gradient @ Current Layer. Image by author.
Bias Gradient @ Current Layer - Sum Across Samples. Image by author.
Bias Gradient @ Current Layer – Sum Across Samples. Image by author.
Activation Gradient - this is dA[L] for the next layer. Image by author.
Activation Gradient – this is dA[L] for the next layer. Image by author.

There is a special case of dZ in the output layer, because we are using softmax activation. This is explained in depth later on in this article.


NumPy Implementation

Data

I will be using the simple iris dataset for this model.

from sklearn.preprocessing import LabelEncoder
def get_data(path):
    data = pd.read_csv(path, index_col=0)
    cols = list(data.columns)
    target = cols.pop()
    X = data[cols].copy()
    y = data[target].copy()
    y = LabelEncoder().fit_transform(y)
    return np.array(X), np.array(y)
X, y = get_data("<path_to_iris_csv>")

Initialize Layers

Initialize Network


Network – The Forward Pass

Architecture

Let’s start by dynamically initializing the network architecture. This means we can initialize our network architecture for an arbitrary number of layers and neurons.

We start by creating a matrix which maps our dimensionality (# of features), to the number of neurons in the input layer. From there it’s pretty straightforward – the input dimension of a new layer is the number of neurons in the previous layer, the output dimension is the number of neurons in the current layer.

model = Network()
model.add(DenseLayer(6))
model.add(DenseLayer(8))
model.add(DenseLayer(10))
model.add(DenseLayer(3))
model._compile(X)
print(model.architecture)
Out -->
[{'input_dim': 4, 'output_dim': 6, 'activation': 'relu'},
 {'input_dim': 6, 'output_dim': 8, 'activation': 'relu'},
 {'input_dim': 8, 'output_dim': 10, 'activation': 'relu'},
 {'input_dim': 10, 'output_dim': 3, 'activation': 'softmax'}]

Parameters

Now that we’ve created a network, we need to once again dynamically initialize our trainable parameters (W, b), for an arbitrary number of layers/neurons.

As you can see, we are creating a weight matrix at each layer.

This matrix contains a vector for each neuron, and a dimension for each input feature.

There is one bias vector with a dimension for each neuron in a layer.

Also notice we are setting a np.random.seed(), to get consistent results each time. Try commenting out this line of code to see how it affects your results.

model = Network()
model.add(DenseLayer(6))
model.add(DenseLayer(8))
model.add(DenseLayer(10))
model.add(DenseLayer(3))
model._init_weights(X)
print(model.params[0]['W'].shape, model.params[0]['b'].shape)
print(model.params[1]['W'].shape, model.params[1]['b'].shape)
print(model.params[2]['W'].shape, model.params[2]['b'].shape)
print(model.params[3]['W'].shape, model.params[3]['b'].shape)
Out -->
(6, 4) (1, 6)
(8, 6) (1, 8)
(10, 8) (1, 10)
(3, 10) (1, 3)

Forward Propagation

A function that performs one full forward pass through the network.

We are passing the output from the previous layer as input to the next, denoted by A_prev.

We are storing the inputs and weighted sum in the model memory. This is needed to perform the backward pass.


Layers – The Forward Pass

Activation Functions

Remember these are element wise functions.

ReLU

Used in the hidden layers. The function and graph were mentioned in the overview section. Here is what’s happening when we call np.maximum().

if input > 0:
 return input
else:
 return 0

Softmax

Used in the final layer. This function takes an input vector of k real values and converts it to a vector of k probabilities that sum to one.

Function Notation. Image by author.
Function Notation. Image by author.

Where:

Get un-normalized values. Image by author.
Get un-normalized values. Image by author.
Normalize for each element to obtain probability distribution. Image by author.
Normalize for each element to obtain probability distribution. Image by author.

Single Sample Example:

Input vector = [ 8.97399717, -4.76946857, -5.33537056]

Softmax Computation Following Formula. Image by author.
Softmax Computation Following Formula. Image by author.

Single Layer Forward Propagation

Where:

  • inputs = A_prev
  • weights = weight matrix of current layer
  • bias = bias vector of current layer
  • activation = activation function of current layer

We call this function in the _forwardprop method of the network and pass the parameters of the network as input.


Perform Forward Pass

model = Network()
model.add(DenseLayer(6))
model.add(DenseLayer(8))
model.add(DenseLayer(10))
model.add(DenseLayer(3))
model._init_weights(X)
out = model._forwardprop(X)
print('SHAPE:', out.shape)
print('Probabilties at idx 0:', out[0])
print('SUM:', sum(out[0]))
Out -->
SHAPE: (150, 3)
Probabilties at idx 0: [9.99998315e-01, 1.07470169e-06, 6.10266912e-07]
SUM: 1.0

Perfect. Everything is coming together! We have 150 instances mapped to our 3 classes, and a probability distribution for each instance that sums to 1.


Network – The Backward Pass

Backpropagation

A function that performs one full backward pass through the network.

We start by computing the gradient on our scores. Denoted by dscores. This is the special case of dZ mentioned in the overview section.

Per CS231n:

"We now wish to understand how the computed values inside z should change to decrease the loss Li that this example contributes to the full objective. In other words, we want to derive the gradient ∂Li/∂zk.

The loss Li is computed from p, which in turn depends on z. It’s a fun exercise to the reader to use the chain rule to derive the gradient, but it turns out to be extremely simple and interpretable in the end, after a lot of things cancel out:"

Derivation of dscores per Stanfords CS231n. Source: Stanford CS231n
Derivation of dscores per Stanfords CS231n. Source: Stanford CS231n

Where:

Introducing the term Li to denote the loss. Went in-depth on this in the overview section. Source: Stanford CS231n
Introducing the term Li to denote the loss. Went in-depth on this in the overview section. Source: Stanford CS231n
Introducing the term pk to denote the normalized probabilities (softmax output). Source: Stanford CS231n
Introducing the term pk to denote the normalized probabilities (softmax output). Source: Stanford CS231n

Single Sample Example:

For each sample we find the index of the correct class and subtract one. Pretty simple! This is line 9 in the code block above. Since dscores is a matrix we can double index using the sample and corresponding class label.

Input vector = [9.99998315e-01, 1.07470169e-06, 6.10266912e-07]

Output vector = [-1.68496861e-06, 1.07470169e-06, 6.10266912e-07]

Here the correct index is zero, so we subtract 1 from the zero index.

Notice we start at the output layer and move to the input layer.


Layers – The Backward Pass

Activation Derivative

Function Notation of ReLU Derivative. Image by author.
Function Notation of ReLU Derivative. Image by author.

Single Layer Backpropagation

This function backpropagates the gradients to each parameter in a layer.

Showing these again because they are so important. These are the backward pass computations.

Special Case of dZ - Softmax Layer. Image by author.
Special Case of dZ – Softmax Layer. Image by author.
Hidden Layer dZ. Image by author.
Hidden Layer dZ. Image by author.
dW - Consistent In All Layers. Image by author.
dW – Consistent In All Layers. Image by author.
db - Consistent In All Layers. Image by author.
db – Consistent In All Layers. Image by author.
dA[L-1] - Consistent In All Layers. Image by author.
dA[L-1] – Consistent In All Layers. Image by author.

As you can see, besides the calculation of dZ, the steps are the same in each layer.


Perform Backward Pass

model = Network()
model.add(DenseLayer(6))
model.add(DenseLayer(8))
model.add(DenseLayer(10))
model.add(DenseLayer(3))
model._init_weights(X)
out = model._forwardprop(X)
model._backprop(predicted=out, actual=y)
print(model.gradients[0]['dW'].shape, model.params[3]['W'].shape)
print(model.gradients[1]['dW'].shape, model.params[2]['W'].shape)
print(model.gradients[2]['dW'].shape, model.params[1]['W'].shape)
print(model.gradients[3]['dW'].shape, model.params[0]['W'].shape)
Out -->
(10, 3) (3, 10)
(8, 10) (10, 8)
(6, 8) (8, 6)
(4, 6) (6, 4)

Wow, beautiful. Remember, the gradients are computed starting from the output layer, moving backward to the input layer.


Train Model

To understand what is happening here, please revisit the overview section where I went into depth on calculating the loss & gave an example.

Now it is time to perform a parameter update after each iteration (epoch). Let’s implement the _update method.

Finally time to put it all together and train the model!

NumPy Model

model = Network()
model.add(DenseLayer(6))
model.add(DenseLayer(8))
model.add(DenseLayer(10))
model.add(DenseLayer(3))
model.train(X, y, 200)
Out -->
EPOCH: 0, ACCURACY: 0.3333333333333333, LOSS: 8.40744716505373
EPOCH: 20, ACCURACY: 0.4, LOSS: 0.9217739285797661
EPOCH: 40, ACCURACY: 0.43333333333333335, LOSS: 0.7513140371257646
EPOCH: 60, ACCURACY: 0.42, LOSS: 0.6686109548451099
EPOCH: 80, ACCURACY: 0.41333333333333333, LOSS: 0.6527102403575207
EPOCH: 100, ACCURACY: 0.6666666666666666, LOSS: 0.5264810434939678
EPOCH: 120, ACCURACY: 0.6666666666666666, LOSS: 0.4708499275871513
EPOCH: 140, ACCURACY: 0.6666666666666666, LOSS: 0.5035542867669844
EPOCH: 160, ACCURACY: 0.47333333333333333, LOSS: 1.0115020349485782
EPOCH: 180, ACCURACY: 0.82, LOSS: 0.49134888468425214

Examine Results

Image by author.
Image by author.

Try with different architecture

model = Network()
model.add(DenseLayer(6))
model.add(DenseLayer(8))
# model.add(DenseLayer(10))
model.add(DenseLayer(3))
model.train(X, y, 200)
Image by author.
Image by author.

Keras Equivalent

from keras.models import Sequential
from keras.layers import Dense
import tensorflow as tf
from tensorflow.keras.optimizers import SGD
ohy = tf.keras.utils.to_categorical(y, num_classes=3)
model2 = Sequential()
model2.add(Dense(6, activation='relu'))
model2.add(Dense(10, activation='relu'))
model2.add(Dense(8, activation='relu'))
model2.add(Dense(3, activation='softmax'))
model2.compile(SGD(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])
model2.fit(x=X, y=ohy, epochs=30)

Examine Results

Image by author.
Image by author.

Keras does not set the random.seed(), so you will get different results between each run. If you remove that line from the _init_weights method, the NumPy network behaves the same.


Conclusion

If you’ve made it this far, then congratulations. These are some mind-bending topics – that is precisely why I had to put this article together. To validate my own understanding of neural networks, and hopefully pass on the knowledge to other devs!

Full code here: NumPy-NN/GitHub

My LinkedIn: Joseph Sasson | LinkedIn

My Email: [email protected]

Please do not hesitate to get in touch, and call out any errors / bugs you may come across in the code or math!

Thank you for reading.


Related Articles