
Introduction
In this article, I will walk through the development of an artificial neural network from scratch using NumPy. The architecture of this model is the most basic of all ANNs – a simple feed-forward network. I will also show the Keras equivalent of this model, as I tried to make my implementation ‘Keras-esque’. Although the feed-forward architecture is basic compared to other Neural Networks such as transformers, these core concepts can be extrapolated to build more complex ANNs. These topics are inherently technical. For a more conceptual article on AI please check out my other post, Demystifying Artificial Intelligence.
Table of Contents
-
Architecture Overview
- Forward Pass
- Backward Pass
-
NumPy Implementation – Data
- Construct Layers
- Construct Network
- Network: The Forward Pass
- Layers: T_he Forward Pass
- Perform Forward Pass / Do Sanity Check
- _Network**: T_he Backward Pass
- _Layers**: T_he Backward Pass
- Perform Backward Pass / Do Sanity Check -_ Train Model
- Conclusion
Term Glossary
- X = inputs
- y = labels
- W = weights
- b = bias
- Z = dot product of X and W plus b
- A = activation(Z)
- k = number of classes
- Lower-case letter denotes vectors, upper-case letters denotes matrix
Architecture
Forward Pass
The dot product
First, we compute the dot product of our inputs & weights and add a bias term.

Second, we put the weighted sum obtained in step one through an activation function.
Both of these operations are element wise, and straightforward. Therefore, I will not go into depth. More on dot products and activation functions here— Dot Product, Activation Functions. These computations take place in every neuron in each hidden layer.


The Activation Function
In my implementation we use ReLU activation in the hidden layers because it is easy to differentiate, and Softmax activation in the output layer (more on this below). In future versions, I will build it out to be more robust and enable any of these activation functions.
Commonly used activation functions:



Backward Pass
The Loss Function
We start by calculating the loss, also referred to as the error. This is a measure of how incorrect the model is.
The loss is a differential objective function that we will train the model to minimize. Depending on the task you’re trying to perform, you may choose a different loss function. In my implementation we use categorical cross-entropy loss because this is a multi-classification task, shown below. For a binary classification task, you could use binary cross-entropy loss, for a regression task mean squared error.

This caused some confusion for me, so I would like to expand on what is happening here. The formula above implies the labels are one-hot encoded. Keras expects the labels to be one-hot, however my implementation does not. Here is an example of computing cross-entropy loss, and an example of why it is not necessary to one-hot encode the labels.
Given the following data from a single sample, the one-hot encoded labels (y) and our models prediction (yhat), we compute the cross-entropy loss.
y = [1, 0, 0]
ŷ = [3.01929735e-07, 7.83961013e-09, 9.99999690e-01]

As you can see, the correct class at this sample was zero, indicated by a one in the zero index of the y array. We multiply the negative log of our output probability, by the corresponding label for that class, and sum together across all classes.
You may have already noticed, but besides the zero index, we get zero, because anything multiplied by zero is zero. What this boils down to is simply the negative log of our probability at the corresponding index for the correct class. Here the correct class was zero, so we take the negative log of our probabilities at the zero index.

The total loss is the average over all samples, denoted by m in the equation. To get this figure we repeat the computation above for each sample, compute the sum and divide by the total number of samples.
Stochastic Gradient Descent
Now that we have calculated the loss, it is time to minimize it. We start by computing the gradient of our output probabilities with respect to the input parameters, and backpropagate the gradients to the parameters at each layer.
At each layer we perform similar computations to the forward pass, except instead of doing the computations for only Z and A, we have to execute a computation for each parameter (dZ, dW, db, dA), as shown below.
Hidden Layer



![Activation Gradient - this is dA[L] for the next layer. Image by author.](https://towardsdatascience.com/wp-content/uploads/2022/04/1cba69YG7qqrEh8LaGv674Q.png)
There is a special case of dZ in the output layer, because we are using softmax activation. This is explained in depth later on in this article.
NumPy Implementation
Data
I will be using the simple iris dataset for this model.
from sklearn.preprocessing import LabelEncoder
def get_data(path):
data = pd.read_csv(path, index_col=0)
cols = list(data.columns)
target = cols.pop()
X = data[cols].copy()
y = data[target].copy()
y = LabelEncoder().fit_transform(y)
return np.array(X), np.array(y)
X, y = get_data("<path_to_iris_csv>")
Initialize Layers
Initialize Network
Network – The Forward Pass
Architecture
Let’s start by dynamically initializing the network architecture. This means we can initialize our network architecture for an arbitrary number of layers and neurons.
We start by creating a matrix which maps our dimensionality (# of features), to the number of neurons in the input layer. From there it’s pretty straightforward – the input dimension of a new layer is the number of neurons in the previous layer, the output dimension is the number of neurons in the current layer.
model = Network()
model.add(DenseLayer(6))
model.add(DenseLayer(8))
model.add(DenseLayer(10))
model.add(DenseLayer(3))
model._compile(X)
print(model.architecture)
Out -->
[{'input_dim': 4, 'output_dim': 6, 'activation': 'relu'},
{'input_dim': 6, 'output_dim': 8, 'activation': 'relu'},
{'input_dim': 8, 'output_dim': 10, 'activation': 'relu'},
{'input_dim': 10, 'output_dim': 3, 'activation': 'softmax'}]
Parameters
Now that we’ve created a network, we need to once again dynamically initialize our trainable parameters (W, b), for an arbitrary number of layers/neurons.
As you can see, we are creating a weight matrix at each layer.
This matrix contains a vector for each neuron, and a dimension for each input feature.
There is one bias vector with a dimension for each neuron in a layer.
Also notice we are setting a np.random.seed(), to get consistent results each time. Try commenting out this line of code to see how it affects your results.
model = Network()
model.add(DenseLayer(6))
model.add(DenseLayer(8))
model.add(DenseLayer(10))
model.add(DenseLayer(3))
model._init_weights(X)
print(model.params[0]['W'].shape, model.params[0]['b'].shape)
print(model.params[1]['W'].shape, model.params[1]['b'].shape)
print(model.params[2]['W'].shape, model.params[2]['b'].shape)
print(model.params[3]['W'].shape, model.params[3]['b'].shape)
Out -->
(6, 4) (1, 6)
(8, 6) (1, 8)
(10, 8) (1, 10)
(3, 10) (1, 3)
Forward Propagation
A function that performs one full forward pass through the network.
We are passing the output from the previous layer as input to the next, denoted by A_prev.
We are storing the inputs and weighted sum in the model memory. This is needed to perform the backward pass.
Layers – The Forward Pass
Activation Functions
Remember these are element wise functions.
ReLU
Used in the hidden layers. The function and graph were mentioned in the overview section. Here is what’s happening when we call np.maximum().
if input > 0:
return input
else:
return 0
Softmax
Used in the final layer. This function takes an input vector of k real values and converts it to a vector of k probabilities that sum to one.

Where:


Single Sample Example:
Input vector = [ 8.97399717, -4.76946857, -5.33537056]

Single Layer Forward Propagation
Where:
- inputs = A_prev
- weights = weight matrix of current layer
- bias = bias vector of current layer
- activation = activation function of current layer
We call this function in the _forwardprop method of the network and pass the parameters of the network as input.
Perform Forward Pass
model = Network()
model.add(DenseLayer(6))
model.add(DenseLayer(8))
model.add(DenseLayer(10))
model.add(DenseLayer(3))
model._init_weights(X)
out = model._forwardprop(X)
print('SHAPE:', out.shape)
print('Probabilties at idx 0:', out[0])
print('SUM:', sum(out[0]))
Out -->
SHAPE: (150, 3)
Probabilties at idx 0: [9.99998315e-01, 1.07470169e-06, 6.10266912e-07]
SUM: 1.0
Perfect. Everything is coming together! We have 150 instances mapped to our 3 classes, and a probability distribution for each instance that sums to 1.
Network – The Backward Pass
Backpropagation
A function that performs one full backward pass through the network.
We start by computing the gradient on our scores. Denoted by dscores. This is the special case of dZ mentioned in the overview section.
Per CS231n:
"We now wish to understand how the computed values inside z should change to decrease the loss Li that this example contributes to the full objective. In other words, we want to derive the gradient ∂Li/∂zk.
The loss Li is computed from p, which in turn depends on z. It’s a fun exercise to the reader to use the chain rule to derive the gradient, but it turns out to be extremely simple and interpretable in the end, after a lot of things cancel out:"

Where:


Single Sample Example:
For each sample we find the index of the correct class and subtract one. Pretty simple! This is line 9 in the code block above. Since dscores is a matrix we can double index using the sample and corresponding class label.
Input vector = [9.99998315e-01, 1.07470169e-06, 6.10266912e-07]
Output vector = [-1.68496861e-06, 1.07470169e-06, 6.10266912e-07]
Here the correct index is zero, so we subtract 1 from the zero index.
Notice we start at the output layer and move to the input layer.
Layers – The Backward Pass
Activation Derivative

Single Layer Backpropagation
This function backpropagates the gradients to each parameter in a layer.
Showing these again because they are so important. These are the backward pass computations.




![dA[L-1] - Consistent In All Layers. Image by author.](https://towardsdatascience.com/wp-content/uploads/2022/04/1uxbYJ5AcrTpNYN7sXq-IKA.png)
As you can see, besides the calculation of dZ, the steps are the same in each layer.
Perform Backward Pass
model = Network()
model.add(DenseLayer(6))
model.add(DenseLayer(8))
model.add(DenseLayer(10))
model.add(DenseLayer(3))
model._init_weights(X)
out = model._forwardprop(X)
model._backprop(predicted=out, actual=y)
print(model.gradients[0]['dW'].shape, model.params[3]['W'].shape)
print(model.gradients[1]['dW'].shape, model.params[2]['W'].shape)
print(model.gradients[2]['dW'].shape, model.params[1]['W'].shape)
print(model.gradients[3]['dW'].shape, model.params[0]['W'].shape)
Out -->
(10, 3) (3, 10)
(8, 10) (10, 8)
(6, 8) (8, 6)
(4, 6) (6, 4)
Wow, beautiful. Remember, the gradients are computed starting from the output layer, moving backward to the input layer.
Train Model
To understand what is happening here, please revisit the overview section where I went into depth on calculating the loss & gave an example.
Now it is time to perform a parameter update after each iteration (epoch). Let’s implement the _update method.
Finally time to put it all together and train the model!
NumPy Model
model = Network()
model.add(DenseLayer(6))
model.add(DenseLayer(8))
model.add(DenseLayer(10))
model.add(DenseLayer(3))
model.train(X, y, 200)
Out -->
EPOCH: 0, ACCURACY: 0.3333333333333333, LOSS: 8.40744716505373
EPOCH: 20, ACCURACY: 0.4, LOSS: 0.9217739285797661
EPOCH: 40, ACCURACY: 0.43333333333333335, LOSS: 0.7513140371257646
EPOCH: 60, ACCURACY: 0.42, LOSS: 0.6686109548451099
EPOCH: 80, ACCURACY: 0.41333333333333333, LOSS: 0.6527102403575207
EPOCH: 100, ACCURACY: 0.6666666666666666, LOSS: 0.5264810434939678
EPOCH: 120, ACCURACY: 0.6666666666666666, LOSS: 0.4708499275871513
EPOCH: 140, ACCURACY: 0.6666666666666666, LOSS: 0.5035542867669844
EPOCH: 160, ACCURACY: 0.47333333333333333, LOSS: 1.0115020349485782
EPOCH: 180, ACCURACY: 0.82, LOSS: 0.49134888468425214
Examine Results


Try with different architecture
model = Network()
model.add(DenseLayer(6))
model.add(DenseLayer(8))
# model.add(DenseLayer(10))
model.add(DenseLayer(3))
model.train(X, y, 200)


Keras Equivalent
from keras.models import Sequential
from keras.layers import Dense
import tensorflow as tf
from tensorflow.keras.optimizers import SGD
ohy = tf.keras.utils.to_categorical(y, num_classes=3)
model2 = Sequential()
model2.add(Dense(6, activation='relu'))
model2.add(Dense(10, activation='relu'))
model2.add(Dense(8, activation='relu'))
model2.add(Dense(3, activation='softmax'))
model2.compile(SGD(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])
model2.fit(x=X, y=ohy, epochs=30)
Examine Results


Keras does not set the random.seed(), so you will get different results between each run. If you remove that line from the _init_weights method, the NumPy network behaves the same.
Conclusion
If you’ve made it this far, then congratulations. These are some mind-bending topics – that is precisely why I had to put this article together. To validate my own understanding of neural networks, and hopefully pass on the knowledge to other devs!
Full code here: NumPy-NN/GitHub
My LinkedIn: Joseph Sasson | LinkedIn
My Email: [email protected]
Please do not hesitate to get in touch, and call out any errors / bugs you may come across in the code or math!
Thank you for reading.