The world’s leading publication for data science, AI, and ML professionals.

Demystifying Neural Networks Pt.2

Implementing hidden layers in NN with Python and NumPy

Photo by Hasan Almasi on Unsplash
Photo by Hasan Almasi on Unsplash

In a previous article, we built a Logistic Regression from scratch as an example of shallow Neural Network. Now let’s increase complexity a bit and add one hidden layer. Even though the math get slightly more complicated (I hope you enjoy derivatives!), we can still code the whole network with just Python and NumPy.


Logistic Regression worked well in a case where data was linearly separable, but we can see it doesn’t work as well when the it’s not linearly separable anymore. As an example, let’s use again scikit-learn to generate some data.

Image by author
Image by author

Obviously, no straight line will ever perfectly fit the data above and we expect Logistic Regression to perform poorly in such a scenario. Let’s apply the Logistic model we coded in the previous article to this data.

As we can see from the decision boundary, quite a few points are misclassified by the model. It’s clear we need a model able to come up with non-linear decision boundaries.

Image by author
Image by author

1-Hidden Layer Neural Network

Let’s add one hidden layer to our network.

We now have one input layer, one hidden layer and one output layer. We will use the sigmoid as activation function for both the output and hidden layers and cross-entropy as the loss, just like we did for the Logistic Regression. We can then re-use the functions defined above for sigmoid and cross-entropy.

As we did for LR, let’s implement the building blocks of our Neural Network: Forward Propagation and Back Propagation.

Forward Propagation

This step is pretty much the same as in the Logistic Regression example, except that we have two weight matrices – one for the hidden layer and one for the output layer. I have added the expected dimensions next to the weights for each layer for this particular example. I think it helps when working out what they should be.

Back Propagation

We can reuse the code from the previous section but with some significant differences in the gradient calculation. The partial derivatives of the loss (cross-entropy) with respect to the weights have different functional forms depending on the layer, so we need to have two different gradient functions.

The gradient for the output layer is the same as the one we defined for the Logistic Regression (see here for the complete derivation).

The gradient of cross-entropy with respect to Wₕ can be obtained using the δ rule. If l is the output layer, δ=σ(z)−y. It can be demonstrated (see here for derivation) that δ₋₁=δₕ=((wₗ₋₁)ᵗδ)σ′(zₗ₋₁), where is the element-wise product. The gradient is then the dot product δᵗ₋₁⋅x. Note that the σ′(zₗ₋₁)=σ(zₗ₋₁)(1−σ(zₗ₋₁)) but, by definition, σ(zₗ₋₁) is the hidden layer activation aₗ₋₁. Consequently σ′(zₗ₋₁)=aₗ₋₁(1−aₗ₋₁).

We can now implement the gradient as usual. The only difference is that we have to update both the output and the hidden layer weights.

Note that the δ rule works quite nicely even when we add more than one layer and can be easily coded with a loop: at every layer n, we only need the activation and the weights for the given layer – which we have – , the δ calculated for the layer n+1 and the n+1 activation.

Let’s see how that works. The derivative for generic layer ln, with n ∈ [1,..,N], is simply σ′(zₗ₋ₙ)=aₗ₋ₙ*(1−aₗ₋ₙ), where aₗ₋ₙ is the activation for that layer. δₗ₋ₙ is the element-wise product of the derivative above and (wₗ₋ₙ)ᵗδ₋ₙ₊₁. wₗ₋ₙ are the weights for layer ln whereas δ₋ₙ₊₁ is the δ calculated in the previous turn (remember that in backward propagation, the order is reversed and layer n+1 comes before layer n). Finally, instead of X – which we only use for the output layer – we use the activation from the n+1 layer.

Piecing everything together

Let’s put everything together.

Let’s compare the loss for the two models. The 1-Hidden Layer Neural Network achieves a smaller loss than LR.

Logistic Regression Min Loss: 0.28
1-Hidden Layer NN Min Loss: 0.18

Visualising the result

We need to define a prediction function for our model in order to use it later on.

Let’s look at the decision boundary again. Our 1-Hidden Layer classifier performs a lot better than the Logistic Regression. The non-linear decision boundary is able to classify correctly all red points. There still are some misclassified blue points but, overall, the classifier does a good job in separating the two classes.

Image by author
Image by author

The complete code is available here. Let me know in the comments what you think!


Related Articles