Neural Networks Overview

Math, Code, Drawings, Plots, Analogies, and Mind Maps

Published in

Towards Data Science

11 min readApr 28, 2020

Picture of my desk. This work was fueled by love, walnuts, raisins, and tea

My intent is to walk with you through the main concepts of Neural Networks using analogies, math, code, plots, drawings, and mind maps. We focus on the building block of Neural Networks: Perceptrons.

Throughout the article, we will confront the intimidating math and implement it using Python code with Numpy. We will also look at the equivalent implementation using Scikit-learn. Our results will be visualized using Matplotlib and Plotly. And at the end of each concept, we will structure our understanding using mind maps.

The Structure of a Perceptron

The input layer of a perceptron is a placeholder. It contains as many nodes as there are features in the training dataset. Each of these nodes is connected to the output node by an edge. We attribute weights to the edges and a bias to the output node.

A good analogy is to think of a perceptron as a squid. It has an input layer with many arms. The number of arms is equal to the number of input it needs to feed from. In this analogy let’s think of our dataset containing three types of ingredients: salty, sour, and spicy. Our squid needs three arms to grab one ingredient from each type. The arms are connected to the head, which is the output node where the squid mixes the ingredients and gives a score for how good they taste.

Having lived all its life in the sea, the squid can hardly notice the salty ingredients, so they don’t impact the overall taste. Towards sourness and spiciness however, it can be a real snob. The weights in a perceptron can be understood as representing how much our types of ingredients contribute towards the final taste. The bias can be understood as a factor that influences the squid’s palate, like its mood or appetite.

The input is multiplied by the corresponding weights, then summed together with the bias. This mixing of the ingredients with their respective weights and the addition with the bias is an affine function: z=𝑤x+𝑏

After the mixing, the squid outputs a score for its impression on the taste. This score is referred to as an activation and calculated using an activation function. The activation could simply be the result z as it is, in this case, we can use the identity function. It could be a number between -1 and 1, in this case, we can use the hyperbolic tangent function. It could also be a number between 0 and 1, in this case, we can use the sigmoid function. Or a number between 0 and ∞, in this case, we can use the rectifier linear unit (ReLU) function. Finally, Squid may also be asked to give multiple scores for the same input, each score between 0 and 1 based on different criteria. In this final case, we may be interested in making all the scores add up to 1, the softmax function would be ideal for this task.

The choice is made depending on the task and the interval of output that serves you best. An example calculating the sigmoid activation 𝑎′ from the input vector 𝑎 with the weights 𝑤 and bias 𝑏:

Terminology:

It can be confusing to see the input vector represented as an 𝑎 in equation 1 and an x in z=𝑤x+𝑏. The reason for this is that the nodes in the input layer are also referred to as activations. In the case of Multilayer Perceptrons there are more than two layers and every layer is considered an input layer for the one that comes after it.
The weights and bias are said to be parameters of 𝑎′. It is possible to merge the weights and bias into one parameter vector. This can be done by prepending a 1 in the vector of inputs, and prepending the bias in the vector initially containing only the weights.

Note on the activation functions: since affine functions are linear, they are unable to represent nonlinear datasets. Neural Networks are considered universal function approximators thanks to the nonlinearity introduced by the activation functions.

Training a Perceptron

De gustibus non est disputandum

We are unsatisfied with the output of our friend Squid. It seems that the parameters on which it is operating are random. Sure enough, the bias and weights have been initialized as such:

We would like to train Squid into acquiring better taste. Our standard for accurate taste is a vector y containing the actual score for each row in our ingredients dataset. The evaluation of the performance of Squid will be with respect to the scores in y.

The purpose of evaluating the performance of Squid is to measure its error with respect to the targets y. There are different functions to calculate this error:

Mean Squared Error (MSE): a good choice if the task is regression and the dataset does not contain outliers.
Absolute Mean Error (AME): a good choice if the task is regression and the dataset contains outliers.
Huber-Loss: a combination of MSE and AME
Cross-Entropy (log loss): a good choice if the task is classification: the output of the perceptron is a probability distribution.

To evaluate our perceptron we will use the mean squared error function:

Training is about minimizing the error 𝐶 by tweaking the parameters w and b. This is best visualized with the analogy of being in a mountain trying to descend back home while it is too dark to see. Home down below represents the error 𝐶 at its minimum. Calculating the square root of the MSE gives you the distance of the straight line between you and home. Knowing this distance, however, is of no help to you in the dark. What you want to know instead is the direction to take for your next step.

Directions in a 3D world account for three coordinates x, y, and z. Hence the question “where is home?” has to be answered with respect to x, y, and z. Similarly, the question “where is the minimum error 𝐶?” has to be answered with respect to the parameters w and b. The mathematical representation of these directions is the gradient of 𝐶. More specifically, the negative gradient of 𝐶: -∇𝐶.

∇𝐶 is a vector containing all the partial derivatives of C with respect to each parameter. For MSE we start by deriving Equation 3:

Equation 4: The gradient of MSE is equal to the activation at layer L minus y

We have merely derived MSE, we have some more work to do before we get the partial derivatives of 𝐶 with respect to the parameters. In equation 4, 𝑎 depends on the output of the activation function. In the case of the sigmoid activation, 𝑎 is equivalent to 𝑎′ in equation 1. Next comes the gradient of 𝐶 with respect to z (Recall that z=𝑤x+𝑏):

Equation 5: The gradient of C with respect to z is the Hadamard product of the gradient of C with respect to a and the output of the derivative of sigmoid of z

Equation 5 is intimidating until you see the equivalent in Python code

We have the gradient of 𝐶 with respect to z. Since the bias in z is multiplied by 1, the partial derivatives of C with respect to the bias is:

Equation 6: The partial derivative of C with respect to the b

And since the weights are multiplied by the input x, the partial derivatives of C with respect to the weights is:

Equation 7: The partial derivative of C with respect to the w

With the partial derivatives of the cost 𝐶 with respect to the parameters, we can now have the direction to take for the next step towards home. Now we need to know how wide should we make the step. Choosing a good step size is important. If your step is too narrow, you won’t be able to jump over obstacles in your way. If your step is too wide, you could overshoot your whole town down below and end up in another mountain. A good step size is somewhere in between and can be calculated by multiplying the partial derivatives (equations 6 and 7) with a chosen value called the learning rate or eta: 𝜂.

Now we can step down the slope of the mountain. This is equivalent to updating our coordinates/parameters:

This concludes Gradient Descent: the process of calculating the direction and size of the next step before updating the parameters. With Gradient Descent we can train Squid to acquire better taste. We do this by making Squid feed on some input and output a score using equation 1: this is referred to as Feedforward. The score is plugged as 𝑎 into equation 4, the result of which is plugged as the gradient of 𝐶 with respect to 𝑎 into equation 5. We then compute the gradient of 𝐶 with respect to z in equation 6. Finally, we compute the gradient of 𝐶 with respect to the parameters and we update the initially random parameters of Squid. This process is referred to as Back-propagation as it propagates the error backwards from the output layer to the input layer.

Gradient Descent is iterative. It stops when one of these conditions is met:

The defined maximum number of iterations has been reached.
The gradient reaches 0 or close to it by some defined value.
The validation error has reached a minimum. This is called early stopping.

Putting The Pieces Together

A full implementation of a Perceptron can be built from the pieces of code we have looked at. For the sake of not submerging this article with code, here is a link to a full implementation of a Perceptron.

To see our Perceptron at work, let’s make a very simple dataset. We will randomly generate two columns with a hundred rows of integers. Then we will make a third column to store our labels. The labels will be equal to the first column added to half the values in the second column.

Generating our dataset

We know that in order to reach the targets, our perceptron will have to start with random parameters and optimize them to have a bias equal to 0, the first weight equal to 1, and the second weight equal to 0.5. Let’s put it to the test:

Great! Our perceptron has successfully optimized all the parameters.

In Scikit-learn

We started from the most basic perceptron. As it is performing regression, it does not need an activation function. All it does so far is stochastic gradient descent. In Scikit-learn this can be achieved using the SGDRegressor class. While Scikit-learn includes a Perceptron class, it does not serve our current purpose as it is a classifier and not a regressor.

Training before inspecting the optimized parameters of an SGDRegressor

Visualizing Gradient Descent

We can plot the steps taken by our Perceptron to see the path it took to reach the ideal parameters. Here is the code to plot our Gradient Descent using Matplotlib:

Gradient Descent visualized using Matplotlib

Here is the same plot from a different angle and using Plotly this time:

Gradient Descent visualized using Plotly

I wanted you to see that our Perceptron’s descent led it home. But it was not the straightest of paths, far from that. We introduce improvements with feature scaling.

Feature Scaling

It is generally the case that Machine Learning algorithms perform better with scaled numerical input. Without scaling, Gradient Descent takes longer to converge. In a 2D world where you are still trying to descend from a mountain in the dark to reach home, you need to reduce the vertical and horizontal distances separating you from home. If the two distances are on different ranges, you will spend more time reducing the distance with the larger range.

For example, if the vertical distance you need to reduce is in the thousands while the horizontal distance is in the ones, your descent will mainly be about climbing down. By the time you get close to the minimum horizontal distance, you will still need to reduce the vertical one.

Scaling the two distances to have equal ranges makes your steps affect both distances simultaneously, which enables you to travel in a straight path directly towards home.

Simplistic drawing to illustrate the effect of scaling on Gradient Descent

The two most common ways to scale data are Normalization and Standardization. We are going to implement both of them and visualize their effect on Gradient Descent.

Normalization

Also referred to as min-max scaling, is a method that constricts data to be between 0 and 1:

Normalizing our training dataset using Numpy

Normalization in Scikit-learn:

Normalizing our training dataset using Scikit-learn

Standardization

Also referred to as z-score normalization, is a method that centers the data around 0 with a standard deviation equal to 1. 𝜇 is the mean, 𝜎 is the standard deviation:

Standardizing our training dataset using Numpy

Standardization in Scikit-learn:

Standardizing our training dataset using Scikit-learn

The Effect of Feature Scaling on Gradient Descent

To investigate the effect of feature scaling, we are going to train two more Perceptrons. The purpose is to compare the convergence of the parameters in Gradient Descent with and without scaling.

Our first Perceptron was training on an unscaled dataset. The second one will be trained on normalized data. And the third one will be trained on standardized data.

Training two perceptrons with normalized and standardized data

We can now visualize the paths taken by our three Perceptrons. The code below uses Plotly:

Visualizing the three Gradient Descents

Output of the previous code: Click here to interact with the figure

The Perceptrons trained on scaled data have taken more direct paths to converge. Direct paths enabled their descents to be faster, with wider steps (possible by increasing the learning rate eta) and a lower number of steps (possible by decreasing the iterations epochs).