Dismantling Neural Networks to Understand the Inner Workings with Math and Pytorch

Simplified math with examples and code to shed light inside black boxes

Mehdi Amine
Towards Data Science

--

Motivation

As a child, you might have dismantled a toy in a moment of frenetic curiosity. You were drawn perhaps towards the source of the sound it made. Or perhaps it was a tempting colorful light from a diode that called you forth, moved your hands into cracking the plastic open.

Sometimes you may have felt deceived that the inside was nowhere close to what the shiny outside led you to imagine. I hope you have been lucky enough to open the right toys. Those filled with enough intricacies to make breaking them open worthwhile. Maybe you found a futuristic looking DC-motor. Or maybe a curious looking speaker with a strong magnet on its back that you tried on your fridge. I am sure it felt just right when you discovered what made your controller vibrate.

We are going to do exactly the same. We are dismantling a neural network with math and with Pytorch. It will be worthwhile, and our toy won’t even break. Maybe you feel discouraged. That’s understandable. There are so many different and complex parts in a neural network. It is overwhelming. It is the rite of passage to a wiser state.

So to help ourselves we will need a reference, some kind of Polaris to ensure we are on the right course. The pre-built functionalities of Pytorch will be our Polaris. They will tell us the output we must get. And it will fall upon us to find the logic that will lead us to the correct output. If differentiations sound like forgotten strangers that you once might have been acquainted with, fret not! We will make introductions again and it will all be mighty jovial.
I hope you will enjoy.

Linearity

The value of a neuron depends on its inputs, weights, and bias. To compute this value for all neurons in a layer, we calculate the dot product of the matrix of inputs with the matrix of weights, and we add the bias vector. We represent this concisely when we write:

The values of all neurons in one layer.

Conciseness in mathematical equations however, is achieved with abstraction of the inner workings. The price we pay for conciseness is making it harder to understand and mentally visualize the steps involved. And to be able to code and debug such intricate structures as Neural Networks we need both deep understanding and clear mental visualization. To that end, we favor verbosity:

The value of one neuron with three inputs, three weights, and a bias.

Now the equation is grounded with constraints imposed by a specific case: one neuron, three inputs, three weights, and a bias. We have moved away from abstraction to something more concrete, something we can easily implement:

To calculate z, we have moved forward from a layer of inputs to the next layer of neurons. When a neural network steps all the way forward through its layers and acquires knowledge, it needs to know how to go backwards to adjust its previous layers. We can achieve this backward propagation of knowledge through derivatives. Simply put, if we differentiate z with respect to each of its parameters (the weights and the bias), we can get the values of the input layer x.

If you have forgotten how to differentiate, rest assured: you won’t be told to go brush up on an entire branch of calculus. We will recall differentiations rules as we need them. The partial derivative of z with respect to a parameter tells you to consider that parameter as a variable, and all other parameters as constants. The derivative of a variable is equal to its coefficient. And the derivative of a constant is equal to zero:

w0 is the variable, all else is a constant. The coefficient of w0 is x0.

Similarly, you can differentiate z with respect to w1, w2, and b (with b having the invisible coefficient of 1). You will find that every partial derivative of z is equal to the coefficient of the parameter with respect to which it is differentiated. With this in mind, we can use Pytorch Autograd to evaluate the correctness of our math.

Non-linearity

We introduce non-linearity with activation functions. This enables neural networks to be universal function approximators. There are various types of activations, each one fulfills a different purpose and produces a different effect. We will go through the formula and differentiation of ReLU, Sigmoid, and Softmax.

ReLU

The Rectified Linear Unit function compares the value of a neuron with zero and outputs the maximum. We can think of ReLU labeling all nonpositive neurons as equally inactive.

All non-negative values stay the same, while negative values are replaced by zero.

To implement our own ReLU, we could compare z with 0 and output whichever is greater. But the clamp method provided in the Torch package can already do this for us. In Numpy, the equivalent function is called clip. The following code implements a clamp-based ReLU, before using Pytorch’s relu to evaluate its output.

The differentiation of ReLU is straightforward:

ReLU’ is either 1 or 0, depending on z.
  • For all positive z, the output of ReLU is z. Therefore the differentiation is the coefficient of z, which is equal to one.
  • For all non-positive z, the output of ReLU is equal to zero. Therefore the differentiation is also equal to zero.

Let’s translate our understanding into Python code. We will implement our own ReLU’(z) before comparing it with the automatic differentiation of Pytorch’s ReLU.

Why are we giving a tensor of ones to backward()?
backward() defaults to the case of being called on a single scalar and uses the default argument torch.tensor(1.) This was previously the case when we called z.backward(). Since torch_relu is not a single scalar we need to explicitly provide a tensor of ones equal in shape to torch_relu.

Sigmoid

The sigmoid activation function produces the effect of mapping z from ℝ to the range [0,1]. When performing binary classification, we typically label instances belonging to the target class with the value 1, and all else with the value 0. We interpret the output of sigmoid as the probability that an instance belongs to the target class.

The sigmoid activation function produces the effect of mapping z from ℝ to the range [0,1].

Quiz: The task of a neural network is to perform binary classification. The output layer of this network consists of a single neuron with a sigmoid activation equal to 0.1. Among the following interpretations, which one(s) are correct?

  1. There is a 0.1 probability that the instance belongs to class 1, the target class.
  2. There is a 0.1 probability that the instance belongs to class 0.
  3. There is a 0.9 probability that the instance belongs to to class 0.

Solution: only 1 and 3 are correct. It is important to understand that a sigmoid-activated neuron with some output p, is implicitly giving an output of 1-p for the non-targeted class. It is also important to keep in mind that p is the probability associated with the target class (usually labeled as 1), while 1-p is the probability associated with the non-targeted class (usually labeled as 0).

Observe: that the sum of p and (1-p) is equal to 1. This seems too obvious to point out at this stage, but it will be useful for us to keep it in mind when we discuss Softmax.

Once again, we translate the math in Python then we check our results with the Pytorch implementation of sigmoid:

Sigmoid differentiation.

There is something graceful about the differentiation of sigmoid. It does, however, take a sinuous path to reach its grace. Once we recall a few differentiation rules, we will have all that we need to saunter our way down the sinuous path.

Detailed differentiation of sigmoid.

Having understood how to differentiate sigmoid, we can now implement the math and evaluate it with Pytorch’s Autograd.

Note: sigmoid_prime() relies on the same sigmoid() implemented earlier.

Nowadays, ReLU has been widely adopted as a replacement for sigmoid. But sigmoid is still lingering around, hiding under the name of its more generalized form: Softmax.

Softmax

We think of sigmoid for binary classification, and softmax for multi-class classification. This association while correct, misleads many of us into thinking that sigmoid and softmax are two different functions. This is emphasized by the fact that when we look at the equations of sigmoid and softmax, it does not seem like there is an apparent link between them.

A softmax activated neuron is the exponential of its value dived by the sum of the exponentials of all other neurons sharing the same layer.

Once again, the abstraction of the formula makes it anything but intuitive at first glance. An example will make it more concrete. We take a case of two output neurons, the first one (z0) outputs the probability that instances belong to a class labeled 0, the second one (z1) outputs the probability that instances belong to a class labeled 1. In fewer words, for z0 the target class is labeled 0, and for z1 the target class is labeled 1. To activate z0 and z1 with softmax we compute:

Softmax is applied to each neuron in the output layer. In addition to mapping all neurons from ℝ to the range [0,1], it makes their values add up to 1.

Now we can remediate the seeming lack of an apparent link between sigmoid and softmax. We will do this by simply rewriting sigmoid:

Another way of writing sigmoid, showing that it’s actually softmax with two classes.

It is more common to see the first mentioned version of sigmoid than it is to see the second one. This is because the latter version is more expensive computationally. Its advantage, however, remains in helping us understand softmax.

With only two neurons in the output layer, and given the fact that softmax makes all output neurons sum up to 1: we always know that Softmax(z0) is going to be equal to 1-Softmax(z1). Hence for binary classification, it makes sense to consider z0 equal to 0, and to only compute the activation of z1 using sigmoid.

The following code implements softmax and tests it with an example of three output neurons. Then it compares our result with the result of Pytorch’s softmax.

We differentiate softmax activations with respect to each neuron. Keeping the same example of an output layer with two neurons, we get four softmax differentiations:

The Jacobian matrix of softmax.

Regardless of the number of output neurons, there are only two formulas for softmax differentiation. The first formula is applied when we differentiate the softmax of a neuron with respect to itself (top left and bottom right differentiations in the Jacobian). The second formula is applied when we differentiate the softmax of a neuron with respect to some other neuron (top right and bottom left differentiations in the Jacobian).

To understand the steps involved in the differentiation of softmax, we need to recall one more differentiation rule:

The division rule.

The following differentiations contain detailed steps. And although they might seem intimidating by the fact that they look dense, I assure you that they are much easier than they look, and I encourage you to redo them on paper.

Detailed partial differentiations of softmax.

The implementation of the softmax differentiation requires us to iterate through the list of neurons and differentiate with respect to each neuron. Hence two loops are involved. Keep in mind that the purpose of these implementations is not to be performant, but rather to explicitly translate the math and arrive at the same results achieved by the built-in methods of Pytorch.

Cross-Entropy Loss

In the sequence of operations involved in a neural network, softmax is generally followed by the cross-entropy loss. In fact, the two functions are so closely connected that in Pytorch the method cross_entropy combines both functions in one.

I remember my first impression when I saw the formula for the cross-entropy loss. It was close to admiring hieroglyphs. After deciphering it, I hope you will share my awe towards how simple ideas can sometimes have the most complex representations.

Cross-entropy loss function.

The variables involved in calculating the cross-entropy loss are p, y, m, and K. Both i and k are used as counters to iterate from 1 to m and K respectively.

  • Z: is an array where each row represents the output neurons of one instance. m: is the number of instances.
  • K: is the number of classes.
  • p: is the probability of the neural network that instance i belongs to class k. This is the same probability computed from softmax.
  • y: is the label of instance i. It is either 1 or 0 depending on whether y belongs to class k or not.
  • log: is the natural logarithm.

Let’s say we are performing a multi-class classification task where the number of possible classes is three (K=3). Each instance can only belong to one class. Therefore each instance is assigned to a vector of labels with two zeros and a one. For example y=[0,0,1] means that the instance of y belongs to class 2. Similarly, y=[1,0,0] means that the instance of y belongs to class 0. The index of the 1 refers to the class to which the instance belongs. We say that the labels are one-hot encoded.

Now let’s take two instances (m=2). We calculate their z values and we find: Z = [[0.1, 0.4, 0.2], [0.3, 0.9, 0.6]]. Then we calculate their softmax probabilities and find: Activations = [[0.29, 0.39, 0.32], [0.24, 0.44, 0.32]]. We know that the first instance belongs to class 2, and the second instance belongs to class 0, because: y = [[0,0,1],[1,0,0]].

To calculate cross-entropy:

  1. We take the log of the softmax activations: log(activations) = [[-1.24, -0.94, -1.14], [-1.43, -0.83, -1.13]].
  2. We multiply by -1 to get the negative log: -log(activations) = [[1.24, 0.94, 1.14], [1.43, 0.83, 1.13]].
  3. Multiplying -log(activations) by y gives: [[0., 0., 1.14], [1.43, 0., 0.]].
  4. The sum over all classes gives: [[0.+0.+1.14], [1.43+0.+0.]] = [[1.14], [1.43]]
  5. The sum over all instances gives: [1.14+1.43] = [2.57]
  6. The division by the number of instances gives: [2.57 / 2] = [1.285]

Observations:

  • Steps 3 and 4 are equivalent to simply retrieving the negative log of the target class.
  • Steps 5 and 6 are equivalent to calculating the mean.
  • The loss is equal to 1.14 when the neural network predicted that the instance belongs to the target class with a probability of 0.32.
  • The loss is equal to 1.43 when the neural network predicted that the instance belongs to the target class with a probability of 0.24.
  • We can see that in both instances the network failed to give the highest probability to the correct class. But compared to the first instance, the network was more confident about the second instance not belonging to the correct class. Consequently, it was penalized with a higher loss of 1.43.

We combine the above steps and observations in our implementation of cross-entropy. As usual, we will also go through the Pytorch equivalent method, before comparing both outputs.

Note: Instead of storing the one-hot encoding of the labels, we simply store the index of the 1. For example, the previous y becomes [2,0]. Notice, at index 0 the value of y is 2, and at index 1 the value of y is 0. Using the indices of y and their values, we can directly retrieve the negative logs for the target classes. This is done by accessing -log(activations) at row 0 column 2, and at row 1 column 0. This allows us to avoid the wasteful multiplications and additions of zeros in steps 3 and 4. This trick is called integer array indexing and is explained by Jeremy Howard in his Deep Learning From The Foundations lecture 9 at 34:57

If going forward through the layers of a neural network can be seen as its journey to acquire some kind of knowledge, this is the place where that knowledge can be found. Using the differentiation of a loss function can inform the neural network of how much it erred on each instance. Taking this error backwards, a neural network can adjust itself.

Cross-entropy differentiation.

We go through the differentiation steps of cross-entropy after we recall a couple differentiation rules:

Recall these two differentiation rules. Also recall that ln is the same as log based e. The base e is assumed throughout the article.
Cross-entropy differentiation steps.

We will not be able to evaluate the following implementation with the output of Pytorch Autograd just yet. The reason goes back to Pytorch’s cross_entropy combining softmax with cross-entropy. Consequently, using backward would also involve the differentiation of softmax in the chain rule. We discuss and implement this in the next section, Backpropagation. For now, here is our implementation of cross-entropy’:

Backpropagation

With every function we discussed, we made one step forward in the layers of a neural network, and we also made the equivalent step backward using the differentiation of the functions. Since neural networks move all the way forward before retracing their steps all the way backward, we need to discuss how to link our functions.

Forward

Going all the way forward, a neural network with one hidden layer starts by feeding input to a linear function, then feeds its output to a non-linear function, then feeds its output to a loss function. The following is an example with an instance x, its corresponding label y, three linear neurons z, each neuron computed using its three weights w and a bias b, followed by a softmax activation layer, and a cross-entropy loss.

Backward

Going all the way backward, the same neural network starts by taking the same input given to the loss function, and feeds it instead to the derivative of that loss function. The output of the derivative of the loss function is the error, what we called the acquired knowledge. To adjust its parameters, the neural network has to carry this error another step backwards to the non-linear layer, and from there another step backwards to the linear layer.

The next step backward is not as simple as feeding the error to the derivative of the non-linear function. We need to use the chain rule (which we previously recalled in the differentiation of sigmoid), and we also need to pay attention to the input we should give to each derivative.

Key rules for feedforward and backpropagation:

  • Functions and their derivatives take the same input.
  • Functions send their output forward to be the input of the next function.
  • Derivatives send their output backward to multiply the output of the previous derivative.
May your machine’s output always be in accordance with your math.

Conclusion

My impression is that a lot of people with backgrounds from different disciplines are curious and enthusiastic towards Machine Learning. Unfortunately, there is a justifiable trend towards acquiring the know-how while trying to keep away from the intimidating math. I consider this unfortunate because I believe many people are actually eager to deepen their understanding; If only they could find more resources that appeal to the fact that they come from different backgrounds and might need a little reminder and a little encouragement here and there.

This article contained my attempt towards writing reader-friendly math. By which I mean math that reminds the reader of the rules required to follow along. And by which I also mean math with equations that avoid skipping so many steps and making us ponder what happened between one line and the next. Because sometimes we really need someone to take our hand and walk with us through the fields of unfamiliar concepts. My sincere hope that I was able to reach your hand.

References

M. Amine, The Inner Workings of Neural Networks, My Colab Notebooks (2020).
M. Amine, The Inner Workings of Neural Networks, My Gists, (2020).
A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, (2019).
S. Gugger, A simple neural net in numpy, (2018).
J. Howard, Fast.ai: Deep Learning from the Foundations Lesson 9, (2019).
The Pytorch documentation.

--

--

Writer. AI Researcher. MSc Advanced Computer Science Intelligent Systems. BSc Computer Science and Mathematics. https://www.linkedin.com/in/mehdi-amine