Nothing but NumPy: Understanding & Creating Binary Classification Neural Networks with Computational Graphs from Scratch
Nothing but Numpy is a continuation of my neural network series. To view the previous blog in this series or for a refresher on neural networks you may click here.
This post continues from Understanding and Creating Neural Networks with Computational Graphs from Scratch.
It’s easy to feel lost when you have twenty browser tabs open trying to understand a complex concept and most of the writeups you come across regurgitate the same shallow explanations. In this second installment of Nothing but NumPy, I’ll again strive to give the reader a deeper understanding of neural networks as we delve deeper into a specific kind of neural network called a “Binary Classification Neural Network”. If you’ve read my previous post then this will seem very familiar.
Understanding “Binary Classification” will help us lay down major concepts that help us understand many of the choices we make in multi-classification, which is why this post will also serve as a prelude to “Understanding & Creating Softmax Layer with Computational Graphs from Scratch”.
This blog post is divided into two parts, the first part will be understanding the basics of a Binary Classification Neural Network and the second part will comprise the code for implementing everything learned from the first part.
Part Ⅰ: Understanding Binary Classification
Let’s dig in🍽️
Binary classification is a common machine learning task. It involves predicting whether a given example is part of one class or the other. The two classes can be arbitrarily assigned either a “0” or a “1” for mathematical representation, but more commonly the object/class of interest is assigned a “1”(positive label) and the rest a “0”(negative label). For example:
- Is the given picture of a cat(1) or not-a-cat(0)?
- Given a patient’s test results, is the tumor benign(0; harmless) or malignant(1; harmful)?
- Given a person’s information (eg. age, education level, marital status, etc) as features, predict whether they make less than $50K(0) or more than $50K(1) a year.
- Is the given email spam(1) or not-spam(0)?
In all the examples above the object/class of interest is assigned a positive label(1).
Most of the time it will be fairly obvious whether a given machine learning problem requires binary classification or not. A general rule of thumb is that binary classification helps us answer yes(1)/no(0) questions.
Now let’s build a simple 1-layer neural network(input and output layers only) and hand solve it to get a better picture. (we’ll make a neural network the same as the one elaborated in my previous post, but with one key difference, the output of the neural network is interpreted as a probability instead of a raw value).
Let’s expand this neural network out to reveal its intricacies.
For those not familiar with all the different parts of a neural network I’ll go over each of them briefly. (A more detailed explanation is provided in my previous post)
- Inputs: x₁ and x₂ are the input nodes for two features that represent an example we want our neural network to learn from. Since input nodes form the first layer of the network they are collectively referred to as the “input layer”.
- Weights: w₁ & w₂ represent the weight values that we associate with the inputs x₁ & x₂, respectively. Weights control the influence each input has in the calculation of the next node. A neural network “learns” these weights to make accurate predictions. Initially, weights are randomly assigned.
- Linear Node(z): The “z” node creates a linear function out of all the inputs coming into it i.e z = w₁x₁+w₂ x₂+b
- Bias: “b” represents the bias node. The bias node inserts an additive quantity into the linear function node(z). As the name suggests the bias sways the output so that it may better align with our desired output. The value of the bias is initialized to b=0 and is also learned during the training phase.
- Sigmoid Node: This σ node, called the Sigmoid node, takes the input from a preceding linear node(z) and passes it through the following activation function, called the Sigmoid function(because of its S-shaped curve), also known as the Logistic function:
Sigmoid is one of the many “activations functions” used in neural networks. Activation functions are non-linear functions(not simple straight lines). They add non-linearity to a neural network by expanding its dimensionality, in turn, helping it learn complex things(for more details please refer to my previous post). Since it is the last node in our neural network, it is the output of the neural network and is, therefore, called the “output layer”.
A linear node(z) combined with a bias node(b) and an activation node, such as the sigmoid node(σ), forms a “neuron” in an artificial neural network.
In neural network literature, every neuron in an artificial neural network is assumed to have a linear node along with its corresponding bias, hence the linear node and bias nodes are not shown in neural network diagrams, as in Fig.1. To get a deeper understanding of the computations in a neural network I will continue to show expanded versions of neural networks in this blog post, as in Fig.2.
The use of a single Sigmoid/Logistic neuron in the output layer is the mainstay of a binary classification neural network. This is because the output of a Sigmoid/Logistic function can be conveniently interpreted as the estimated probability(p̂, pronounced p-hat) that the given input belongs to the “positive” class. How? Let’s delve a bit deeper.
The Sigmoid function squashes any input into the output range 0<σ<1. So, for example, if we were creating a neural network-based “cat(1) vs. not-cat(0)” detector, given images as input examples, our output layer will still be a single Sigmoid neuron, converting all the calculations from previous layers into p̂, a simple 0–1 output range.
We can then simply interpret p̂ as “What is the probability that the given input image is of a cat?”, where “cat” is the positive label. If p̂≈0, then it is highly unlikely that the input image is of a cat, on the other hand, p̂≈1 then it is very likely that the input image is of a cat. Simply put, p̂ is how confident our neural network model is in predicting that the input is a cat i.e the positive class(1).
This can be mathematically summarized simply as a conditional probability:
Since every binary classification neural net architecture has a single Sigmoid neuron in the output layer, as shown in Fig.6 above, the output of the Sigmoid (estimated probability) depends on the output of the linear node(z) associated with the neuron. If the value of the linear node(z) is :
- Greater than zero(z>0) then the output of the Sigmoid node is greater than 0.5(σ(z)>0.5), which can be interpreted as “The probability that the input image is of a cat is greater than 50%”.
- Less than zero(z<0) then the output of the Sigmoid node is less than 0.5(σ(z)<0.5), which can be interpreted as “The probability that the input image is of a cat is less than 50%”.
- Equal to zero(z=0) then the output of the Sigmoid node equals 0.5(σ(z)=0.5), which means that “The probability that the input image is of a cat is exactly 50%”.
Now that we know what everything represents in our neural network let’s see what calculations our binary classification neural network performs given the following data set:
The data above represents the AND logic gate, where the output is given a positive label(1) only when both the inputs are x₁=1 and x₂=1, all other cases are assigned a negative label(0). Each row of the data represents an example we want our neural network to learn from and then classify. I have also plotted the points on a 2-D plane so that it is easy to visualize(red dots represent points where the class(y) is 0 and the green cross represents the point where the class is 1). This data set also happens to be linearly separable i.e. we can draw a straight line to separate the positive labeled examples from the negative ones.
The blue line shown above, called the decision boundary, separates our two classes. Above the line is our positive labeled example(green cross) and below the line are our negative labeled examples(red crosses). Behind the scenes, this blue line is formed by the z(linear function) node. We’ll later see how the neural network learns this decision boundary.
Like my previous blog post, first, we will perform Stochastic Gradient Descent, which is training a neural network using just one example from our training data. Then we’ll generalize our learnings from the stochastic process to Batch Gradient Descent(preferred method) where we train the neural network using all the examples in the training data.
Stochastic Gradient Descent
Computations in a neural network move from left to right, this is called forward propagation. Let’s go through all the forward computations our neural network will perform when provided with just the first training example x₁ = 0 and x₂ = 0. Also, we’ll randomly initialize the weights to w₁=0.1 and w₂=0.6 and bias to b=0.
So, the prediction of the neural network is p̂=0.5. Recall, this is a binary classification neural network, p̂ here represents the estimated probability that the input example, with features x₁=0 & x₂=0, belongs to the positive class(1). Our neural network currently thinks that there is a 0.5(or 50%) chance that the first training example belongs to the positive class (recall from the probability equation this equates to P(1∣ x₁, x₂; w,b)=p̂=0.5).
Yikes! This is kinda poor 😕, especially since the negative label is associated with the first example, i.e y=0. The estimated probability should be around p̂≈0; it should be very unlikely that the first example belongs to the positive class, so that, the chance of belonging to the negative class is high (i.e P(0∣ x₁, x₂; w,b)≈1-p̂≈1).
If you’ve read my previous post then you know that at this point we need a Loss function to help us out. So, what Loss function should we use to tell a binary classification neural network to correct its estimated probability? In comes the Binary Cross-Entropy Loss Function to our rescue.
Binary Cross-Entropy Loss Function
Note: in most programming languages “log” is the natural logarithm(log with base-e), denoted in mathematics as “ln”. For consistency between code and equations consider “log” as natural logarithm and not as “log₁₀”(log with base-10).
The Binary Cross-Entropy(BCE) Loss function is defined as follows :
All Loss functions essentially tell us how far our predicted output is from our desired output, for one example only. Simply put, a Loss function computes the error between prediction and actual value. Keeping that in view, the Binary Cross-Entropy(BCE) Loss function computes a different Loss when the associated label of a training example is y=1(positive) and a different Loss when the label is y=0(negative). Let’s see:
Now it’s apparent that the BCE Loss function in Fig.12 is just an elegantly compressed version of the piecewise equation.
Let’s plot the above piecewise function to visualize what’s going on underneath.
So, the BCE Loss function captures the intuition that the neural network should pay a high penalty(Loss→∞) when the estimated probability, with respect to the training example’s label, is completely wrong. On the other hand, the Loss should equal zero(Loss=0) when the estimated probability, with respect to the training example’s label, is correct. Simply put, the BCE Loss should equal zero in only two instances:
- if the example is positively labeled(y=1) the neural network model should be completely sure that the example belongs to the positive class i.e p̂=1.
- if the example is negatively labeled(y=0) the neural network model should be completely sure that the example does not belong to the positive class i.e p̂=0.
In neural networks, the gradient/derivative of the Loss function dictates whether to increase or decrease the weights and bias of a neural network. So let’s see what the derivative of the Binary Cross-Entropy(BCE) Loss function looks like:
We can also split the derivative into a piecewise function and visualize its effects:
A positive derivative would mean decrease the weights and negative would mean increase the weights. The steeper the slope(gradient) the more incorrect the prediction was. Let’s take a moment to make sure we understand this statement:
- If the gradient is negative that would mean we are looking at the first Loss curve, where the actual label for the example is positive(y=1). The only way to drive Loss to zero would be to move in the opposite direction of the slope(gradient), from negative to positive. Therefore, we need to increase the weights and bias so that z = w₁x₁+w₂x₂+b > 0 (recall Fig.8) and in turn estimated probability of belonging to the positive class is p̂≈σ(z)≈1.
- Similarly, when the gradient is positive we are looking at the second Loss curve where the actual label for the example is negative(y=0). The only way to drive the Loss to zero would again be to move in the opposite direction of the slope(gradient), this time from positive to negative. In this instance, we would need to decrease the weights and bias so that z = w₁x₁+w₂x₂+b < 0 and consequently estimated probability of belonging to the positive class p̂≈σ(z)≈0.
The explanation provided for BCE Loss up till now is sufficient for all intents and purposes, but the curious among you might be wondering where did this Loss function even come from and why not just use the Mean Squared Error Loss function like in the previous post? More on this later.
Now that we know the purpose of a Loss function and how the Binary Cross-Entropy Loss function works let’s calculate the BCE Loss on our current example(x₁ = 0 and x₂ = 0), for which our neural network estimated that the probability for belonging to the positive class is p̂=0.5 while its label(y) is y=0:
The Loss is about 0.693(rounded to 3 decimal places). We can now use the derivative of the BCE Loss function to check if we need to increase or decrease the weights and bias, using the process called backpropagation; it is the opposite of the forward-propagation, we track backward from output to input. Backpropagation allows us to figure out how much of the Loss each part of the neural network was responsible for, we can then adjust those parts of the neural network accordingly.
As shown in my previous post, we’ll employ the following graph technique for propagating the gradients back from the output layer to the input layer of the neural network:
At each node, we only have our local gradient computed(partial derivatives of that node). Then during backpropagation, as we are receiving numerical values of gradients from upstream, we multiply upstream gradients with local gradients and pass them on to their respective connected nodes. This is a generalization of the chain rule from calculus.
Let’s go over backpropagation step by step:
For the next calculation, we’ll need the derivative of the Sigmoid function which forms the local gradient at the red node. The derivative if the Sigmoid function is(derived in detail in my previous post):
Now let’s use the derivate of the Sigmoid node and backpropagate the gradient further:
Gradients should not propagate back to the input nodes( i.e red arrows should not travel towards the green nodes) as we do not want to change our input data, we only intend to change the weights associated with them.
Finally, we can update the parameters(weights and bias) of the neural network by performing gradient descent.
Gradient Descent
Gradient descent is adjusting the parameters of the neural network by moving in the negative direction of the gradient i.e away from a sloping region to a flatter region.
The general equation for gradient descent is:
The Learning Rate, α(pronounced alpha), is used to control the step size down the Loss curve(Fig. 21). The learning rate is a hyper-parameter of the neural network, which means it can’t be learned through the backpropagation of gradients and must be set by the creator of the neural network, ideally after some experimentation. For more information about the effects of the learning rate, you may refer to my previous post.
Notice that the gradient descent steps (blue arrows) keep getting smaller and smaller, that’s because as we move away from the sloping region to a flatter region, near the minimum point, the magnitude of the gradient also decreases resulting in progressively smaller steps.
We’ll set the learning rate(α) to α=1.
Now that we have updated the weights and bias(actually we were only able to update our bias in this training iteration) let’s do a forward propagation on the same example and calculate the new Loss to check if we’ve done the right thing.
Now the estimated probability for the 1ˢᵗ example belonging to the positive class(p̂) is down from 0.5 to approximately 0.378(rounded to 3 d.p) and consequently, the BCE Loss has reduced a bit, too, down from 0.693 to around 0.475(to 3 d.p).
Up till now, we have performed stochastic gradient descent. We have used only one example(x₁=0 and x₂=0), from our AND gate dataset of four examples, to perform a single training iteration(each training iteration is forward propagation, calculating Loss, followed by backward propagation and updating the weights through gradient descent).
We can continue on this path of updating the weights just by learning from one example at a time, but ideally, we’d like to learn from multiple examples at a time and reduce our Loss across all of them.
Batch gradient descent
In batch gradient descent(also called full batch gradient descent) we use all the training examples in a dataset during each training iteration. (If batch gradient descent is not possible for some reason, e.g. size of all the training data is too big to fit into RAM or GPU, we may use a subset of the dataset in each training iteration, this is called mini-batch gradient descent.)
A batch is just a vector/matrix full of training examples.
Before we proceed with processing multiple examples we need to define a Cost function.
Binary Cross Entropy Cost Function
For batch gradient descent we need to adjust the Binary Cross Entropy(BCE) Loss function to accommodate not just one example but all the examples in a batch. This adjusted Loss function is called the Cost function(also represented by the letter J in neural network literature and some times also called the objective function).
Instead of calculating the Loss on one example, the Cost function calculates average Loss across ALL the examples in the batch.
When performing batch gradient descent(or mini-batch gradient descent) we take the derivative with respect to the Cost function instead of the Loss function. So next, we’ll see how to take the derivative of the Binary Cross-Entropy Cost function, using a simple example and then generalizing from there.
The derivative of Binary Cross-Entropy Cost function
In vectorized form our BCE Cost function looks as follows:
As expected the Cost is just the average of the Loss of the two examples, but all our calculations are vectorized, allowing us to compute the Binary Cross-Entropy Cost for a batch in one go. We prefer to use vectorized computations in neural networks as computer hardware(CPU and GPU) is better suited to batch computations in vectorized form. (Note: if we had just one example in the batch the BCE Cost would simply be calculating the BCE Loss, just like the stochastic gradient descent example we went through earlier)
Next, let’s derive the partial derivatives of this vectorized Cost function.
From this, we can generalize the partial derivative of the Binary Cross-Entropy Cost function.
A very important consequence of the Cost function is that since it calculates the average Loss across a batch of examples it also calculates the average of the gradient across the batch of examples, this helps in figuring out a less noisy general direction in which Loss across all examples decreases. In contrast, stochastic gradient descent(batch with only one example) gives a very noisy estimate of the gradients because it uses only one example per training iteration to guide gradient descent.
For vectorized(batched) computations, we need to adjust the linear node(z) of the neural network, so that it accepts vectorized inputs and use the Cost function instead of the Loss function, also for the same reason.
Z node now computes the dot-product between appropriately sized weight matrix(W) and training data(X). The output of the Z node is now also a vector/matrix.
Now we can set up our data(X, W, b & Y) for vectorized computation.
We are now finally ready to perform forward and backward propagation using Xₜᵣₐᵢₙ, Yₜᵣₐᵢₙ, W, and b.
(NOTE: All the results below are rounded to 3 decimal points, just for brevity)
Through vectorized computations, we have performed forward propagation; calculating all the estimated probabilities for every example in the batch in one go.
Now we can calculate the BCE Cost on these output estimated probabilities(P̂ ). (Below, for legibility, I have highlighted the portions of the Cost function that are calculating the Loss on positive examples in blue and the negative examples in red)
So, the Cost with our current weights and bias is approximately 0.720. Our goal now is to reduce this Cost using backpropagation and gradient descent. Let’s go through backpropagation step-by-step.
And just like that, we have computed all the gradients with respect to the Cost function in one go for our entire batch of training examples, using vectorized computations. We can now perform gradient descent to update the weights and bias.
(For those confused with how ∂Cost/∂W and ∂Cost/∂b are being calculated in the last backpropagation step please refer to my previous blog where I break down this computation, more specifically why derivatives of dot products result in transposed matrices)
To check if we have done the right thing we can use the new weights and bias to perform another forward propagation and calculate the new Cost.
With one training iteration, we have reduced the Binary Cross Entropy Cost from 0.720 to around 0.618. We will need to perform multiple training iterations before we can converge to good weight and bias values that result in an overall low BCE Cost.
At this point, if you’d like to give it a go and perform the next backpropagation step yourself, as an exercise, here are the approximate gradients of Cost w.r.t weights(W) and bias(b) you should get(rounded to 3 d.p):
- ∂Cost/∂W = [-0.002, 0.027]
- ∂Cost/∂b =[0.239]
After about 5000 Epochs (an epoch is complete when the neural net goes through all the training examples in a training iteration) the Cost steadily decreases to about 0.003, our weights settle to around W = [10.678, 10.678], bias resolves to around b = [-16.186]. We see by the Cost Curve below that the network has converged to a good set of parameters(i.e W & b):
The Cost Curve(or Learning Curve) is a neural network model’s performance over time. It is the Cost plotted after every few training iterations(or epochs). Note how quickly the Cost decreases initially but then asymptotes, recall Fig 21 this is because initially the magnitude of the gradient is high but as we descend to flatter region near minimum Cost the magnitude of gradient decreases and further training only slightly improves the neural network parameters.
After the neural net has been trained for 5000 epochs the predicted output probabilities(p̂) on Xₜᵣₐᵢₙ are:
[[9.46258077e-08, 4.05463814e-03, 4.05463814e-03, 9.94323194e-01]]
Let’s break this down:
- for x₁=0, x₂=0, the predicted output is p̂≈ 9.46×10⁻ ⁸≈0.0000000946
- for x₁=0, x₂=1 the predicted output is p̂≈ 4.05×10⁻ ³≈0.00405
- for x₁=1, x₂=0 the predicted output is p̂≈ 4.05×10⁻ ³≈0.00405
- for x₁=1, x₂=1 the predicted output is p̂≈ 9.94×10⁻ ¹≈0.994
Recall, that the labels are y = [0, 0, 0, 1]. So, only for the last example, the neural network is 99.4% confident that it belongs to the positive class for the rest it’s less than 1% confident. Also, remember the probability equations from Fig.7? P(1)=p̂ and P(0)=1-p̂, so the predicted probabilities(p̂) confirm our neural network knows what it is doing 👌.
Now that we know that the neural network’s predicted probabilities are correct we need to define when the predicted class should be 1 and when it should be 0 i.e. classify the examples based on these probabilities. For this, we need to define a classification threshold (also called decision threshold). What’s that? Let’s get into it
Classification Threshold
In binary classification tasks, it is common to classify all the predictions of a neural network to the positive class(1) if the estimated probability(p̂ ) is greater than a certain threshold, and similarly, to the negative class(0) if the estimated probability is below the threshold.
This can be mathematically written as follows:
The value of the threshold defines how stringent our model is in assigning an input to the positive class. Suppose if the threshold is thresh=0, then all the input examples will be assigned to the positive class i.e predicted class(ŷ) will always be ŷ=1. Similarly, if thresh=1 then all the input examples will be assigned to the negative class i.e predicted class(ŷ) will always be ŷ=0. (Recall, that the sigmoid activation function asymptotes at either ends so it may come very close to 0 or 1 but will never output completely 0 or 1)
The Sigmoid/Logistic function provides a natural threshold value for us. Recall Fig.8 from earlier.
So, with the natural threshold of 0.5 the classes can be predicted as follows:
How do we interpret this? Well if the neural network is at least 50%(0.5) confident than the input belongs to the positive class(1) then we’ll assign it to the positive class(ŷ=1), otherwise we’ll assign it to the negative class(ŷ=0).
Recall how we predicted in Fig.10 the neural network could separate the two classes in the AND gate dataset by drawing a line that separates the positive class(green cross) and negative class(red crosses). Well, the location of that line is defined by our threshold value. Let’s see:
Recall after training our weights and bias converged to around W = [10.678, 10.678] and b = [-16.186], respectively. Let’s plug these into the inequality derived in Fig. 43, above.
Further, realize this inequality gives us an equation of a line that separates our two classes:
This equation of a line marked in Fig.45 forms the Decision Boundary. The Decision Boundary is the line along which the neural network changes its prediction from positive to negative class and vice versa. All points(x₁,x₂) that fall on the line have the estimated probability of exactly 50% i.e p̂=0.5, all points above it have estimated probabilities of greater than 50% i.e p̂ >0.5, and all points that fall below the line have estimated probabilities of less than 50% i.e p̂<0.5.
We can visualize the decision boundary by shading the area green where the neural network predicts the positive class(1) and red where the neural net predicts the negative class(0).
In most cases, we can set a threshold value of 0.5 in binary classification problems. So, what’s the take away after going this deep into understanding the threshold value? Should we just set it to 0.5 and forget about it? NO! In some cases, you’d want the threshold value to be high, for example, if you’re creating a cancer detection neural network model you’d want your neural network to be very confident, maybe at least 95%(0.95) or even 99%(0.99), that the patient has cancer, because if they don’t they may have to go through toxic chemotherapy for nothing. On the other hand, a cat-detector neural net model may be set to a low threshold, around 0.5 or so, because even if the neural net misclassifies a cat, it’s just a funny accident, no harm no foul.
Now to drive home the concept of classification threshold let’s visualize its effect on the location of the decision boundary and the resultant accuracy of the neural network model:
After training the neural network in the above four figures I have plotted the decision boundary(left), the shaded decision boundary(middle) and the shortest distance of each point from the decision boundary(right) with the classification threshold ranging from 0.000000001 to 0.9999.
The classification threshold is also a hyperparameter of the neural network model which needs to tuned according to the problem at hand. Classification threshold doesn’t affect the neural network directly(it does not change the weights and bias) it is only used to convert the output probabilities back to binary representations for our classes i.e back to 1’s and 0's.
On a final note, the decision boundary is not the property of the dataset, its shape(straight, curved, etc.) is the result of the weights and bias of the neural network and its location is the result of the value of the classification threshold.
We’ve learned a lot up till now, right?😅 For the most part, we know almost everything about binary classification problems and how to solve them through neural networks. Unfortunately, I’ve got some bad news, our Binary Cross-Entropy Loss function has a serious computational flaw, it is very unstable in its current form😱.
Don’t worry! With some simple maths, we’ll be able to solve this problem
Implementation of Binary Cross-Entropy Function
Let’s take another look at the Binary Cross-Entropy(BCE) Loss function:
Note from the piecewise equation that all the characteristics for the Binary Cross-Entropy Loss function are dependent on the “log” function(recall, “log” here is the natural logarithm).
Let’s plot the log function and visualize its characteristics:
The log function in Binary Cross-Entropy Loss defines when the neural network pays a high penalty (Loss→∞) and when the neural network is correct (Loss→0). The domain of the log function is 0<x<∞ and its range is unbounded -∞<log(x)<∞ , more importantly, as x gets closer and closer to zero(x → 0) the value of log(x) tends to negative infinity(log(x) → -∞). So, small changes in values near zero have an extreme impact on the result of the Binary Cross-Entropy Loss function, further our computers can store numbers only to a certain floating-point precision, and when there are functions that tend to infinity they cause a numerical overflow(overflow is when the number is too big to be stored in computer memory and underflow is when the number is too small) in computers. It turns out the Binary Cross Entropy function’s strength, the log function, is also its weakness making it unstable near small values.
This has a dire effect on the calculation of the gradients, too. As the values get closer and closer to zero the gradient tends to approach infinity making the gradient calculations also unstable.
Consider the following example:
Similarly, when calculating the gradients for the above example:
Now let’s see how we can fix this:
We have successfully taken then natural logarithm(log) function out of the danger zone! The range of “1+e⁻ ᶻ” is greater than 1 (i.e 1+e⁻ ᶻ>1) resultantly the range of “log” function in BCE loss becomes greater than 0 (i.e log(1+e⁻ ᶻ)>0). The overall Binary Cross-Entropy function is no longer critically unstable.
We can stop here but let’s go one step further and simplify the Loss function even more:
We’ve significantly simplified the Binary Cross-Entropy(BCE) expression, but there is a problem with it. Can you guess it, looking at the curve for “1+e⁻ ᶻ” from Fig.53?
The expression is “1+e⁻ ᶻ” tends approach to infinity for negative values (i.e 1+e⁻ ᶻ →∞, when z<0)! So, unfortunately, this simplified expression overflows when a negative value is encountered. Let’s try to fix this.
Now with this “eᶻ+1” expression, we have solved the problem of the log function being unstable at negative values. Unfortunately, now we face the opposite problem, the new Binary Cross-Entropy Loss function is unstable for large positive values 😕 because “eᶻ+1” tends to infinity for positive values (i.e eᶻ+1 →∞, when z>0)!
Let’s visualize the two exponential expressions:
We need to somehow combine these two simplified functions(in Fig.54 & 56) into one the Binary Cross-Entropy(BCE) Function so that the overall Loss function is stable across all values, positive and negative.
Let’s confirm that it is doing the right calculation on negative and positive values:
Take a moment to understand this and try to piece it together with the piecewise stable Binary Cross-Entropy Loss function from Fig.58.
So, with some simple highschool level math, we have solved the numerical flaw in the basic Binary Cross-Entropy function and created a Stable Binary Cross-Entropy Loss and Cost function.
Note that the previous “unstable” Binary Cross-Entropy Loss function took as inputs label(y) and probabilities from the last sigmoid node(p̂ ) but the new Stable Binary Cross-Entropy Loss function takes as input label(y) and the values from the last linear node(z). The same goes for the stable Cost function.
Now that we have a stable BCE Loss function and its corresponding BCE Cost function how do we find the stable gradient of the Binary Cross-Entropy function?
That answer has been in plain sight all along!
Recall the derivative of the Binary Cross Entropy Loss function(Fig.15):
Also recall that during backpropagation this derivative flows into the Sigmoid node and multiplies with the local gradient at the sigmoid node, which is just the derivative of the Sigmoid function(Fig.19.b.):
Some beautiful mathematics takes place as we multiply the two derivatives:
So to calculate the derivative ∂Loss/∂z we don’t even need to calculate the derivative of the Loss function or the derivative of the Sigmoid node instead we can just bypass the Sigmoid node and pass “p̂-y” as the upstream gradient to the last linear node(z)!
This optimization has two great benefits:
- We no longer have to use the unstable derivative of the Binary Cross-Entropy function.
- We also avoid multiplying with the saturating gradients of the Sigmoid function.
What is a saturating gradient? Recall the Sigmoid function curve
At either end the Sigmoid curve becomes flat. This becomes a huge problem in neural networks when the weights increase or decrease by a large amount such that the output of the associated linear node(z) becomes very big or very small. In these cases, the gradient(i.e the local gradient at the sigmoid node) becomes zero or very close to zero. So, when an incoming upstream gradient is multiplied with a very small or a zero local gradient at the Sigmoid node, not much or none of the upstream gradient value is able to pass through.
On a final note of this section, we could have found the derivative of the stable Binary Cross-Entropy function and reached the same conclusion, but I like the above explanation better as it helps us understand why we can bypass the last sigmoid node when backpropagating gradients in a binary classification neural network. For sake of completion I’ve also derived that below:
Now let’s apply all that we’ve learned onto the slightly complicated XOR gate data where we’d need a multilayer neural network(a deep neural network) as a simple straight line from a single layer neural network won’t cut(view my previous post for more information on this phenomena):
To classify the data points of the XOR dataset we’ll use the following neural network architecture:
A layer in a neural network is any set of nodes at the same depth with tunable weights. Above neural network as two layers with tunable weights, the middle(hidden) and the last output layer.
Let’s expand out this 2-layer neural network before we proceed with forward and backward propagation:
Now we are ready to perform batch gradient descent, starting with forwarding propagation:
We can now calculate the stable Cost:
After the calculation of Cost, we can now move on to backpropagation and improving the weights and biases. Recall, we can bypass the last Sigmoid node with our optimization technique.
Man, that was a lot!😅 But now we know everything in-depth about a Binary Classification Neural Network. Finally, let’s move on to gradient descent and update our weights.
At this point, if you would like to perform the next training iteration yourself and further your understanding, the following are the approximate gradients you should get(rounded to 3 d.p):
So, after 5000 epochs the Cost steadily decreases to about 0.0017 and we get the following Learning Curve and Decision Boundary when the classification threshold value set to 0.5(in the coding section you can play around with the threshold value and see how it affects the decision boundary) :
Before I conclude this section I want to answer some remaining questions, that might be bugging you:
1- Isn’t this just Logistic Regression?
Yes, a neural network with just one sigmoid neuron and no hidden layers, as in Fig.1, is logistic regression. A single-sigmoid-neuron neural net/logistic regression can classify simpler datasets that can be separated with just a straight line (like AND gate data). For a complicated dataset(such as XOR) feature engineering needs to be performed, by hand, to make a single-sigmoid-neuron neural net/logistic regression work adequately(explained in the previous post).
A multilayer neural network with multiple hidden layers and multiple neurons is called a deep neural network. A deep neural network can capture much more information about a dataset, than a single neuron, and can make classifications on complex datasets with little to no human intervention, the only caveat is that it needs much more training data than a simpler classification model such as a single-sigmoid-neuron neural net/logistic regression.
Further, the Binary Cross-Entropy Cost function for a single-sigmoid-neuron neural net/logistic regression is convex(u-shaped) with a guaranteed global minimum point. On the other hand, for a deep neural network, the Binary Cross-Entropy Cost function is not guaranteed to have a global minimum; practically this does not have a serious effect on training deep neural nets and research has shown this can be mitigated with more training data.
2- Can we use the raw output probabilities as/is?
Yes, raw probabilities from a neural network can also be used, depending on the type of problem you are trying to solve. For example, you train a binary classification model to predict the probability of a car accident at a junction per day, P(accident ∣ day). Suppose the probability is P(accident ∣ day)=0.08. So in a year at that junction, we can expect:
P(accident ∣ day) × 365 = 0.08 × 365 = 29.2 accidents
3- How to find the optimal classification threshold?
Accuracy is one metric to figure out the classification threshold. We would want a classification threshold that maximizes the accuracy of our model.
Unfortunately in may real-world cases accuracy, alone, is a poor metric. This is especially evident in cases where the classes are skewed in a dataset(in simple terms, there are more examples of one class than the other). The AND gate, we saw earlier, also suffered from this problem; only one example of positive class, the rest of the negative class. If you go back and look at Fig.47.d. where we set the classification threshold so high(0.9999) that the model predicted the negative class for all our examples you’ll see that the model’s accuracy is still 75%! This sounds pretty acceptable, but looking at the data it isn’t.
Consider another case where you are training a cancer detection model, but your 1000 patient dataset has only one example of a patient with cancer. Now if the model always outputs a negative class(i.e not-cancer, 0), regardless of input, you’d have a classifier that has a 99.9% accuracy on the dataset!
So, to deal with real-world problems many data scientists use metrics that employ the use of Precision and Recall.
Precision: How many of the positive predictions did the classifier get correct? (True Positives / Total number of Predicted Positives)
Recall: What proportion of positive examples was the classifier able to identify? (True Positives / Total number of Actual Positives)
Both these metrics can be visualized through a 2×2 matrix called the “confusion matrix”:
Tuning the classification threshold is a tug of war between Precision and Recall. If the Precision is high (i.e high classification threshold) Recall will be Low and vice versa. Understanding the Precision vs. Recall trade-off is a topic that is beyond the scope of the post and will be a topic of a future Nothing but Numpy blog.
One common metric that most data scientists employ for tuning classification threshold, which combines both Precision and Recall, is the F1 score.
For the sake of brevity, the following questions have been given their own short post and serve as a supplement to our discussion(click/tap on the question to go to its respective post)
4- Where did this Binary Cross-Entropy Loss Function Come from?
5- Why not just use the Mean Squared Error(MSE) function like your last blog? After all, you were able to solve the same examples using MSE.
6- How do Tensorflow and Keras implement Binary Classification and the Binary Cross-Entropy function(Bonus)?
This concludes Part Ⅰ.
Part Ⅱ: Coding a Modular Binary Classification Neural Network
This implementation builds upon the code from the previous post(for more details you may review the coding section of the last post or read the documentation in the code).
The code for the Linear Layer class remains the same.
The code for the Sigmoid Layer class also remains the same:
The Binary Cross-Entropy(BCE) Cost function(and its variants) are the main new addition to the code form last time.
First, let’s look at the “unstable” Binary Cross-Entropy Cost function compute_bce_cost(Y, P_hat)
, which takes as arguments the true labels(Y
)and the probabilities from the last Sigmoid layer(P_hat
). This simple version of the Cost function returns the unstable version of the Binary Cross-Entropy Cost(cost
)and its derivative with respect to the probabilities(dP_hat
):
Now, let’s look at the stable version of Binary Cross-Entropy Cost function compute_stable_bce_cost(Y, Z)
, which takes as argument the true labels(Y
)and the output from the last Linear layer(Z
). This Cost function returns the stable version of the Binary Cross-Entropy Cost(cost
), as calculated by TensorFlow, and the derivative with respect to the last linear layer(dZ_last
):
Finally, let’s also look at the way Keras implements the Binary Cross-Entropy Cost function. compute_keras_like_bce_cost(Y, P_hat, from_logits=Flase
takes as arguments true labels(Y
), the output from the last Linear layer(Z
) or the last Sigmoid layer (P_hat
) depending on the optional argument from_logits
. If from from_logtis=Flase
(default) then all assume P_hat
contains probabilities that need to be converted to logits for computing the stable cost function. If from from_logtis=True
then all assume P_hat
contains output from the Linear node(Z
) and stable cost function can be directly computed. This function returns the Cost(cost
) and the derivative with respect to the last linear layer(dZ_last
).
At this point, you should open up the 1_layer_toy_network_on_Iris_petals notebook from this repository in a separate window and go over this blog and the notebook side-by-side.
We will use the Iris flower dataset, which happens to be one of the first datasets created for statistical analysis. The Iris dataset contains 150 examples of Iris flowers belonging to 3 species — Iris-setosa, Iris-versicolor and, Iris-virginica. Each example has 4 features — petal length, petal width, sepal length, and sepal width.
For our first Binary Classification neural network, we will create a 1-layer neural network, as in Fig.1, to discriminate between Iris-virginica vs. others, using only petal length and petal width as input features. So let’s build our neural network layers:
Now we can move on to training our neural network:
Notice that we are passing the derivative, dZ1
, directly into the Linear layer Z1.backward(dZ1)
bypassing the Sigmoid layer, A1
, because of the optimization, we came up with earlier.
After running the loop for 5000 epochs, in the notebook, we see that the Cost steadily decreases to about 0.080.
Cost at epoch#4700: 0.08127062969243247
Cost at epoch#4800: 0.08099585868475366
Cost at epoch#4900: 0.08073032792428664
Cost at epoch#4999: 0.08047611054333165
Resulting in the following Learning Curve and Decision Boundary:
Our model’s accuracy on the training data is:
The predicted outputs of first 5 examples:
[[ 0. 0. 1. 0. 1.]]
The predicted prbabilities of first 5 examples:
[[ 0.012 0.022 0.542 0. 0.719]]The accuracy of the model is: 96.0%
Check out other notebooks in the repository. We’ll be building upon the things we learned in this blog in future Nothing but NumPy blogs, therefore, it would behoove you to create the layer classes(if you haven’t before) and the Binary Cross-Entropy Cost functions from memory as an exercise and try recreating the AND gate example from Part Ⅰ.
This concludes the blog🙌🎉. Thank you for taking the time out to read this post, I hope you enjoyed.
For any questions feel free to reach out to me on Twitter @RafayAK
This blog would not have been possible without the following resources and people:
- TensorFlow Documentation and GitHub(especially this)
- Keras Documentation and GitHub(especially this and this)
- Sait Celebi’ s blogs
- Google’s ML crash-course
- James D. McCaffrey’s blog
- Will Wolf’s( @willwolf_) amazing post on deriving functions through MLE
- Andrej Karpathy’s(@karpathy) Stanford course
- Christopher Olah’s(@ch402) blogs
- Andrew Ng(@AndrewYNg) and his Coursera courses on deep learning and machine learning
- Ian Goodfellow(@goodfellow_ian) and his amazing book
- Reddit and StackExchange
- Berkeley CS294 lecture notes
- Stanford CS229 lecture notes
- Stanford Probabilistic Graphical Models lecture
- Finally, Hassan-uz-Zaman(@OKidAmnesiac) and Hassaan Tauqeer(@_hassaantauqeer) for invaluable feedback.