The world’s leading publication for data science, AI, and ML professionals.

Deep Learning Illustrated, Part 3: Convolutional Neural Networks

An illustrated and intuitive guide on the inner workings of a CNN

Welcome to Part 3 of our illustrated journey through Deep Learning. If you’ve missed the previous articles, definitely go back to read them. They lay the groundwork for what we’re about to dive into today.

Deep Learning, Illustrated

To quickly recap, we previously discussed the inner workings of neural networks by building a simple model to predict the daily revenue of an ice cream shop. We found that neural networks can handle complex problems by harnessing the combined power of several neurons. This allows them to uncover patterns in data that might otherwise be hard to recognize. We also learned that neural networks primarily solve two types of problems: Regression or Classification.

Just as we built a revenue prediction model, we can create models to address diverse problems by modifying the structure. Convolutional Neural Networks (CNNs) are specialized models designed for image recognition tasks. However, they rely on the same fundamental principles as the models we have encountered thus far (plus a few more steps). Today, we will explore the inner workings of a CNN and understand exactly what is happening behind the scenes.

For our first-ever CNN, let’s build an X-or-not-X model. This model should determine whether an image represents an X or not.

Groundbreaking, I know. For fellow Silicon Valley watchers, this model is very much inspired by my boy Jian Yang’s brilliant hotdog-not-hotdog app.

In our revenue model, we used two inputs – temperature and day of the week – to predict revenue. These were easy to input because they were numerical. But how do we input images into a neural network instead of numerical values?

The answer is rather straightforward. When we zoom into an image, we see that it’s basically just a bunch of pixels:

Since our X is a simple black and white image, let’s designate each pixel as either a 1 (representing a black pixel) or a 0 (representing a white pixel). These pixels are stored as a matrix of 0s and 1s.

We can convert this 5×5 matrix into a column:

And this column of 25 (5×5) 1s and 0s can now be our inputs into the neural network:

From the previous article, we also know that a trained neural network comes with weight and bias terms. Let’s assume this is a trained neural network, then here, we’ll have 25 inputs to this neuron, each with its own weight, plus one bias term. If we want to create a more complex neural network (as images typically require), we need to add more neurons and/or layers. However, this increase will exponentially increase the number of weights and bias terms that need to be optimized, requiring significant computational power.

Despite this, it may still be feasible for very very small images, such as our 5×5 pixel image. However, a 256×256 pixel image will result in 65536 (256×256) input weights plus 1 bias term… for a neural network with just 1 neuron! More complex images would require even more neurons and layers (!!). As a result, this method of feeding image pixel values may not scale effectively.

Another concern is that images may not always look as expected. For instance, we could have this ideally centered, beautiful little ‘X’:

Or a wonky one like this:

Or an off-centered one like this:

All the images are of ‘X’, but each ‘X’ looks slightly different. If we train our neural network using a perfectly centered ‘X’, it may not perform well with other ‘X’ images. This is because the network only recognizes a perfectly-centered ‘X’. It cannot identify an off-center or distorted ‘X’. It knows only one pattern. This is not practical for real-world applications, as images are rarely that straightforward. Therefore, we need to adapt our neural network to handle situations where the ‘X’ isn’t perfectly centered.

We need to be more creative with our approach in constructing this neural network, perhaps by understanding the underlying patterns in all the images instead of just the pattern of one kind of image.

And if you think about it – our minds recognize images in a similar way, focusing on the features of an image and piecing them together. Given the vast amount of information we encounter, our brains excel at identifying features and discarding unnecessary information.

So we need to address two issues: reducing the inputs we feed into the neural network and finding a way to detect patterns in images.

Filters

Let’s start by finding some consistent pattern in all the ‘X’ images. For instance, one possible pattern can be:

same pattern across all 3 'X' images
same pattern across all 3 ‘X’ images

And then we can determine that the image is of an ‘X’ by confirming that this pattern exists in the image.

This pattern is called a filter here. A filter captures a critical characteristic of ‘X’. Thus, even if the image is rotated or smaller or distorted, we maintain the essence of the image.

These filters are typically small square matrices, most commonly 3×3 pixels, although the size can vary.

To apply a filter to an image for pattern detection, we slide the 3×3 filter over each section, and calculate the dot product of the filter and the section it covers. So for the first section we…

…and then multiply together each overlapping pixel value in the filter and matrix…

…and then add the products:

By computing the dot product between the image and the filter, we can say that the filter is convolved with the image and that’s what gives convolutional neural networks their name.

We now do this to all the sections by sliding this filter depending on something called the stride, which we can set. The stride dictates how many cells over we want to move our filter. So if our stride = 1, we move it over the next section like this…

…and if stride = 2, we move it over like this:

Usually the stride is set to 2, but in our case let’s set it to 1.

With stride = 1, if we store all the dot products in a matrix, we get:

We then add a bias term to this output matrix…

…which results in something called a feature map.

It’s important to note that the larger our strides, the smaller our feature map will be. In our example, we used stride = 1, resulting in a relatively large feature map.

When working with actual images, we may need to increase our strides. After all, we are dealing with a 5×5 input image in our example, but real-world images are usually much larger and more complex.

Typically, each value in this feature map is passed through the ReLU activation function. And as a quick reminder from the first article, here is the formula for ReLU:

The function outputs the value as is if it’s greater than 0, and outputs 0 if the input is less than or equal to 0. Thus, by passing the feature map through the ReLU function, we obtain the following updated feature map:

In this scenario, all cells are set to 0, except for the one cell in the middle.

I know that was a lot of steps, but to summarize the convolutional process, we started with an input image of the X…

…and then a filter was applied to it, also known as convolving the filter with the image…

…subsequently, a bias term was added to the convolved matrix to create a feature map…

…and finally, we typically pass this feature map through the ReLU function to obtain an updated feature map:

The primary purpose of the convolution step is to reduce the input size (from the whole image to a feature map) to simplify processing. A valid question that arises is whether we’re losing a significant amount of information due to the reduced values in the resulting feature map matrix. Indeed, we do have fewer values, but the filters are designed to detect certain integral parts or features of the images and eliminate all unnecessary information. And like we discussed earlier, this is similar to how the human eye discerns objects, often ignoring irrelevant details. We don’t examine every single pixel, but rather look at distinct features. The focus is on preserving these essential features.

Similar to the previously mentioned filter, we can use additional filters to detect other features. For instance, we could use this filter…

…that can detect the following patterns:

So if we apply multiple filters using the same process as above, we’ll obtain a collection of feature maps derived from the same input image.

input image -> feature maps
input image -> feature maps

A crucial question is, how do we determine the filters needed to detect features? This is determined during the training process, which we will discuss shortly.

Pooling

With our feature map now ready, we can move on to the next step – Pooling. This step is quite straightforward. We simply scan the previously created feature map, selecting small 2×2 sections, and choose the maximum value from each section. Here’s what our first step looks like:

max pooling - step 1
max pooling – step 1

These 2×2 sections we take do not overlap, so here’s what our next step will look like:

max pooling - step 2
max pooling – step 2

In this step, you’ll see we don’t have a full 2×2 section, but that’s okay because these sections don’t need to be perfect 2×2. We then move to the next step:

max pooling - step 3
max pooling – step 3

And finally:

max pooling - step 4
max pooling – step 4

We call this method max pooling because it takes the maximum value from each section. Alternatively, we could use mean pooling, which calculates the average value for each region. The result would look like this:

2x2 matrix from mean pooling
2×2 matrix from mean pooling

Note: Sum pooling is another option, which, as the name suggests, sums up the values in each region. However, max pooling is the most commonly used method.

Max pooling is primarily used to further reduce noise in an image. Its effectiveness becomes more apparent with larger images, as it identifies the area where the filter best matches the input image. Just as in the convolution step, the creation of the pooled feature map discards extraneous information. In this process, approximately 75% of the original information in the feature map is lost, as we retain only the maximum value from every set of four pixels and discard the rest. These are unnecessary details that, when removed, enable the network to function more efficiently. The extraction of the maximum value, the key point of the pooling step, is done to account for distortions.

Whew. That was quite a journey, but we’re not at the actual neural network part yet! But don’t worry, if you’ve read the previous articles, the rest will be pretty straightforward. All the work we’ve done so far has prepared us to use a real neural network. We’ll use the results from the pooling step as inputs for the neural network.

Flattening

The first step to input these values into a neural network involves flattening the feature map matrix. We can’t input the feature map as it is. Therefore, we flatten it. For instance, if we have four filters, they would result in four feature maps. These, in turn, would lead to four 2×2 matrices from the max pooling step. And this is what they will look like flattened:

max pooling -> flattening
max pooling -> flattening

Neural Network (Finally)

All the features we’ve talked about before are stored in this flattened output, which allows us to use the flattened output as inputs to a neural network.

These features already provide a good level of accuracy for classifying images. But we want to improve the model’s complexity and precision. The job of the artificial neural network is to take this data and use these features to make the image classification better, which is the main reason we’re creating a convolutional neural network.

So we take these inputs and plug them into a fully connected neural network.

Note: This is called a fully connected neural network because here we are ensuring that each input and each neuron is connected to another neuron.

Let’s set our neural network architecture to be: 1 hidden layer with 3 neurons and 1 output neuron:

Now, we need to select our activation functions. In the previous article, we used the ReLU activation function for all neurons in our neural network for ice cream sales. The ReLU activation function remains a good choice for the inner layer. However, for the outer neuron, it’s not suitable due to the different nature of the problem we are trying to solve.

Previously, we were trying to answer: given the day of the week and temperature, what will the revenue of the ice cream store be? Now, our question is: given an image, is it the letter X or not? The nature of the problems and the answers we are seeking are significantly different, which means we need to adjust the processing of the outer neuron.

The first scenario was a regression problem, while the current one is a classification problem. We can approach our current problem by calculating a probability. For instance, given an input image, we can determine how likely it is that the image represents the ‘X’. Here, we’ll want the neural network to output values in the range of 0–1, where 1 indicates a high likelihood of being ‘X’, and 0 indicates it’s probably not an ‘X’.

To achieve this type of output, from our discussion of activation functions, the sigmoid function is a good choice.

The functions takes an input and squishes it into an S-shaped curve that outputs a value between 0 and 1. This is perfect for predicting probabilities. Given this, here is what the neural network would look like:

Let’s assume this neural network is trained. Then we know that each input in a trained neural network has associated weight and bias terms. This network subsequently outputs values between 0 and 1.

So if we input our flattened example into this trained neural network and the output is 0.98, and that indicates that there’s a 98% probability the image is an ‘X’.

To recap once again let’s see visually what we have done so far. We start with an input image:

Then convolve this image by applying filters to it…

Add bias terms to the output and pass them through the ReLU function to get feature maps:

Next, we perform max pooling on the feature maps:

We then take these outputs, flatten them, and pass them through our neural network…

…to get a prediction of 0.98!

Okay, this is great. But now we need a way to check how good this 0.98 prediction is. In this case, we know our original image is an ‘X’, so we can say – "the CNN did a good job here!", but we need something that in math-y terms tells us the same thing.

In the previous article, we used the Mean Square Error (MSE) cost function to evaluate the accuracy of our prediction and used that for our training process. Similarly, we need to use a cost function here. But as we discussed earlier, since the kinds of predictions are different, we can’t use the MSE.

In this case, we’ll use something called a Log Loss function, which will sound familiar if you read the article on Logistic Regression. In Logistic Regression we’re trying to check the accuracy of a similar kind of output. Even though a CNN is way more complex than a Logistic Regression model, we’re trying to answer the same type of question.

A Log Loss cost function looks like this:

Here, y = 1 if the image is an ‘X’ and 0 if it’s not and p_hat is the predicted probability. And the sigma just sums all the values across all the image predictions we want to evaluate over. So for this example, y = 1 (because we know the image is an ‘X’), the predicted probability p_hat = 0.98 and n = 1 because we are just trying to evaluate the output of one image:

Here, we see the cost function is very close to 0, which is good. The lower the cost function, the better. So this in mathematical terms is saying what we said earlier – "the CNN did a good job here!"

Training

NOTE: We won’t go into detail about the training process in this article because we covered it extensively in the previous one. So make sure you read that before you do this section!

Remember from the previous article that a neural network learns the optimal weights and bias through the training process using gradient descent. This involves running the training set through the network, making predictions, and calculating costs. We keep doing this until we get the optimal values. The same process happens when we train our Convolutional Neural Network, but with two changes.

First, instead of using the MSE cost function, we use the Log Loss. Second, besides finding the best weight and bias, we also look for the best filters and bias terms in the convolution step. The filters are just 3×3 number matrices. So, the goal is to find the optimal values for all these elements – the filters, bias terms in the convolution step, and the weight and bias terms in the neural network.

If you want to dive deeper into the math behind the training process, this video does a great job.


And that’s about it! This was a pretty meaty article, so it might be helpful to read through it and the previous two articles a couple of times and work through some of the logic on your own to let the concepts sync in.

Check out this article that brings this X-or-not-X classifier to life by building a CNN from scratch in TensorFlow!

Implementing Convolutional Neural Networks in TensorFlow


Part 4 on Recurrent Neural Networks is now live!

Deep Learning Illustrated, Part 4: Recurrent Neural Networks

NOTE: All images are illustrated by the author unless indicated otherwise.

As always, feel free to connect with me on LinkedIn if you have any questions/comments!


Related Articles