The world’s leading publication for data science, AI, and ML professionals.

Understanding Convolutional Neural Networks

By taking a look under the hood

Image by Cecbur on Wikimedia Commons
Image by Cecbur on Wikimedia Commons

In this article, I will train a deep neural network to classify images, but I will also give you an understanding of what is happening inside the neural network itself and how convolutions really work.

We will explore the following sections.

· ConvolutionsFiltersPooling · The NetworkPreprocessingThe modelTraining the network · Predicting · What does the neural network "see"?

I will show you the network in action by displaying to you what it is actually "seeing" when it makes the decisions by diving into its "brain" — the layers of the network and together we will try to understand how it determines the important features of the images.


When you see an image of a cat, how do you (your brain, that is) recognize that there is a cat in the image? And how do you tell the difference between a cat and a dog even though you haven’t seen the specific animals before?

These are tough questions to answer. Maybe we should start with a simpler one…

How would you program a computer to recognize a cat in an image? A random cat placed randomly in an image that is.

You might try to capture the essence of a cat by noticing general different cat-like features like the length and the color of the fur or the shape of its ears.

But there are several big problems with that approach.

First of all, cats are very different looking depending on race and you won’t capture all their features. Secondly, it is not straightforward to make an algorithm that can detect a cat independently of where that cat is in the image.

If you look at the pixels, an image containing a cat on the left-hand side is very different looking than an image containing a cat on the right-hand side.

How would you detect a cat independently of the position and size of the cat? That is a very tough problem to solve by means of classical programming if not impossible.

Convolutions

The idea of convolutions is to solve the above problem by detecting features of cats in a very clever way.

Instead of programming them explicitly, we use a supervised Machine Learning algorithm.

That is, we take a neural network with a certain architecture. Then we show it a lot of images with corresponding labels (like cat and dog), and through training, it will learn which features are most important when we want to classify images of cats and dogs.

A standard neural network does however not immediately solve the "size and position" problem above.

This is where convolutions come in. You see, a convolution highlights some shapes by making some pixel values greater and some smaller. Thus they transform the image.

Filters

The way it works is by the following. We take filters, which are just 2-dimensional arrays of numbers, and you should imagine that we pass them over the pixels of the image.

For each pixel value of the image, you center the filter "above" that pixel and multiply each number of the filter by the pixel directly ‘below" it. We place these new values as pixels in a new image.

Image by Vincent Dumoulin, Francesco Visin on Wikimedia Commons
Image by Vincent Dumoulin, Francesco Visin on Wikimedia Commons

The results of these operations are passed to the next layer of the network.

This is similar to the response of a neuron in the visual cortex to a specific stimulus and this architecture is therefore inspired by the biological brain.

- Wikipedia

More low-levelly speaking, for each filter, the resulting image is a transformation of the input image where, depending on the filter, some specific shapes have been highlighted.

Pooling

The second ingredient in CNNs is the concept of pooling.

Pooling is when you downsample an image by mapping n×n pixels down to 1 pixel. Again you should imagine an n×n array moving across the image but this time not necessarily mapping the pixels more than one time, but rather mapping non-intersecting squares to pixels.

Image by Andreas Maier on Wikimedia Commons
Image by Andreas Maier on Wikimedia Commons

This process will preserve the important features but shrink the image size. For example, if we use a 2×2 array for pooling, the resulting image will shrink by a factor of 4.

The Network

In this paragraph, I will show you how to work with image data that has not already been preprocessed!

The first thing to do is to get some data.

In this tutorial, I will download images of dogs and cats from Kaggle. You can get them here.

I have unpacked the train.zip and placed the folder in a directory called data. That is, I have the path data/train where all the training images are.

If an image starts with "cat", then it is an image of a cat and if it starts with "dog", then it is an image of a dog. I am going to use this pattern to sort the training data into dogs and cats.

Preprocessing

Now we need to do some preprocessing. I will create two functions that will do this work for me.

Now simply make some directories and run these functions with the appropriate parameters.

You can do a quick check to make sure that it worked by running the following:

The output should be

11250
11250
1250
1250

Now that we have done the preprocessing part, the fun is about to begin.

The model

Let us build the architecture of the neural network that we are going to train and save it to a variable called model.

Now if we run

print(model.summary())

we get the following output

Let’s try to understand the architecture of the neural network from this image. The first thing you should notice is that the resulting transformed images after the first convolution seem to be of shape 148×148 even though we fed the network images of shape 150×150.

The reason for this is simple. When the first convolution happens we use a 3 × 3 array to slide over the image. Now, since each value of the array needs to be above some pixel in the image, the center of the array cannot be above any pixels on the edge of the image because then some of the neighboring numbers in the array would have no pixels under them.

Therefore we cut off the edges leaving us with a 148×148.

Moreover, notice that the output from the pooling layer is 4 times smaller (half of each side) because we used a 2×2 array.

We have three pairs of convolution-pooling layers before going into the second state of the network.

The second state is where the actual classification happens. After the flattened layer, the neural network looks at the extracted features of the convolutions and tries to classify them as a dog or a cat.

You may notice by looking at the code that the last layer consists of only one single neuron. The activation function for this neuron is the sigmoid function, which is

sigmoid(x) = 1/(1 + e^(-x)).

When x is a very "large, negative" number, the sigmoid(x) becomes close to 0 and when x is a very large, positive number, sigmoid(x) becomes close to 1.

The neural network thus learns that to classify cats and dogs because of this binary behavior of the sigmoid function.

Now that we have designed the model, we need to train it. First, let’s create some generators.

These generators will, in turn, feed data in batches into the neural network. The labels will be inherited from the folder names of the images that we created in the preprocessing phase.

We will also do image augmentation which is a way of making sure that our network "sees" more data. If it hasn’t seen a climbing cat at an angle in the training data, it might not classify it as a cat when trained. By rotating, shifting, shearing and flipping the images in different ways before feeding them to the network, we will make sure that the training data captures as many features as possible.

You might notice that we are rescaling the images as well. This is standard practice and it is known as normalizing. The idea is that it is the relations between the pixel values that matter, not the pixel values themselves.

This is not only true for image data of course.

The cool thing about this generator is that the actual transformed images, are not saved on disk and your training data stays as it is and where it is. The augmentation happens dynamically in the RAM, right before they are fed to the network for training.

We also need to set the target size to 150×150 since this is what we told the model that the input shape would be.

Training the network

Let us train the model.

You don’t need the TensorBoard callback, but if you know what it is, feel free to play with it. I will address this in another article.

When I ran this, I got a validation accuracy of about 0.76 which means that it correctly classified about 76% of the images that it hadn’t seen before. This is of course before we have tweaked the model’s hyperparameters and so on.

If we were to get more training data, the accuracy would increase. Another way to get a much better model is to use transfer learning where you build your network on top of a huge pre-trained model. I will show you how to do this in another article. In this way, you can get an accuracy in the high 90’s.

Predicting

Let’s take the following image of a cat.

Image by Alvesgaspar on Wikimedia Commons
Image by Alvesgaspar on Wikimedia Commons

We can ask our newly trained neural network if this is a cat or a dog.

The output when I run this code is

cat

Pretty good.

What does the neural network "see"?

Let’s try to take a deep look inside and try to figure out what features the network thinks are important.

Let us take a look at one of the images from the training data and see how it gets transformed through the network. The following images are outputs from the convolution layers in the trained network giving us an idea about which filters the network has learned were most important.

Image by author
Image by author
Image by author
Image by author
Image by author
Image by author

If you look closely, you see that already after the first convolution, the cat’s whiskers and its ears light up, which indicates that those are important features in order to classify it as a cat.

If you look at the third image, it seems that the eyes are also important.

This shows that a convolutional neural network identifies shapes no matter the size and position because it learns which filters are most important through convolution. Then a simple neural network with a sigmoid output calibrates its weights through backpropagation to classify these image features from the convolutions.

This short article shows some of the powerful API’s that TensorFlow has to offer. Hopefully, this also gives an intuition about what convolution really is.


If you have any questions, comments or concerns please text me on LinkedIn.

Kasper Müller – Senior Consultant, Data and Analytics, FS, Technology Consulting – EY | LinkedIn


Related Articles