The Most Intuitive and Easiest Guide for Convolutional Neural Network

Published in

Towards Data Science

9 min readJan 24, 2019

Demystifying CNN for complete starters

Computer vision is one of the hottest topics in the AI industry. Its power is so revolutionary that now we are witnessing technical advances in image and video processing services almost every day. Moreover, convolutional neural networks are also showing huge potentials not only in the vision industry but also in Natural Language Processing including voice recognition.

This is the second series of ‘The Most Intuitive and Easiest Guide’ for neural networks. The complete set of this post is like follows:

The Easiest Guide for Artificial Neural Network
The Easiest Guide for Convolutional Neural Network (this post)
The Easiest Guide for Recurrent Neural Network

This post assumes that you have pre-knowledge on the basics of neural networks. If you are new to ANN, it’ll be great to check out this previous post. Today’s keywords are convolution, filters, padding, pooling, and flattening. The core of CNN lies on understanding the functional usage of these steps and how the dimension of data changes. Are you ready to become a pixel of an image and take a trip to neural networks? Let’s get it done.

How do we recognize image?

Before jumping into the convolutional networks, let’s talk about us. About how we perceive images. This will be helpful to understand the CNN much easier. Take a look at this picture. What can you see?

Is it a man with the right face? Or is it a man who is looking at you directly? If your attention is at the nose or the right contour, you’ll probably say the former. If you see the ear or the left contour, you’ll say the latter. What do you think? Let’s try again with another one.

What do you see? Just a lady who is walking behind the tree? If then, lean back a little bit from your window. Or take one step behind your computer. Can you see that? Did you notice that there is a man with a shiny forehead?

These two pictures are showing us how we recognize images. The contours and edges of an image affect our perception. You might not have noticed, but our brain captures the patterns in figures to classify an object. And the kick of CNN lies here. Our machine will be trained to classify images by detecting some patterns like us. For example, In the case of ‘the cats and dogs’ problem, the patterns will be like the shape of ears, eyes, colors and hairs and so on.

So what is Convolution?

In the terminology of convolutional neural networks, we call the patterns as ‘kernel’, ‘filter’ or ‘feature detector.’ In my opinion, the last one is the most intuitive. So what CNN does is detecting the wanted features from the image data using corresponding filters and extracting the significant features for prediction.

Let’s try to understand the concept of convolution with a simple 1-dimensional case first. Suppose our train image is a 1D array with the numbers like below. We want to detect the point that the value changes from 0 to 1. There can be other possible filters but I’ll use a simple one with [-1, 1].

Passing the detector from left to right, we get the convolution value for each time. So as you can see, we can capture the wanted pattern with the outcome. Then let’s try a 2-dimensional case.

I think you now get the idea of what convolution is. By definition, Convolution is a mathematical operation on two objects to produce an outcome that expresses how the shape of one is modified by the other. With this computation, we detect a particular feature from the input image and get the result having information about that feature. This is called ‘feature map.’ If we see this with the real image example, the outcome is shown below.

Now we can consider the same process with a 3-dimensional case. But It’s simple cause you can take it as stacking the 2-dimensional data.

You know our image data consists of three basic colors: Red, Green and Blue. Say our image has 7x7 pixel data with RGB. This means the data has a 7x7x3 volume. If we are to detect the certain features with 4 filters, the convolution computation occurs for each filter. Please pay attention to the volume of the outcome. How the height and the width and the depth changed. If we use a bigger size of filters, the height and the width will be bigger. And the number of filters will determine the depth of the outcome.

Convolution with padding and stride

There are other tricks we can apply with the convolution, one of which is padding. You might already notice that the pixels of the image aren’t processed with the same number. The pixels at the corner are less counted than those in the middle. This means that the pixels don’t get the same amount of weights. Additionally, If we just keep applying the convolution, we might lose the data too fast. Padding is the trick we can use here to fix this problem. As its name, padding means giving additional pixels at the boundary of the data.

The first example on the picture above is showing what we have done in the previous section. The input image has 4x4 pixels and the filter has 3x3. There is no padding, which is called ‘valid.’ The result becomes 2x2 pixels data (4–3+1 = 2). We can see that the output data is downsized.

Let’s see the third example this time. There is one layer padding with the blank pixels. The input image has 5x5 pixels and the filter has 3x3. So the result gets 5x5 pixels (5 + 1*2–3 + 1= 5), which is the same size as the input image. We call this ‘same.’ We can even make the outcome bigger than the input data, but the two cases are used the most.

By the way, does a filter always have to move one pixel at a time? Of course not. We can also make it move two steps or three steps at a time both in the horizontal and vertical ways. This is called ‘stride.’

Then we can ask one question here. Suppose there are a N x N input image and Nc` number of f x f filters. We’ll make p number of padding layers and move the filters with the S value of stride. Please take a moment to think and converge all we discussed so far. What will be the size of the result?

Pooling and flattening

What we’ve been walking on is about the convolution layer. And actually, there are additional layers different from convolution layer: pooling layer and flattening layer. Let’s talk about the first one. Pooling is the process of merging. So it’s basically for the purpose of reducing the size of the data.

Sliding a window, we only take the maximum value inside the box on the left case. This is ‘max pooling.’ We can also take the average values like the picture on the right. This is ‘average pooling.’ And we can also tune the stride like what we do at the convolution layer.

But isn’t this losing valuable data? Why are we reducing the size? It could be seen like losing information at the first glimpse, but it’s rather getting more ‘meaningful’ data than losing. By removing some noise in the data and extracting only the significant one, we can reduce overfitting and speed up the computation.

Flattening and fully-connected layers are what we have at the last stage of CNN, which means you’re almost there! Great. By the way, you didn’t forget why we are doing these, right? What are we doing? Image processing. For what? 🙄🙄 Classifying ‘the cats and dogs.’ We are making a classification model, which means these processed data should be good input to the model. It needs to be in the form of a 1-dimensional linear vector. Rectangular or cubic shapes can’t be direct inputs. And this is why we need flattening and fully-connected layers.

Flattening is converting the data into a 1-dimensional array for inputting it to the next layer. We flatten the output of the convolutional layers to create a single long feature vector. And it is connected to the final classification model, which is called a fully-connected layer. In other words, we put all the pixel data in one line and make connections with the final layer. And once again. What is the final layer for? The classification of ‘the cats and dogs.’

In one shot

Now let’s see what we’ve walked through in one shot.

Adding multiple convolutional layers and pooling layers, the image will be processed for feature extraction. And there will be fully connected layers heading to the layer for softmax (for a multi-class case) or sigmoid (for a binary case) function. I didn’t mention the ReLu activation step, but there’s no difference with the activation step in ANN.

As the layers go deeper and deeper, the features that the model deals with become more complex. For example, at the early stage of ConvNet, it looks up for oriented line patterns and then finds some simple figures. At the deep stage, it can catch the specific forms of objects and finally able to detect the object of an input image.

Understanding CNN is understanding how the image data is processed. How to transfer images into data and extract the features for prediction. From this basic structure, there are many modified and converged versions. ResNet, AlexNet, VGG-16 or Inception Networks are some of them. Want to know more about these models? Check these additional resources below!

A Comprehensive Tutorial to learn Convolutional Neural Networks from Scratch by Pulkit Sharma: This post is for the summary version of the 4th course of deep learning specialization on Coursera. It covers not only the concept of CNN but also other types of foundational CNN that I mentioned.
Image Classifier — Cats🐱 vs Dogs🐶 by Greg Surma: Nice explanation on convolutions and a gentle guide to building your first CNN using Colab.
Convolutional Neural Networks for Image Processing on DataCamp: A excellent course for CNN. This course covers from the concept of convolution to building a model in Keras in a very intuitive way.

And you might also find these series interesting as well! 😍👍

OpenCV tutorials for image processing

Computer Vision for Beginners: Part 1

Introduction to OpenCV and Image Processing in Python

towardsdatascience.com

Advanced networks of convolutional neural networks

Deep Dive into the Computer Vision World: Part 1

Starting from VGG, ResNet and Inception Network