Demystifying Convolutional Neural Networks

Published in

Towards Data Science

6 min readOct 24, 2018

Brief History

In the past decade, advances made in the field of computer vision are truly amazing and unprecedented. Machines can now recognize images and frames in videos with an accuracy (98 percent) surpassing humans(97 percent). The machinery behind this amazing feat is inspired by the functioning of human brain.

Back then neurologists were conducting experiments on cats when they discovered that similar parts of an image cause similar parts of cat’s brain to become active. In other words, when a cat looks at circle, the alpha-zone is activated in its brain. When it looks at square, beta-zone is activated in the brain. Their findings concluded that animal’s brain contain a zone of neurons which react to the specific characteristics of an image i.e they perceive the environment through a layered architecture of neurons in the brain. And each and every image passes through some sort of feature extractor before entering further into the brain.

Inspired by the functioning of brain, mathematicians hoarded to create a system to simulate different groups of neurons firing for different aspect of an image and communicate with each other to form a bigger picture.

Feature Extractor

They materialized an idea of group of neurons getting activated, provided with specific input, into mathematical notion of multi-dimensional matrix representing detector for specific set of features also known as filter or kernel. Each such filter would serve the purpose of detecting something specific in the image e.g filter for detecting edges. These learned features would then be passed through another set of filters designed to detect higher level features i.e eyes, nose etc.

Convolution of an image with Laplacian filter to detect edges

Mathematically, we will perform convolution operation between given input image, represented as a matrix of pixel intensities, and filter to produce so called feature map. This feature map will serve as input for another layer of filters.

Why Convolution?

Convolution is a process where the network tries to label the input signal by referring to what it has learned in the past. If the input signal looks like previous cat images it has seen before, the “cat” reference signal will be convoluted with, or mixed into, the input signal. The resulting output signal is then passed on to the next layer (Here, input signal is a 3-D representation of input image in terms of RGB pixel intensities whereas “cat” reference signal is kernel learned for recognizing cats).

Convolution operation of image and filter. Source

One nice property of convolution operation is it being transnational invariant. That mean each convolution filter represents specific feature set e.g eyes, ears etc. and it’s the job CNN algorithm excels at learning which feature set comprise the resulting reference e.g cat. The output signal strength is not dependent on where the features are located, but simply whether the features are present. Hence, a cat could be sitting in different positions, and the CNN algorithm would still be able to recognize it.

Pooling

Tracking the trajectory laid down by biological functioning of brain, they were able to setup mathematical apparatus needed for feature extraction. However, after understanding total number of levels and features they need to analyse in order to track complex geometrical shapes, they realized they would fall short on memory to hold all of that data. Even computing power needed to process all this would explode exponentially with number of features. Soon they figured out a technique known as pooling to solve this problem. It’s core idea is very simple.

If an area contains strongly expressed features, we can avoid searching for other features in that area.

This pooling operation, in addition to saving additional memory and computing power requirements , helped remove noise from images.

Fully Connected Layer

So far so good, but what would a network be useful for if it only ends up detecting set of features in an image. We need a way for the network to be able to classify the given image into some set of categories. And this is where the traditional neural network setup can be utilized. Particularly, we can have a fully connected layer from feature maps detected by earlier layers to number of labels we have for categorization. This final layer would end up assigning probabilities to each class in the output category. Based on these output probabilities, we can finally classify image into one of the output categories.

Final Architecture

The only thing remaining is to combine all these learned concepts into a single framework we call as Convolution Neural Network aka CNN. Essentially, a CNN consists of series of convolution layers combined optionally with pooling layers to generate a feature map which then is fed to bunch of fully connected layers to produce class probabilities. Back propagating the errors outputted, we would be able to train this setup to generate accurate results.

Now, functional perspective laid down properly, let’s dive a bit into operational aspect of CNN.

Convolution Neural Networks

The convolution layer is the main building block of CNN. Each such layer is comprised of set of independent filters each looking for a different feature set in a given image.

Mathematically, we take a filter of fixed size and slide it over the complete image taking dot product between filter and chunk of input image along the way. The result of this dot product would be a scalar which goes into the resultant feature map. We then slide the filter to the right and perform the same operation, adding that result to the feature map as well. After convolving complete image with the filter, we end up with a feature map representing distinct feature set which again serves as input to further layers.

Strides

The amount by which the filter shifts is the stride. In the above image, we are sliding filter by a factor of 1. That might not what we always need. The intuition behind using stride of more than 1 is that neighboring pixels are strongly correlated (specially in lowest layers), and hence it makes sense to reduce the size of the output by applying appropriate stride. However, a large stride could lead to high information loss. So, we must be careful while choosing the stride.

Padding

One undesirable effect of strides is the size of feature map getting reduced subsequently as we perform convolution over and over. This might not be a desired implication as shrinking here means information loss as well. To see why this is the case notice the difference in number of times filter applied to cell in middle and cell at the corner. Clearly the information from cells in the middle gains more importance than cells at the edges for no particular reason. To retain the useful information from earlier layers we could surround the given matrix with layers of zeros.

Parameter sharing

Why CNN when we already have a good setup of deep neural networks. Interestingly, if we were to use deep neural networks for image classification, the number of parameters at each layers would be thousands times more than with CNN.

Please let me know through your comments any modifications/improvements this article could accommodate.