A Conceptual Explanation of Convolutional Neural Networks (CNN’s)

A gentle introduction into the inner workings of a popular algorithm used for computer vision purposes

Published in

Towards Data Science

6 min readJun 7, 2021

Computer vision: the branch of machine learning that seems complicated, is complicated, and will always be complicated. As Mike Krieger, the founder of Instagram says, “Computer vision and machine learning have really started to take off, but for most people, the whole idea of what is a computer seeing when it’s looking at an image is relatively obscure”. Hopefully by the end of the article, you will have a sense of what is going on under the hood from the standpoint of convolutional neural networks.

There is a large demand for programmers who know the ins-and-outs of computer vision due to the sheer breadth of optical problems that it is capable of tackling. Using computer vision, we can perform facial recognition, analyze medical images, enable autonomous driving, and most importantly, use neat filters on Snapchat. Many organizations are using such algorithms to accomplish amazing feats. Veo Robotics is a rapidly growing startup that has created Veo FreeMove, a 3D safeguarding system that enables safer collaboration between robots and humans in the manufacturing workplace using computer vision. The autopilot feature of Teslas that we are apprehensive, yet mesmerized by is due to advanced object-detection algorithms. A research group at Stanford University is utilizing X-ray images to quantify the severity of knee osteoarthritis in patients using, guess what, convolutional neural networks! The point is that this seemingly niche branch of machine learning has roots in a plenitude of applications including social media, healthcare, research and even manufacturing. In this blog, let us hone in on one popular algorithm that is employed for image classification: convolutional neural networks (CNN’s).

In mathematics, we know that convolution is the interaction of two functions to produce a new output. However, in the field of computer vision, convolution is the process of filtering through an input image to get a more meaningful output. Now that we have defined convolution, let us consider Mike’s concern about what a computer actually sees when given an image. Look at the wallpaper on your phone/desktop. Assuming it is not a blank screen, you can see edges that define the outer outlines and inner details of people, animals, landscapes, or any object for that matter. Your brain uses the collection of these edges and the colors filling between them to recognize objects. The CNN essentially does the same thing. When we give a CNN an input image, it sees an array of numbers which resemble the pixel intensities of said image. Pixel intensities range from 0 (black) to 255 (white). We have the option of working with RGB, grayscale, or black/white images as well. For the purposes of image classification, using black/white images would save a lot of computational expense, but would not be ideal since many edges would not be represented. Depending on the project you are working on, it is important to decide whether you want to keep your images in color or on a grayscale albeit the latter would save you computational cost. This is because colored images consume three bytes of storage for each elementary color (red, blue, green).

Gaussian Blurring

Before getting into the CNN, let’s discuss image pre-processing a little more. When working with images, it is common for the objects inside them to have unclear boundaries. This leads to the CNN seeing many so called ‘fake edges’ that will ultimately lead to poor performance in classification, detection, etc. Thus, it is at times necessary to reduce pixel intensities (blur the image). It is common practice to use the Gaussian blurring algorithm in Keras to do this de-noising step.

Filters and Feature Maps

It then uses filters to scan the image horizontally and vertically to recognize edges. For future reference, a kernel is a singular filter, i.e a 2D array of weights. A filter is composed of many kernels. The gradients, or the difference in pixel intensities, is captured by these filters where larger magnitude gradients correspond to a greater presence of an edge. The Sobel filter is a commonly used one that allows for this due to its capability to scan horizontally and vertically. Vertical scans will unravel horizontal edges and vice versa. Think of filters as a weight matrix that is multiplied with pixel intensities in order to form a new image that captures features of the original image. These products are then summed to obtain an output value in the corresponding pixel position of the output image. This new image is called a feature map. Through back-propagation, the CNN is able to update the weights contained within these filters allowing for the most information-capturing feature maps to be created. Once the weights are learnt, they are used across all of the neurons in that layer. In the convolutional layers of the neural net, these filters are being applied and the feature maps are created. In Keras, the first input while building the convolutional layers is the number of filters. Be sure to enter a value that is a power of 2.

Stride and Padding

But how exactly does the filter scanning process work? Glad you asked. Conceptually, kernels applied to portions of the input images as we can see in the lovely gif above (it shows just one kernel although this scanning is done by multiple). Note that the kernel reads the same portion of the image several times as it shifts one pixel at a time in this case. The magnitude of this shift is called a stride. In Keras, the stride is another input. As we increase the stride number, the feature map becomes more condensed. The pixels at the edge of the input image are not read multiple times by the filter if fed in by itself. Thus, padding is done in order to circumvent this issue. Padding is basically the addition of artificial values, usually zeros, to the very outermost edges of the input image in order for the filter to capture data at the outermost edges to a higher degree.

Dissection of the CNN

Now that we have defined all of the niche terms, lets put it all in context of one another. One can easily construct a CNN using Keras. In your model, you will need convolutional layers which as stated before, multiply the weights contained in the filters to the input image generating feature maps. A common activation function to use is ReLu. After each convolutional layer, it is common practice to create a pooling layer to reduce the dimensionality of the feature map. This basically condenses the most important features that the model captured in each localized region of the input image. Additionally, we can use batch normalization layers after each layer containing an activation function. This reduces internal covariate shift and thus allows for faster, more stable training of the neural network. Lastly, it is good practice to include dropout layers in your model to avoid overfitting. These layers discard a portion of outputs from the preceding layer and replace them with zeros, thus changing the input to the subsequent layer. Finally, a flattening layer can be applied (based on your needs) to get a one-dimensional array to feed into the output layer of your model. For the purposes of classification, the softmax function can be used to get an output of probabilities which resemble the belongingness of an image to a certain class.

Keep in mind all of this information is merely scratching the surface of a CNN. We are looking at this model from a birds-eye point of view. There are many moving parts to be considered and tuning the model for optimal parameters will be time and cost intensive depending on the scale of your project. I highly recommend doing an image classification project of your own to better understand the workings of this model. As always, feel free to reach out if you have any questions!