Introduction
Each time you unlock your smartphone using Face ID or use real-time Google Translate with your camera, something insane is going on behind the scenes! CNNs are the backbone of many amazing applications and tools that we use all the time. This post will explain the intuition behind the workings of CNNs, without delving into the complex probability functions and math equations. Everyone should have an opportunity to learn the basics about these tools, given how they are deeply ingrained in our lives now. For the nerdy folks, here is one of the best explanations provided by Stanford University.
Convolutional Neural Networks (CNNs) are the go-to solution for any image classification or object detection task today. Looking at the ImageNet performance over time, we see that the classification accuracy has really taken off from 63% in 2013 by AlexNet to over 88% in 2020!

What are Images?
CNNs can be segmented into 2 main parts: Feature extractor and Classifier. To understand this, we first have to know what makes an image. The figure below shows a very simple image of the handwritten digit ‘8’. We can see that all images are basically a grid of pixels, in this case, it is a 18 x 18 grid. Next, each pixel takes a value from 0 to 255, indicating how ‘bright’ it is. Pixels with value = 0 are completely black while white pixels have a value of 255. Essentially, we can represent this image with just 18 * 18 = 324 numbers!

From this, we can also extend to colored and more complex images. The main differences would be: a larger grid size due to the higher resolution, and there would be 3 data values for each pixel corresponding to Red, Green and Blue ‘brightness’. Here you can read more about how we use these 3 basic colors to represent nearly every color that we see on our screens!
CNN Basics
Now that we know what makes up an image, let’s look at the overview of a CNN’s function. The feature extractor part will process the image and output N number of features. These are just numbers describing certain characteristics of each image, and typically cannot be easily understood by humans. For this task of recognizing handwritten digits, some of the possible features could be: straightness/’curviness’ of the edges, number of loops etc. For the digit 8, it would have strong features when it comes to curvy edges, but weak features for straight edges.

After the feature extractor, the features are fed into the classifier part. This is simply a linear model, where it decides what is the most probable digit based on the strength of each feature. In this case, an image with a large number of loops and ‘curvy’ edges would be more likely to be the digit 8, instead of the digit 4.
Of course, when we move on to more complex images such as classifying pictures of cats vs dogs, there can be hundreds of features! These could then describe the pointedness of the ears, the size of the nose, and texture of the object. Most likely, we would not be able to definitively understand what each feature represents, but hopefully the example above gives an intuition of what the CNN does.
What are the Convolutions in CNN?
It would be weird to explain CNNs without mentioning convolutions, since that is in the name of the technique itself. Without going too deep into the math (You can check out a great article here) or details, convolutions are basically template matching operations. They involve sliding a filter or kernel (usually with a size of 3×3), and calculating the ‘match’ at each position in the input image! By using different types of kernels, we are able to detect the positions of many different shapes, edges and textures.

The power of convolutions in image analysis lies in the fact that information in images are highly localized. Looking at the figure above, we see that each convolutional layer captures only a small visual field (3×3 pixels), but the effective receptive field can be expanded by using multiple layers. The localized nature of images allow us to only look at a small portion of the image at each time and still be able to extract valuable information from it.
Conclusion
The intuitive idea behind CNNs are quite simple: stack up multiple filters, each with small receptive fields to extract important features from images, then use a linear model to classify the image. CNNs are great as they have a good balance between computational cost and task accuracy. Recently, Transformers have been introduced for image processing tasks, which could threaten the reign of CNNs! However, transformer based methods still require immense computing power that are not available to feeble individuals like you and I. Feel free to read more about it [here](https://medium.com/swlh/an-image-is-worth-16×16-words-transformers-for-image-recognition-at-scale-brief-review-of-the-8770a636c6a8) and here.