- When Computers Can See & Perceive!!!

It has been well established and documented that the computational capabilities of computers far exceed that of human beings. Hence, since the middle of the 20th century, computation heavy activities have been attempted to be gradually automated by leveraging computers. With the advent and increasing proficiency of application of Machine Learning (ML), predictive models have taken precedence. Some of the examples of where ML has taken over are
- Customer retention by proactive identification of customers likely to terminate subscription
- Forecasting of sales, demand, revenue etc. for any industry
- Identification of customers for cross-selling, in any industry
- Identification of customers likely to default in let’s say mortgage payments
- Preventive Maintenance of machines by identification of machines likely to break down
However problems that had historically required human perception for arriving at the solution, were beyond the reach of classical (static) ML algorithms. The good news is that with the massive break throughs in Deep Learning (DL) since early 2010s, near human-like performance is being observed in solving such problems that traditionally required human intuition. The aforementioned perceptual problems required skills to process sound and images. So the ability to sense and process, through hearing and seeing seemed to be the major skill that could facilitate solution of such problems. These skills came naturally and intuitively to humans but had been long elusive for machines. Some of the examples of perceptual problems solved through application of DL are given below
- Near-human-level Image Classification and object detection
- Near-human-level speech recognition
- Near-human-level handwriting transcription
- Improved machine translation
- Improved text-to-speech conversion
- Digital assistants such as Google Now and Amazon Alexa
- Near-human-level autonomous driving
- Ability to answer natural-language questions
It must however be said that, this is just the beginning and we are merely scratching the surface of what can be achieved. There is research going on to foray into the space of formal reasoning as well. If successful, this may help human beings in fields of science, psychology, software development etc.
The question that comes to mind is, how does the machine do this i.e. gain the perceptual power to solve problems requiring human intuition. Number crunching, solving problems dealing with numbers and/or categories, with the application of supervised learning (or unsupervised learning) was within the realms of machines with huge computational capabilities. Being able to see and recognize components of an image is a new skill and this article’s focus is to bring to attention the process that is now giving computers the ability to see and perform like humans in solving problems involving images. The content for this article includes the following
- A brief introduction to Artificial Intelligence (AI) and Deep Learning (DL)
- Image Recognition by application of Fully Connected Layers (FCN) i.e. Multi-layer perceptron’s (MLP)
- Introduction to Convolution Neural Networks (CNN)
- Why is CNN used over MLPs for solving Image related problems
- Working of a CNN
- Comparison of performance between MLP and CNN
A brief introduction to Artificial Intelligence (AI) and DL
AI is a field that enables computers to mimic human intelligence by application of logic, if-then rules, decision trees and ML (including DL). Artificial Neural Network (ANN) is a branch of AI. The below pictorial representation will further elucidate the subsets of AI i.e. ML and DL. Data Science cuts across all layers of AI.

The foundations of any ANN architecture starts with the premise that, the model needs to go through multiple iterations making allowance for mistakes, learn from mistakes and achieve salvation or saturation when it has learnt enough, thereby producing results equal to or similar to reality. The picture below gives us a brief idea of how an ANN looks like and introduce DL Neural Network.

ANN essentially is a structure consisting of multiple layers of processing units (i.e. neurons) that take input data and process it through successive layers to derive meaningful representations. The word deep in Deep Learning stands for this idea of successive layers of representation. How many layers contribute to a model of the data is called the depth of the model. The above diagram illustrates the structure better as we have a simple ANN with only one hidden layer and a DL Neural Network (DNN) with multiple hidden layers. Thus DL or DLNN is an ANN with multiple hidden layers.
The readers of this article are assumed to be familiar with the concepts that constitute the foundations of Supervised Learning and how DL leverages it to learn the patterns in data (in this case patterns in image) for achieving accurate results. Conceptual understanding of the below items will be required for better understanding of the rest of the article
- Supervised Learning
- Deep Learning
- Loss Function
- Gradient Descent
- Forward and Backward Propagation
For refresher, the reader can explore freely available material on these topics in internet or go through my published article, Foundations of Deep Learning. Going through each of the above concepts in-depth is beyond the scope of this article.
Image Recognition by application of Fully Connected Layers i.e. Multi-layer Perceptron’s
Let’s now delve into an practical example of how an ANN can be leveraged to see images and classify them to appropriate categories. We will start with application of a FCN on a real world dataset and gauge the efficiency achieved. The reader is assumed to have deeper understanding of what is a FCN and how it operates. At a very high level, FCN or MLP is an ANN where each element of every layer is connected to each element of the following layer. Figure 3 below illustrates how a FCN looks like. For more details, please refer to my published article, Foundations of Deep Learning.
The problem we will try to solve involves recognition of digits from images of house numbers taken from streets. The data set name is Street View House Numbers (SVHN). SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data formatting but comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.
The goal we are going to achieve is to take an image from the SVHN dataset and determine what that digit is. This is a multi-class classification problem with 10 possible classes, one for each digit 0–9. Digit ‘1’ has label 1, ‘9’ has label 9 and ‘0’ has label 10. Although, there are close to 6,00,000 images in this dataset, we have extracted 60,000 images (42000 training and 18000 test images) to do this project. The data comes in a MNIST-like format of 32-by-32 RGB images centered around a single digit (many of the images do contain some distractors at the sides).
We will use raw pixel values as input to the network. The images are matrices of size 32×32. So, we reshape the image matrix to an array of size 1024 ( 32*32 ) and feed this array to a ANN like below.

Link to the dataset is here.
Below are the codes to load the data and visualize a few of the images.

Multiple MLPs were built with different number of hidden layers and different number of neurons in each of the layers. The results from each model was compared and the best was achieved with three hidden layers. The model architecture is given below. It achieved an accuracy of ~80%.
The visualization of how the loss gradually decreased and accuracy increased iteratively is depicted below

A few of the images in the test data (which was not exposed to the training process) were predicted. Visualization for the same and comparison with it’s original label is given below.

We tested 3 images from the test set for visualization and see that the classification is accurate which can give us a sense of achievement and feeling of comfort. The accuracy achieved on the entire test set was found to be 80% as mentioned earlier. Since the test set constitutes 18000 images, an accuracy of 80% on it is fairly acceptable performance. However, is this the best approach and is this the best result that can be achieved? Or are there better approaches than this. Let’s transition to next section with these questions in mind.
The entire code for the end to end project can be found in my github account. Link to the code can also be found here.
Introduction to Convolutional Neural Networks (CNN)
While, the Fully Connected Network (FCN) explained above achieved an acceptable accuracy, it can be further improved. Besides FCN, there is another class of neural networks called as Convolutional Neural Networks (CNN), whose internal operations are more compatible with processing requirements of visual images. CNNs have been found to give much better performance in use-cases involving images.
CNNs are much better than FCN because they are shift invariant or space invariant artificial neural networks, based on their shared-weights architecture and translation invariance characteristics. We will delve into it in-depth in the following section.
Before we get into the details of how CNNs work and why they are better than FCNs, let’s understand the primary underlying operation in a CNN. The most important concept to understand CNN is the convolution operation.
Convolution is a simple mathematical operation which is fundamental to many common image processing operators. Convolution provides a way of `multiplying together’ two arrays of numbers, generally of different sizes, but of the same dimensionality, to produce a third array of numbers of the same dimensionality. This can be used in image processing to implement operators whose output pixel values are simple linear combinations of certain input pixel values.
In an image processing context, one of the input arrays is normally just a pixel representation of the image. The second array is usually much smaller, and generally two-dimensional (although it may be just a single pixel thick), and is known as the kernel. Below is a pictorial representation of an image and a kernel that can be used for convolution.

The convolution is performed by sliding the kernel over the image, generally starting at the top left corner, so as to move the kernel through all the positions where the kernel fits entirely within the boundaries of the image. (Note that implementations differ in what they do at the edges of images, as explained below.) Each kernel position corresponds to a single output pixel, the value of which is calculated by multiplying together the kernel value and the underlying image pixel value for each of the cells in the kernel, and then adding all these numbers together.
So, in our example, the value of the bottom right pixel in the output image will be given by:
O33 = I33 W11 + I34 W12 + I43 W21 + I44 W22
Image classifications done by CNNs, work by taking an input image, process it and classify it under certain categories (E.g., 0 through to 9 in our use SVHN use case). Computers sees an input image as an array of pixels. Based on the image resolution, it will see h x w x d( h = Height, w = Width, d = Dimension ).

Convolution is the first layer to extract features from an input image. Convolution preserves the relationship between pixels by learning image features using small squares of input data. It is a mathematical operation that takes two inputs such as image matrix and a filter or kernel.
The below diagram illustrates how a convolution (or correlation) operation is performed. Here I represents the image and W represents the kernel which will convolve over the input image.

The above can be further simplified by a practical example of a kernel acting as a filter on an image represented by it’s pixels. The formula above is utilized on the image input and kernel weight to arrive at the kernel or filter output.

So when we have a bigger image and typically a smaller kernel, the kernel will slide over the image as can be seen below and the output computed through convolution operation for every iteration. The yellow square that slides over the green square is the kernel. The green square is the image representation via pixel values. The convolution output is represented by the pink square on the right which is essentially the learned feature.
Why is CNN used over MLPs for solving Image related problems
Now that the convolution operation is addressed, let’s try to answer the above question. The major reasons are
- A simple FCN will require humongous amount of parameters to train the model
- In images, spatial correlation is local
- Number of images generally available is limited
- Translational Invariance a key necessity for dealing with images
A simple FCN will require humongous amount of parameters to train the model
FCNs (or MLPs) use one perceptron for each input (e.g. pixel in an image) and the amount of weights rapidly becomes unmanageable for large images. It includes too many parameters because it is fully connected. Each node is connected to every other node in next and the previous layer, forming a very dense web. For the sake of argument if we had an ANN as seen in figure 3, the number of parameters with only two hidden layers is more than 200 million parameters. And this is for an image whose dimension is only 3232. For mid-sized images with dimensions of 200200 pixels, the number of parameters that will be required to be trained will go through the roof. Thus using FCN (or MLPs) for solving complicated image related problems is not an optimal method from the computational capacity (of computers) point of view. Additionally, if the number of images available is less, training with such high number of parameters will lead to overfitting.
In images, spatial correlation is local. CNN helps achieve translational invariance
Another common problem is that MLPs react differently to an input (images) and its shifted version – they are not translation invariant. For example, if a picture of a space shuttle appears in the top left of the image in one picture and the bottom right of another picture, the MLP will try to correct itself and assume that a space shuttle will always appear in this section of the image. Refer to figure 11 below for illustration. Hence, MLPs are not the best idea to use for image processing. One of the main problems is that spatial information is lost when the image is flattened(matrix to vector) into an MLP. In a FCN the correlation between pixels is not local i.e. components of space shuttle are related to the back ground or any other object present in the image (for example the car in the first section of Figure 11). Hence if the picture shifts, the learning in the model suffers. This is addressed in CNNs as they are dependent on the convolution operations and the activations will remain same even if the picture shifts. The activations are localized as the kernel slides on the image to provide the activations at each iteration. The weights are shared.

Working of a CNN
The image below is a representation of a CNN for an input image of size 3232. We have 20 kernels with dimension of 55 which perform the convolution operation as explained earlier. The dimensions of the features images resulting after the first convolution operation on the image is 2828. The CONV layer in the figure below represents the same i.e. 20 feature images of dimension 2828. The mathematical formula for calculating the resulting dimension of any convolution operation is given later in this section.

The new concept introduced here that has not been discussed earlier is the pooling operation represented by the POOL layer in the image above.
The pooling operation involves sliding a two-dimensional filter over each channel of feature map and summarizing the features lying within the region covered by the filter. The mathematical calculation for a pooling layer is very similar to that of a convolution operation. The convolution layer helps with learning the features of the image and produces as many feature maps as the number of kernels used. The pooling layer does the down sampling operation and summarizes the features produced from multiple feature maps generated by the convolution layer. This helps in firming up the receptive field in the network. A typical CNN architecture generally has a number of convolution and pooling layers stacked one after the other.
Convolutional layers in a CNN systematically apply learned filters to input images in order to create feature maps that summarize the presence of those features in the input.
Convolutional layers prove very effective, and stacking convolutional layers in deep models allows layers close to the input to learn low-level features (e.g. lines) and layers deeper in the model to learn high-order or more abstract features, like shapes or specific objects.
A limitation of the feature map output of convolutional layers is that they record the precise position of features in the input. This means that small shifts in the position of the feature in the input image will result in a different feature map. This is addressed by introduction of pooling layers. The pooling layer summarizes the features present in a region of the feature map generated by a convolution layer. So, further operations are performed on summarized features instead of precisely positioned features generated by the convolution layer. This makes the model more robust to variations in the position of the features in the input image. The two major pooling methods are given below
- Average Pooling: Calculate the average value for each patch on the feature map.

- Maximum Pooling (or Max Pooling): Calculate the maximum value for each patch of the feature map.

The last thing I want to explain before moving over to the running case study is lay out the formula for calculating the dimension of the feature maps when the convolution layer is applied to the input image. The same formula still applies in the subsequent layers of convolution and pooling.
Output Image Dimension = (W – F +2P)/S +1
where W = size of image
F = size of filter
P = Padding
S = Stride
Stride denotes how many steps we are moving in each steps in convolution. By default it is one. Convolution with Stride 1. We can observe that the size of output is smaller than input. To maintain the dimension of output as in input , we use padding. Padding is a process of adding zeros to the input matrix symmetrically. A more detailed explanation of the two can be found here amongst many more freely available content in the internet.
Comparison of performance between MLP and CNN
Let’s continue with the digit recognition problem we were trying to solve on the SVHN dataset. We had previously achieved an accuracy of 80% using FCNs (or MLPs). We will now experiment with application of CNNs on the dataset for training.
The codes for loading the data and pre-process it (normalize the input and one hot encode the labels) is provided below
The model architecture for the first CNN attempted is given below
The model is then compiled and data parsed through it for training. The model gives an accuracy of 90% on validation data which is already 10% improvement on the performance achieved from a MLP. The codes to compile, train and evaluate performance on validation data is given below
The output from the print statement above is 90%.
The visual representation of how the model learns the features iteratively and gradually improves in performance is represented below. We see a trend of decreasing loss and increasing accuracy with each epoch.

After a few experiments with different model architecture, the best result was achieved from an ensemble of 7 CNNs built. An accuracy of 95% was achieved. The confusion matrix from the model is given below. The link to the entire code is here in github.

Conclusion
In this article we got introduced to two different methods of solving image classification problems i.e. FCNs and CNNs. We witnessed the difference in performance of the two methods due to superior operation modus operandi in CNN. The advantages achieved through what we saw in CNN so far in image classification is further built upon for solving much more sophisticated problems like object detection and semantic segmentation. Both of them are key components in sophisticated solutions like self driving cars where the machine has to identify different objects in-front of the car and their structures to be able to steer the car in appropriate direction and angle. The opening image in this article is that of my study table followed by identification of all the objects on it. All of these achievements stand on the underlying and clinching concepts in CNN. In the next few articles we will explore a new concept called as transfer learning which helps us with dealing with scenarios when we may not have enough number of training images. Stay tuned!!!