Understanding Convolutional Neural Networks

Published in

Towards Data Science

7 min readMay 24, 2018

In this blog post, we’re going to explore (or at least attempt to) the intuition behind Convolutional Neural Networks, one of the most important deep learning techniques in machine vision and image recognition. We’re also going to work through an example in recognizing different shapes using Convolutional Neural Networks.

Some Background

As AI breakthroughs continue to captivate the general public, terms such as “artificial intelligence”, “machine learning” and “deep learning” have been used interchangeably. It is definitely worthwhile to understand the difference between each term in order to have a better picture of where the AI landscape is heading within the next few years.

The concentric circles of Artificial Intelligence

We can think of these three terms as concentric circles with artificial intelligence encompassing machine learning, and machine learning encompassing deep learning. We can delve deeper into the history of artificial intelligence and how this progress came about — but much has already been discussed on the topic.

In short — artificial intelligence is the development of computer systems that perform tasks usually reserved for human cognition. For example, despite being a hard coded system, a calculator is a form of artificial intelligence. The techniques used to develop such systems are at the crux of these concentric circles.

Machine learning revolves around creating systems that can learn useful patterns from large data sets, and provide useful insights as a consequence. Machine learning in itself is divided into three main categories — the first being supervised learning, which entails creating systems that understand for a set of data points (context) and labels (outcome) the relationships between them, and thus provide outcomes on unlabeled data points. Examples of such systems include systems which classify whether loan applications would result in a default or not, systems which predict future stock prices etc … Alternatively, unsupervised learning is constructing systems that can identify meaningful patterns from a data set simply based on similar features or characteristics. An example of this would be clustering customers based on similar shopping behaviors. Finally, reinforcement learning is a branch of machine learning that tries to pit an intelligent agent in a well defined environment, with a set of possible actions and an objective function (reward) to be maximized.We can think of self driving cars (the agent) driving on a highway (environment) whose sole objective is not committing accidents (reward) for example.

Finally, deep learning is a technique used within machine learning which utilizes vast amounts of data and neural networks with multiple layers (an excellent walk through of neural networks can be found here) in order to understand patterns within a data set. The recent explosion in AI breakthroughs in computer vision and speech recognition among other vertices almost all lead back to deep learning research, and more importantly the commoditization of computing power (an interesting blog post on the use of computing power in AI research throughout the years can be found here).

Intuitively speaking, think of machine learning as an attempt to model the brain of a child. A child learns from the actions of others (supervised learning), tries to discern similarities between different objects in the world such as grouping similar shaped Lego blocks together (unsupervised learning) and navigates difficult environments such as a jungle gym with no direct input (reinforcement learning). Deep learning is a technique in machine learning that is at the root of recent breakthroughs in Artificial Intelligence.

What are Convolutional Neural Networks?

Within deep learning, a plethora of architectures and techniques have emerged that can enable numerous use cases, chief among them being Convolutional Neural Networks. Convolutional Neural Networks were inspired by research done on the visual cortex of mammals and how they perceive the world using a layered architecture of neurons in the brain. Think of this model of the visual cortex as groups of neurons designed specifically to recognize different shapes. Each group of neurons fires at the sight of an object, and communicate with each other to develop a holistic understanding of the perceived object.

Different groups of neurons in the brain learn to recognize different groups of characteristics given an input stimulus

The system can be explained as hierarchical clusters of neurons that detect low level characteristics of an input stimulus and communicate between each other in that hierarchy to develop a high level detection of objects.
Think of the hierarchy as the following:

First cluster as a structure that recognizes low level features (i.e. contour of a face)
The second cluster as a structure that recognizes colors and shapes (skin color or jaw lines)
The third as a structure that recognizes detail (ears, nose and eyes …)
Final cluster recognizes the entire object holistically (the face and the person attached to the face)

In simple terms, given a sight of an object, the system has different groups of neurons that fire for different aspects of the object and communicate with each other to form the big picture.

Yann LeCun drew inspiration from this hierarchical model of the visual cortex and developed Convolutional Neural Networks to encompass the following:

Local Connections: Each layer (or cluster) shares a connection where they transfer learned features from one cluster to another.
Layering: There is an obvious hierarchy between the different layers (or clusters) — which is analogous to saying there’s a hierarchy in learning from low level features (i.e. ears, eyes) to high level features (the face, the person in question etc…).
Spatial Invariance: Shifts in the inputs results in an equally shifted output — regardless of how we change an input image, the model should adapt and shift it’s outputs accordingly. (Humans have an ability to recognize an object even if it’s upside down or shifted in a variety of conditions)

Thus the Convolutional Neural Network architecture corresponds to something like this:

A typical Convolutional Neural Network Architecture

Where an input data in the form of a 4D matrix that includes number of samples (number of images), height of each sample (height of each image), width of each sample (width of each image) , number of channels (number of channels here refers to the color specification of each image — a colored image corresponds to Red (R), Green (G) and Blue (B) pixels, so each image has 3 channels — think of this as a three 2-dimensional matrices superposed on top of each other each corresponding to the intensity of the RGB pixels, whereas a gray image has only 1 channel). In our example we will be using a gray image only.
Our input data will be connected to a hidden convolution layer which applies a number of arbitrary filters of an arbitrary dimension (typically 3 x 3 or 5 x 5) over our image. Think of a filter as a small flashlight of dimension 3 x 3 (or 5 x 5) that tries to understand our input images and draw a feature map. From the feature map the algorithm can understand local features in our data (eyes, ears etc…) regardless of its position (translation invariance). We can see the convolution operation greatly displayed here:

A convolution trying to capture low level features in a picture of buildings

Pooling is a sub sampling operation that reduces the dimensionality of the extracted feature maps by applying a window of an arbitrary size (this is called stride) and extracts either the sum of the windows, max or average depending on the specification of the user. In this case we will be using max pooling, which for each 2 x 2 window in the feature map we extract the highest values. This technique helps us reduce dimensionality while preserving information. We can take a look at this operation below:

Finally a traditional fully connected layer which produced a Softmax output takes in the learned representations of the Convolutional and max pooling layers and outputs a prediction. In short, a fully connected layer is a layer in a neural network that contains nodes which “light up” when a certain pattern is observed. A much more detailed breakdown of Convolutional Neural Networks and the math behind it can be found here.

Intuitively speaking, Convolutional Neural Nets take in images as input, attempt to decipher different small features (local connections) about the images regardless of their position (spatial invariance) using a series of mathematical operations (layering, pooling) in order to understand the full picture of what’s happening. These mathematical operations pertain to modeling an image as series of numbers with each number representing pixel density (the intensity of the color in a specific position on the picture).

A Working Example

A more detailed notebook with all utility functions and results can be found here.

Our data set is a set of geometric shapes (triangles, circles and rectangles) that can be positioned anywhere in a 72 x 72 grid. These pictures only have 1 channel since they are grey scale. We can see examples of these images below:

Random triangles, circles and rectangles that can have any position in the grid — utility functions on how to produce them can be found in the github link above.

We will use the Keras package on python (a detailed documentation of Keras’s implementation of Convolutional Neural Networks can be found here) to develop a Convolutional Neural Network that will be able to classify each shape with up to 98% accuracy.

Ultimately, Convolutional Neural Networks represent a major breakthrough in image recognition. Self driving cars, facial recognition systems and medical diagnostics represent a few of the use cases Convolutional Neural Nets can enable. However, it is important to note that there is still room for growth, with new techniques appearing on the horizon.

Understanding Convolutional Neural Networks

Some Background

What are Convolutional Neural Networks?

A Working Example

Written by Adel Nehme