Understand the architecture of CNN

Published in

Towards Data Science

8 min readOct 31, 2019

In 2012, a revolution occurred: during the annual ILSVRC computer vision competition, a new Deep Learning algorithm exploded the records! It’s a convolutional neural network called Alexnet.

Convolutional neural networks have a methodology similar to that of traditional supervised learning methods: they receive input images, detect the features of each of them, and then drag a grader on it.

However, features are learned automatically! The CNN themselves carry out all the tedious work of extracting and describing features: during the training phase, the classification error is minimized to optimize the parameters of the classifier AND the features!

What is a CNN ?

Convolutional neural networks refer to a sub-category of neural networks: they, therefore, have all the characteristics of neural networks. However, CNN is specifically designed to process input images. Their architecture is then more specific: it is composed of two main blocks.

The first block makes the particularity of this type of neural network since it functions as a feature extractor. To do this, it performs template matching by applying convolution filtering operations. The first layer filters the image with several convolution kernels and returns “feature maps”, which are then normalized (with an activation function) and/or resized.

This process can be repeated several times: we filter the features maps obtained with new kernels, which gives us new features maps to normalize and resize, and we can filter again, and so on. Finally, the values of the last feature maps are concatenated into a vector. This vector defines the output of the first block and the input of the second.

The second block is not characteristic of a CNN: it is in fact at the end of all the neural networks used for classification. The input vector values are transformed (with several linear combinations and activation functions) to return a new vector to the output. This last vector contains as many elements as there are classes: element i represents the probability that the image belongs to class i. Each element is therefore between 0 and 1, and the sum of all is worth 1. These probabilities are calculated by the last layer of this block (and therefore of the network), which uses a logistic function (binary classification) or a softmax function (multi-class classification) as an activation function.

As with ordinary neural networks, the parameters of the layers are determined by gradient backpropagation: the cross-entropy is minimized during the training phase. But in the case of CNN, these parameters refer in particular to the image features.

The different layers of a CNN

There are four types of layers for a convolutional neural network: the convolutional layer, the pooling layer, the ReLU correction layer and the fully-connected layer.

The convolutional layer

The convolutional layer is the key component of convolutional neural networks, and is always at least their first layer.

Its purpose is to detect the presence of a set of features in the images received as input. This is done by convolution filtering: the principle is to “drag” a window representing the feature on the image, and to calculate the convolution product between the feature and each portion of the scanned image. A feature is then seen as a filter: the two terms are equivalent in this context.

The convolutional layer thus receives several images as input, and calculates the convolution of each of them with each filter. The filters correspond exactly to the features we want to find in the images.

We get for each pair (image, filter) a feature map, which tells us where the features are in the image: the higher the value, the more the corresponding place in the image resembles the feature.

**Convolutional layer** (source: https://www.embedded-vision.com)

Unlike traditional methods, features are not pre-defined according to a particular formalism (for example SIFT), but learned by the network during the training phase! Filter kernels refer to the convolution layer weights. They are initialized and then updated by backpropagation using gradient descent.

The pooling layer

This type of layer is often placed between two layers of convolution: it receives several feature maps and applies the pooling operation to each of them.

The pooling operation consists in reducing the size of the images while preserving their important characteristics.

To do this, we cut the image into regular cells, then we keep the maximum value within each cell. In practice, small square cells are often used to avoid losing too much information. The most common choices are 2x2 adjacent cells that don’t overlap, or 3x3 cells, separated from each other by a step of 2 pixels (thus overlapping).

We get in output the same number of feature maps as input, but these are much smaller.

The pooling layer reduces the number of parameters and calculations in the network. This improves the efficiency of the network and avoids over-learning.

The maximum values are spotted less accurately in the feature maps obtained after pooling than in those received in input — this is a big advantage! For example, when you want to recognize a dog, its ears do not need to be located as precisely as possible: knowing that they are located almost next to the head is enough!

The ReLU correction layer

ReLU (Rectified Linear Units) refers to the real non-linear function defined by ReLU(x)=max(0,x). Visually, it looks like the following:

The ReLU correction layer replaces all negative values received as inputs by zeros. It acts as an activation function.

The fully-connected layer

The fully-connected layer is always the last layer of a neural network, convolutional or not — so it is not characteristic of a CNN.

This type of layer receives an input vector and produces a new output vector. To do this, it applies a linear combination and then possibly an activation function to the input values received.

The last fully-connected layer classifies the image as an input to the network: it returns a vector of size N, where N is the number of classes in our image classification problem. Each element of the vector indicates the probability for the input image to belong to a class.

To calculate the probabilities, the fully-connected layer, therefore, multiplies each input element by weight, makes the sum, and then applies an activation function (logistic if N=2, softmax if N>2). This is equivalent to multiplying the input vector by the matrix containing the weights. The fact that each input value is connected with all output values explains the term fully-connected.

The convolutional neural network learns weight values in the same way as it learns the convolution layer filters: during the training phase, by backpropagation of the gradient.

The fully connected layer determines the relationship between the position of features in the image and a class. Indeed, the input table being the result of the previous layer, it corresponds to a feature map for a given feature: the high values indicate the location (more or less precise depending on the pooling) of this feature in the image. If the location of a feature at a certain point in the image is characteristic of a certain class, then the corresponding value in the table is given significant weight.

The parmetrization of the layers

A convolutional neural network differs from another by the way the layers are stacked, but also parameterized.

The layers of convolution and pooling have indeed hyperparameters, that is to say parameters whose you must first define the value.

The size of the output feature maps of the convolution and pooling layers depends on the hyperparameters.

Each image (or feature map) is W×H×D, where W is its width in pixels, H is its height in pixels and D the number of channels (1 for a black and white image, 3 for a colour image).

The convolutional layer has four hyperparameters:

1. The number of filters K

2. The size F filters: each filter is of dimensions F×F×D pixels.

3. The S step with which you drag the window corresponding to the filter on the image. For example, a step of 1 means moving the window one pixel at a time.

4. The Zero-padding P: add a black contour of P pixels thickness to the input image of the layer. Without this contour, the exit dimensions are smaller. Thus, the more convolutional layers are stacked with P=0, the smaller the input image of the network is. We lose a lot of information quickly, which makes the task of extracting features difficult.

For each input image of size W×H×D, the pooling layer returns a matrix of dimensions Wc×Hc×Dc, where:

Choosing P=F-1/2 and S=1 gives feature maps of the same width and height as those received in the input.

The pooling layer has two hyperparameters:

1. The size F of the cells: the image is divided into square cells of size F×F pixels.

2. The S step: cells are separated from each other by S pixels.

For each input image of size W×H×D, the pooling layer returns a matrix of dimensions Wp×Hp×Dp, where:

Just like stacking, the choice of hyperparameters is made according to a classic scheme:

For the convolution layer, the filters are small and dragged on the image one pixel at a time. The zero-padding value is chosen so that the width and height of the input volume are not changed at the output. In general, we then choose F=3,P=1,S=1 or F=5,P=2,S=1
For pooling layer, F=2 and S=2 is a wise choice. This eliminates 75% of the input pixels. We can also choose F=3 and S=2: in this case, the cells overlap. Choosing larger cells causes too much loss of information and results in less good results in practice

Here, you have the basics to build your own CNN! But don’t tire of doing so: there are already many architectures adapted to the majority of applications.

In practice, I strongly advise you not to create a convolutional neural network from scratch to solve your problem: the most effective strategy is to take an existing network that classifies well a large collection of images (such as ImageNet) and apply Transfer Learning.