The world’s leading publication for data science, AI, and ML professionals.

Understanding Convolutional Neural Networks (CNNs)

A gentle introduction to one of the most powerful deep learning tools and its building blocks

Original by Andrew Schultz on Unsplash
Original by Andrew Schultz on Unsplash

This article will cover all the main aspects of Convolutional Neural Networks (CNNs), how they work and the main building blocks of this technique. The references used on this article can be found on my github repository.

The Convolutional Neural Networks (CNNs) or simply Convolutional Networks are a kind of neural network that uses the convolution operation instead of the matrix multiplication, in at least one of its layers.

This kind of network is effectively used in applications in which the data elements have some relationship with their neighbors, as in images (represented by bidimensional arrays of pixels) or in the case of time series or audio files (represented by a unidimensional sequence of data points samples in a regular time frame).

For being effective in extracting image features, the CNNs are widely used in tasks such as object detection, facial recognition, semantic segmentation, image processing and manipulation, and others. Some cameras also have smart filters based on CNNs, autonomous vehicles use these nets to navigate and detect obstacles, and many other systems are based on this method.

Convolution operation

The convolution is a linear mathematic operation between two functions. Given the fucntions x(t) and w(a), respectively called input and kernel. The convolution of the functions x and w over a, for a given t is:

The wildcard operator *(xw)** represents the convolution and the output s(t) is usually called feature map.

This operator can be expanded to discrete functions, which is the case used on unidimensional CNNs (such as the ones used for time series):

And in the case of a 2D convolution, very used for image processing, and considering I as the input image, and K as the bidimensional kernel, the convolution can be represented by:

Usually the kernels are smaller than the images, and the convolution uses one single kernel to process the whole image. Because of that reduced representation, each convolutional kernel (or layer) needs to learn a small amount of parameters.

On the other hand, since standard ANNs (Artificial Neural Networks) use weight matrices to connect the neurons of a given layer to the outputs of the previous one, each weight would represent one single connection and consequently the ANNs would need to learn much more parameters than the equivalent CNN for the same task.

Convolution on images

As I already said, the convolution is widely used for image processing. By changing the used kernel it is possible to manipulate blur and shapness, change the style of the image or detect its edges. The figure below shows the application of some different kernels on a base image.

Use of different kernels on a base image. The kernels are the matrices represented below their respetive images. Image by author.
Use of different kernels on a base image. The kernels are the matrices represented below their respetive images. Image by author.

The kernel is applied in each pixel of the image, and the result of the convolution operation between the kernel and the affected region of the image will become the new pixel on the output image, as illustrated by the image below. The kernel, however, cannot be applied on the border pixels because part of the kernel would be out of the image, and as a consequence, the resulting image is slightly smaller than the original.

Bidimensional convolution on images. The kernel (blue) is applied over an image (red). The result of the operation will become the new pixel of the output image (green). Since the kernel cannot be applied outside the real image, the green area is where the convolution will happen, resulting on a slightly smaller image. The kernel will slide over the original image during the convolution process. Image by author.
Bidimensional convolution on images. The kernel (blue) is applied over an image (red). The result of the operation will become the new pixel of the output image (green). Since the kernel cannot be applied outside the real image, the green area is where the convolution will happen, resulting on a slightly smaller image. The kernel will slide over the original image during the convolution process. Image by author.

In the figure above, the convolution of a 5×5 kernel (blue) over a 16×12 image (red) will result in a 12×8 image (green). The new width and height can be calculated with the equations below:

On CNNs we consider that the convolution is a layer of the network, but each layer can have more than one kernel. The output will then have the same number of feature maps as the number of kernels.

For example, a convolutional layer with 4 kernels that receives a grayscale image with dimensions (win x hin x 1) will produce a matrix with dimensions (wout x hout x 4) as output. This matrix will not be considered an image anymore, but a set of four feature maps that correspond to the features detected by the four kernels of this layer.

By chaining the feature maps and kernels over the layers that constitute the structure of the network, the CNNs can capture a diversity of features that are present on the image, and this is the process that allows these networks to learn to recognize objects and other features.

Multikernel filter

In a multichannel image, such as RGB images or the feature maps that are produced by inner layers of the networks, the convolution will happen with one kernel per channel. All these kernels together form a filter, and each filter will have the same number of kernels as the input image has channels.

Example of a multikernel filter. Image by author.
Example of a multikernel filter. Image by author.

Each filter will produce one single feature map, so we need to aggregate the outputs of the individual kernels by summing or averaging them, resulting in a single output for that filter.

A layer with n filters will produce n feature maps as the output of the layer.

Padding

To avoid the problem of the kernel convolution on the pixels at the edge of the image, we can apply padding on the input. We can do it by simply adding more pixels on the border to enlarge it.

The image below illustrates three types of padding: the zero padding (a) that inserts the value 0 over the perimeter of the input image; the reflection padding that reflects the inner pixels over the border; and the constant padding (c) which fills the new border with a constant value (including 0).

The size of the padding border can be determined by the creator of the network, but usually we choose a value that eliminates the reduction effect on the output image. We can calculate the new dimensions with the equations below:

Pooling

As the network learns the best filters to extract only the relevant features from the images, each filter will also generate noise and information that is not relevant to the following layers. We can see on the convolution examples that the edge detector found some pixels that are not part of the dog.

The pooling technique replaces groups of pixels for a value that represents them. The figure below illustrates a size 2 MaxPool, which groups the pixels in 2×2 squares and replaces the whole group with the value of the most intense pixel. Similarly we could perform the AvgPool in which the resulting pixel is the average intensity of the group.

Max pooling operation. Image by author.
Max pooling operation. Image by author.

As an immediate consequence of the pooling layer, the information is condensed, keeping only what is most important for the following layers and reducing noise. The output is also reduced by a factor that depends on the pooling size (on the example, the output has half the width and height of the input), so that each convolution would have less pixels to process and as a result the network gets more efficient.

The image below illustrates the application of MaxPool and AvgPool in the result of the edge detector applied on the dog image.

Application of pooling techniques on an edge image. Image by author.
Application of pooling techniques on an edge image. Image by author.

Another important aspect of the MaxPool is that it makes the network more robust to small translations on the input. Since the neighboring pixels are replaced by the highest value of them all, if the translation keeps the most important pixel inside of the same group, the result will remain unaltered.

Stride

We can also use the strided convolution to reduce the dimensionality of the image (or feature maps) through the layers. The stride parameter represents how many pixels (or matrix elements) the kernel will walk after each iteration.

On the previous examples we used stride = 1, which means that the kernel would move to the next pixel, and when the row is finished it will proceed to the next row. The top half of the following image illustrates this scenario, by applying a 5×5 kernel on the red input. The blue squares represent the center of the kernel on each iteration.

The bottom half of the picture represent a case with stride = 2. In this case the kernel will skip a pixel when walking on the row and as the row ends, it will also skip an entire row before continuing. As we can see, the kernel stops in less pixels than on the previous case, so the resulting image is smaller.

Comparing convolutions with stride = 1 (top) and stride = 2 (bottom). The resulting outputs differ in size. Image by author.
Comparing convolutions with stride = 1 (top) and stride = 2 (bottom). The resulting outputs differ in size. Image by author.

The dimensionality reduction is important in situations when we want to recover an output smaller than the input, as an encoder which wants to compress the image, or a binary classifier that has a single value as an output.

We can define different strides for width (wstride) and height (hstride), so the expressions below can be used to calculate the resulting dimensions.

Transposed convolution

In other applications we might want to increase the dimension of the output after a layer. Some generative methods usually start from a compressed representation (such as a latent vector) and increase the dimensions through the layers until the output generates the final image.

One way to do taht is by using the transposed convolution (many times mistaken by a deconvolution), which consists in the combined use of padding and stride to enlarge the input image before it passes through a "traditional" convolution, so that the output would be larger than the input.

The following picture illustrates the process that generates a 5×5 image from a 3×3 input, by using a transposed convolution with kernel 3×3 and stride = 1.

Transposed convolution. Image by author.
Transposed convolution. Image by author.

A "traditional" convolution with kernel 3×3 and stride = 1 in a 5×5 image would generate a 3×3 feature map as output, so we need to perform a transposed convolution with the same parameters to do the opposite.

It goes in two steps:

  1. The image will have insertions of zero-valued pixels (gray) between its original pixels (red) and also on the border.
  2. Now the convolution will be applied on the enlarged image, with kernel 3×3 and stride = 1.

To calculate the dimensions of the output after a transposed convolution we can use the following equations:

The transposed convolution is not the only way to increase the dimensions of the layer output. Other architectures use classical resizing techniques such as the nearest-neighbor interpolation followed by traditional convolutional layers.

Checkerboard Artifacts

A downside of the transposed convolution is that the kernel overlaps when passing through the image, reinforcing the importance of the pixels on these areas. As a result the generated image (or feature map) will have checkerboard artifacts that appear periodically. This mechanism can be seen on the image below.

Mechanism that creates checkerboard artifacts. Image by author.
Mechanism that creates checkerboard artifacts. Image by author.

By using other techniques such as the resize-convolution cited above, we can avoid this effect.


Conclusion and remarks

There is a lot to learn about convolution, but the first step is to understand the basics, and I hope this article may be helpful for many.

This text was adapted from my master’s degree dissertation.

A special thanks to Bruce, my dog, who kindly allowed me to take his picture and use it for examples. He is the best boy.

If you like this post…

Support me with a coffee!

Buy me a coffee!
Buy me a coffee!

And read this awesome post

5 tips to start a career in data

References

Odena, et. al. Deconvolution and Checkerboard Artifacts

Goodfellow, et. al. Deep Learning. Chapter 9: Convolutional Networks

LeCun, Y.: Generalization and network design strategies. Technical Report CRG-TR-89–4

Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning. arXiv:1603.07285 (2016)

Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks.


Related Articles