Visualization from http://terencebroad.com/convnetvis/vis.html

Deep-dive into Convolutional Networks

From the building-blocks to the most advanced architectures, touching on interpretability, and bias.

Antonino Ingargiola

Published in

Towards Data Science

16 min readMar 20, 2019

By the end of this post you will understand this diagram. Image courtesy of FloydHub.

Introduction

Convolutional Networks (ConvNets) are a class of efficient neural networks that achieve impressive performances in perceptual tasks such as object recognition. Their architecture is loosely inspired by the visual cortex. In 2012 AlexNet, a type of ConvNet, won by a large margin the ILSVRC 2012 competition, starting the huge wave of interest in deep learning that continues today. In 2019, the state of the art architecture for object detection is ResNet, which is a type of ConvNet.

In this article, I assume some familiarity with standard fully-connected neural networks (or multi-layer perceptron, MLP). After a high-level view of ConvNets activations, I will deep-dive into the concept of convolution and other building blocks (pooling, batch normalization, 1x1 filters, etc). Next, I will briefly illustrate some advanced architectures that achieve state-of-the-art results (Inception, ResNet). In the final part, I will touch the topics of interpretability and bias. Each section contains a list of references and links for further study. Some concepts are generally applicable to deep neural networks, but I will illustrate them in the context of ConvNets.

If you are new to ConvNets, it will take some time to digest the material, take your time and read many sources. You can use this post as a quick reference of ideas surrounding ConvNets. If you find any error or if there are other topics you think I missed let me know in the comments section.

An overview of a ConvNets

Convolutional Neural Nets (ConvNets) are a class of neural networks specialized for image processing. As other neural networks, they transform input to output through many layers. In ConvNets, layers have a convolution step, a pooling step (optional) and a non-linear activation. Each layer in a neural net transforms the input tensor into the output tensor through linear an non-linear operations. All these intermediate tensors (including the network input and output) are called activations and they are all different representations of the input.

**Figure 1.** Activation tensors in a convolutional neural net.

I like to start illustrating ConvNets by visualizing the shape of activations as we go from input to output (see Figure 1). Each layer transforms the activation through both linear an non-linear operations (we will see the details in the next section). As we go through the layers, the spatial dimension of the activations shrinks while the depth increases. The last part of the ConvNet transforms the 3D activation to 1D, typically by average pooling (see Pooling section). Finally, 1 or 2 fully-connected layers project the activation into the output space for the final classification. In this post, I am using classification as an example of the final task. Some architectures avoid the final dense layers by directly generating a 1D activation with length matching the number of categories.

The flow of activations shows how the input is represented in an increasingly “rich” feature space (increased depth) while trading off spatial information (decreased height/width). The last fully-connected layers forego any spatial information in order to achieve the final classification task. As we go through layers the features not only increase in number (depth size) but also in complexity, being a combination of features in the previous layer. In other words, the network builds a hierarchical representation of the input: the first layers represents the input in term of elementary features such as edges, the second layer in term of more complex features such as corners, etc. A deeper layer can recognize abstract features such as an eye or even a face. The striking part is that the ConvNet will learn this hierarchy of feature autonomously during training.

During training, the network will learn a representation that is good for solving the assigned task. With a large and diverse dataset such as ImageNet (millions of images classified in 1000 categories), the learned representations will be general enough to be useful for many other visual perception tasks, even on different or very specific domains. This is the foundation of transfer learning: training a model on a big dataset once, then fine-tuning the model on a new domain-specific (and potentially small) data-set. This allows to quickly adapt a pre-trained network to solve new problems quickly and with high accuracy.

Convolution step

Let’s zoom-in now into one convolutional layer. Keep in mind that what we call convolution in neural nets is a bit different than the classical 2D convolution in signal processing. While the broad idea is similar, it is not mathematically the same.

***Figure 1.1*** *Convolution of a 5x5 input (blue) with 3x3 kernel (grey) with a stride of 2 and padding of 1. The 3x3 output is in green (source*).

Both classical and deep-learning convolution compute the output by applying kernel to an input array. Each output pixel is the sum of the element-by-element product between input and kernel (dot product). By shifting the kernel over the input, we obtain the different output pixels. The number of pixels we shift at each step (1 or more) is called the stride.

One fundamental difference is in the shape of the input and output tensors: in neural nets we have additional dimensions.

If you are not familiar with 2D convolutions, have a look this great interactive demonstration to gain some an intuition.

http://setosa.io/ev/image-kernels/

**Figure 2.** A single convolution layer. The convolution output is a tensor with increased depth. Each spatial position in the output (yellow “rod”, middle) depends on a portion of the input (the “receptive field”, left) and on a bank of filters (kernels).

Differences with classical 2D convolution

In 2D ConvNets, the convolution has the following properties:

Input and output activations (also called feature maps) are 3D arrays (height, width, depth). The first layer input depth is 3 (RGB). The depth increases as we go into the deeper layers. Note that, when considering a mini-batch, the input is actually 4D.
Like input and output, the kernel is also 3D. The spatial size is usually 3x3, 5x5 or 7x7 and depth is equal to the input depth. A kernel is also called a filter.
Each layer has multiple kernels called a filter bank. The number of kernels determines the depth of the output (which is typically larger than input depth).
Unlike classical convolution, in ConvNets we compute many convolutions as a single step (one for each kernel in a layer).
Unlike classical convolution, the kernel is not flipped along spatial dimensions before multiplication (this makes the convolution non-commutative, but this property is irrelevant for neural nets).

Receptive Field

The receptive field is the 3D region of the input contributing to the output pixel (see the yellow cuboid in Figure 2). Note that one output pixel has many “values”, one for each kernel (64 in Figure 2). By arranging all the output vectors corresponding to different receptive fields we obtain the full 3D activation.

Typically, the receptive fields of two adjacent output locations will partially overlap. There is no overlapping only if the stride equals the kernel size.

Number of parameters

All the kernels in a layer (64 in Figure 2) can be arranged in single 4-D tensor of shape

(# kernels, kernel size, kernel size, input_depth)

The parameters include all the weights in the kernels plus the 1D bias vector.

The bias introduces one additional parameter per kernel. Like kernels, the bias is the same for each spatial position, so there are as many bias parameters as the number of kernels (or output depth).

Putting bias and weights together, the total parameters in a layer sum up to:

(# kernels x kernel size x kernel size x input_depth)
+ # kernels

Mini-batches

In practice, the activations of Figure 1 are not computed for a single image but for a mini-batch. In this case, all the activations will have an additional dimension of size batch_size. The batch size must be taken into account because it directly influences the RAM needed to train and evaluate the model. Typically we use the largest batch size that can fit in the GPU RAM.

Batch normalization

Batch normalization (BatchNorm) is one of the most important advances in deep learning in recent years. BatchNorm speeds up and stabilizes training on virtually any neural net architecture, including ConvNets.

Curiously, the original BatchNorm authors attributed the improved performance to a reduction in the “internal covariance shift”. But it was recently discovered that BatchNorm is instead smoothing the optimization landscape, allowing larger learning rates to quickly converge to more accurate solutions. Just a reminder that theory, even if compelling or “intuitive”, must always be empirically validated.

“Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, Sergey Ioffe, Christian Szegedy, arXiv:1502.03167 (2015)
“Batch Normalization”, I. Goodfellow, J. Bengio, A. Courville, Deep Learning Book Ch 8.7.1 (2016)
“How Does Batch Normalization Help Optimization?”, Santurkar et al. arXiv:1805.11604 (2018)

Other normalizations

BatchNorm is unquestionably the most popular normalization method in deep learning, but it is not the only one. Research is very active in this area and we may see new advances in the near future. The problem is two-fold. On one hand, BachNorm is difficult to apply to recurrent networks due to its reliance on the mini-batch mean and standard deviation. On the other hand, the effect of BatchNorm was quite fortuitous, and more research in understanding how BatchNorm helps optimization can lead to even better normalization approaches.

For brevity, I will only mention one alternative normalization scheme: weight normalization. In this scheme, instead of normalizing the activations, we normalize the weights. In particular, we normalize each kernel (all the weights contributing to a single activation) to have unit norm. Then, to preserve the model expressiveness, we also add a scale parameter for each activation. In principle, this should help training in a similar way as BatchNorm, by providing a single direct “knob” to change each activation, thus providing an “easier” (i.e. smoother) path toward the minimum.

**Figure 4.** Different approaches to normalize of activations in a mini-batch. (source)

Many other normalization methods have been proposed each with its own pro and contra. For an excellent overview please see “An Overview of Normalization Methods in Deep Learning” by Keita Kurita.

“Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks”, Tim Salimans, Diederik P. Kingma, arXiv:1602.07868 (2016)
An Overview of Normalization Methods in Deep Learning, Keita Kurita, (30 Nov. 2018)

Pooling

Convolutional blocks are oftentimes followed by a pooling block to reduce the activation spatial dimensions. Pooling helps in reducing memory consumption in deeper layers. It is also an important step to convert the spatial information into features. According to the Deep Learning Book by Ian Goddfellow et al.

pooling helps to make the representation approximately invariant to small translations of the input.

There are different pooling strategies. The most common are max-polling and average pooling. In all cases, pooling reduces an input “block” (receptive field) into a 1x1 output block, while keeping the depth unchanged. The reduction is done by selecting the max input activation (max-pooling) or by taking an average (average-pooling). Similar to convolution, a pooling block maps a receptive field to a single “pixel” in the output. For this reason, we can define a polling spatial size (2x2, 3x3, etc.) and stride. Usually, the stride is chosen to have non-overlapping receptive fields to achieve a reduction in spatial size. Oftentimes, the last pooling layer is an average over the whole spatial activations (global average pooling or GAP) resulting in a 1x1 output activation (the 1D activation of size 512 in Figure 1). Unlike convolution, pooling does not have any parameters and the number of output features (depth) is always the same as the input.

Pooling layers with a “learnable structure” have been proposed but have enjoyed limited popularity so far (Jia et al 2012).

“Network in Network”, Min Lin, Qiang Chen, Shuicheng Yan, arXiv:1312.4400 (2013)
“Ch 9.3: Pooling”, I. Goodfellow, J. Bengio, A. Courville, Deep Learning Book Ch 9.3 (2016)
“Beyond spatial pyramids: Receptive field learning for pooled image features”, Yangqing Jia, Chang Huang, Trevor Darrell doi: 10.1109/CVPR.2012.6248076 (2012)

1x1 convolutions

Some architectures use a 1x1 filter. In this case, the filter maps input of shape

(num_filters_i, height_i, width_i)

to an output of shape:

(num_filters_o, height_i, width_i)

Note that only the number of features changes, while height and width remain the same. In this case, each output pixel is a vector of num_filters_o features that depends only on one input pixel (a vector of size num_filters_i). Each output feature is a (different) linear combination of the input features for the same pixel which is a receptive field of size 1x1.

The 1x1 filter is used to reduce the number of output features thus reducing the computational cost while keeping the spatial dimension unchanged. For example, the inception network uses 1x1 filters to reduce the features and create “bottlenecks” which make the architecture more computationally affordable. However, if the bottleneck is too tight it may end up hurting the network performances.

When the size of the convolution kernel is larger than 1 x 1,
each output feature is still a linear combination of all the input features
in the receptive field, which in this case is >1 pixel wide.

The 1x1 convolution was called Network in Network in the original paper by Lin et al. The original paper described it as a “mini” fully connected layer between the 1x1 input and output features. Note that the same fully connected layer is applied to at each spatial position using the same weights.

“Network in Network”, Min Lin, Qiang Chen, Shuicheng Yan, arXiv:1312.4400 (2013)

Inception

The ILSVRC 2014 winner was the GoogLeNet architecture by Szgedy et al. which introduces the inception module shown below.

In ConvNets, an important choice is the spatial size of the convolution kernel. The size is typically 7x7 for the first layer and 3x3 for all the following layers. Instead of choosing one size for the convolution, the inception module performs many convolutions in parallel. Figure 5 shows the inception block as proposed in the inception v1 paper. Convolutions with size 1x1, 3x3 and 5x5 (blue blocks) as well as a max-pooling (red block) are performed on the same input. Additional 1x1 convolutions (yellow blocks) reduce the depth size in order to heavily reduce the memory requirements. These parallel paths produce output tensors (with the same spatial size) which are concatenated along the depth to form the layer output.

Since the first paper, many updates to the inception architecture have been proposed including inception v2, v3, v4, and inception-resnet. The latter combines the inception idea of multiple convolutions with skip-connections (from ResNets, see next section).

“Going Deeper with Convolutions”, C. Szegedy, et al. (2014) arXiv:1409.4842
“Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning”, Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi, (2016) arXiv:1602.07261
A Simple Guide to the Versions of the Inception Network, Bharath Raj (May 2018)

ResNet

One known problem in neural networks with many layers is the vanishing gradient. In essence, during back-propagation, the derivative gets multiplied by the derivative in the previous layer. So, by the time we reach the first layers the gradient can become vanishingly small or can explode (overflow). This effect makes it hard to train deep neural nets, including ConvNets. To address this problem, Kaiming He et al. introduced the “skip connection” which forms the building block of the ResNet architecture.

In ResNet, the output of a layer is fed not only to the next layer but also to the input of two layers ahead. The input is added to the layer output and then fed to the next layer.

Since winning the ILSVRC 2015 competition, ResNet is still the state-of-the-art ConvNet architecture. Pretrained ResNet34 or ResNet50 is the de facto standard in transfer learning to implement specialized applications spanning from medical imaging to teddy bear detectors.

“Deep Residual Learning for Image Recognition”, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, (2015) arXiv:1512.03385
“An Overview of ResNet and its Variants”, Vincent Fung (2017)

Interpretability

In order to build trust in intelligent systems and move towards their meaningful integration into our everyday lives, it is clear that we must build “transparent” models that explaining why they predict what they predict.

from Grad-CAM arXiv:1610.02391.

Understanding how a neural network reaches a decision is notoriously a difficult task. Interpreting neural network results is not only important as a scientific endeavor but also required by many applications. Neural net interpretability is an active research topic involving several network visualization techniques.

Broadly speaking there are two broad techniques on interpretability. One is called attribution aiming to find the region in the input image used to reach the decision. The second is called feature visualization aiming to visualize which features in an input image activate a specific neuron or group of neurons.

**Figure 8.** Two approaches in interpreting ConvNets: feature visualization and attribution. Source Olah et al. 2017.

On some architectures, attribution can be done by overlaying spatial activations in the hidden layers with the input image and plotting a so-called saliency map (Figure 8, right panel). Saliency maps have the same spatial resolution as the last 3D activation, which is low but oftentimes sufficient. An extension of this approach applicable to any architecture is Grad-CAM, where the saliency map is a weighted mean of the last spatial activations (last activation with a 3D shape). The weights in this weighted mean are computed from the gradient of the loss function with respect to each activation. Grad-CAM can be applied to any network, even to non-classification tasks.

**Figure 9.** Feature visualization. The top row shows the “optimization objective”: single neuron, channel, layer, a class before soft-max, a class after soft-max. The bottom row shows an input image resulting from optimizing the corresponding objective. Source Olah et al. 2017.

For feature visualization, we can plot kernel weights in each layer. Each kernel shows the pattern detected in layer input. Interpretation, in this case, is easier in the first layer but becomes more difficult for deeper layers. Another simple approach is plotting the activations for a given input.

A more nuanced approach in generating (via optimization) an input image to maximally activate a neuron, a channel, a layer or a class (Figure 9). This allows building atlases of “features” that visually represents what the network responds to in each layer. This approach suffers from a lack of diversity of generated images, which may not represent the full set of spatial features the network responds to. For further information, papers published by Distill.pub are both insightful and graphically stunning.

“Grad-CAM”, R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra (2016), arXiv:1610.02391
“Understanding CNN”, Andrej Karpathy, CS231n Course
“Feature Visualization”, Chris Olah, Alexander Mordvintsev, Ludwig Schubert (2017), Distill.pub, doi:10.23915/distill.00007
“Exploring Neural Networks with Activation Atlases”, Shan Carte, Zan Armstrong, Ludwig Schubert, Ian Johnson, Chris Olah (2018), Distill.pub, doi:10.23915/distill.00015

Bias

A discussion of ConvNets cannot be complete without mentioning the issue of bias. Bias in machine learning comes from bias in datasets and/or algorithms, which in turn reflect biases of the people creating the system. While bias is a serious problem in machine learning, ConvNets applications offer some glaring examples of how it can affect people lives.

Remember that a network will “learn” representations useful to solve a task. For example, if our goal is recognizing faces, the ideal dataset should be as diverse as possible in order to represent all the ethnic, age, and gender groups in a balanced way. In practice, most popular datasets over-represent white males. As researcher Joy Buolamwini found, this leads to heavy biases in the current state of the art commercial face-recognition systems. In these systems, faces of women of color are recognized with orders of magnitude lower accuracy than while males (Figure 4). These systems have been or will be deployed to identify suspects of a crime, for example. Unfortunately, if you are a dark-skinned woman, you will be misidentified for a criminal at a hundredfold higher rate than a white man!

As machine learning practitioners, we cannot forgo our moral responsibilities. We know that the systems we create can disrupt people lives at an unprecedented scale. We must thus take steps to overcome this bias. Rachel Thomas, one of “20 Incredible Women in AI” according to Forbes, has written about the bias issue at length and her posts are an excellent source of information.

Five Things That Scare Me About AI, Rachel Thomas, fast.ai blog (2019)
Gender Shades, Joy Buolamwini, (2018) MIT Media Lab
AI Safety Needs Social Scientists, Geoffrey Irving, Amanda Askell, (2019) doi:10.23915/distill.00014

Conclusion

In this post, I touched several fundamental aspects of ConvNets. Even the most advanced architectures are based on the basic building block of the convolutional layer.

ConvNets may have “solved” the image identification problem but many problems stills exist. Despite recent progress, interpreting results is still a challenge, an issue impeding applications in some fields. Better generalization with smaller datasets would also greatly expand the class of treatable problems. But, most importantly, we need to acknowledge and try to overcome social bias. Given the dramatic implications on individuals and communities, it is paramount to strive for more fairness in these system.

References

Here you find general references on ConvNets. References for specific topics are at the end of each section.

“Ch. 9. Convolutional Networks”, I. Goodfellow, J. Bengio, A. Courville, Deep Learning Book (2016).
“Lesson 6: Regularization; Convolutions; Data ethics”, Fast.ai Practical Deep Learning for Coders, v3"
“Ch. 6: Deep Learning”, Neural Networks and Deep Learning, Michael Nielsen (2015) https://neuralnetworksanddeeplearning.com/
CS231n Convolutional Neural Networks for Visual Recognition Andrej Karpathy, Standford CS231n Lectures
“A guide to convolution arithmetic for deep learning”, Vincent Dumoulin, Francesco Visin (2016), arXiv:1603.07285 (see also their animations)

Other blog posts:

“A Beginner’s Guide To Understanding Convolutional Neural Networks” by Adit Deshpande
“Intuitively Understanding Convolutions for Deep Learning” by Irhum Shafkat

Header image from Topological Visualisation of a Convolutional Neural Network.