An intuitive introduction to different variations of the glamorous Cnn layer
Just a brief intro
Convolution is using a ‘kernel’ to extract certain ‘features’ from an input image. Let me explain. A kernel is a matrix, which is slid across the image and multiplied with the input such that the output is enhanced in a certain desirable manner. Watch this in action below.

For example, the kernel used above is useful for sharpening the image. But what is so special about this kernel?? Consider the two input image arrangements as shown in the example below. For the first image, the center value is 35 + 2-1 + 2-1 + 2-1 + 2-1 = 7. The value 3 got increased to 7. For the second image, the output is 15+ 2-1 + 2-1 + 2-1 + 2-1 = -3. The value 1 got decreased to -3. Clearly, the contrast between 3 and 1 is increased to 7 and -3, which will in turn sharpen the image.

Instead of using manually made kernels for feature extraction, through Deep CNNs we can learn these kernel values which can extract latent features. For further reading into working of conventional CNNs, I would suggest this blog.
Kernel vs Filter
Before we dive into it, I just want to make the distinction between the terms ‘kernel’ and ‘filter’ very clear because I have seen a lot of people use them interchangeably. A kernel is, as described earlier, a matrix of weights which are multiplied with the input to extract relevant features. The dimensions of the kernel matrix is how the convolution gets it’s name. For example, in 2D convolutions, the kernel matrix is a 2D matrix.
A filter however is a concatenation of multiple kernels, each kernel assigned to a particular channel of the input. Filters are always one dimension more than the kernels. For example, in 2D convolutions, filters are 3D matrices (which is essentially a concatenation of 2D matrices i.e. the kernels). So for a CNN layer with kernel dimensions hw and input channels k, the filter dimensions are kh*w.
A common convolution layer actually consist of multiple such filters. For the sake of simplicity in the discussion to follow, assume the presence of only one filter unless specified, since the same behavior is replicated across all the filters.
1D, 2D and 3D Convolutions
1D convolutions are commonly used for time series data analysis (since the input in such cases is 1D). As mentioned earlier, the 1D data input can have multiple channels. The filter can move in one direction only, and thus the output is 1D. See below an example of single channel 1D convolution.

We already saw an example of single channel 2D convolution at the start of the post, so let’s visualize a multi channel 2D convolution and try to wrap our heads around it. In the diagram below, the kernel dimensions are 3*3 and there are multiple such kernels in the filter (marked yellow). This is because there are multiple channels in the input (marked blue) and we have one kernel corresponding to every channel in the input. Clearly, here the filter can move in 2 directions and thus the final output is 2D. 2D convolutions are the most common convolutions, and are heavily used in Computer Vision.

It is difficult to visualize a 3D filter (since it’s a 4D dimensional matrix), so we will discuss single channel 3D convolution here. As you can see from the image below, in 3D convolutions, a kernel can move in 3 directions and thus the output obtained is also 3D.

Most of the work done in modifying and customizing CNN layers have been focused towards 2D convolutions only and so from this point forward I will only be discussing these variations in context of 2D convolutions.
Transposed Convolution (Deconvolution)
The GIF below nicely captures how a 2D convolution decreases the dimensions of the input. But sometimes we need to do input processing such as to increase it’s dimensions (also called ‘upsampling’).

To achieve this using convolutions, we use a modification known as transposed convolution or deconvolution (although it is not truly ‘reversing’ a convolution operation, so a lot of people don’t prefer to use this term). The dotted blocks in the GIF below represent padding.

I think these animations give a good intuition of how different up-sampled outputs can be created from the same input, based on the padding pattern. Such convolutions are very commonly used in modern CNN networks, mainly because of their ability to increase the image dimensions.

Separable Convolution
Separable Convolution refers to breaking down the convolution kernel into lower dimension kernels. Separable convolutions are of 2 major types. First are spatially separable convolutions, see below for example.


However, spatially separable convolutions are not that common in Deep Learning. On the other hand, Depthwise separable convolutions are widely used in light weight CNN models and provide really good performances. See below for example.


But why do separable convolutions? Efficiency!! Using separable convolutions can significantly decrease the number of parameters required. With the increasing complexity and tremendous size of Deep Learning networks that we have today, being able to provide similar performances with lower number of parameters is definitely a requirement.
Dilated (Atrous) Convolution
As you have seen in all the convolution layers above (without exception) that they process all the neighboring values together. However, sometimes it might be in the best interest of the pipeline to skip certain input values and this is how dilated convolutions (also called atrous convolutions) were introduced. Such a modification allows the kernel to increase it’s range of view, without increasing the number of parameters.

Clearly one can notice from the animation above, that the kernel is able to process a wider neighborhood with those same 9 parameters as earlier. This also means loss in information because of not being able to process the fine-grained information (since it is skipping certain values). However, the overall effect seems to be positive in certain applications.
Deformable convolution
Convolutions are very rigid in terms of the shape of feature extraction. That is, the kernel shapes are only square/rectangle (or some other shape that needs to be manually decided) and thus they can only work on such patterns. What if the shape of the convolution in itself was learnable? This is the core idea behind the introduction of deformable convolutions.

The implementation of a Deformable Convolution is actually very straightforward. Every kernel is actually represented with two different matrices. The first branch learns to predict the ‘offset’ from the origin. This offset is an indication of what inputs around the origin will be processed. Since each offset is predicted independently, they don’t need to form any rigid shape between themselves, thus allowing the deformable nature. The second branch is simply the convolution branch whose input is now the values at these offsets.

What’s next?
There have been multiple variations of CNN layers which have been used independently or in combination with each other to create successful and complex architectures. Each variation was born out of an intuition of how feature extraction should work. Thus I believe that while these Deep CNN networks learn weights that we cannot explain, the intuitions involved in forming them is very important to their performance and further work in that direction is important for success of highly complex CNNs.
This blog is a part of an effort to create simplified introductions to the field of Machine Learning. Follow the complete series here
Or simply read the next blog in the series
References
_[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [2] Dumoulin, Vincent, and Francesco Visin. "A guide to convolution arithmetic for deep learning." arXiv preprint arXiv:1603.07285 (2016). [3] Chen, Liang-Chieh, et al. "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs." IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834–848. [4] Dai, Jifeng, et al. "Deformable convolutional networks." Proceedings of the IEEE international conference on computer vision. 2017. [5] Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017). [6] https://github.com/vdumoulin/conv_arithmetic_