Transposed Convolutions is a revolutionary concept for applications like image segmentation, super-resolution etc but sometimes it becomes a little trickier to understand. In this post, I will try to demystify the concept and make it easier to understand.

Introduction
Computer Vision Domain is going through a transition phase since gaining popularity of Convolutional Neural Networks(CNN). The revolution started with Alexnet winning the ImageNet challenge in 2012 and since then CNN’s have ruled the domain in Image Classification, Object Detection, Image Segmentation and many other image/videos related tasks.
The Convolution operation reduces the spatial dimensions as we go deeper down the network and creates an abstract representation of the input image. This feature of CNN’s is very useful for tasks like image classification where you just have to predict whether a particular object is present in the input image or not. But this feature might cause problems for tasks like Object Localization, Segmentation where the spatial dimensions of the object in the original image are necessary to predict the output bounding box or segment the object.
To fix this problem various techniques are used such as fully convolutional Neural Networks where we preserve the input dimensions using ‘same’ padding. Though this technique solves the problem to a great extent, it also increases the computation cost as now the convolution operation has to be applied to original input dimensions throughout the network.

Another approach used for image segmentation is dividing the network into two parts i.e An Downsampling network and then an Upsampling network. In the Downsampling network, simple CNN architectures are used and abstract representations of the input image are produced. In the Upsampling network, the abstract image representations are upsampled using various techniques to make their spatial dimensions equal to the input image. This kind of architecture is famously known as the Encoder-Decoder network.

Upsampling Techniques
The Downsampling network is intuitive and well known to all of us but very little is discussed about the various techniques used for Upsampling.
The most widely used techniques for upsampling in Encoder-Decoder Networks are:
- Nearest Neighbors: In Nearest Neighbors, as the name suggests we take an input pixel value and copy it to the K-Nearest Neighbors where K depends on the expected output.

- Bi-Linear Interpolation: In Bi-Linear Interpolation, we take the 4 nearest pixel value of the input pixel and perform a weighted average based on the distance of the four nearest cells smoothing the output.

- Bed Of Nails: In Bed of Nails, we copy the value of the input pixel at the corresponding position in the output image and filling zeros in the remaining positions.

- Max-Unpooling: The Max-Pooling layer in CNN takes the maximum among all the values in the kernel. To perform max-unpooling, first, the index of the maximum value is saved for every max-pooling layer during the encoding step. The saved index is then used during the Decoding step where the input pixel is mapped to the saved index, filling zeros everywhere else.

All the above-mentioned techniques are predefined and do not depend on data, which makes them task-specific. They do not learn from data and hence are not a generalized technique.
Transposed Convolutions
Transposed Convolutions are used to upsample the input feature map to a desired output feature map using some learnable parameters. The basic operation that goes in a transposed convolution is explained below:
- Consider a 2×2 encoded feature map which needs to be upsampled to a 3×3 feature map.


- We take a kernel of size 2×2 with unit stride and zero padding.

- Now we take the upper left element of the input feature map and multiply it with every element of the kernel as shown in figure 10.

- Similarly, we do it for all the remaining elements of the input feature map as depicted in figure 11.

- As you can see, some of the elements of the resulting upsampled feature maps are over-lapping. To solve this issue, we simply add the elements of the over-lapping positions.

- The resulting output will be the final upsampled feature map having the required spatial dimensions of 3×3.
Transposed convolution is also known as Deconvolution which is not appropriate as deconvolution implies removing the effect of convolution which we are not aiming to achieve.
It is also known as upsampled convolution which is intuitive to the task it is used to perform, i.e upsample the input feature map.
It is also referred to as fractionally strided convolution due since stride over the output is equivalent to fractional stride over the input. For instance, a stride of 2 over the output is 1/2 stride over the input.
Finally, it is also referred to as Backward strided convolution because forward pass in a Transposed Convolution is equivalent to backward pass of a normal convolution.
Problems with Transposed Convolutions:
Transposed convolutions suffer from chequered board effects as shown below.

The main cause of this is uneven overlap at some parts of the image causing artifacts. This can be fixed or reduced by using kernel-size divisible by the stride, for e.g taking a kernel size of 2×2 or 4×4 when having a stride of 2.
Applications of Transposed Convolution:
- Super- Resolution:

- Semantic Segmentation:

Conclusion:
Transposed Convolutions are the backbone of the modern segmentation and super-resolution algorithms. They provide the best and most generalized upsampling of abstract representations. In this post, we explored the various upsampling techniques used and then tried to dig deeper into the intuitive understanding of transposed convolutions. I hope you liked the post and if you have any doubts, queries or comments, please feel free to connect with me on Twitter or Linkedin.
References: