The world’s leading publication for data science, AI, and ML professionals.

Understand Transposed Convolutions

Get to know the concepts of transposed convolutions and build your own transposed convolutional layers from scratch

Image From Unsplash by NeONBRAND
Image From Unsplash by NeONBRAND

Generative adversarial network (GAN) is one of the most state-of-the-art artificial neural networks for new data generation. It is widely implemented in photograph generation, photograph editing, face aging, and more. The cores of GAN are the generator and the discriminator. They two play an adversarial game where the generator is learning to fool the discriminator by making fake images that look very much like real images, while the discriminator is learning to get better at detecting fake images. In an ideal situation, at the end of the training, the generator can produce fake images that look exactly like real images, but the discriminator can still find out they are fake.

While convolutional layers play an important role in the discriminator, transposed convolutional layers are the primary building blocks for the generator. Thanks to the TensorFlow API – Keras, building GAN becomes a very convenient process. However, setting the right values for the parameters, such as kernel sizes, strides, and padding, require us to understand how transposed convolutions work. In this notebook, I would like to share some of my personal understandings about transposed convolutions, and hopefully help you to reveal the mystery. Throughout the notebook, I will use convolutions as the comparison to better explain transposed convolutions. I will also show you how I implement these understandings to build my own convolutional and transposed convolutional layers, which act like a naive version of the Conv2D and Conv2DTranspose layers from Keras. The notebook consists of three sections:

  1. What is the transposed convolution?
  2. What are the parameters (kernel size, strides, and padding) in Keras Conv2DTranspose?
  3. Build my own Conv2D and Conv2DTranspose layers from scratch

Section 1: What Is The Transposed Convolution?

I understand the transposed convolution as the opposite of the convolution. In the convolutional layer, we use a special operation named cross-correlation (in Machine Learning, the operation is more often known as convolution, and thus the layers are named "Convolutional Layers") to calculate the output values. This operation adds all the neighboring numbers in the input layer together, weighted by a convolution matrix (kernel). For example, in the image below, the output value 55 is calculated by the element-wise multiplication between the 3×3 part of the input layer and the 3×3 kernel, and sum all results together:

One Convolution Operation With 3x3 Part of the Input Layer and the 3x3 Kernel (Image by Author)
One Convolution Operation With 3×3 Part of the Input Layer and the 3×3 Kernel (Image by Author)

Without any padding, this operation transforms a 4×4 matrix into a 2×2 matrix. This looks like someone is casting the light from left to right, and projecting an object (the 4×4 matrix) through a hole (the 3×3 kernel), and yield a smaller object (the 2×2 matrix). Now, our question is: what if we want to go backward from a 2×2 matrix to a 4×4 matrix? Well, the intuitive way is, we just cast the light backward! Mathematically, instead of multiplying two 3×3 matrices, we can multiply each value in the input layer by the 3×3 kernel to yield a 3×3 matrix. Then, we just combine all of them together according to the initial positions in the input layer, and sum the overlapped values together:

Multiply Each Element in the Input Layer by Each Value in the Kernel
Multiply Each Element in the Input Layer by Each Value in the Kernel
Combine All Four Resulting Layers Together And Sum the Overlapped Values (Image by Author)
Combine All Four Resulting Layers Together And Sum the Overlapped Values (Image by Author)

In this way, it is always certain that the output of the transposed convolution operation can have exactly the same shape as the input of the previous convolution operation, because we just did exactly the reverse. However, you may notice that the numbers are not restored. Therefore, a totally different kernel has to be used to restore the initial input matrix, and this kernel can be determined through training.

To demonstrate that my results are not just some random numbers, I build the convolutional neural networks using the conditions indicated above through Keras. As can be seen from the code below, the outputs are exactly the same.

Why "Transposed"?

Now that you may be wondering: hey, this looks just like a reversed convolution. Why is it named "transposed" convolution?

To be honest, I don’t know why I had to struggle with this question, but I did. I believed that it’s named as "transposed" convolution for a reason. To answer this question, I read many online resources about transposed convolution. An article named "Up-sampling with Transposed Convolution" helped me a lot. In this article, the author Naoki Shibuya expresses the convolution operation using a zero-padded convolution matrix instead of a normal squared-shape convolution matrix. Essentially, instead of expressing the above kernel as a 3×3 matrix, when performing the convolutional transformation, we can express it as a 4×16 matrix. And instead of expressing the above input as a 4×4 matrix, we can express it as a 16×1 vector:

Express 3x3 Kernel as 4x16 Zero-Padded Convolution Matrix (Image by Author)
Express 3×3 Kernel as 4×16 Zero-Padded Convolution Matrix (Image by Author)

The reason it is 4×16 matrix is that:

  • 4 rows: in total, we can perform four convolutions by splitting a 4×4 input matrix into four 3×3 matrices;
  • 16 columns: the input matrix will be transformed into a 16×1 vector. To perform the matrix multiplication, it has to be 16 columns.
Express 4x4 Input Matrix as 16x1 Vector (Image by Author)
Express 4×4 Input Matrix as 16×1 Vector (Image by Author)

In this way, we can directly perform the matrix multiplication to get an output layer. The reshaped output layer will be exactly the same as the one derived by the general convolution operation.

Matrix Multiplication Between 4x16 Convolution Matrix and 16x1 Input Vector (Image by Author)
Matrix Multiplication Between 4×16 Convolution Matrix and 16×1 Input Vector (Image by Author)

Now comes the most interesting part! When we perform transposed convolution operation, we just simply transpose the zero-padded convolution matrix and multiply it with the input vector (which was the output of the convolutional layer). In the picture below, the four colored vectors in the middle stage represent the intermediate step of the matrix multiplication:

Matrix Multiplication Between 16x4 Transposed Convolution Matrix and 4x1 Input Vector (Image by Author)
Matrix Multiplication Between 16×4 Transposed Convolution Matrix and 4×1 Input Vector (Image by Author)

If we rearrange the four vectors in the middle stage, we will get the four 4×4 matrices that have exactly the same numbers as the 3×3 matrices we obtained by multiplying the 3×3 kernel with each individual element in the input layer, with the extra slots filled by zeros. These four matrices can also be further combined to get the final 4×4 output matrix:

Rearrange 16x1 Intermediate Vectors Into 4x4 Intermediate Matrices (Image by Author)
Rearrange 16×1 Intermediate Vectors Into 4×4 Intermediate Matrices (Image by Author)
Combine All Four Intermediate Matrices Together To Yield The Final Result (Image by Author)
Combine All Four Intermediate Matrices Together To Yield The Final Result (Image by Author)

Thus, the operation is called "transposed" convolution because we performed exactly the same operation except that we transposed the convolution matrix!


Section 2: What are the parameters (kernel size, strides, and padding) in Keras Conv2DTranspose?

1. Kernel Size

In convolutions, the kernel size affects how many numbers in the input layer you "project" to form one number in the output layer. The larger the kernel size, the more numbers you use, and thus each number in the output layer is a broader representation of the input layer and carries more information from the input layer. But at the same time, using a larger kernel will give you an output with a smaller size. For example, a 4×4 input matrix with a 3×3 kernel will yield a 2×2 output matrix, while with a 2×2 kernel will yield a 3×3 output matrix (if no padding is added):

(Image by Author)
(Image by Author)

In transposed convolutions, when the kernel size gets larger, we "disperse" every single number from the input layer to a broader area. Therefore, the larger the kernel size, the larger the output matrix (if no padding is added):

(Image by Author)
(Image by Author)

2. Strides

In convolutions, the strides parameter indicates how fast the kernel moves along the rows and columns on the input layer. If a stride is (1, 1), the kernel moves one row/column for each step; if a stride is (2, 2), the kernel moves two rows/columns for each step. As a result, the larger the strides, the faster you reach the end of the rows/columns, and therefore the smaller the output matrix (if no padding is added). Setting a larger stride can also decrease the repetitive use of the same numbers.

(Image by Author)
(Image by Author)

In transposed convolutions, the strides parameter indicates how fast the kernel moves on the output layer, as explained by the picture below. Notice that the kernel always move only one number at a time on the input layer. Thus, the larger the strides, the larger the output matrix (if no padding).

(Image by Author)
(Image by Author)

3. Padding

In convolutions, we often want to maintain the shape of the input layers, and we do it through zero-padding. In Keras, padding parameter can be one of two strings: "valid" or "same". When padding is "valid", it means no zero-padding is implemented. When padding is "same", the input-layer is padded in a way so that the output layer has a shape of the input shape divided by the stride. When the stride is equal to 1, the output shape is the same as the input shape.

(Image by Author)
(Image by Author)

In transposed convolutions, the padding parameter also can be the two strings: "valid" and "same". However, since we expand the input layer in transposed convolutions, if choosing "valid", the output shape will be larger than the input shape. If "same" is used, then the output shape is forced to become the input shape multiplied by the stride. If this output shape is smaller than the original output shape, then only the very middle part of the output is maintained.

(Image by Author)
(Image by Author)

An easier way to remember "valid" and "same" in both convolutions and transposed convolutions is:

  • "valid": no extra operation is performed. The output stays what it is meant to be.
  • "same": the output shape is the input shape divided by the stride (convolutions) or multiplied by the stride (transposed convolutions). When the stride is equal to 1, the output shape is always the same as the input shape.

Build My Own Conv2D and Conv2DTransposed Layers From Scratch

Up to now, I have explained all the concepts about transposed convolutional layers and their important parameters. They may still be very abstract for you, and I totally understand you, because I also struggled a lot to understand how transposed convolutional layers work. But don’t worry, now we can get our hands dirty and build our own convolutional and transposed convolutional layers using the concepts we learned – this will definitely reveal the mystery of the transposed convolutional layers!

Let’s first start with Conv2D:

Let’s go through my home-made Conv2D layer:

  • First, I defined the number of zero-padding I need to add. If the padding is "valid", then I don’t need to add any padding. If the padding is "same", I calculate the number of padding on each side of the input layer based on the formula:

Where:

  • o is the output size
  • s is the strides
  • m is the kernel size
  • n is the input size
  • p is the padding number on each side of the original input layer

This formula is derived from the formula for calculating the output shape:

With the output shape of o = n/s.

  • Then, I padded the input by constructing a larger matrix filled with zeros and put the original input in the middle.
  • After that, I calculated the output using the convolution operation. The convolution operation is performed between the kernel W and the subset of input X_sub (which has the same shape as the kernel). The output indices i, j range from 0 to the last index that can fit in the kernel. For example, if X_padded has a shape of 4×4, and the kernel has a shape of 3×3, then the last index that can fit in the kernel is 1(the kernel can be fitted from index 1 to index 4). A graphical explanation of the process can be found below:
Calculating the Output Using a 4x4 Input Matrix and a 3X3 Kernel. The Number Updated at Each Step in the Output is Bolded (Image by Author)
Calculating the Output Using a 4×4 Input Matrix and a 3X3 Kernel. The Number Updated at Each Step in the Output is Bolded (Image by Author)

I also compared the results using my Conv2D with Keras Conv2D. The results are the same!

  • "valid" padding

  • "same" padding

Now let’s build the transposed convolutional layer:

Let’s against break up the code:

  • The output shape is firstly defined by the formula below:

If you compare to the formula to calculate the output shape of Conv2D, you can notice that in Conv2DTranspose both the strides and the kernel size have the opposite effect on the output shape.

  • To calculate the output, I use two pairs of indices: i, j move along the input and _i_prime, jprime move along with the output. When i, j change, _i_prime, jprime change with step sizes of given strides. For example, if i changes from 0 to 1, and strides = (2, 2), then _iprime changes from 0 to 2. Each value in the input matrix is multiplied by all values in the kernel, and the results are recorded in the output matrix.
  • Then, I defined the length of the padding. Again, if the padding is "valid", I don’t need to modify anything. If the padding is "same", then the output shape has to be the input shape multiplied by the stride. That is:

The padding has to convert the original output shape to the desired output shape:

And therefore an easy way to set the values of padding is:

  • Lastly, the output is padded by only selecting the matrix in the middle that has the same shape as the input matrix.

A graphical explanation for the process of calculating the output is shown below:

(Image by Author)
(Image by Author)

Now we can verify our Conv2DTranspose function by comparing the results with Conv2DTranspose in Keras:

  • "valid" padding:

  • "same" padding:

The results are exactly the same!


Conclusion

Congratulations, you’ve made it to the end! As mentioned before, this notebook is based on my personal understanding, and I am actually self-taught on GAN and all other machine learning knowledge. Therefore, if you spotted any mistake, I’ll be glad to know!


Related Articles