The world’s leading publication for data science, AI, and ML professionals.

Revision for Deep Image Inpainting and Review: Patch-Based Image Inpainting with Generative…

Welcome back guys:) Today, I would like to give a revision for deep image inpainting we have talked about so far. Also, I want to have…

Welcome back guys:) Today, I would like to give a revision for deep image inpainting we have talked about so far. Also, I want to have another review of an image inpainting paper for the consolidation of knowledge of deep image inpainting. Let’s learn and enjoy!

Recall

Here, Let’s first briefly recall what we have learnt from previous posts.

Context Encoder (CE) [1] is the first GAN-based inpainting algorithm in the literature. It emphasizes the importance of understanding the context of the entire image for the task of inpainting and (channel-wise) fully-connected layer is used to achieve such a function. You can click here to link to the previous post for details.

Multi-scale Neural Patch Synthesis (MNPS) [2] can be regarded as an improved version of CE. It consists of two networks, namely content network and texture network. The content network is a CE and the texture network is a pre-trained VGG-19 for object classification task. The idea of employing a texture network comes from the recent success of neural style transfer. Simply speaking, the neural responses in a pre-trained network for high-level vision tasks (e.g. object classification) contain the information about the image styles. By encouraging similar neural responses inside and outside the missing regions, we can further enhance the texture details of the generated pixels and hence the completed images would be more realistic-looking. Interested readers are highly recommended to skim through the post here for details.

Globally and Locally Consistent Image Completion (GLCIC) [3] is a milestone in the task of deep image inpainting. The authors adopt Fully Convolutional Network (FCN) with Dilated Convolution (DilatedConv) as a framework of their proposed model. FCN allows various input sizes and DilatedConv replaces (channel-wise) fully-connected layer which is used to understand the context of the entire image. In addition, two discriminators are used for distinguishing completed images from real images at two scales. A global discriminator looks at the entire image while a local discriminator focuses on the local filled image patch. I highly recommend readers to take a look of the post here, especially for the dilated convolution in CNNs.


Today, we are going to review the paper, Patch-Based Image Inpainting with Generative Adversarial Networks [4]. This can be regarded as a variant of GLCIC, hence we can do some revision for this typical network structure.

Motivation

The authors of this paper would like to take the advantages of using residual connections and PatchGAN discriminator to further improve their inpainting results.

Deep Residual Learning for Image Recognition (ResNet) [5] has achieved remarkable success in Deep Learning. By employing residual blocks (residual connections), we are able to train very deep networks and many papers have shown that residual learning is useful for obtaining better results.

PatchGAN [6] has also achieved great success in Image-to-Image Translation. Compared to the discriminator in typical GAN, PatchGAN discriminator (refer to Figure 1 below) outputs a matrix (2d-array) instead of just a single value. Simply speaking, the output of typical GAN discriminator is a single value ranges from 0 to 1. This means that the discriminator looks at the entire image and decides whether this image is real or fake. If the image is real, it should give 1. If the image is fake (i.e. generated image), it should give 0. This formulation focuses on the entire image and hence local texture details of the image may be neglected. On the other hand, the output of PatchGAN discriminator is a matrix and each element in this matrix ranges from 0 to 1. Note each element represents a local region in the input image as shown in Figure 1. So, this time, the discriminator looks at multiple local image patches and has to judge each patch is real or not. By doing this, the local texture details of the generated images can be enhanced. This is the reason why PatchGAN is widely used in image generation tasks.

Figure 1. PatchGAN discriminator. The output is a matrix and each element in the matrix represents a local region in the input image. If the local region is real, we should get 1, else 0. Extracted from [4]
Figure 1. PatchGAN discriminator. The output is a matrix and each element in the matrix represents a local region in the input image. If the local region is real, we should get 1, else 0. Extracted from [4]

Introduction

Image Inpainting can be regarded as a kind of image generation tasks. We would like to fill in the missing regions in an image (i.e. generating the missing pixels) such that the image is completed and realistic-looking.

To generate realistic-looking images, GAN is commonly used for different image generation tasks, including image inpainting. Typical GAN discriminator looks at the entire image to judge whether the input is real or not by just one single value [0,1]. This kind of GAN discriminator is called global GAN (G-GAN) in this paper.

On the other hand, PatchGAN looks at multiple local regions in the input and decides the realness of each local region independently as mentioned in the previous section. Researchers have shown that the use of PatchGAN can further improves the visual quality of the generated images by focusing on more local texture details.

Solution (in short)

  • Residual blocks with dilated convolution (Dilated Residual Blocks) are employed in the generator. (The authors expected that the inpainting results can be enhanced by using residual learning)
  • Mixture of PatchGAN and G-GAN discriminators (PGGAN) is proposed to encourage that the output completed images should be both globally and locally realistic-looking. (Same intention as in GLCIC which employs two discriminators, one global and one local)

Contributions

  • Combination of PatchGAN and G-GAN discriminators (PGGAN) in which the early convolutional layers are shared. Their experimental results show that it can further enhance the local texture details of the generated pixels.
  • Dilated and interpolated convolutions are used in the generator network. The inpainting results have been improved by the use of the dilated residual blocks.

Approach

Figure 2. The proposed Generative ResNet architecture and PGGAN Discriminator. Extracted from [4]
Figure 2. The proposed Generative ResNet architecture and PGGAN Discriminator. Extracted from [4]
Figure 3. The proposed architecture of GLCIC. Extracted from [3]
Figure 3. The proposed architecture of GLCIC. Extracted from [3]

Figure 2 and 3 show the proposed network structure of this paper and GLCIC respectively. It is obvious that they are similar. Two main differences are that i) dilated residual blocks are used in the generator; ii) global and local discriminators in GLCIC are modified.

In GLCIC, the global discriminator takes the entire image as input while the local discriminator takes a sub-image around the filled region as input. The outputs of the two discriminators are concatenated then a single value is returned to show whether the input is real or fake (one adversarial loss). In this point of view, the local discriminator would focus on the local filled image patch, hence the local texture details of the filled patch can be enhanced. One main drawback is that the input to the local discriminator depends on the missing regions and the authors assume a single rectangular missing region during training.

For PGGAN discriminator, we have few early shared convolutional layers shown in Figure 2. Then, we have two branches, one gives a single value as output (G-GAN) and one gives a matrix as output (PatchGAN). Note that 1×256 is a reshaped version of a 16×16 matrix. As mentioned, this is also a way to let the discriminator focusing on both global (entire image) and local (local image patches) information when distinguishing completed images from real images. Note that we will have two adversarial losses as we have two branches in this case.

Dilated Residual Block

In [my previous post](https://towardsdatascience.com/a-milestone-in-deep-image-inpainting-review-globally-and-locally-consistent-image-completion-505413c300df), I have introduced Dilated Convolution in CNNs. For a short recall, dilated convolution increases the receptive field without adding additional parameters by skipping consecutive spatial locations. For readers who forget this concept, please feel free to revisit my previous post first.

Figure 4. Residual block types. From top to bottom: standard residual block, dilated residual block with dilated convolution first, dilated residual block with dilated convolution second. Extracted from [4]
Figure 4. Residual block types. From top to bottom: standard residual block, dilated residual block with dilated convolution first, dilated residual block with dilated convolution second. Extracted from [4]

Figure 4 shows different types of residual blocks. I would like to briefly talk about a basic residual block as shown in the top of Figure 4 for the ease of our further discussion.

Simply speaking, residual block can be formulated to Y = X + F(X), where Y is the output, X is the input and F is a sequence of few layers. In the basic residual block in Figure 4, F is Conv-Norm-ReLU-Conv. This means that we feed X to a convolutional layer followed by a normalization layer, a ReLU activation layer, and finally another convolutional layer to get F(X). One main point is that the input X is directly added to the output Y and this is the reason why we call it skip connection. As there is no any trainable parameters along this path, we can ensure that there must be enough gradient to be passed to early layers during back-propagation. Therefore, we can train a very deep network without encountering gradient vanishing problem.

Why Residual Block?

You may wonder about the advantage of using residual block. Some of you guys may already know the answer. Let me give my views below.

Let’s compare Y = X + F(X) and Y = F(X). For Y = X + F(X), what we learn actually is F(X) = Y – X, the difference between Y and X. This is so called residual learning and X can be regarded as a reference for the residual learning. On the other hand, for Y = F(X), we directly learn to map the input X to the output Y without reference. So, people think that residual learning is relatively easy. More importantly, many papers have shown that residual learning can bring better results!

As the dilated convolution is useful to increase the receptive field which is important to the task of inpainting, the authors replace one of the two standard convolutional layers by a dilated convolutional layer as shown in Figure 4. There are two types of dilated residual block, i) dilated convolution is placed first and ii) dilated convolution is placed second. In this paper, the dilation rate is increased by a factor of two starting from 1 based on the number of dilated residual blocks employed. For example, if there are 4 dilated residual blocks, the dilation rates would be 1, 2, 4, 8.

Interpolated Convolution

To address the artifacts caused by standard deconvolution (i.e. transposed convolution), the authors adopt interpolated convolution in this work. For interpolated convolution, the input is first resized to the desired size using typical interpolation method such as bilinear and bicubic interpolation. Then, standard convolution is applied. Figure 5 below shows the difference between transposed convolution and interpolated convolution.

Figure 5. Visual comparison of results obtained by using transposed convolution (top) and interpolated convolution (bottom). Extracted from [4]
Figure 5. Visual comparison of results obtained by using transposed convolution (top) and interpolated convolution (bottom). Extracted from [4]

In my opinion, both types of convolution have similar performance. Sometimes transposed convolution is better, and sometimes interpolated convolution is better.

Discriminator Network

We have talked about the PGGAN discriminator used in this paper. Here, to recall, the discriminator has two branches, one branch gives a single value just like global-GAN (G-GAN) and another branch gives 256 values in which each value represents the realness of a local region in the input.

Focus on the realness of multiple local regions in the input is useful for improving the local texture details of the completed images.

Objective Function

Actually, the loss function (i.e. objective function) used in this paper is more or less the same as the papers we have covered before.

Reconstruction loss: this loss is for ensuring the pixel-wise reconstruction accuracy. We usually employ L1 or L2 (Euclidean) distance for this loss. This paper uses the L1 loss as their reconstruction loss,

N is the number of images in a training batch. W, H, and C are the width, height and channels of the training images. x and y are the ground truth and the completed image given by the model.

Adversarial loss: I think most of you are familiar with this typical adversarial loss now.

x is the ground truth, so we want D(x) returns 1, or else 0. Note that D is just the function form of the discriminator.

Joint loss:

Equation 3 is their joint loss function. Lambda 1, 2, 3 are used to balance the importance of each loss. _gadv represents the output given by the global branch while _padv represents the output given by the PatchGAN branch. Note that Lambda 1, 2, 3 are set to 0.995, 0.0025 and 0.0025 respectively in their experiments.

Experimental Results

Three datasets were used in their experiments. i) Paris StreetView [7] contains 14,900 training images and 100 testing images. ii) Google StreetView has 62,058 high-resolution images and it is divided into 10 parts. The first and tenth parts were used for testing, the ninth part for validation, and the rest for training. In total, there were 46,200 training images. iii) Places consists of more than 8 million training images. This dataset was used for testing only to show the generalizability.

To compare the performance of typical residual block and dilated residual block, the authors trained two models, namely PGGAN-Res and PGGAN-DRes. For PGGAN-Res, basic residual blocks and 3 sub-sampling blocks were used. This means that the input is down-sampled by a factor of two 3 times. For PGGAN-DRes, dilated residual blocks and 2 sub-sampling blocks were used. This means that the input is down-sampled by a factor of two 2 times.

Figure 6. Results from training the same generator network with different discriminator structures. Extracted from [4]
Figure 6. Results from training the same generator network with different discriminator structures. Extracted from [4]

Figure 6 shows the inpainting results from training the same generator network with different discriminator structures. From the last column in Figure 6, poor local texture details of the window are observed if just G-GAN discriminator is used. Compared to G-GAN, PatchGAN gives better local texture details of the window but the corner of the window looks incoherent to the global structure. Overall, PGGAN can offer results with the best visual quality.

Table 1. Quantitative comparison on 256x256 images from Paris StreetView. Extracted from [4]
Table 1. Quantitative comparison on 256×256 images from Paris StreetView. Extracted from [4]
Table 2. Quantitative comparison on 512x512 images from Paris StreetView. Extracted from [4]
Table 2. Quantitative comparison on 512×512 images from Paris StreetView. Extracted from [4]

Table 1 and 2 show the quantitative comparison of different approaches on Paris StreetView dataset at two resolutions, 256×256 and 512×512. Note that CE is Context Encoder [1], NPS is Multi-scale Neural Patch Synthesis (MNPS) [2], and GLGAN is Globally and Locally Consistent Image Completion (GLCIC) [3]. We have covered all these approaches in the previous posts.

From Table 1 and 2, it is obvious that PGGAN offers an improvement in all these measures. But, remember that visual quality is much more important than these objective evaluation metrics.

Figure 7. Perceptual comparison of completed images by using different approaches. Extracted from [4]
Figure 7. Perceptual comparison of completed images by using different approaches. Extracted from [4]

The authors performed a perceptual evaluation among the approaches as shown in Figure 7. 12 voters were required to score the naturalness of the original images and the inpainting results of various methods. Each voter is randomly assigned 500 images from the Paris StreetView dataset. Note that CE is trained on 128×128 images and hence it has poor performance on 256×256 testing images. The other methods have similar performance in this perceptual evaluation.

Figure 8. Qualitative comparison on 256x256 Paris StreetView dataset. Extracted from [4]
Figure 8. Qualitative comparison on 256×256 Paris StreetView dataset. Extracted from [4]
Figure 9. Qualitative comparison on 512x512 Paris StreetView dataset. Extracted from [4]
Figure 9. Qualitative comparison on 512×512 Paris StreetView dataset. Extracted from [4]

Figure 8 and 9 show the inpainting results for images of size 256×256 and 512×512 respectively. I recommend readers to zoom in for a better view of the results. In my opinion, PGGAN-DRes and PGGAN-Res generally give results with better local texture details, see for examples, the 4th row in Figure 8 and the 3rd row in Figure 9.

Conclusion

First, the concept of residual learning is embedded in the generator network in the form of dilated residual blocks. From their experimental results, residual learning is useful to boost the inpainting performance.

Second, the concept of PatchGAN discriminator is combined with the traditional GAN discriminator (G-GAN) to encourage both better local texture details and global structure consistency.

Takeaways

Same as previous, I would like to list out some useful points in this section. If you have followed my previous posts, you should find this post is relatively simple.

Actually, most of the things in this paper are similar to GLCIC [3]. Two new concepts are embedded in the network architecture to further enhance the inpainting results, namely residual block and PatchGAN discriminator.

I hope that you can realize this typical network architecture for image inpainting. The networks proposed in later inpainting papers are more or less the same.

You should also notice that reconstruction loss and adversarial loss are two fundamental losses for image inpainting task. The proposed method in later inpainting papers must include L1 loss and adversarial loss.

What’s Next?

This is my fourth post related to deep image inpainting. Until now, we have actually covered almost all basics of deep image inpainting, including the objective of image inpainting, the typical network architecture for inpainting, loss function, difficulties in general image inpainting, and techniques to obtain better inpainting results.

Starting from the next post, we will dive into more inpainting papers in which more specific techniques are designed for image inpainting. On the assumption that you guys have already known the basics, I can spend much more time on explaining those inpainting techniques. Enjoy! 🙂

References

  1. Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros, "Context Encoders: Feature Learning by Inpainting," Proc. Computer Vision and Pattern Recognition (CVPR), 27–30 Jun. 2016.
  2. Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li, "High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis," Proc. Computer Vision and Pattern Recognition (CVPR), 21–26 Jul. 2017.
  3. Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa, "Globally and Locally Consistent Image Completion," ACM Trans. on Graphics, Vol. 36, No. 4, Article 107, Publication date: July 2017.
  4. Ugur Demir, and Gozde Unal, "Patch-Based Image Inpainting with Generative Adversarial Networks," https://arxiv.org/pdf/1803.07422.pdf.
  5. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, "Deep Residual Learning for Image Recognition," Proc. Computer Vision and Pattern Recognition (CVPR), 27–30 Jun. 2016.
  6. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros, "Image-to-Image Translation with Conditional Adversarial Networks," Proc. Computer Vision and Pattern Recognition (CVPR), 21–26 Jul. 2017.
  7. C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros. "What makes Paris look like Paris?," ACM Trans. on Graphics, Vol. 31, No. 4, Article 101, Publication date: July 2012.

Thanks for reading my post! If you have any questions, please feel free to ask or leave comments here. See you next time! 🙂


Related Articles