The world’s leading publication for data science, AI, and ML professionals.

A Practical Generative Deep Image Inpainting Approach

Review: Free-Form Image Inpainting with Gated Convolution

Figure 1. Some free-form inpainting results by using DeepFill v2. Note that optional user sketch input is allowed for interactive editing. Image by Jiahui Yu et al. from their paper [1]
Figure 1. Some free-form inpainting results by using DeepFill v2. Note that optional user sketch input is allowed for interactive editing. Image by Jiahui Yu et al. from their paper [1]

Hello guys! Welcome back! Today, we are going to dive into a very practical generative deep Image Inpainting approach named DeepFill v2. As mentioned in my previous post, this paper can be regarded as an enhanced version of DeepFill v1, Partial Convolution, and EdgeConnect. Simply speaking, the Contextual Attention (CA) layer proposed in DeepFill v1 and the concept of user guidance (optional user sketch input) introduced in EdgeConnect are embedded in DeepFill v2. Also, Partial Convolution (PConv) is modified to Gated Convolution (GConv) in which rule-based mask update is formulated to a learnable gating to the next convolution layer. With these ideas, DeepFill v2 achieves high-quality free-form inpainting than previous state-of-the-art methods. Figure 1 shows some free-form inpainting results by using DeepFill v2, please zoom in for a better view. Let’s see how they combine all the techniques to achieve the state-of-the-art.

Source code is available at [here].

Motivation

Recall that Partial Convolution (PConv) has been proposed to separate valid and invalid pixels such that convolution results only depend on the valid pixels and Edge generator has been proposed to estimate the skeleton inside the missing region(s) as user guidance to further improve inpainting performance, the authors of this paper would like to merge these techniques with their Contextual Attention (CA) layer to further enhance the inpainting results.

First, PConv employs rule-based mask update for separating valid and invalid pixels. The rule-based mask update is hand-made and un-learnable. Readers can refer to my previous post for a short review of PConv. As PConv is un-learnable, hence the most straightforward way to improve it is to make it learnable.

Second, previous methods usually input the masked image and the mask image to the generator network for completion. What if we also allow user sketch input as additional conditions to the task? Does the generator know how to distinguish user sketch input from the mask image input? A simple answer, Gated Convolution (a learnable version of PConv) will do!

Introduction

Again, I assume that readers have a basic understanding of deep image inpainting from my previous posts. Actually, the network architecture and loss functions used by DeepFill v2 have been covered previously. If you need or want, please feel free to skim through my previous posts. In this introduction, I will briefly cover those I think are less important and interested readers can refer to the paper for more details by themselves. So, we can leave much more time to introduce the most important idea, Gated Convolution, in the following sections.

Network architecture. This paper (DeepFill v2) is an improved version of their previous work (DeepFill v1). So, the network architecture is very similar except for the replacement of standard convolutions by the proposed gated convolutions. We have introduced DeepFill v1 previously, I highly recommend you to take a look at it [here]. Note that the most important idea of DeepFill v1 is the Contextual Attention (CA) layer which allows the generator to use the information given by distant spatial locations for the reconstruction of local missing pixels. So, DeepFill v2 also follows a two-stage coarse-to-fine network structure. The first generator network is responsible for a coarse reconstruction while the second generator network is responsible for a refinement of the coarse filled image.

Loss functions. Interestingly, only the two most standard loss terms are used to train the network, namely the L1 loss and the GAN loss. This is one of the claims of this paper as other state-of-the-art inpainting papers employ up to 5–6 loss terms to train their networks. I will talk about the GAN loss used in this paper very soon.

Solution (in short)

To further improve the Partial Convolution for handling irregular masks, the authors of this paper propose Gated Convolution which can be regarded as a learnable version of the Partial Convolution. Apart from Gated convolution, optional user sketch input is allowed to enhance the ability of interactive editing of the proposed model. Lastly, similar to the EdgeConnect we have introduced in my last post, Spectral Normalization (SN) [2] is applied to the discriminator to stabilize the training process.

Approach

Figure 2. Overview of the network architecture of the proposed model for free-form image inpainting. Image by Jiahui Yu et al. from their paper [1]
Figure 2. Overview of the network architecture of the proposed model for free-form image inpainting. Image by Jiahui Yu et al. from their paper [1]

Figure 2 shows the network architecture of the proposed DeepFill v2. As you can see, this is a two-stage coarse-to-fine network with Gated convolutions. The coarse generator takes the masked image, mask image, and an optional user sketch image as input for a coarse reconstruction of the missing regions. Then, the coarse filled image will be passed to the second refinement generator network for refinement. Note that the Contextual Attention (CA) layer proposed in DeepFill v1 is used in this refinement network.

For the discriminator, the authors of this paper employ the famous PatchGAN structure proposed by [3]. We have also covered the idea of PatchGAN previously. You may skim through it again for a review [here]. Besides the employment of PatchGAN, the authors also apply Spectral Normalization (SN) [2] to each standard convolutional layer of the discriminator for the sake of training stability.

Gated Convolution

Figure 3. Graphical illustration of Partial Convolution (left) and Gated Convolution (right). Image by Jiahui Yu et al. from their paper [1]
Figure 3. Graphical illustration of Partial Convolution (left) and Gated Convolution (right). Image by Jiahui Yu et al. from their paper [1]

Figure 3 shows the difference between the Partial convolution and the proposed Gated convolution. Simply speaking, a standard convolutional layer followed by a sigmoid activation function is used for updating the mask instead of a rule-based mask update in PConv. Note that after a sigmoid activation function, all the values would fall into [0, 1] which can indicate the importance (or validness) of each local area. For readers who want to know about the rule-based mask update, please visit my previous post [here] for details. The output of a gated convolution is computed as,

where the output is the element-wise multiplication of the outputs of two standard convolutional layers, one followed by any activation functions and another one followed by a sigmoid activation function.

Very straightforward! a standard convolutional layer followed by a sigmoid function acts as the soft gating to weight the output of the current convolution layer before feeding to the next convolutional layer. Note that for hard gating, we have only 0 or 1 to do the weighting, but for soft gating, we have 0 to 1 to de the weighting which is more flexible and this operation is learnable.

So, you can see the idea of Gated convolution is very simple and easy to implement.

Loss Function

The loss function to train the model consists of two loss terms, one is pixel-wise L1 reconstruction loss (L1 loss) and another is SN-PatchGAN loss. Note that the hyper-parameters to balance these two loss terms are 1:1.

As you can see, the SN-PatchGAN loss for the generator is very simple. It is the negative mean of the output of the SN-PatchGAN discriminator. Actually, this is a hinge loss which is also commonly used in many GAN frameworks.

Experiments

Free-Form Mask Generation and Edge Map as User Sketch Input

The authors propose a method to generate free-form masks on-the-fly during training. I think the simplest way is to directly use their code at [here]. Interested readers can refer to their paper for details.

For the optional user sketch input, the authors make use of the HED edge detector [4] to generate the edge map as the sketch input. Note that the sketch input is optional. For readers who are interested in the user sketch input for interactive editing, I highly recommend you to read their paper.

Similar to previous inpainting papers, the authors evaluated their model on Places2 and CelebA-HQ datasets. These two datasets are commonly used in the task of deep image inpainting.

Quantitative Comparison

Table 1. Quantitative Comparison of various approaches on Places2 dataset with both rectangular mask and free-form mask. Data by Jiahui Yu et al. from their paper [1]
Table 1. Quantitative Comparison of various approaches on Places2 dataset with both rectangular mask and free-form mask. Data by Jiahui Yu et al. from their paper [1]

Table 1 lists two objective evaluation metric numbers for readers’ information. As I mentioned in my previous posts, there is no good quantitative evaluation metric for the task of deep image inpainting. Hence, these numbers are only for reference and you should focus on the visual quality of the filled images. As you can see, the proposed model offers the lowest l1 and l2 errors. Note that Global&Local, ContextAttention, and PartialConv have been introduced in my previous posts and readers may refer to them for recall.

Qualitative Comparison

Figure 4. Qualitative comparison of various methods on Places2 and CelebA-HQ datasets. Image by Jiahui Yu et al. from their paper [1]
Figure 4. Qualitative comparison of various methods on Places2 and CelebA-HQ datasets. Image by Jiahui Yu et al. from their paper [1]

Figure 4 shows the qualitative comparison of different deep image inpainting approaches. Please zoom in for a better view of the quality of the filled images. It is obvious that the proposed model (GatedConv) outperforms all the other methods in terms of visual quality. You can see that the proposed method offers inpainting results without obvious color inconsistency.

Figure 5. An example of showing the advantage of using the user sketch input. Image by Jiahui Yu et al. from their paper [1]
Figure 5. An example of showing the advantage of using the user sketch input. Image by Jiahui Yu et al. from their paper [1]

Figure 5 shows that the proposed model can understand and make use of the user sketch input to further improve the inpainting result compared to the previous method which does not allow user sketch input. Again, you can see that the proposed method gives better inpainting result without color inconsistency.

Figure 6. An example of object removal by using different existing approaches. Image by Jiahui Yu et al. from their paper [1]
Figure 6. An example of object removal by using different existing approaches. Image by Jiahui Yu et al. from their paper [1]

Figure 6 shows an object removal example and you can see that the proposed method can completely remove the objects with better visual quality.

Figure 7. Examples of inpainting results with user guidance. Image by Jiahui Yu et al. from their paper [1]
Figure 7. Examples of inpainting results with user guidance. Image by Jiahui Yu et al. from their paper [1]

Figure 7 shows some examples of image inpainting with user sketch input as guidance. You can see how interactive editing can be achieved by the proposed method.

Similar to previous inpainting papers, the authors also did a user study to evaluate which method provides results with better visual quality. Interested readers can refer to the paper for the details.

Ablation Study of SN-PatchGAN

Figure 8. Ablation study of SN-PatchGAN. From left to right: Original image, Masked image, Results with one global GAN, Results with SN-PatchGAN. Image by Jiahui Yu et al. from their paper [1]
Figure 8. Ablation study of SN-PatchGAN. From left to right: Original image, Masked image, Results with one global GAN, Results with SN-PatchGAN. Image by Jiahui Yu et al. from their paper [1]

Figure 8 shows the ablation study of SN-PatchGAN. Compared to a standard single global GAN, the employment of the SN-PatchGAN brings better inpainting results. The authors claim that a simple combination of simple L1 loss and SN-PatchGAN loss can produce realistic inpainting results.

Conclusion

The main idea of this paper is Gated Convolution. The Gated Convolution is a learnable version of the Partial Convolution. We can implement the Gated Convolution by using an extra standard convolutional layer followed by a sigmoid function as shown in Figure 3. The employment of the Gated Convolution and the SN-PatchGAN significantly improves the inpainting results compared to existing inpainting methods. The authors also show how interactive editing can be achieved by allowing optional user sketch input. With the user sketch input, better and meaningful inpainting results can be achieved. Hence, this is a very practical deep image inpainting approach in the literature.

Takeaways

I hope that you can understand what is Gated Convolution – the most important idea of this paper.

To all readers, we have gone through nearly all the common techniques for deep image inpainting, such as coarse-to-fine network, contextual attention, gated convolution, partial convolution, PatchGAN, perceptual loss, style loss, etc. We have also covered both regular and irregular masks. As you can see, image inpainting has many possible applications. However, until now, it is still very difficult to fill images with complicated scene structures and large missing areas. Hence, Extreme Image Inpainting will be a promising direction. Let’s learn and read more together:)

What’s Next?

I think it’s time to recall what we have covered previously. Perhaps, I will write a short summary of the papers we have read before for a revision! Thanks a lot.

References

[1] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas Huang, "Free-Form Image Inpainting with Gated Convolution," Proc. International Conference on Computer Vision (ICCV), 2019.

[2] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida, "Spectral Normalization for Generative Adversarial Networks," Proc. International Conference on Learning Representations (ICLR), 2018.

[3] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros, "Image-to-Image Translation with Conditional Adversarial Networks," Proc. Computer Vision and Pattern Recognition (CVPR), 21–26 Jul. 2017.

[4] Saining Xie, and Zhuowen Tu, "Holistically-nested edge detection," Proc. International Conference on Computer Vision (ICCV), 2015.

Thanks for reading:) If you have any questions, please feel free to leave comments here or even send me an email. I am happy to hear from you and any suggestions are welcome. Hope to see you again next time!


Related Articles