Pushing the Limits of Deep Image Inpainting Using Partial Convolutions

Image Inpainting for Irregular Holes Using Partial Convolutions

Published in

Towards Data Science

12 min readNov 10, 2020

Figure 1. Some inpainting results by using Partial Convolutions. Image by Guilin Liu et al. from their paper [1]

Hi. Today, I would like to talk about a good deep image inpainting paper which has broken some limitations on previous inpainting work. In short, most of the previous papers assume that the missing region(s) is/are regular (i.e. a center missing rectangular hole or multiple small rectangular holes) and this paper proposes a Partial Convolutional (PConv) layer to deal with irregular holes. Figure 1 shows some inpainting results using the proposed PConv. Are they good? Let’s grasp the main idea of PConv together!

Motivation

First of all, previous deep image inpainting approaches treat missing pixels and valid pixels the same in the sense that they fill a fixed pixel value (255 or 1 before/after normalization) to all the missing pixels in an image and apply standard convolutions to the input image for the task of inpainting. There are two problems here. i) Is it appropriate to fix the pixel values of the missing pixels to a pre-defined value? ii) Is it suitable for convolving the input image regardless of the validness of the pixels? So, it may be a good option to perform operations only on valid pixels.

Figure 2. Visual comparisons of previous deep inpainting methods trained by using regular masked images and the proposed Partial Conv. Image by Guilin Liu et al. from their paper [1]

Secondly, existing approaches assume that the missing region(s) is/are regular/rectangular. Some of them employ local discriminator to distinguish generated content from real content. For the case of a fixed size center missing hole, they input the filled center hole to the local discriminator for enhancing the local texture details. But how about irregular missing areas? Is it possible to have fine local details without the employment of a discriminator? If you have tried using the existing approaches for handling irregular masked images, you can find that the inpainting results are not satisfactory as shown in Figure 2. For practical inpainting methods, they should be able to handle irregular masked images.

Introduction

Similar to my previous posts, I assume that readers have a basic understanding of deep image inpainting such as network architecture, loss function (i.e. L1 and Adversarial losses), and related terminology (i.e. valid pixels, missing pixels, etc.). If you need or want, please have a quick recall of my previous posts. In this section, I will briefly go through the things that I think are less important, hence we can have more time to dive into the most important idea of this paper, Partial Convolution, in the following sections.

The authors of this paper employ a U-Net like network with skip connections in which all standard convolutional layers are replaced by the proposed partial convolutional layers. If you are interested in their network architecture, you may refer to the paper and they provide detailed tables of their model.

Interestingly, no discriminator is used in this work. Apart from standard L1 loss and total variation loss (TV loss), the authors adopt two high-level feature losses to complete the masked images with fine textures. I will introduce these two losses in detail later on.

Solution (spotlight)

As mentioned in Motivation, the key idea is to separate the missing pixels from the valid pixels during convolutions such that the results of convolutions only depend on the valid pixels. This is the reason why the proposed convolution is named partial convolution. The convolution is partially performed on the input based on a binary mask image that can be updated automatically.

Approach

Partial Convolutional Layer

Let us define W and b be the weights and bias of the convolution filter. X represents the pixel values (or feature activation values) being convolved and M is the corresponding binary mask which indicates the validness of each pixel/feature value (0 for missing pixels and 1 for valid pixels). The proposed partial convolution is computed,

where ⦿ means element-wise multiplication and 1 is a matrix of ones that has the same shape as M. From this equation, you can see that the results of the partial convolution only depend on the valid input values (as X ⦿ M). sum(1)/sum(M) is a scaling factor to adjust the results as the number of valid input values for each convolution is varying.

Update the binary mask after each partial convolutional layer. The proposed rule to update the binary mask is quite easy. If the result of the current convolution is conditioned on at least one valid input value, the corresponding location will be regarded as valid for the next partial convolutional layer.

as you can see above, the updating rule is simple to understand.

Figure 3. Graphical illustration of the proposed partial convolution. Image by author

Figure 3 shows a simple example to illustrate the proposed partial convolution. We consider a simple 5×5 input with its corresponding 5×5 binary mask image (1 for valid pixels and 0 for hole pixels) and a 3×3 W with fixed weights. Assume that we want to keep the same output size as the input size 5×5, hence we perform zero paddings before doing the convolution. Let’s consider the top-left corner (orange-bounded) first. X and M for this convolution are clearly shown in the figure and the number of valid input values is 3. Hence, the output of this location is -9+b. Also, the value of the corresponding location in the updated binary mask is 1 as there are 3 valid input values.

Considering the middle (purple-bounded) box, this time, as you can see, there is no valid input value for this convolution, so the result is 0+b and the updated mask value is also 0. The bottom-right (blue-bounded) box is another convolution example for showing the role of the scaling factor. By the scaling factor, the network can distinguish -3 computed by 3 valid input values from -3 computed by 5 valid input values.

For readers’ information, the updated binary mask after this partial convolutional layer is shown in the top-right corner of Figure 3. You can see that there are fewer zeros in the updated binary mask. When we perform more and more partial convolutions, the binary mask will be eventually updated to have all ones. This means that we can control the information to be passed inside the network regardless of the size and the shape of the missing regions.

Loss Functions

In total, there are 4 loss terms in their final loss function, namely L1 loss, Perceptual loss, Style loss, and TV loss.

L1 loss (per-pixel loss)

This loss is for ensuring the pixel-wise reconstruction accuracy.

where I_out and I_gt are the output of the network and the ground truth respectively. M is the binary mask, 0 for holes, and 1 for valid pixels. N_I_gt is the total number of pixel values in an image, which equals C×H×W, C is the channel size (3 for RGB image), H and W are the height and width of the image I_gt. You can see that L_hole and L_valid are the L1 loss of the hole pixels and valid pixels respectively.

Perceptual loss (VGG loss)

The perceptual loss is proposed by Gatys et al. [2] and we have introduced this loss previously. Simply speaking, we want the filled image and the ground truth image to have similar feature representations computed by a pre-trained network like VGG-16. Specifically, we feed the ground truth image and the filled image to a pre-trained VGG-16 to extract features. Then, we calculate the L1 distance between their feature values at all or several layers.

For the above equation, I_comp is the same as I_out except the valid pixels are directly replaced by the ground truth pixels. Ψ^I_p is the feature maps of the p-th layer computed by a pre-trained VGG-16 given the input I. N_Ψ^I_p is the number of elements in Ψ^I_p. According to Gatys et al. [2], this perceptual is small when the completed image is semantically close to its ground truth image. Perhaps, it is because deeper layers (higher level) provide more semantic information of an image and similar high-level feature representations represent better semantic correctness of the completion. For readers’ information, VGG-16 pool1, pool2, and pool3 layers are used to compute the perceptual loss.

Style loss

Apart from the perceptual loss, the authors also adopt the style loss as shown in the above. You can see that the style loss is also computed using the feature maps given by a pre-trained VGG-16. This time, we first calculate the auto-correlation of each feature map and it is called Gram matrix in [2]. According to [2], the Gram matrix contains the style information of an image such as textures and colours. This is also the reason why this loss is named Style loss. Thus, we compute the L1 distance between the Gram matrices of the completed image and the ground truth image. Note that Ψ^I_p is with size of (H_p×W_p)×C_p and its Gram matrix is with shape of C_p×C_p. K_p is a normalizing factor which depends on the spatial size of the feature maps at p-th layer.

Total Variation loss

The last loss term in their final loss function is the TV loss. We have talked about this loss in my previous posts. Simply speaking, this loss is adopted to ensure the smoothness of the completed images. This is also a common loss in many image processing tasks.

where N_I_comp is the total number of pixel values in I_comp.

Final loss

This is the final loss function to train the proposed model. The hyper-parameters used to control the importance of each loss term are set based on experiments on 100 validation images.

Ablation study

Figure 4. Inpainting results using different loss terms. (a) Input image (b) Result without style loss (c) Result using full loss (d) ground truth (e) Input image (f) Result using a small style loss weight (g) Results using full loss (h) ground truth (i) Input image (j) Result without perceptual loss (k) Result using full loss (l) ground truth. Please zoom in for a better view. Image by Guilin Liu et al. from their paper [1]

The authors did experiments to show the effects of different loss terms. The results are shown in Figure 4 in the above. First of all, Figure 4(b) shows the inpainting result without using style loss. They found that the use of the style loss is necessary in their model to generate fine local textures. However, the hyper-parameter for the style loss has to be carefully selected. As you can see in Figure 4(f), small weight for the style loss would cause some obvious artifacts as compared to the result using the full loss (Figure 4(g)). Apart from the style loss, the perceptual loss is also important. They also found that the employment of the perceptual loss can reduce the grid-shaped artifacts. Please see Figure 4(j) and (k) for the effect of the use of the perceptual loss.

In fact, the use of the high-level feature loss has not been fully studied. We cannot 100% say that the perceptual loss or the style loss must be useful for image inpainting. So, we have to do our own experiments to check the effectiveness of different loss terms for our desired applications.

Experiments

Figure 5. Some examples of mask images. 1, 3, 5 are with border constraint while 2, 4, 6 are without border constraint. Image by Guilin Liu et al. from their paper [1]

In their experiments, all the mask, training, and testing images are with the size of 512×512. The authors divided the testing images into two groups, i) mask with holes close to border. ii) mask without holes close to border. Images with all the holes with distance of at least 50 pixels from the border are classified into the second group. Figure 5 shows some examples of these two groups of masks. Furthermore, the authors generate 6 types of masks according to the hole-to-image area ratios: (0.01, 0.1], (0.1, 0.2], (0.2, 0.3], (0.3, 0.4], (0.4, 0.5], and (0.5, 0.6]. This means that the largest mask would mask out 60% of the original image content.

Training data. Similar to previous work, the authors evaluated their model on 3 publicly available datasets, namely, ImageNet, Places2 and CelebA-HQ datasets.

Figure 6. Visual comparisons of different approaches on ImageNet. (a) Input image (b) PatchMatch (c) GLCIC (d) Contextual Attention (e) PConv (f) Ground truth. Image by Guilin Liu et al. from their paper [1]

Figure 7. Visual comparisons of different approaches on Places2. (a) Input image (b) PatchMatch (c) GLCIC (d) Contextual Attention (e) PConv (f) Ground truth. Image by Guilin Liu et al. from their paper [1]

Figure 6 and 7 show the visual comparisons of different approaches on ImageNet and Places2 respectively. PatchMatch is the state-of-the-art conventional approach. GLCIC and Contextual Attention are two state-of-the-art deep learning approaches we have introduced before. As you can see, GLCIC (c) and Contextual Attention (d) cannot offer inpainting results with good visual quality. It may due to the fact that these two previous deep learning approaches are trained for regular masked images instead of irregular masked images. If you are interested, please zoom in for a better view of the inpainting results.

Figure 8. Visual comparisons of different approaches on CelebA-HQ. (a) Input image (b) Contextual Attention (c) PConv (d) Ground truth. Image by Guilin Liu et al. from their paper [1]

Figure 8 shows the inpainting results on CelebA-HQ dataset. You may zoom in for a better view of the results.

Table 1. Quantitative comparisons of various methods. 6 columns represent 6 different mask ratios. N means no border (i.e. holes can close to the border), B means border (i.e. no holes close to the border). Data by Guilin Liu et al. from their paper [1]

Table 1 lists out several objective evaluation metric numbers for readers’ information. Clearly, the proposed PConv offers the best numbers in nearly all the cases. Note that IScore is the inception score which is used as an estimation of the visual quality, and the lower the better estimated visual quality.

Apart from the qualitative and quantitative comparisons, the authors also conducted a human subjective study to evaluate the visual quality of different approaches. Interested readers may refer to the paper for the study.

Limitation

Figure 9. Inpainting results by PConv when the missing holes are larger and larger. Image by Guilin Liu et al. from their paper [1]

Figure 10. Some failure cases especially when the scenes are much more complicated. Image by Guilin Liu et al. from their paper [1]

At the end of this paper, the authors also mention some limitations of the current deep image inpainting approaches. First, it is difficult to complete an image with a large missing area as shown in the right of Figure 9. Second, when the image contains complex structure, it is also difficult to complete the image with good visual quality as shown in Figure 10. There is still no a comprehensive method to handle extreme large masked and complicated images. So, you may try to propose a good solution to this extreme image inpainting problem. :)

Conclusion

Obviously, Partial Convolution is the main idea of this paper. I hope that my simple example can explain clearly to you how the partial convolution is performed and how a binary mask is updated after each partial convolution layer. By using Partial Convolution, the results of convolution would only depend on valid pixels, hence we can have the control of the information pass inside the network and this may be useful for the task of image inpainting (at least the authors provide evidence that partial convolution is useful in their case). Apart from image inpainting, the authors have also tried to extend the partial convolution to the task of super-resolution as it shares similarity with image inpainting. Interested readers are highly recommended to refer to their paper.

Takeaways

Without a doubt, I hope you can understand what is partial convolution. Starting from this paper, the later deep image inpainting methods can deal with both regular and irregular masks. Together with my previous posts related to image inpainting, I also hope that you can have better understanding of the field of image inpainting. You should know some common techniques and remaining challenges in image inpainting. For example, dilation convolution, contextual attention layer, etc. It is also difficult to fill in an image when the hole(s) in an image is/are too large and the image is with complex structure.

What’s Next?

Next time, we will look at another paper which makes use of additional information to help filling in the masked images. Hope you enjoy! Let’s learn together! :)

References

[1] Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro, “Image Inpainting for Irregular Holes Using Partial Convolution,” Proc. European Conference on Computer Vision (ECCV), 2018.

[2] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge, “A Neural Algorithm of Artistic Style,” arXiv preprint arXiv:1508.06576, 2015.

Thanks for reading my post! If you have any questions, please feel free to send my an email or leave comments here. Any suggestions are welcome. Thank you very much again and hope to see you next time! :)