The world’s leading publication for data science, AI, and ML professionals.

What if multiple receptive fields are used for Image Inpainting?

Review: Image Inpainting via Generative Multi-column Convolutional Neural Networks

Hello guys! Long time no see! Today, we are going to talk about another inpainting paper called Image Inpainting via Generative Multi-column CNNs (GMCNN). The network architecture used in this paper is similar to those papers we have introduced before. The main contribution of this paper is several modifications to the loss function.

Short Recall

As mentioned in my previous posts, how to make use of the information given by the remaining pixels in an image is crucial to superior Image Inpainting. A very straightforward sense of image inpainting is to directly copy the most similar image patches found in the image itself and paste on the missing areas. Interestingly, we should realize that there is no "correct" answer to the missing areas in practice. In reality, given a corrupted/masked image, you cannot know the original image (ground truth) for comparison. So, we have so many answers to the missing areas.

Introduction and Motivation

From previous inpainting papers, we know that receptive field is important to the task of image inpainting. For a 3×3 kernel, we can adjust the dilation rate to control the receptive field of it. If the dilation rate is 1, we have a 3×3 receptive field. If the dilation rate is 2, we have a 5×5 receptive field by skipping one neighboring pixel, and so on. You may refer to my previous post to find more details. Here, what if we employ 3×3, 5×5, and 7×7 kernels with dilated convolutions? This is defined as a multi-column structure in this paper.

In my previous post related to the contextual attention layer, the process of searching for the most similar image patches to the missing areas is embedded in the generator network (i.e. this process is used in both training and testing stages). In this work, this process is used during training only by designing a new loss term.

Due to the fact that there is no "correct" answer to the missing areas, pixel-wise reconstruction accuracy loss term (i.e. _L1 loss) seems inappropriate for image inpainting. The authors proposed to weight the L1 loss term based on the spatial locations of the missing pixels._ Spatial locations where are close to the valid (remaining) pixels should have higher weights to the _L_1 loss as they have more reasonable references for the reconstruction, and vice versa.

Solution (in short) and Contributions

Figure 1. Some inpainting results given by the proposed method. Image by Yi Wang et al. from their paper [1]
Figure 1. Some inpainting results given by the proposed method. Image by Yi Wang et al. from their paper [1]

In my opinion, this paper follows the trend in image inpainting that we have covered previously. First, the authors adopt multi-branches CNNs with dilated convolutions instead of single branch. Three different kernel sizes are used in three different branches for achieving various receptive fields and extracting features at different resolutions.

Second, two new loss terms are introduced to train the network, namely confidence-driven reconstruction loss and implicit diversified Markov Random Field (ID-MRF) loss. Confidence-driven reconstruction loss is a weighted _L_1 loss while ID-MRF loss relates to the feature patch comparison computed by pre-trained VGG network. We have talked about the MRF loss [here]. You may refer to it for a brief recall.

Figure 1 shows some inpainting results by the proposed method. You may zoom in for a better view of these high quality results.

Approach

Figure 2. The proposed network architecture. Image by Yi Wang et al. from their paper [1]
Figure 2. The proposed network architecture. Image by Yi Wang et al. from their paper [1]

Figure 2 shows the network architecture of the proposed Generative Multi-column Convolution Neural Networks (GMCNN). As you can see, there are one multi-column generator network, two discriminators (both global and local), and a pre-train VGG19 for calculating the ID-MRF loss.

There are three columns in the generator network and each column uses filters with three different sizes, namely 3×3, 5×5 and 7×7. Note that the outputs from the three columns are concatenated to feed to other two convolutional layers to get the completed images.

ID-MRF Regularization

Figure 3. Inpainting results using different similarity measures to search the nearest neighbors. (a) Inpainting results using cosine similarity (b) Inpainting results using the proposed relative similarity (c) Ground truth image (red rectangle highlights the filled region). Image by Yi Wang et al. from their paper [1]
Figure 3. Inpainting results using different similarity measures to search the nearest neighbors. (a) Inpainting results using cosine similarity (b) Inpainting results using the proposed relative similarity (c) Ground truth image (red rectangle highlights the filled region). Image by Yi Wang et al. from their paper [1]

Simply speaking, for MRF objective, we would like to minimize the difference between the generated features and the most nearest-neighbor features from the ground truth computed by a pre-trained network. In most previous work, cosine similarity measure has been employed to search for the nearest neighbors (you may read my previous post [here] for reviewing the cosine similarity measure). However, this similarity measure usually gives the same nearest neighbor to different generated feature patches and results in blurry inpainting results as shown in Figure 3(a).

To avoid blurry completed images that may be caused by the employment of the cosine similarity measure, the authors adopt a relative distance measure and the inpainting results are shown in Figure 3(b). You can see that the completed image is with better local fine textures.

Let’s talk about how they perform the relative distance measure. Let Y(hat)g_ be the generated content for the missing areas, Y(hat)^_L__g and Y^L are the features at the L-th layer of a pre-trained network. For the feature patches v and s extracted from Y(hat)^Lg_ and Y^L respectively, the relative similarity from v to s is computed,

where mu(. , .) is the cosine similarity. r belongs to Y^L excluding v. h and epsilon are positive constants. Clearly, if v is more similar to s than other feature patches, RS(v, s) will be large. You may also consider that if v has two similar patches s and r, then RS(v, s) will be small. We encourage to find similar patches outside the missing regions.

Then, RS is normalized as follows.

Finally, the proposed ID-MRF loss between Y(hat)^_L__g and Y^L is calculated,

where the argument max RS(bar)(v, s) means that s is the nearest neighbor to v and Z is a normalization factor. If we consider the extreme case that all generated feature patches are close to a particular feature patch s, then max RS(bar) (v, r) will be small and thus the ID-MRF loss will be large.

On the other hand, if each r in Y^L has its own nearest neighbor in Y(hat)^_L__g, then max RS(bar) (v, r) will be large and thus the ID-MRF loss will be small. Here, the main idea is to force/guide the generated feature patches to have different nearest neighbors, and thus the generated features are with better local textures.

Same as previous work, the authors use a pre-trained VGG19 to calculate the ID-MRF loss. Note that the middle layers _conv_3_2 and _conv_4_2 represent the structural and semantic features respectively.

The authors claim that this loss is related to the nearest neighbor searching and is only employed during the training stage. This is different from methods which search for the nearest neighbors during the testing stage.

Spatial Variant Reconstruction Loss

The proposed spatial variant reconstruction loss is actually a weighted _L_1 loss. There are many ways to decide the weights and the authors use a Gaussian filter to convolve the mask to create a weighted mask for calculating the weighted _L_1 loss. Interested readers may refer to the paper for details. The main idea of the weighted _L_1 loss is that the missing pixels close to the valid pixels are highly constrained than those missing pixels far away from the valid pixels. Hence, the missing pixels located at the center of the missing region should have lower _L_1 loss weights (i.e. less constrained).

Adversarial Loss

Similar to the previous work, the authors employ the improved WGAN loss and both local and global discriminators. Again, interested readers are highly recommended to read the paper.

Final Loss Function

This is the final loss function used to train the proposed model. Similar to most inpainting papers, the importance of the weighted _L_1 loss (the first loss term) is 1. _Lambdamrf and lambdaadv_ are the parameters to control the importance of the local texture MRF regularization and the adversarial training.

Experiments

The authors evaluate their method on 5 public datasets, namely Paris StreetView, Places2, ImageNet, CelebA, and CelebA-HQ datasets. During their training, all the images are resized to 256×256 with the largest center hole of size 128×128. For your information, their generator network has 12.562M parameters. It takes around 49.37 ms and 146.11 ms per image on GPU for testing images with size of 256×256 and 512×512 respectively.

Figure 4. Qualitative comparisons on Paris StreetView (top) and ImageNet (down). (a) Input image (b) Context Encoder (c) MSNPS (d) Contextual Attention (e) Proposed method. Image by Yi Wang et al. from their paper [1]
Figure 4. Qualitative comparisons on Paris StreetView (top) and ImageNet (down). (a) Input image (b) Context Encoder (c) MSNPS (d) Contextual Attention (e) Proposed method. Image by Yi Wang et al. from their paper [1]

Figure 4 shows the qualitative comparisons on Paris StreetView and ImageNet datasets. Please zoom in for a better view of the inpainting results. It is clear that the proposed method, GMCNN, gives the inpainting results with the best visual quality. If you are interested in more inpainting results, please refer to the paper or their project website.

Table 1. Quantitative results on the five datasets. Data by Yi Wang et al. from their paper [1]
Table 1. Quantitative results on the five datasets. Data by Yi Wang et al. from their paper [1]

As mentioned in my previous posts and at the beginning of this post, PSNR is related to the pixel-wise reconstruction accuracy which may not be appropriate for evaluating image inpainting. Researchers still report PSNR and SSIM for readers’ reference as these numerical metrics are fundamental for all Image Processing tasks. As you can see in Table 1, the proposed method achieves comparable or even better PSNR and SSIM on the five datasets.

Ablation Study

Table 2. Quantitative results of different network structures on the Paris StreetView dataset. Data by Yi Wang et al. from their paper [1]
Table 2. Quantitative results of different network structures on the Paris StreetView dataset. Data by Yi Wang et al. from their paper [1]
Figure 5. Qualitative comparisons of different network structures on the Paris StreetView dataset. (a) Input image (b) Single encoder-decoder (c) coarse-to-fine (d) GMCNN with fixed receptive field in all the 3 branches (e) GMCNN with varied receptive fields. Image by Yi Wang et al. from their paper [1]
Figure 5. Qualitative comparisons of different network structures on the Paris StreetView dataset. (a) Input image (b) Single encoder-decoder (c) coarse-to-fine (d) GMCNN with fixed receptive field in all the 3 branches (e) GMCNN with varied receptive fields. Image by Yi Wang et al. from their paper [1]

The authors evaluate the performance of different network structures used in the task of image inpainting. We have covered the encoder-decoder structure and coarse-to-fine structure. For the coarse-to-fine structure in their experiments, no contextual attention is employed. For GMCNN with fixed receptive field in all the 3 branches, they employ filter with size of 5×5. For GMCNN with varied receptive fields, 3×3, 5×5, and 7×7 filters are used in the 3 branches respectively. The quantitative and qualitative results are shown in Table 2 and Figure 5 respectively. Obviously, GMCNN with varied receptive fields provides the best inpainting results.

Apart from the choice of the network architecture and the employment of multiple receptive fields, the authors also study the effectiveness of the two proposed loss terms, namely confidence-driven reconstruction loss and the ID-MRF loss.

Figure 6. Qualitative comparisons of different reconstruction losses on the Paris StreetView dataset. (a) Input image (b) Spatial discounted loss (c) Proposed confidence-driven reconstruction loss. Image by Yi Wang et al. from their paper [1]
Figure 6. Qualitative comparisons of different reconstruction losses on the Paris StreetView dataset. (a) Input image (b) Spatial discounted loss (c) Proposed confidence-driven reconstruction loss. Image by Yi Wang et al. from their paper [1]

Figure 6 shows the visual comparisons of different reconstruction losses, namely spatial discounted loss and the proposed confidence-driven reconstruction loss. Note that the spatial discounted loss gets the weight mask based on the spatial locations of the missing pixels while the proposed confidence-driven reconstruction loss gets the weight mask by convolving the mask image multiple times with a Gaussian filter. The authors claim that their confidence-driven reconstruction loss works better. From my own experience, both reconstruction losses are similar to each other. Perhaps you may have a try on it.

Table 3. Quantitative results of using different lambda_mrf on the Paris StreetView dataset. Data by Yi Wang et al. from their paper [1]
Table 3. Quantitative results of using different lambda_mrf on the Paris StreetView dataset. Data by Yi Wang et al. from their paper [1]
Figure 7. Qualitative comparisons of using ID-MRF loss or not on the Paris StreetView dataset. (a) Input image (b) Inpainting results using ID-MRF loss (c) Inpainting results without using ID-MRF loss. Image by Yi Wang et al. from their paper [1]
Figure 7. Qualitative comparisons of using ID-MRF loss or not on the Paris StreetView dataset. (a) Input image (b) Inpainting results using ID-MRF loss (c) Inpainting results without using ID-MRF loss. Image by Yi Wang et al. from their paper [1]
Figure 8. Qualitative comparisons of using ID-MRF loss with different lambda_mrf on the Paris StreetView dataset. (a) Input image (b) lambda_mrf = 2 (c) lambda_mrf = 0.2 (d) lambda_mrf = 0.02 (e) lambda_mrf = 0.002. Image by Yi Wang et al. from their paper [1]
Figure 8. Qualitative comparisons of using ID-MRF loss with different lambda_mrf on the Paris StreetView dataset. (a) Input image (b) lambda_mrf = 2 (c) lambda_mrf = 0.2 (d) lambda_mrf = 0.02 (e) lambda_mrf = 0.002. Image by Yi Wang et al. from their paper [1]

More importantly, the ID-MRF loss term is the strongest claim of this paper. So, the authors show the importance of this loss term and the quantitative results are listed in Table 3. Figure 7 shows the difference between model trained using the ID-MRF loss and not using the ID-MRF loss. We can see that the use of the ID-MRF can enhance the local details of the generated pixels. Additionally, Figure 8 shows the effects of using different lambda_mrf to control the importance of the ID-MRF loss. You can zoom in for a better view of the results. Personally, the inpainting results are similar. From Table 3, lambda_mrf = 0.02 offers a good balance between the PSNR and the visual quality.

Conclusion

To conclude, the key novelty of this paper is the ID-MRF loss term to further enhance the local details of the generated content. The main idea of this loss is to guide the generated feature patches to find their nearest neighbors outside the missing areas as references and the nearest neighbors should be diverse such that the more local details can be simulated.

The use of the multiple receptive fields (multi-column or multiple branches) is due to the fact that the size of the receptive field is important to the task of image inpainting. As the local neighboring pixels are missing, we have to borrow information given by distant spatial locations to fill in the missing pixels. I think this idea is not difficult for you to understand if you have followed my previous posts.

The use of the weighted _L_1 loss is also due to the fact that there is no "correct" answer to the missing regions. For those missing pixels which are closer to the boundary of the missing areas, they are relatively constrained by the near valid pixels, hence higher weights to the _L_1 loss should be assigned. On the other hand, for missing pixels which are located at the center of the missing areas, they should be less _L_1 constrained.

Takeaways

Refer to my conclusion in above, I hope that you can understand the meaning of the proposed ID-MRF loss as this is the key idea of this paper. For other two ideas in this paper, namely the multi-column structure and the weighted _L_1 loss. Actually, I think you can understand the reasons behind well if you have followed my previous posts. I would say that the concept of multiple/various receptive fields is a common practice in deep image inpainting.

For the weighted _L_1 loss, from my own experience, I don’t think it can bring an obvious improvement in the inpainting performance. Of course, there are many ways to achieve weighted _L_1 loss. If you are interested in this, you may have a try on it. I will also keep doing experiments on this! 🙂

What’s Next?

In my next post, I will talk about how to deal with irregular masks. So far we have introduced several famous deep image inpainting methods. However, they mainly focus on regular masks (usually a large center rectangular mask or sometimes multiple small rectangular masks). So, let’s see how researchers deal with irregular masks recently.

If you are interested in deep generative models for image inpainting, I highly recommend you skim all of my previous posts. Hope you guys enjoy 🙂

References

[1] Yi Wang, Xin Tao, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia, "Image Inpainting via Generative Multi-column Convolutional Neural Networks," Proc. Neural Information Processing Systems, 2018.

Again, many thanks for reading my post! If you have any questions, please feel free to send my an email or leave comments here. Any suggestions are welcome. It is extremely important for us to learn systematically. Thank you very much and see you next time! 🙂


Related Articles