A Breakthrough in Deep Image Inpainting
Review: Generative Image Inpainting with Contextual Attention
Welcome back guys! Happy to see you guys:) Last time, we realized that how copy-and-paste is embedded in CNNs for deep image inpainting. Can you get the main idea? If yes, Good! If no, Don’t worry! Today, we are going to dive into a breakthrough in deep image inpainting, for which contextual attention is proposed. By using contextual attention, we can effectively borrow information from distant spatial locations for reconstructing the local missing pixels. This idea is actually more or less the same as copy-and-paste. Let’s see how they can do that together!
Recall
In my previous post, I have introduced the shift-connection layer in which features from known regions act as references to the generated features inside missing regions to allow us to further refine the generated features for better inpainting results. Here, we assume that the generated features are reasonable estimations of the ground truth and suitable references are determined according to the similarity between features from known regions and the generated features inside missing regions.
Motivation
For the task of image inpainting, the structure of CNNs cannot effectively model the long-term correlations between the missing regions and information given by distant spatial locations. If you are familiar with CNNs, you should know that the kernel size and the dilation rate control the receptive field at a convolutional layer and the network has to go deeper and deeper so as to see the entire input image. This means that if we want to capture the context of an image, we have to rely on deeper layers but we lose the spatial information as deeper layers always have smaller spatial size of features. So, we have to find a way to borrow information from distant spatial locations (i.e. understanding the context of an image) without going too deep into a network.
If you remember what dilated convolution is (We have covered in previous post), you will know that dilated convolution is one way to increase the receptive field at early convolutional layers without adding additional parameters. However, dilated convolution has its limitations. It skips consecutive spatial locations in order to enlarge the receptive field. Note that the skipped consecutive spatial locations are also crucial for filling in the missing regions.
Introduction
This work shares similar network architecture, loss function and relevant techniques that we have covered before. For the architecture, the proposed framework consists of two generator networks and two discriminator networks. The two generators follow the fully convolutional networks with dilated convolutions. One generator is for coarse reconstruction and another one is for refinement. This is called standard coarse-to-fine network structure. The two discriminators also look at the completed images both globally and locally. The global discriminator takes the entire image as input while the local discriminator takes the filled region as input.
For the loss function, simply speaking, they also employ adversarial loss (GAN loss) and L1 loss (for pixel-wise reconstruction accuracy). For the L1 loss, they use a spatially discounted L1 loss in which a weight is assigned to each pixel difference and the weight is based on the distance of a pixel to its nearest known pixel. For GAN loss, they use a WGAN-GP loss instead of the most standard adversarial loss we have introduced. They claim that this WGAN adversarial loss is also based on L1 distance measure, hence the network is easier to train and the training process is more stable.
In this post, I would like to focus on the proposed contextual attention mechanism. Therefore, I briefly cover the coarse-to-fine network architecture, the WGAN adversarial loss, and the weighted L1 loss in above. Interested readers can refer to my previous posts and the paper of this work for further details.
Solution (in short)
Contextual Attention mechanism is proposed to effectively borrow the contextual information from distant spatial locations for reconstructing the missing pixels. The contextual attention is applied to the second refinement network. The first coarse reconstruction network is responsible for a rough estimation of the missing regions. Same as previous, global and local discriminators are used to encourage better local texture details of the generated pixels.
Contributions
![Figure 1. Some examples of inpainting results by the proposed model on natural scene, face, and texture images. Image by Jiahui Yu et al. from their paper [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1CpWLaX-KOrj7t1klMt64Lw.png)
The most important idea in this paper is the contextual attention which allows us to make use of information from distant spatial locations for reconstructing local missing pixels. Second, the employment of the WGAN adversarial loss and weighted L1 loss improves the training stability. Also, the proposed inpainting framework achieves high-quality inpainting results on various datasets such as natural scene, face and texture as shown in Figure 1 in above.
Approach
![Figure 2. Network architecture of the proposed inpainting framework. Image by Jiahui Yu et al. from their paper [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1sGAZM3GZMUG0cRRYSZTP0g.png)
Figure 2 shows the network architecture of the proposed inpainting framework. As mentioned, it consists of two generators and two discriminators. If you have read my previous posts, you will find this typical network architecture for deep Image Inpainting.
Contextual Attention
Here is the main focus of this post. Let’s see how the proposed contextual attention layer is designed for borrowing feature information given by known regions at distant spatial locations to generate features inside missing regions.
![Figure 3. Graphical illustration of the proposed contextual attention layer. Image by Jiahui Yu et al. from their paper [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1M-3oTs0df0T5_jppYpSJTQ.png)
Figure 3 shows the graphically illustration of the proposed contextual attention layer. The operation is differentiable and fully convolutional.

Figure 4 is a more detailed example of the proposed contextual attention layer. As refer to Figure 3, the Foreground means the generated features inside the missing region while the Background means the features extracted from the known region. Similar to copy-and-paste method, we first want to match the generated features inside the missing region to the features outside the missing region.
Taking Figure 4 as an example, the generated features inside the missing region are with the size of 64×64×64 and assume that the features outside the missing region are divided into 128 small feature patches with size of 64×3×3. Note that the channel size of the features in this example is 64. Then, we perform convolution using the 128 small feature patches and the generated features inside the missing region, and get feature maps with size of 128×64×64. In the paper, this operation is described as,

where {_fx,y} are the foreground patches (generated feature patches inside the missing region), {bx’_,y’} are the background patches (extracted feature patches outside the missing region). _s__x,y,x’,y’ is the similarity between the generated patch centered at missing region (x,y) and the known patch centered at the known region (x’,y’). Actually, this is a standard cosine similarity measure process.
When we look along the channel dimension, the 128 elements represent the similarities between all the known patches and a particular location inside the missing region. This reflects the contributions of the 128 known patches to the location. We then perform Softmax normalization to the feature maps along the channel dimension as shown in the blue-colored region in Figure 4. After the Softmax normalization, the sum of each location along the channel dimension should be 1.
Compared with the Shift-Net covered in my previous post, you can see that this time we assign a weight to each known feature patch to indicate its importance for reconstructing each feature location inside the missing region (soft assignment) instead of just keeping the most similar known feature patch to each feature location inside the missing region (hard assignment). This is also the reason why the proposed contextual attention is differentiable.
Finally, we reconstruct the generated features inside the missing region by means of deconvolution using the attention feature maps as input features and the known patches as kernels. For readers who are interested in the actual implementation, you can visit their github project page for further details.
Attention Propagation
The attention propagation can be regarded as a fine-tuning of the attention feature maps. The key idea here is that neighboring pixels usually have closer pixel value. This means that they consider the attention values of the neighborhood to adjust each attention score,

For example, if we consider the attention values of the left and right neighbors, we can update the current attention value using the equation listed in above. Note that k controls the number of neighbors to be considered.
The authors claim that this can further improve the inpainting results and this can also be done by convolution with identity matrices as kernels.
One more point about the attention mechanism is that two techniques were used to control the number of extracted known feature patches. i) Extracting known feature patches with larger strides to reduce the number of kernels. ii) Downsampling the feature map size before the operation and then upsampling after getting the attention maps.
Attention in Network
![Figure 5. Illustration of embedding the contextual attention layer in the second refinement network. Image by Jiahui Yu et al. from their paper [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1tpIs6Fs1B23QDYozBSlJHA.png)
Figure 5 shows how the authors integrate the proposed contextual attention layer into the second refinement network. You can see that one more branch is introduced to apply the contextual attention and then two branches are concatenated to get the final inpainting results. The attention map color coding is used to visualize the attention map. For example, the middle white color means the pixel focuses on itself, pink on bottom-left region, green on top-right region, etc. You can see that this example has a pink-filled attention map. This means that the filled region borrows much information from the bottom-left region.
Experiments
The authors first compared with the previous state-of-the-art that we have introduced before.
![Figure 6. Comparison of the proposed baseline model and the GLCIC [2]. From left to right, input image, result by GLCIC, and result by the baseline. Image by Jiahui Yu et al. from their paper [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1dbttFfPFmsZTiIEpYAAD-A.png)
Figure 6 shows the inpainting results by the proposed baseline model and the previous state-of-the-art, GLCIC [2]. The proposed baseline model is the model as shown in Figure 2, without the proposed contextual attention branch. It is clear that the baseline model is better than GLCIC in terms of local texture details. Please zoom in for a better view.
![Figure 7. Visual comparison of inpainting results by the baseline and the full model. From left to right, ground truth, input image, results by baseline, results by full model, attention maps by full model. Image by Jiahui Yu et al. from their paper [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1zMcnRHcLxAIxWOR1YZtpHg.png)
Figure 7 shows the qualitative results using the baseline model and the full model (with contextual attention) on Places2 dataset. It is obvious that the full model offers better inpainting results with fine local texture details. This reflects that the contextual attention layer can effectively borrow information from distant spatial locations to help reconstructing the missing pixels. Please zoom in for a better view, especially for the attention maps.
![Table 1. Quantitative comparison of different approaches on Places2 dataset. Table by Jiahui Yu et al. from their paper [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1kRENr5OdlH-IxWaQiFK22Q.png)
Table 1 lists some objective evaluation metrics for reference. As mentioned before, these metrics cannot fully reflect the quality of the inpainting results as there are many possible solutions to fill in the missing regions. You can see that the proposed full model offers best _l_1, _l_2 loss and PSNR. For TV loss, PatchMatch gives lower TV loss as it directly copies raw image patches for filling the holes.
For reference, the proposed full model has 2.9M parameters. It takes 0.2 second per image on GPU and 1.5 seconds per image on CPU for images with size of 512×512.
Ablation Study
Attention mechanism is not a new idea and there are several attention modules in the literature. The authors did experiments on using different attention modules.
![Figure 8. Inpainting results by using different attention modules. From left to right: Input, results using Spatial Transformer Network, results using Appearance Flow, and results using the proposed contextual attention. Image by Jiahui Yu et al. from their paper [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1uDC1T3ACuuPIYSRLqo_mpA.png)
The authors compared with two famous attention modules in the literature, namely Spatial Transformer Network [3] and Appearance Flow [4]. Simply speaking, for appearance flow, a convolutional layer is used to replace the contextual attention layer to directly predict 2D pixel offsets as attention. This means that we add a convolutional layer to predict the shifts of the known pixels into the missing pixels. In Figure 8, you can see the results using appearance flow (middle) provide similar attention maps for different testing images. This means that the attention maps are not useful for giving the "attention" we want. You can also observe that the Spatial Transformer Network (left) cannot offer meaningful attention maps for the task of image inpainting. One possible reason is that spatial transformer network predicts parameters of global affine transformation and this is not sufficient to help filling in the missing region which also requires local information. I haven’t go too deep into different attention modules here. Interested readers may refer to the papers for further details.
Choice of GAN loss for image inpainting. The authors experimented with different GAN losses, such as WGAN loss, typical adversarial loss, and Least Square GAN. They empirically found that WGAN loss provides the best inpainting results.
Essential Reconstruction loss. The authors trained the refinement network without the L1 loss. They found that L1 loss is necessary to ensure the pixel-wise reconstruction accuracy even L1 loss makes the inpainting results blurry. So, L1 loss is crucial to ensure better content structure of the completed images.
Perceptual loss, style loss, and TV loss. We will cover perceptual loss and style loss very soon. A simple conclusion here is that these three losses do not bring obvious improvements in inpainting results. So, their model is trained using only weighted L1 loss and WGAN loss.
Conclusion
Clearly, the key idea of this paper is the contextual attention mechanism. The contextual attention layer is embedded in the second refinement network. Note the role of the first coarse reconstruction network is to have a rough estimation of the missing region. This estimation is used at the contextual attention layer. By matching the generated features inside the missing region and the features outside the missing region, we can know the contributions of all the features outside the missing region to each location inside the missing region. Note that the contextual attention layer is differentiable and fully-convolutional. With the proposed contextual attention, they achieve the state-of-the-art inpainting results.
Takeaways
You may find that we are going deeper and deeper into the field of deep image inpainting. In my previous post, Shift-connection layer is introduced which embeds the concept of copy-and-paste in CNNs in the form of hard assignment. This paper formulates a contextual attention layer in the form of soft assignment such that this layer is differentiable and can be learned end-to-end without modifying the computation of the gradient.
I hope that you can grasp the key idea of the contextual attention layer proposed in this paper, especially its formulation as shown in Figure 3 and 4. For readers who want to know more about the network architectures and loss function, please refer to the paper.
What’s Next?
In the future posts, we will look into more task-specific inpainting techniques. Hope that we can learn together and enjoy!
References
[1] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang, "Generative Image Inpainting with Contextual Attention," Proc. Computer Vision and Pattern Recognition (CVPR), 2018.
[2] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa, "Globally and Locally Consistent Image Completion," ACM Trans. on Graphics, Vol. 36, №4, Article 107, Publication date: July 2017.
[3] M. Jaderberg, K. Simonyan, A. Zisserman, et al., "Spatial Transformer Networks," In Advances in Neural Information Processing Systems, pp. 2017–2025, 2015.
[4] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, "View synthesis by appearance flow," Proc. European Conference on Computer Vision (ECCV), 2016.
Again, many thanks for reading my post! If you have any questions, please feel free to send my an email or leave comments here.
Actually, I try to shorten the length of the post and focus just one key idea of the paper. I have assumed that readers have basic knowledge of deep image inpainting from my previous posts. By the way, I must keep improving my writing skills to more effectively express my understanding of the papers. Any suggestions are welcome. It is extremely important for us to learn systematically. Thank you very much and see you next time!