How ‘Copy-and-Paste’ is embedded in CNNs for Image Inpainting — Review: Shift-Net: Image Inpainting via Deep Feature Rearrangement

Published in

Towards Data Science

16 min readOct 16, 2020

Hello everyone:) Welcome back!! Today, we will dive into a more specific deep image inpainting technique, Deep Feature Rearrangement. This technique takes both the advantages of using modern data-driven CNNs and conventional copy-and-paste inpainting method. Let’s learn and enjoy together!

Recall

This is my fifth post related to deep image inpainting. In my first post, I introduced the objective of image inpainting and the first GAN-based image inpainting method. In my second post, we went through an improved version of the first GAN-based image inpainting method in which a texture network is employed to enhance the local texture details. In my third post, we dived into a milestone in deep image inpainting in which the proposed network architecture can be regarded as a standard network design for image inpainting. In my fourth post, we had a revision and skimmed through a variant/improved version of the standard inpainting network. If you are new to this topic, I highly recommend you to read the previous posts first. I hope that you can have a full picture of the progress in recent deep image inpainting. I have tried my best to tell the story:)

Motivation

Figure 1. Qualitative comparison of inpainting results by different methods. (a) Input (b) Conventional method (based on copy-and-paste) (c) First GAN-based method, Context Encoder (d) Proposed method. Image by Zhaoyi Yan et al. from their paper [1]

As I mentioned in my previous posts, a conventional way to fill in the missing parts in an image is to search for the most similar image patches then directly copy-and-paste these patches on the missing parts (i.e. copy-and-paste method). This method offers good local details as we directly paste other image patches on the missing parts. However, the patches may not perfectly fit the context of the entire image, which may lead to poor global consistency. Please see Figure 1(b) as an example, you can see that the local texture details of the filled region is good but it is not consistent with the non-missing parts (i.e. valid pixels).

On the other hand, deep learning-based methods focus on the context of the entire image. Fully-connected layers or dilated convolutional layers are used to capture the context of the entire image. Deep learning models are trained using L1 loss to ensure the pixel-wise reconstruction accuracy. Therefore, filled images offered by deep learning approaches are with better global consistency. However, L1 loss leads to blurry inpainting results even though adversarial loss (GAN loss) can be used to enhance the sharpness of the filled pixels. Please see Figure 1(c) as an example, you can see that the filled region is more consistent with the non-missing region but the filled region is blurry.

So, the authors of this paper would like to take both advantages of using conventional “Copy-and-Paste” method (good local details) and modern deep learning approach (good global consistency).

Introduction

In image inpainting, we want a completed image with good visual quality. Therefore, we need both correct global semantic structure and fine detailed textures. Correct global semantic structure means that the generated pixels and the valid pixels should be consistent. In other words, we have to fill in an image and its context has to be maintained. Fine detailed textures mean that the generated pixels should be realistic-looking, and as sharp as possible.

In the previous section, we mentioned that conventional “Copy-and-Paste” methods can offer fine detailed textures while recent deep learning approaches can provide much better correct global semantic structure. So, this paper introduces a shift-connection layer to achieve deep feature rearrangement with the concept of “Copy-and-Paste” inside their network. Figure 1(d) shows the inpainting results offered by their proposed method.

Solution (in short)

A guidance loss is proposed to encourage their network (Shift-Net) to learn to fill in the missing parts during the decoding process. Apart from that, a shift-connection layer is suggested to match the decoded feature inside the missing region to the encoded feature outside the missing region, and then each matched location of the encoded feature outside the missing region is shifted to the corresponding location inside the missing region. This captures the information about the most similar local image patches found outside the missing region and this information is concatenated to the decoded feature for further reconstruction.

Contributions

As mentioned, a shift-connection layer is proposed to embed the concept of copy-and-paste in modern CNNs such that their proposed model can offer inpainting results with both correct global semantic structure and fine detailed textures.

Apart from standard L1 and adversarial losses, they also suggest guidance loss to train their Shift-Net in an end-to-end data-driven manner.

Approach

Figure 2. Network architecture of Shift-Net. The shift-connection layer is added at resolution of 32×32. Image by Zhaoyi Yan et al. from their paper [1]

Figure 2 shows the network architecture of Shift-Net. Without the shift-connection layer, this is a very standard U-Net structure with skip connections. Note that the encoded feature is concatenated to the corresponding layer of the decoded feature. This kind of skip connections is useful for low-level vision tasks including image inpainting in terms of both better local visual details and reconstruction accuracy.

Guidance loss

The guidance loss is proposed to train their Shift-Net. Simply speaking, this loss calculate the difference between the decoded feature of input masked image inside the missing region and the encoded feature of the ground truth inside the missing region.

Let’s define the problem first. Let Ω be the missing region and Ω(bar) be the valid region (i.e. non-missing region). For a U-Net with L layers, ϕ_l(I) represents the encoded feature of the l-th layer and ϕ_L-l(I) represents the decoded feature of the (L-l)-th layer. Our final objective is to recover I^gt (ground truth), thus we can expect that ϕ_l(I) and ϕ_L-l(I) contain almost all the information in ϕ_l(I^gt). If we consider y ∈ Ω, (ϕ_l(I))_y should be 0 (i.e. the encoded feature of the missing region in an input masked image at l-th layer is zero). So, (ϕ_L-l(I))_y should contain the information of (ϕ_l(I^gt))_y (i.e. the decoded feature of the missing region in an input masked image at (L-l)-th layer should be equal to the encoded feature of the missing region in the ground truth image at l-th layer). This means that the decoding process should fill in the missing region.

Equation 1 shows the relationship between (ϕ_L-l(I))_y and (ϕ_l(I^gt))_y. Note that for x ∈ Ω(bar) (i.e. non-missing region), they assume that (ϕ_l(I))_x is almost the same as (ϕ_l(I^gt))_x. Hence, the guidance loss is only defined in the missing region. By concatenating ϕ_l(I) and ϕ_L-l(I) as shown in Figure 2, almost all information in ϕ_l(I^gt) can be obtained.

Figure 3. Visualisation of features learned by Shift-Net. (a) Input (the lighter region indicates the missing region) (b) visualisation of (ϕ_l(*I^gt*))_y (c) visualisation of (ϕ_L-l(I))_y (d) visualisation of (ϕ^*shift*_L-l(I))_y Image by Zhaoyi Yan et al. from their paper [1]

To further show the relationship between (ϕ_L-l(I))_y and (ϕ_l(I^gt))_y, the authors visualise the features learned by their Shift-Net as shown in Figure 3. Comparing Figure 3(b) and (c), we can see that (ϕ_L-l(I))_y can be a reasonable estimation of (ϕ_l(I^gt))_y but the estimation is too blur. This leads to blurry inpainting results without fine texture details. This problem is solved by their proposed shift-connection layer ant the result is shown in Figure (d). So, Let’s talk about the shift operation.

For readers who are interested in their visualisation method, please refer to their paper or their github page. The visualisation method is just used to show the learned features, thus I would not cover it here.

Shift-connection layer

Personally, I would say this is the core idea of this paper. Recall that ϕ_l(I) and ϕ_L-l(I) are assumed to have almost all information in ϕ_l(I^gt). From the previous section, we can see that (ϕ_L-l(I))_y can be a reasonable estimation of (ϕ_l(I^gt))_y but it is not sharp enough. Let’s see how the authors make use of the feature outside the missing region to further enhance the blurry estimation inside the missing region.

Simply speaking, the equation 4 in above is to find the most similar encoded feature outside the missing region to each decoded feature inside the missing region. This is a cosine similarity operation. For each (ϕ_L-l(I))_y with y ∈ Ω, we find its nearest neighbour in (ϕ_l(I))_x with x ∈ Ω(bar). The output x*(y) represents the coordinates of the matched feature locations and we can obtain a shift vector u_y = x*(y) - y. Note that this shift operation can be formulated as a convolutional layer. I will talk about this in detail in my next post.

After getting the shift vector, we can rearrange the spatial locations of (ϕ_l(I))_x and then concatenate it to ϕ_l(I) and ϕ_L-l(I) to further enhance the estimation of (ϕ_l(I^gt))_y. The spatial rearrangement of (ϕ_l(I))_x is as follows,

Verbally, for each decoded feature inside the missing region, after finding the most similar encoded feature outside the missing region, we form another set of feature maps based on the shift vector. This set of feature maps contains the information about the nearest encoded features outside the missing region to the decoded features inside the missing region. All the related information is then combined as shown in Figure 2 for further reconstruction.

Here I would like to highlight some points about the shift-connection layer. i) The conventional “Copy-and-Paste” method operates at pixel or image patch domain while the shift-connection layer operates at deep feature domain. ii) The deep features are learned from a large amount of training data, all the components are learned in an end-to-end data-driven manner. Hence, both the advantages of using “Copy-and-Paste” and CNNs are inherited.

Loss Function

Their loss function is very standard. As mentioned, apart from the proposed guidance loss we have introduced, they also employ L1 loss and standard adversarial loss. The overall loss function is as follows,

Lambda g and lambda adv are used to control the importance of the guidance loss and the adversarial loss respectively. In their experiments, these two hyper-parameters are set to 0.01 and 0.002 respectively.

If you are familiar with the training process of CNNs, you may notice that the shift operation is a kind of manually modification in feature maps. Therefore, we have to modify the calculation of the gradient with respect to the l-th layer of feature F_l = ϕ_l(I). Based on equation 5, the relationship between ϕ^shift_L-l(I) and ϕ_l(I) can be written as follows,

where P is the shift matrix of {0, 1}, and only one element of 1 in each row of P. Element of 1 shows the location of the nearest neighbour. Therefore, the gradient with respect to ϕ_l(I) is computed as,

where F^skip_l represents F_l after the skip connection, and F^skip_l = F_l. All the three terms can be directly computed except that we have to multiply the transpose of the shift matrix P to the last term in order to make ensure that the gradient is correctly back-propagated.

Perhaps, you may find this part is a bit difficult to understand as we have to modify the computation of the gradient. For readers who are interested in how the authors actually do the implementation, I highly recommend you to visit their github page. If you do not understand this part, it doesn’t matter as long as you can catch the core idea of shift operation. Here, their shift operation is a kind of hard assignment. This means that each decoded feature in the missing region can only have one single nearest neighbour outside the missing region. This is why the shift matrix P is in the form of {0, 1} and why we have to modify the computation of the gradient. Later on, similar idea of shift operation is proposed and soft assignment is employed. In such a case, all neighbours outside the missing region are assigned weights to indicate the closeness to each decoded feature inside the missing region and we do not need to modify the computation of the gradient as this operation is completely differentiable. I will talk about this in detail in my next post:)

Experiments

The authors evaluate their model on two datasets, namely Paris StreetView [2] and six scenes from Places365-Standard [3]. Paris StreeView contains 14,900 training images and 100 testing images. For Places365, there are 1.6 million training images from 365 scenes. Six scenes are selected for the evaluation. Each scene has 5,000 training images, 900 testing images, and 100 validation images. For both datasets, they resize each image such that the smallest dimension is 350 then randomly crop a sub-image of size 256×256 as the input to their model.

For training, they use Adam optimiser with a learning rate of 0.0002 and beta_1 = 0.5. The batch size is set to 1 and the total number of training epochs is 30. Note that flipping is adopted as data augmentation. They claim that around one day is required to train their Shift-Net on a Nvidia Titan X Pascal GPU.

Figure 4. Visual comparison of inpainting results on Paris StreetView dataset. (a) Input (b) Content-Aware Fill (copy-and-paste method) (c) Context Encoder (d) Multi-scale Neural Patch Synthesis (MNPS) (e) Shift-Net. Image by Zhaoyi Yan et al. from their paper [1]

Figure 4 shows the visual comparison of state-of-the-art approaches on Paris StreetView dataset. Content-Aware Fill (Figure 4(b)) is the conventional method which utilises the concept of copy-and-paste. You can see it offers fine local texture details but wrong global semantic structure. Figure 4(c) and (d) are the results of Context Encoder and Multi-scale Neural Patch Synthesis respectively. We have reviewed these two methods previously. You can see that the results of Context Encoder are with correct global semantic structure but they are blurry. MNPS provides better results than Context Encoder but we still can easily observe the filled region with a bit artifacts. In contrast, Shift-Net can offer the inpainting results with both correct global semantic structure and fine local texture details. The results are as shown in Figure 4(e), please zoom in for a better view.

Figure 5. Visual comparison of inpainting results on Places dataset. (a) Input (b) Content-Aware Fill (copy-and-paste method) (c) Context Encoder (d) Multi-scale Neural Patch Synthesis (MNPS) (e) Shift-Net. Image by Zhaoyi Yan et al. from their paper [1]

Figure 5 shows the qualitative comparison of state-of-the-art approaches on Places dataset. Similar observations are made, please zoom in for a better view of the local texture details.

Table 1. Quantitative comparison of state-of-the-art approaches. Table by Zhaoyi Yan et al. from their paper [1]

Table 1 lists some quantitative evaluation metric numbers on Paris StreeView dataset. It is obvious that the proposed Shift-Net offers the best PSNR, SSIM and mean l2 loss. As mentioned in my previous posts, these numbers are related to the pixel-wise reconstruction accuracy (objective evaluation). They cannot reflect the visual quality of the inpainting results.

Figure 6. Examples of filling random regions. From top to bottom: Input, Content-Aware Fill, and Shift-Net. Image by Zhaoyi Yan et al. from their paper [1]

Figure 6 shows some examples of filling random regions using Content-Aware Fill and the proposed Shift-Net. Shift-Net is able to handle random cropped regions with good visual quality. Please zoom in for a better view of the local texture details.

Ablation study

The authors also did ablation studies to show the effectiveness of the proposed guidance loss and the shift-connection layer.

Figure 7. The effect of the proposed guidance loss in standard U-Net and the proposed Shift-Net. Image by Zhaoyi Yan et al. from their paper [1]

Figure 7 shows the inpainting results of U-Net and Shift-Net with and without the proposed guidance loss. It is clear that the guidance loss is useful for reducing the visual artifacts.

Figure 8. The effect of different lambda_g for the guidance loss. Image by Zhaoyi Yan et al. from their paper [1]

Figure 8 shows the inpainting results of Shift-Net with different lambda_g. We can see that better inpainting results can be obtained when lambda_g = 0.01. So, they empirically set lambda_g = 0.01 for their experiments.

Figure 9. The effect of doing shift operation at different layers L-l. Image by Zhaoyi Yan et al. from their paper [1]

Figure 9 shows the effect of shift operation at different layers. Recall that the shift operation is performed on deep feature maps at (L-l)-th layer with the use of the feature at l-th layer. When l is smaller, the feature map size is larger, hence shift operation at this layer is more computationally expensive. When l is larger, the feature map size is smaller, hence the time cost is lower but more spatial information is lost as the spatial size is smaller. This may also lead to poor inpainting results. In Figure 9, we can see that both L-3 (c) and L-2 (d) give good inpainting results (may be L-2 is a bit better). Note that L-2 takes around 400 ms to process an image while L-3 takes around 80 ms to process an image. So, to balance the time cost and the performance, the authors decide to perform shift operation at (L-3)-th layer.

Figure 10. The effect of zero out (b) ϕ_L-l(I), (c) ϕ_l(I), and (d) ϕ^shift_L-l(I) at (L-l+1)-th layer. (e) is the results of Shift-Net, using all (b), (c), (d). Image by Zhaoyi Yan et al. from their paper [1]

Recall Figure 2 for the architecture of Shift-Net, 3 different feature maps are concatenated after the shift-connection layer, namely ϕ_L-l(I), ϕ_l(I) and ϕ^shift_L-l(I). The authors try to examine the importance of these feature maps for the final reconstruction and the results are shown in Figure 10. It is clear that the decoded feature ϕ_L-l(I) is extremely important for the final reconstruction. If we zero out this decoded feature, the reconstruction would completely fail as shown in Figure 10 (b). So, we can know that the decoded feature ϕ_L-l(I) contains the information about the main structure and content of the missing region.

Figure 10 (c) shows the result of removing the encoded feature ϕ_l(I). We can see that the main structure can still be reconstructed but the visual quality is poor than the full model as shown in Figure 10 (e). This means that the guidance loss is useful to not just encourage the relationship between ϕ_L-l(I)_y and ϕ_l(I^gt)_y but also the relationship between ϕ_L-l(I)_x and ϕ_l(I^gt)_x.

Finally, if we remove the ϕ^shift_L-l(I) as shown in Figure 10 (d), there are obvious artifacts in the filled missing region. Therefore, we can know that ϕ^shift_L-l(I) is useful to refine the filled missing region by means of providing the results of the nearest neighbour searching outside the missing region as a reference for refinement.

Figure 11. From top to bottom. The inpainting results of Shift-Net with random shift-connection and the nearest neighbour searching. Image by Zhaoyi Yan et al. from their paper [1]

To further show the effectiveness of ϕ^shift_L-l(I), the authors compare random shift-connection and the nearest neighbour searching as shown in Figure 11. We can see that random shift-connection is not useful to refine the inpainting results to get better global semantic structure consistency compared to the nearest neighbour searching. So, we can say that correct shift-operation is important to get visually good inpainting results.

To conclude the use of shift-connection layer, I think the most important idea is that we provide a reference to the generated features inside the missing region (assume that the generated features are good estimation of the missing region) to further refine these generated features based on the reference and this reference is the most similar feature to each generated feature obtained from the non-missing region. Hence, we can borrow the structure and texture of the features in the non-missing region to refine the features in the missing region.

Other points to be noted

For interested readers, there are three more points here for you to further study. First, you may be interested in the definition of the masked region in the feature maps. Actually, we only have the input mask image. So, the masked region for feature maps has to be defined. Simply speaking, the authors use a simple CNN with the same architecture as the encoder to obtain the masked region inside the network. Simple convolutions and thresholding technique are used to get the masked region in the feature maps. If you are interested in this, please read the paper. I do not go through it here as we will introduce a learnable way to obtain the mask inside the network very soon!

Second, about the detailed architecture of the generator and discriminator. As I mentioned before, we have a standard design for image inpainting network and the network in this paper is also quite standard. They use a U-Net with the shift-connection layer as the generator and their discriminator is just a PatchGAN discriminator we discussed previously. Again, interested readers can refer to the paper for the complete description of the architecture.

Third, more comparisons on Paris StreetView and Places datasets can be found in the paper. You can have a look of it and you will have a brief idea about the visual quality of today’s deep image inpainting algorithm.

Conclusion

There are two main points in this paper. First, the proposed guidance loss encourages the decoded feature of the missing region (given an input masked image) to be close to the encoded feature of the missing region (given a ground truth). So, the decoding process should be able to fill in the missing region with a reasonable estimation of the missing region in the ground truth. Second, the proposed shift-connection layer with shift operation can effectively borrow the information from the nearest neighbour outside the missing region to further enhance both the global semantic structure and local texture details.

Takeaways

This post may be a bit advanced for some of you guys. I think the most important idea of this paper is the shift-operation which can be regarded as a kind of attention technique applied to the task of image inpainting.

Verbally, shift-operation allows the generated features of the missing region to borrow information from the most similar features outside the missing region. As the features outside the missing region are with good global semantic structure and fine local texture details, we can take them as references to refine the generated features. I hope you can get this main idea.

What’s Next?

In the coming post, we will cover another paper that also makes use of the ideas of borrowing information from non-missing regions so as to maintain the context of the entire image. I will also talk about how the nearest neighbour searching can be done in the form of a convolutional layer. Really hope that you guys enjoy my posts:)

References

[1] Zhaoyi Yan, Xiaoming Li, Mu Li, Wangmeng Zuo, and Shiguang Shan, “Shift-Net: Image Inpainting via Deep Feature Rearrangement,” Proc. European Conference on Computer Vision (ECCV), 2018.

[2] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes Paris look like Paris? ACM Transactions on Graphics, 2012.

[3] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba, “Places: A 10 million Image Database for Scene Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

Again, many thanks for reading my post! If you have any questions, please feel free to send my an email or leave comments here. See you next time! :)