Welcome back guys, I hope that the previous posts aroused your curiosity about deep generative models for image inpainting. If you are a new friend, I highly recommend you skim through the previous posts [here](https://medium.com/@ronct/review-high-resolution-image-inpainting-using-multi-scale-neural-patch-synthesis-4bbda21aa5bc) and here. As per the announcement made in the previous post, we will dive into another milestone in deep image inpainting today! Are you ready? Let’s start 🙂
*Image Inpainting and Image Completion represent the same task
Recall
Here is just a short recall of what we have learnt previously.
- For image inpainting, texture details of the filled pixels are important. The valid pixels and the filled pixels should be consistent and the filled images should look realistic.
- Roughly speaking, researchers adopt pixel-wise reconstruction loss (i.e. L2 loss) to ensure that we can fill in the missing parts with "correct" structure. On the other hand, GAN loss (i.e. Adversarial loss) and/or texture loss should be used to obtain the filled images with sharper texture details of the generated pixels.
Motivation
![Figure 1. An example to show the need of generating novel fragments for the task of image inpainting. Extracted and modified from [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1FoABBYYiYcwNguXG9dVZzw.png)
- For patch-based methods, one heavy assumption is that we believe we can find similar patches outside the missing regions and these similar patches would be useful for filling in the missing regions. This assumption may be true for natural scenes as sky and lawn can have many similar patches in an image. What if there are not any similar patches outside the missing regions just like the case of face image inpainting as shown in Figure 1. For such a case, we cannot find any eye patches to fill in the corresponding missing parts. Therefore, robust inpainting algorithms should be able to generate novel fragments.
- Existing GAN-based inpainting approaches make use of a discriminator (adversarial loss) to enhance the sharpness of the filled region by feeding the filled region to the discriminator (i.e. fool the discriminator). Some may compare the local neural responses inside and outside the missing regions among a pre-trained network to ensure similar texture details of local patches inside and outside the missing regions. What if we consider both the local and global information of an image to enforce locally and globally consistency? Will we obtain better completed images? Let’s see.
![Figure 2. Network architecture of Context Encoder. Extracted from [2]](https://towardsdatascience.com/wp-content/uploads/2020/10/0EQoA5NuL9Tym2OKb.png)
- How to deal with high-resolution images? We have previously talked about the 1st GAN-based inpainting approach, Context Encoder [here]. They assume that the test images are always 128×128 with a 64×64 center missing hole. Then, we have also covered an improved version of Context Encoder, called Multi-Scale Neural Patch Synthesis in the previous post. They suggest a multi-scale way to handle test images with the largest resolution of 512×512 with a 256×256 center missing hole. In short, they employ three networks for images at three scales, namely 128×128, 256×256, then 512×512. As a result, the speed is the bottleneck of their proposed method. It requires roughly 1 min to fill in a 512×512 image with a Titan X GPU. So, an interesting question! How can we deal with high-resolution images with just one single forward pass to our network? Give you few seconds to think about it and you may find some hints from the architecture as shown in Figure 2 (pay attention to the middle layer). A quick answer is to remove the middle fully-connected layer and employ fully Convolutional Network! You will know how and why very soon!
Introduction
- Most existing methods assume that similar image patches can be found to fill in the missing parts in the same image. It is not always true for image inpainting, please see Figure 1 for such a case. To be more accurate, we should look at the entire image for understanding its context, then filling in the missing parts based on its context.
- If fully-connected layer is employed, the input image size has to be fixed. Hence, the network cannot handle test images at different resolutions by just one single forward pass. Recall that fully-connected layer fully connects all neurons between two layers and hence it is sensitive to the changes in the output sizes of the previous layers and test image sizes have to be fixed. On the other hand, for convolutional layer, there is no fully connection between neurons. Smaller input feature maps would result in smaller output feature maps. So, if a network consists of only convolutional layers, it can handle input images at various sizes. We call this kind of networks as Fully Convolutional Networks (FCNs).
Solution (in short)
- Employ Dilated Convolution instead of Fully-Connected Layer such that we can still understand the context of an image and construct a Fully Convolutional Network (FCN) for handling different image sizes.
- Employ two discriminators to ensure the local and global consistency of the completed (filled) images. One discriminator looks at the whole image in a global sense while one looks at the sub-image around the filled region in a local sense.
- Employ a simple post-processing. It is sometimes obvious to spot the difference between the generated pixels and the valid pixels. In order to further enhance the visual quality, the authors adopt two conventional techniques, namely Fast Marching method and Poisson image blending. These two techniques are out of the scope of this post, interested readers may click the hyperlinks to know more about them. Later on, to some extent the post-processing step has been embedded in the network in the form of refinement network. We will cover it in the later posts.
Contributions
- Propose a Fully Convolutional Network with Dilated Convolution for Image Inpainting. It allows us to understand the context of an image without using fully-connected layers, hence the trained network can be used for images with diverse sizes. This architecture actually forms the basis of later Deep Learning-based image inpainting approaches. This is why I think this paper is a milestone in inpainting.
- Suggest to use two discriminators (one local and one global). Multi-scale discriminators seem can provide better texture details of the completed images at various scales.
- Emphasize the importance of generating novel fragments in the task of Image Inpainting. Actually, the training data is extremely important. Simply speaking, you cannot generate something you haven’t seen before!
Approach
![Figure 3. Overview of the proposed architecture. Extracted from [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/17c9-z1DNVzQ4L265bB7OJg.png)
- Figure 3 shows the overall proposed network architecture. It consists of three networks, namely Completion Network (i.e. generator, used in both training and testing), Local Discriminator and Global Discriminator (used in just training as auxiliary networks for learning). A quick recall of this GAN framework. Generator is responsible for completing images to fool the discriminators while discriminators are responsible for distinguishing completed images from real images.
Dilated Convolution in CNNs
The concept of dilated convolution is important for readers to understand the network design of this paper. So, I would like to try my best to explain it for readers who are not familiar with dilated convolution. For readers who know it very well, please have a quick review as well.

- In the paper, the authors use half of a page to describe CNNs, standard and dilated convolution. They also provide the equation of convolution for reference. One point I have to clarify is that dilated convolution is not proposed by the authors in this paper, they employ it for image inpainting.
- Here, I just want to use a simple figure to illustrate the difference between standard and dilated convolution.
- Figure 4(a) is a standard convolution layer with 3×3 kernel, stride=1, padding=1 and dilation rate=1. In a case setting, a 8×8 input gives a 8×8 output and each 9 neighbouring locations contributes to an element at the output.
- Figure 4(b) is also a standard convolution layer. This time we use a 5×5 kernel, stride=1, padding=2 (for keeping same input and output sizes) and dilation rate=1. In this case, each 25 neighbouring locations contributes to each element at the output. This means that for each value at the output, we have to consider (look) more at the input. We usually refer it to larger receptive field. For a large receptive field, more features from distant spatial locations would be taken into account to give each value at the output.
- However, for the case in Figure 4(b), we use a larger kernel (5×5) to attain larger receptive field. This means that more parameters have to be learnt (3×3=9 compared to 5×5=25). Is there any way to increase the receptive field without having more parameters? The answer is Dilated Convolution.
- Figure 4(c) is a dilated convolution layer with 3×3 kernel, stride=1, padding=2 and dilation rate=2. When comparing the coverage of the kernels in Figure 4(b) and (c), we can see that they both cover a 5×5 local spatial region at the input. A 3×3 kernel can attain the receptive field as a 5×5 kernel by skipping consecutive spatial locations. The step of skipping is determined by the dilation rate. Say for example, a 3×3 kernel with dilation rate=2 gives a 5×5 receptive field; a 3×3 kernel with dilation rate=3 gives a 7×7 receptive field and so on. Obviously, dilated convolution increases the receptive field without adding additional parameters by skipping consecutive spatial locations. The advantage is that we have larger receptive field with same number of parameters. The disadvantage is that we skip some locations (we may lose some information because of this).
Why Dilated Convolution?
After reviewing the concept of dilated convolution, I am going to talk about the reasons why authors employ dilated convolution in their model now. Some of you guys may already guess the reasons. Let’s check!
- As mentioned before, understanding about the context of the entire image is important to the task of image inpainting. Previous approaches employ fully-connected layer as the middle layer so as to understand the context. Remember that standard convolution layer performs convolution at local regions while fully-connected layer fully connects all the neurons (i.e. each output value depends on all the input values). However, fully-connected layer imposes limitation on the input image size and induces much more learnable parameters.
- To solve these limitations, dilated convolution is used to construct a fully convolutional network that allows input at various sizes. On the other hand, by adjusting the dilation rate of a standard kernel (usually 3×3), we can have larger receptive fields at different layers to help understanding the context of the entire image.
![Figure 5. Effect of different sizes of receptive field. Extracted and modified from [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1-dsUUjOr9UNiAw_iMszL9Q.png)
- Figure 5 is an example to show the usefulness of dilated convolution. You may think that (a) is standard convolution with a 3×3 kernel (smaller receptive field) and (b) is dilated convolution with a 3×3 kernel and dilation rate≥2 (larger receptive field). Locations p1 and p2 are inside the hole region where p1 is close to the boundary and p2 is roughly at the center point. For (a), you can see that the receptive field (influencing region) at location p1 can cover the valid region. This means that valid pixels can be used to help filling in the pixel at location p1. On the other hand, the receptive field at location p2 cannot cover the valid region, hence no information from valid region can be used for generation.
- For (b), we use dilated convolution to increase the receptive field. This time, the receptive fields at both locations can cover the valid region. Readers now can realize the effectiveness of dilated convolution.
Completion Network
Let’s back to the structure of the Completion Network as shown in Figure 3.
![Table 1. Structure of the Completion Network. Each convolution layer is followed by ReLU except for the last one which is followed by a Sigmoid [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1Oljl1QfV4vKVEA7e7YTPSg.png)
- The Completion Network is a Fully Convolution Network which accepts input images at different sizes.
- This network down-samples the input by a factor of 2 for 2 times. This means that if the input is 256×256, the input size at the middle layers is 64×64.
- In order to make full use of the valid pixels and ensure the pixel-wise accuracy, the authors replace the pixels outside the hole region(s) by the valid pixels.
Context Discriminators
Let’s talk about the local and global discriminators. Nothing is special, same as the single discriminator case. The only difference is that we have two this time.
![Table 2. Structure of the Local and Global discriminators. FC stands for Fully-Connected layer. The final FC at Concatenation layer (c) is followed by a Sigmoid [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1ZevmwSWQjQU-hBjRNqx2JA.png)
- The local and global discriminators share almost the same architecture. The global discriminator takes input image at size of 256×256 (entire image, for global consistency) while the local discriminator takes input at 128×128 around the center of the missing region for local consistency.
- One point to note is that during training, there is always one single missing region. During testing, there could be multiple missing regions in an image. Apart from that, for the local discriminator, random selection of a 128×128 patch is employed for real images as there is no filled region for real images.
Training Strategy and Loss Function
- Same as before, two loss functions are used to train the network, namely L2 loss and Adversarial loss (GAN loss).

- C(x, _Mc) denotes the completion network as a function. x is the input image and _Mc is a binary mask indicating the missing region. 1 for the hole region, 0 for the outside region. You can see that L2 loss is computed just within the hole region. Note that the pixels at outside region of the completed images are directly replaced by the valid pixels.

- D(x, _Md) denotes the two discriminators as a function. _Md is a random mask to randomly select an image patch for local discriminator. This is a standard GAN loss. We want the discriminators cannot distinguish completed images from real images, hence we can have completed images with realistic texture details.

- This is the joint loss function to train the network. alpha is a weighting hyper-parameter to balance the L2 loss and GAN loss.
![Algorithm 1. The proposed training procedure [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1v_o4tmsJANoN4hQxtMrT_Q.png)
- The authors split their training into three phases. i) Train the completion network with just L2 loss for _TC iterations. ii) Fix the completion network, train the discriminators using the GAN loss for _TD iterations. iii) Train the completion network and the discriminators alternately until the end of training.
- For stable training, Batch Normalization (BN) is employed after all convolutional layers except for the last layers of the completion network and the discriminators.
- To generate training data, they randomly resize the smallest edge of an image to [256, 384] pixel range. Then, they randomly crop a 256×256 image patch as an input image. For the mask image, they randomly generate a hole and each edge ranges from [96, 128].
- Simple post-processing: as mentioned, the authors also employ conventional Fast Marching method followed by Poisson image blending to further enhance the visual quality of the completed images.
Experiments
- The authors train their network using 8,097,967 training images from the Places2 dataset [3]. The alpha weighting hyper-parameter in the joint loss function is set to 0.0004 and the batch size is 96.
- From the paper, the completion network is trained for _TC = 90,000 iterations; the discriminators is trained for _TD = 10,000 iterations and finally all the networks are jointly trained for 400,000 iterations. They claim that the entire training procedure takes around 2 months on a computer with 4 K80 GPUs.
![Table 3. Timing of the proposed method [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1mbG5cMhVX9ws11nPkhl0TA.png)
- They evaluate both on CPU and GPU using an Intel Core i7–5960X CPU @ 3.00 GHz with 8 cores and a NVIDIA GeForce TITAN X GPU. Actually, the speed is quite fast, slightly more than half of a second to complete a 1024×1024 image.
![Figure 6. Comparisons with existing inpainting approaches [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1s8fTWH5KhWJ6jqqtx8KCGA.png)
- Figure 6 shows the comparisons with some existing approaches. Overall, patch-based methods are able to complete with locally consistent image patches, but they may not be globally consistent with the entire scene. The recent GAN-based method, Context Encoder (5th row), tends to give blurry completed images. The proposed method offers both locally and globally consistent completed images.
![Figure 7. Comparison with Context Encoder (CE) for center missing hole. Both CE and Ours (CM) are trained using the same 100k subset of training images from ImageNet for center hole filling [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1GE45VoCRfHwdbn4d5qXOtw.png)
- To compare with the state-of-the-art GAN-based inpainting approach, the authors perform center region completion and the results are shown in Figure 7. It can be seen that CE performs better for center region completion than arbitrary region completion (Figure 6). In my opinion, CE and the proposed method have similar performance in Figure 7. Readers may zoom in to see the differences.
![Figure 8. The effect of different discriminator settings [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1zkcHrivN8xNmsOY-s3qKgw.png)
- The authors provide ablation study on the two discriminators. From Figure 8(b) and (c), when the local discriminator is not used, the completed regions seem more blurry. On the other hand, for (d), if just using the local discriminator, we can have good local consistent texture details but the globally consistency is not guaranteed. For the full method in (e), we achieve results with locally and globally consistency.
![Figure 9. The effect of the simple post-processing [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1n8bYEi7EiiHIjcaoVUaTog.png)
- Figure 9 shows the effect of the simple post-processing. For Figure 9(b), we can easily observe the boundary.
![Figure 10. Inpainting results of training with different datasets [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1Xexe65J1OGWdbCrvC8k8fA.png)
- Figure 10 shows the inpainting results from models trained on different datasets. Note that Places2 consists of roughly 8 million training images of diverse scenes while ImageNet contains 1 million training images for object classification. We can see that results from model trained on Places2 are a bit better than that of trained on ImageNet.
![Figure 11. Examples of object removal by the proposed method [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/16UnpFUnkIpsK9a3PPSNf_w.png)
- One potential application of image inpainting is object removal. Figure 11 shows some examples of object removal by using the proposed method.
![Figure 12. Results for more specific datasets, namely for faces and facades [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1ss18mM9gieY8p1rMfl312g.png)
- The authors of this paper also consider the domain specific image inpainting. They fine-tuned their pre-trained model on CelebA dataset [4](for face inpainting) and CMP Facade dataset [5] (for facade inpainting), which consists of 202,599 and 606 images respectively. They used the pre-trained model trained on Places2 dataset. For new dataset, they trained the discriminators from scratch, then both the completion network and discriminators are trained alternately together.
- Figure 12 shows some inpainting results given by the proposed method for domain specific image inpainting. For face inpainting, the proposed method is able to generate novel fragments such as eyes and mouths. For facade inpainting, the proposed method is also able to generate fragments like windows that are locally and globally consistent with the entire image.
- The authors also did a user study on the completed face images. The results show that 77.0% of the completed faces by using the proposed method are regarded as real faces by 10 users. On the other hand, 96.5% of the real faces can be correctly identified by the 10 users.
Limitations and Discussion
Here are some points about the limitations and future direction listed by the authors.
![Figure 13. Failure cases i) The mask is at the border of the image ii) The scene is complicated. Extracted from [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1ngN4_ij5MWwFR6AespUOkw.png)
- For the case in Figure 13 left, we can see that the missing part is at the border of the upper image. The authors claim that less information can be borrowed from neighbouring locations in such a case, hence the GAN-based methods (3rd and 4th rows) perform worse than the conventional patch-based method (2nd row). Another reason is that this example is natural scene, so patch-based method can work well.
- For the case in Figure 13 right, the scene is much more complicated. According to the mask, we want to remove a person and we have to fill in some details of the buildings to complete this complicated scene. In such a case, all the methods cannot work properly. Therefore, it is still challenging to fill in the missing parts in complicated scenes.
![Figure 14. Examples to show the importance of generating novel fragments and we can only generate what we have seen before during training. Extracted from [1]](https://towardsdatascience.com/wp-content/uploads/2020/10/1k2EXh95B5sEwmohkxZUT3Q.png)
- The authors provide additional examples to emphasize two more points. i) The importance of generating novel fragments such as eyes, noses, mouths. ii) The importance of the training dataset.
- For the cases that we cannot find similar image patches to fill in the missing parts, patch-based methods (2nd and 3rd rows) cannot work properly as shown in Figure 14. So, a robust inpainting algorithm must be able to generate novel fragments.
- To further show the importance of the selection of the training dataset, the authors compare two models trained on Places2 (General dataset, (d)) and CelebA (Face dataset, (e)). It is obvious that (d) cannot fill in the missing parts with reasonable facial details as it is trained on Places2 which does not contain any aligned face images. On the other hand, (e) works well as it is trained on CelebA, a dataset with many aligned face images. Therefore, we can only generate what we have seen during training. Robust general inpainting is still a long way to go.
Conclusion
- The proposed architecture acts the basis of most later inpainting papers. Fully Convolutional Network with Dilated Convolution allows us to understand the context of an image without using fully-connected layers, hence the network can take input images at various sizes.
- Multi-scale discriminators (in this we have two discriminators, actually some may have three!) are useful to enhance texture details of the completed images at different scales.
- It is still challenging to fill in the missing parts when the scene is complicated. On the other hand, natural scene is relatively easy to complete.
Takeaways
Here, I would like to list out some points that are useful for the future posts.
- Remember that the Fully Convolution Network with Dilated Convolution is a typical network structure for image inpainting. It allows input at different sizes and provides similar function to fully-connected layers (i.e. help understanding the context of an image). If you want, you may have a jump to [here] for a review of a recent inpainting paper to see the variation of this typical structure.
- Actually, face image inpainting is relatively simpler than general image inpainting. This is because we always train a model on face dataset for face image inpainting and the dataset consists of many aligned face images. For general image inpainting, we may train on a much more diverse dataset such as Places2 which contains millions of images from various categories such as urban, buildings, and many others. It is much more difficult for a model to learn to generate all the things with good visual quality. Anyway, it is still long way to go.
What’s Next?
- Until now, we have dived into three very good early inpainting papers. Next time, I would like to have a revision. I will talk about another paper which employs this Fully Convolution Network with Dilated Convolution. I hope that you can see the development of deep learning-based image inpainting. Enjoy! 🙂
References
- Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa, "Globally and Locally Consistent Image Completion," ACM Trans. on Graphics, Vol. 36, No. 4, Article 107, Publication date: July 2017.
- Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros, "Context Encoders: Feature Learning by Inpainting," Proc. Computer Vision and Pattern Recognition (CVPR), 27–30 Jun. 2016.
- Places2 dataset, http://places2.csail.mit.edu/download.html
- Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang, "Deep Learning Face Attributes in the Wild," Proc. Computer Vision and Pattern Recognition (CVPR), 2015.
- Radim Tyleček, and Radim Šára, "Spatial Pattern Templates for Recognition of Objects with Regular Structure," Proc. German Conference on Pattern Recognition, 2013.
Thanks for you spending time on this post. If you have any questions, please feel free to ask or leave comments here. See you next time! 🙂