
Introduction
We have all heard about Generative Adversarial Networks (GANs) and the amazing things that they can do. If you haven’t, be sure to check out this incredibly interesting paper by Ian J. Goodfellow and co-authors that introduced GANs to the world: Generative Adversarial Networks.
One of the many applications of GANs is Facial Inpainting. In this post, we will be going over a rather interesting architecture discussed in an IEEE paper titled: Face Inpainting via Nested Generative Adversarial Networks which can be found here.
Note: Prior knowledge of deep learning concepts like Convolutional Neural Networks(CNNs) along with Generative Adversarial Networks(GANs) will be required to fully grasp the contents of this article.
Facial inpainting (or face completion) is the task of generating plausible facial structures for missing pixels in a face image. The goal of face inpainting is to produce a more legible and visually realistic face image from an image with a masked region or that has missing content.

Nested Generative Adversarial Networks
GANs have a huge amount of very interesting applications like super-resolution, music generation, 3-D object generation, text generation, text-to-image translation, image-to-image translation and even in domains like cybersecurity. But the most popular and fun application of GANs by far is image generation as GANs can generate very realistic images. We will look at a model made of a generative convolutional network that has a nested structure, that is, it consists of a total of 2 generators and 3 discriminators that work together to form a nested structure of 1 Gan inside another to increase performance. The network also implements Dilated Convolution and Novel Residual Connections that improve training.
This is what the overall model looks like –

- The Sub-Confrontation Generation Network –
The Sub-Confrontation Generation Network consists of the code generator and code discriminator. It takes in the corrupted image as input, and outputs encoded code information to decode next. This network helps extract the robust features of the image and the location of the missing or corrupted area in the image. It generates a blurry output of the image with the missing area partly fixed and the rest of the image almost intact.
Here you can see that the code generator and the code discriminator itself form a GAN as both these networks form an adversarial structure, where the code generator generates an image and the code discriminator classifies them, and both these networks are trained alternately just like a regular GAN.
In the given paper, the following structure is applied:
The code generator has 5 convolutional and 5 additional dilated convolutional layers. Dilated convolutional layers are used to increase the receptive field of the network without increasing the number of parameters. Novel residual connections are also used here to avoid loss of information between layers.
The code discriminator has three convolutional layers and one fully-connected layer.
This is what the structure of the Sub-Confrontation Generation Network **** looks like :

- The Parent-Confrontation Generation Network –
The Parent-Confrontation Generation Network consists of 2 parts: the generation part and the discrimination part.
The generation part includes the Sub-Confrontation Generation Network that outputs the partly fixed image to another image generator that constructs a fixed image using that. This way, 2 different generators are used to generate a fixed image twice, which improves the robustness of the model.
The discriminator part has 2 different discriminators: a global discriminator and a local discriminator. The global discriminator looks at the complete image to determine the genuineness of the image as a whole. The local discriminator on the other hand focuses just on the corrupted area of the image. Both of the discriminators come together with a fully connected layer to determine if the image is fake or real.
This is what the structure of the Parent-Confrontation Generation Network looks like :

Let us now look at the loss functions and other mathematics used here –
A total of 5 loss functions are used for the 5 networks of the model: The code generator, the code discriminator, the image generator, the global discriminator, and the local discriminator.
This is how the losses are defined in the paper:
- The Code __ Generator Loss :
"The code generator loss is the entropy deviation of information between the input and output of the code generator. It comes from the reconstruction loss of the structure of code generator, coding loss and the loss of GAN with code discriminator" :

"Where: X is the ground truth, X’ the reconstructed image, C() is the output of the code generator, D-code() is the output of the code discriminator, and MSE() is the pixel-wise mean square error between the two images."
So the code generator basically is trying to optimize its prediction of strong information of the image and the location of the missing area, by comparing the real image and the image generated by itself, but it also gets inputs from the code discriminator where the discriminator is to be trained by labelling the generated images as fake. So the objective of the code generator is to fool the code discriminator to the point where it is difficult for it to discriminate between the real and generated image.
- The Code Discriminator Loss :
"The Code Discriminator loss is the distance between the synthesized image and ground truth. It is the discriminator loss in the GAN:"

The code discriminator aims at being able to classify the image input it receives as real or fake correctly. The loss function shown above is known as the binary cross-entropy or the log loss function. Whenever a real image is passed into the discriminator, the coefficient of the second term in the equation becomes 0 so the loss function becomes the probability of the image being real. Similarly, when the image passed inside is fake the coefficient of the first term becomes 0 making the loss function to be the probability that the given image is a fake.
To understand this loss function better check out this great article here: Understanding binary cross-entropy / log loss: a visual explanation
- The Image Generator Loss :
"Image generator loss is the loss in the coding reconstruction process and generator loss of the generative adversarial neural network:"

"Where: G() is the output of image generator, Y is the reconstruction result of the broken image, D-GL() is the sum result of the global and local discriminators."
The image generator loss is the overall generator loss function of the NGAN as it is this generator’s output image that is considered as the output of the NGAN. It works just like the code generator loss, the comparison will be between the image generated by the image generator and the actual image, and the discriminator input will be that of the global and local discriminators of the parent-confrontation generation network.
- The Global and Local Discriminator Losses :
"Global discriminator loss and local discriminator loss compute the accuracy of distinguishing synthesized image and ground truth. Global discriminator calculates based on the whole image while local discriminator calculates only based on reconstructed area:"


"Where: x′ is the missing area of the corrupted image, y is the corresponding region in the ground truth, D-G(⋅) is the result of the global discriminator and D-L() is the result of the local discriminator."
Again, the global and local discriminators work just like the code discriminator, implementing the binary cross-entropy function, but the input fake images here will be those generated by the image generator, and with the difference of the local discriminator only looking at the reconstructed area.
Conclusion
- NGAN makes use of a nested architecture with one generative network inside another to increase performance.
- It uses a total of 5 networks: 2 generators and 3 discriminators.
- The Sub-Confrontation Generation Network takes in the corrupted image as an input and outputs a blurry image. It has a code generator and a code discriminator.
- The Parent-Confrontation Generation Network takes in the blurry image as an input, passes it to the image generator, which generates the fixed image, which in turn is passed over to the global and local discriminators.
- The model also implements dilated convolution and novel residual connections to improve training.
I hope you understood and enjoyed all of the concepts explained in this post. Please feel free to reach out for any kind of questions or doubts.
Thanks for reading!
Sources
- Z. Li, H. Zhu, L. Cao, L. Jiao, Y. Zhong and A. Ma, "Face Inpainting via Nested Generative Adversarial Networks," in IEEE Access, vol. 7, pp. 155462–155471, 2019, doi: 10.1109/ACCESS.2019.2949614.