The world’s leading publication for data science, AI, and ML professionals.

An Intuitive Look at GANs

Learn the intuition behind how GANs work, without the need of complex math equations.

Photo by Mario Gogh on Unsplash
Photo by Mario Gogh on Unsplash

Introduction

GANs (Generative Adversarial Networks) have taken the world of Deep Learning and computer vision by storm since they were introduced by Goodfellow et al. in 2014 at NIPS. The main idea of GANs is to simultaneously train two models; a generator model G that generates samples based on random noise, and another discriminator model D that determines whether a sample is real or generated by G.

This post will introduce the intuition behind the workings of GANs, without delving too much into the loss functions, probability distributions and math. The focus will be to have a good top-level understanding of how GANs function. Given the increasing popularity of GANs, it is important that anyone is able start off their Deep Learning journey without frontloading too much complex information! For the full explanation, Joseph Rocca has a great article about it!

Training a GAN framework is similar to a two player min-max game. G continually improves to generate images that are more realistic and have better quality. D improves in its ability to determine whether an image was created by G. Training a GAN can be done completely with backpropagation, which highly simplifies the training process. Typically, training is performed by regular switching from G to D in order to prevent a huge performance gap in the two models.

Generator Model

The generator model usually consists of a series of up-sampling and convolution layers. One common architecture is the DC(Deep Convolutional)-GAN network, which was presented at ICLR 2016 by Alec Radford et al. The DCGAN framework can be found below. If you have seen other common CNN frameworks, the GAN structure is very similar to a standard CNN classifier, just that it is ‘flipped’ horizontally.

DCGAN architecture from https://arxiv.org/abs/1511.06434
DCGAN architecture from https://arxiv.org/abs/1511.06434

The input supplied to the generator network labelled as ‘100z’ in the image. This means that there are 100 points sampled, creating a latent vector with length 100. ‘z’ also indicates that the points were sampled from a unit normal distribution. Hence, we can see the generator network as a function that performs a mapping from latent space onto the training data.

We can imagine the latent space (100 dimension) as a fixed distribution based on the Gaussian distribution. The generator network samples random points from this latent space, and maps it to the image space (64 x 64 x 3 dimension). In the space of all possible images, there exists a smaller subspace that describes the images found in the input training data. The generator will be penalized by the discriminator for creating images that do not belong to the training data distribution (not ‘realistic’) by an adversarial loss function.

Mapping function of Generator, Image by Author
Mapping function of Generator, Image by Author

Discriminator Model

The Discriminator typically has a framework similar to standard CNN classifiers, such as VGG. The aim of the discriminator is to learn to classify input images as real or fake, depending on whether the image came from the training data or was generated by G. Looking at the diagram below, the aim of the discriminator is to learn the red dotted line. It will thus be able to classify real and fake images according to this input data distribution. If the supplied images lie outside the red space, they will be classified as ‘fake’.

Discriminator learning, Image by Author
Discriminator learning, Image by Author

G and D in tandem

In a GAN framework, G and D models have to be trained together. Improvements in either model will eventually lead to better and more realistic generated images. A good discriminator model can capture the training data distribution perfectly. This allows the generator to have a good ‘reference’ space, since the training of generator is highly dependent on the discriminator output.

If the discriminator poorly captures the training data distribution, generated images that are not similar to training images will be classified as ‘real’ and this decreases model performance!

Limitations

It is evident that this simple GAN framework is only able to produce images that are similar to the training data distribution. Hence, a large amount of training data is required! Also, there are many hurdles to GAN training. One common problem is modal collapse, whereby the Generator model learns to map multiple latent vectors to one single image. This highly affects the diversity of the GAN framework.

Conclusion

There are many variations and developments of GANs in recent years that address these problems. These include improved loss functions and specialized frameworks that are customized for specific tasks such as super-resolution or image-to-image translation.


Related Articles