The world’s leading publication for data science, AI, and ML professionals.

Generative Adversarial Networks – Learning to Create

A peek into the design, training, loss functions and arithmetic behind GANs

Let’s say we have a dataset of images of bedrooms and an image classifier CNN that was trained on this dataset to tells us if a given input image is a bedroom or not. Let’s say the images are of size 16 16. Each pixel can have 256 possible values. So there is an infinitely large number of possible inputs (i.e. 256¹⁶¹⁶ or ~10⁶¹⁶ possible combinations). That really makes our classifier model a high dimensional probability distribution function that gives the probability of a given input from this large input space being a bedroom.

So, if we can learn this high dimensional knowledge on data distribution of bedroom images for classification purposes, we should surely be able to leverage the same knowledge to even generate completely new bedroom images too? As it turns out, yes, we can.

While there are multiple approaches for generative modeling, we will explore Generative Adversarial Networks in this post. The original GAN paper was published in 2014 and deep convolutional generative adversarial networks (DCGANs) was introduced in this paper and has been a popular reference since. This post is based on the study of these two papers and provides a good introduction to GANs.

GAN is a network where two models, a generative model G and a discriminatory model D, are trained simultaneously. The generative model will be trained to produce new bedroom images by capturing the data distribution associated with the training dataset. The discriminatory model will be trained to correctly classify a given input image as real (i.e. coming from the training dataset images) or fake (i.e. synthetic image produced by the generative model). Simply put, the discriminatory model is a typical CNN image classifier model, or more specifically, a binary image classifier.

The Generative Model is a bit different compared to the discriminatory model. Its goal is not classification but generation. While discriminative models spit out a vector of activations representing different classes as output given an input image, the generative model does the reverse.

Generative vs Discriminative Model
Generative vs Discriminative Model

It can be thought of as a reverse CNN, in the sense that it will take a vector of random numbers as input and produce an image as output, while a normal CNN does the opposite taking an image as input and producing a vector of numbers or activations (corresponding to different classes) as output.

But how do these different models work together? The image below gives an illustration of the network. First, we have random noise vector fed as input to the Generative model, which produces an image output. We’ll call these generated images as fake images or synthetic image. Then we have the Discriminative model that takes both fake images and real images from a training data set as inputs and produces an output that classifies whether the image was a fake image or a real image.

An illustration of Generative Adversarial Network
An illustration of Generative Adversarial Network

The training and optimization of the parameters of this network with the two models becomes a two player minimax game. The goal of the discriminative model is to maximize the correct classification of images as real vs fake. On the contrary, the goal of the generative model is to minimize the discriminator correctly classifying a fake image as fake.

Back propagation is used to train the network parameters like a regular CNN, but the fact that there are two models with diverse goals involved makes the application of back propagation different. More specifically, the loss functions involved and the number of iterations performed on each model are two key areas where GANs differ.

The loss function of the discriminative model will be nothing but a regular cross entropy loss function associated with a binary classifier. Depending on the input image, one or the other term in the loss function will be 0 and the result will be the negative log of the model predicted probability of the image being classified correctly. In other words, in our context, "y" will be "1" for real images and "1–y" will be "1" for fake images. "p" is predicted probability that the image is a real image and "1-p" is predicted probability that the image is a fake image.

Cross Entropy Loss for a Binary Classifier
Cross Entropy Loss for a Binary Classifier

"p", the probability above, can be represented as D(x), i.e. probability as estimated by discriminator D that image "x" is a real image. Rewritten, it looks like below:

Dicriminator's loss function
Dicriminator’s loss function

Based on how we assigned context, the first part of the equation will be activated and the second part will be zero for real images. It will be vice versa for fake images. Keeping this in mind, the representation of image "x" in the second part can therefore be replaced by "G(z)". In other words, the fake image is represented as output from model "G" given "z" as input. "z" is nothing but the random noise input vector to model "G" producing "G(z)". Not worrying too much about the rest of the math notations, this is the same as the loss function for the discriminator D as presented in the GAN paper. The signs were confusing on first look, but the algorithm in the paper provided the clarity by updating the discriminator by "ascending" its stochastic gradient, which is the same as minimizing the loss function as described above. Here’s a snapshot of the function from the paper:

From here
From here

Getting back to the generator G, the loss function for G would be to do the reverse, i.e. to maximize D’s loss function. But the first part of the equation does not have any meaning to the generator, so what we are really saying is that the second part should be maximized. So the loss function of G will be the same as D’s loss function with the sign flipped and first term ignored.

Generator's loss function
Generator’s loss function

Here’s a snapshot of the generator loss function from the paper:

From here
From here

I was also curious to know a bit more about the generative model’s internals as it does something that’s intuitively the reverse of a typical image classifying CNN. As was shown in the DCGAN paper, this is achieved through a combination of reshaping and transposed convolutions. Here’s a representation of the generator from the paper:

DCGAN Generator from here
DCGAN Generator from here

Transposed convolution is not the same as inverse of a convolution and does not recover the input given an output of original convolution, but rather only changes the shape. The bellow is an illustration of the math behind the Generator model above, particularly the CONV layers.

Illustration of a regular convolution used in CNNs followed by two examples of up-sampling achieved through transposed convolutions. The result of the first example is used as input in the second and third examples with the same kernel to demonstrate transposing is not the same as deconvolution and is not meant to recover an original input.
Illustration of a regular convolution used in CNNs followed by two examples of up-sampling achieved through transposed convolutions. The result of the first example is used as input in the second and third examples with the same kernel to demonstrate transposing is not the same as deconvolution and is not meant to recover an original input.

There are some additional interesting points to note from the papers. One is the inner for loop in the algorithm proposed in the original paper. This implies, for k > 1, we are performing multiple training iterations for discriminator D for every iteration of G. This is to ensure that D’s are sufficiently trained and learn more early on compared to G. We need a good D for the G to fool.

From here
From here

The other relevant highlight is the issue of generator possibly memorizing input examples, which the DCGAN paper addresses by using a 3072-128-3072 de-noising dropout regularized RELU autoencoder, basically a reduce and reconstruct mechanism, to minimize memorization.

DCGAN paper also highlights how the generator behaved when it was manipulated to forget certain objects within the bedroom image it was generating. They did so by dropping feature maps that correspond to a window from the second highest convolution layer feature set and showed how the network replaced the window space with other objects.

Additional manipulations based on arithmetic performed on the noise vector "Z" given as input to the generator was also demonstrated. Like when a vector that produced a "Neutral Woman" was subtracted from the vector for "Smiling Woman" and the result added to a "Neutral Man" vector, the resultant vector generated a "Smiling Man" image, putting into perspective the relationship between the input and output spaces and the probability distribution mapping happening between the two.

While there are other variants of the algorithms and loss functions seen above, this hopefully provided a reasonable introduction to this fascinating topic.


Related Articles