Fulfillment lies in the creating something

This Blog post aims to provide a complete intuitive understanding of Generative Adversarial networks along with mathematics that goes with it, including the tensorflow implementation of it.

Published in

Towards Data Science

16 min readNov 21, 2018

Note: The reader should have a basic understanding of deep learning.

In a paper in 2014 Ian Goodfellow, a Ph.D. Student at the University of Montreal introduced a paper named Generative Adversarial Nets along with his mentors at Mila including Yoshua Bengio and Aaron Courville. This paper got citations over 5000 and got reviewed by all major pioneers of deep learning. Read below what Yann Lecun has to say about Gan’s.

Generative Adversarial Networks is the most interesting idea in machine learning in last ten years — Yann Lecun (Facebook AI Director)

This is a clear indication that Generative Adversarial Networks does something really interesting. From now we will refer Generative Adversarial networks as Gan’s. Gan’s are interesting as they can learn to approximate data distribution(aka mimic real data), which is powerful as now it can generate images, audios clips, videos which do not exists in reality. In a sense, you can assume Gan’s are artists. One of the famous quotes by Richard Feynman which explains the intuition behind Gan’s.

What I cannot create, I do not understand — Richard Feynman( American theoretical physicist)

The results generated by gan’s are pretty powerful, In October 2018 Christie’s digital house a French company sold a portrait(fig1.0) in an auction for $432,000 which was generated by a Gan.

Why Generative adversarial models?

This section is a summarization of blog post on Gan’s by OpenAI. The Idea is if Gan’s becomes better at creating data then it will also be able to understand the data which is present in the world much better then any other algorithms. As they learn the underlying representations of the data they then will understand it better. Gan’s belong to the class of Generative models and are based on the approach of differentiable generator networks, so they are one of the ways to achieve this, other some popular way’s are:

Variational Autoencoders.
Rnn’s like LSTM’s or PixelRnn’s.
RNN-RBM for sequential output.
Convolution Boltzmann machines and many more.

More on these models later, but the idea remains same all these architectures are trying to learn the underlying representations which define the data. Generative models usually belong to an active area of research called representation learning i.e learning representations from unlabeled data. All the above works differently but still all of them try to learn probability distribution nonetheless.

How Gan’s Work?

The idea behind Gan’s is intuitive to understand, it consists of two architectures a Discriminator which estimates the probability that the sample is real and a Generator which tries to generate data close to real one. With time the Discriminator gets better in discriminating and Generator gets better in generating samples that are more close to real samples (fig 1.1 for illustration). The Goals of the two models are opposite as when Generator is maximizing the probability of Discriminator making a mistake at the same time Discriminator tries its best to predict the correct label either real or fake. This idea becomes also interesting if you think carefully that one model is helping other to get better.

fig 1.1. Block Diagram of Gan’s. source: here

Consider the fig. 1.2. Here p(x) (Blue one in right) is the true distribution of the image and pθ(x) (Green one) are points taken from Gaussian distribution passed through our neural network. Initially, the points will be completely random but as with training, our model parameter’s θ are optimized to minimize the difference between the two distributions (By using KL divergence or any other means).

Both Generator and Discriminator can be same architectures or different, In the Initial paper of Gan’s, they used Multilayer perceptron by providing random noise as input to the generator and another Multilayer Perceptron model is used as a discriminator which detects, is sample a generated one or real. In the later section of this article, we will be building our own Generative neural net by using a convolution-based architecture known as DCGAN( Deep Convolutional Generative Adversarial Networks) but first Let’s put the mathematics behind Gan’s into its correct position. I will try to make math as simple as possible but make sure your basics are in correct order.

Formulating the objective function

The Goal is to learn the generator’s distribution Pg which defines the real data x, by passing random noise z as input through a neural net G(z ; θg) where θg are parameters of the Generator neural network, think of any neural network as a, Universal function approximator, we will also define another neural network D(x ; θd) the discriminator, this network will give probability d(x)[between 0 to 1] that x came from real data rather than Pg, while G is trained to minimize log(1 − D(G(z))) the D is trained to predict the correct label. The minimization term of G which is log(1 − D(G(z))) makes sense as initially, the probability D(G(z)) will be low as the generator has barely started generating samples and it will be easy for discriminator to predict if the sample is real or fake so overall result of log(1 − D(G(z))) is large. So when G’s objective is to minimize, the D’s objective is to maximize, consider this whole situation as a min-max game V(D,G)

fig 1.3 Loss function of Gan’s source here

Breaking down above equation,

First, Ex∼pdata(x) [log (D(x))] — Expectation of [maximizing that the generated sample x is real].

Second, Ez∼pz(z) [log(1 − D(G(z)))] — Expectation of [minimizing that the generated sample is not being real.]

The above equation is essentially a min-max game where one is trying to minimize and other one to maximize it. Let’s understand this min-max game situation by taking example of L2 regularization in machine learning. In regularization the the regularizing term is preventing the weights to be very small but at the same time the optimizer is minimizing the overall objective function which generally results in small values of weights. This situation is similar to Gan’s min-max situation but the difference being, In gan’s both minimization and maximization is controlled by optimizer itself.

Ideally, the best value of D(x) would be 0.5 it means it fails to discriminate between the real sample or the fake one i.e

D∗(x) = Pdata(x) / (Pdata(x)+Pg(x)) and Pg = Pdata.

The idea of simultaneously training D(x; θd) and G(z; θg) looks good in theory but in practice, we usually train D(x; θd) for k times for each time we train G(z; θg) so that the discriminator is always stronger than generator.

Generative Adversarial Networks training.

Consider the fig. given below.

The idea is simple to train the network for some number of iteration until the results look satisfactory.

Repeat following steps for N iterations:

Step 1: for each of n iteration, K iterations :
1.select a mini-batch of noise samples (z1....zm) 2.select a mini-batch of real-data samples (r1....rm) 3.train the discriminator with using max-min loss function [eq(1)] over m samples.

Step 2: For the next iteration :
1.select a mini-batch (z1.....zm) from p(z) 2.train the generator by minimizing the log probability of generator making mistake.

The initial paper has mentioned the use of SGD for training but it is recommended to use more powerful optimizers like Adam, Adagrad, rms-prop etc.

What is DCGAN?

DCGAN or Deep Convolutional Generative Adversarial Networks is a big improvement over previous Gan’s. Though in theory, the architecture of Gan’s looks simple in practice, but it’s really hard to train a Gan. Reason being, the gradients has to flow to through both discriminator and generator and due to this fact, the training of Gan’s becomes unstable. We will cover these topics in later part of this article but first, let’s try to understand the architecture of DCGAN’s.

The DCGAN’s are itself a class of deep convolutions based architecture’s consists of two convolution neural nets with the following constraints (as per original paper, read: here). Following constraints are based after performing an extensive model search.

Replace all pooling layer with strided convolutions, with this approach the network instead of relying upon pooling layers (max pool or avg pool) to perform spatial downsampling, the network learns by itself.
Remove any fully connected layers to create deeper architectures as fully connected layers hurt the convergence of the network.
Use BatchNormalization to stabilize the gradients flow between deep layers.
Using ReLu activation for all layers in generator except output layer which uses tanh and in discriminator use sigmoid or softmax.

The architecture proposed in the original paper of DCGAN contains one input layer generating noise along with four deconvolution layers in the generator and four convolution layer along with an output layer in discriminator.

if you are not aware of Transposed Convolution then check out this blog post by Naoki Shibuya.

Building a Deep convolution Generative Adversarial Network.

The code for this demonstration taken from tensorlayer original implementation of DCGAN with changes to work with Sully Chen data-set and layer-wise visualizations to really understand how network is learning. The data-set was built to train a self-driving car, it contains 63k images of road driving with corresponding steering angle and other values like gas peddle etc. They main idea behind using this data-set is to see how good the Gan’s can mimic real driving scenes like other pedestrians, vehicles etc. If you look at bigger picture, this experiment is great as now instead of relying on a camera attached in front of your car to capture images of the roads, you can generate endless amount of data just by sitting in a room. Although how good the idea to train a self-driving car with generated images is? I’ll leave it for further discussion.

Note: All the weights of the model are initialized using Glorot initializer. Typical Glorot initialization looks something like this,

step1: std_dev = sqrt(2.0/(fan-in + fan-out))
step2: weights = Normal distribution(mean=0.0, stddev)

if you want to learn more about Glorot initializer or why it is used, click here.

Tensorflow Implementation of Generator

Let’s look at the code for Generator. The flow of the code is as follows:

1. Defining generator parameters like output image size, filter parameters, gamma initializing.

2. Input layer is of 100 dimension i.e 100 random numbers generated from uniform distribution with mean = 0 and standard_dev = 1.0.

3. Input Layer is connected with Dense layer with 64 *8*64*64 units.

4. Reshaping Dense layer height = 64, width = 64, depth = 512

5. Normalizing using Batch Normalizing with gamma sampled from uniform distribution with mean = 1.0, std_dev=0.02 with ReLu activation.

6. DeConv2d or simply transposed convolution layer with kernel =(5,5), stride = (h=2,w=2) with depth 64 *4 with zero padding.

7. BatchNormalizing with ReLu activation.

8. DeConv2d or simply transposed convolution layer with kernel =(5,5), stride = (h=2,w=2) with depth 64 *4 with zero padding.

9. BatchNormalizing with ReLu activation.

10. DeConv2d or simply transposed convolution layer with kernel =(5,5), stride = (h=2,w=2) with depth 64 *4 with zero padding with tanh activation.

Tensorflow implementation of Discriminator

Defining model parameters like output image size, filter parameters, gamma initializer.
Convolution Layer with kernel size = (5,5) with stride = (h=2,w=2), with Relu Activation and with zero padding.
Convolution Layer with kernel size = (5,5) with stride = (h=2,w=2), with Relu Activation and with zero padding.
Batch normalizing with gamma = Normal distribution(mean = 1.0, std_dev = 0.02).
Convolution Layer with kernel size = (5,5) with stride = (h=2,w=2), with Relu Activation and with zero padding.
Batch normalizing with gamma = Normal distribution(mean = 1.0, std_dev = 0.02).
Convolution Layer with kernel size = (5,5) with stride = (h=2,w=2), with Relu Activation and with zero padding.
Batch normalizing with gamma = Normal distribution(mean = 1.0, std_dev = 0.02).
Flatten Layer.
Output Layer with 1 hidden unit with Sigmoid activation.

Let’s compare our generated results with real ones

I have to say the results looks much promising than expected. Before this let’s see how real data images look like (samples below are random from real dataset, the idea is to see how good gans are in understanding data).

fig 1.6 Random samples from train dataset source

Now Generated samples,

fig 1.7 Incremental progress shown using epoches as progress points.

As you can see the samples from 1st epoch model is just able to get essence of data but as the number of epoches increases the output gets better and in 10th epoches it becomes really better. The Network is trained on 30 epoches, let’s visualize those. Images arranged from epoches 11 to 30 each in a row.

fig 1.8 Generated samples from 11th epoch to 30th(some results skipped)

Approaching towards last epoch the results start to capture even tiny architecture like tyres, bushes etc. In-spite of good results, still the generator is sometimes unable to capture blurry edges like between roads and mountain resulting in fusion of two. Still results are completely acceptable.

Generator and discriminator loss.

Convergence of Gan’s are unstable and this is directly reflected in variance of loss values. Below provided a glimpse of the loss values of our gan architecture.

Epoch: [ 2/30] [ 139/ 997] time: 0.4634, d_loss: 0.67223823, g_loss: 0.71121359
Epoch: [ 2/30] [ 140/ 997] time: 0.4640, d_loss: 1.28069568, g_loss: 1.98622787
Epoch: [ 2/30] [ 141/ 997] time: 0.4628, d_loss: 1.44974589, g_loss: 0.46058652
Epoch: [ 2/30] [ 142/ 997] time: 0.4819, d_loss: 1.02387762, g_loss: 1.69937968
Epoch: [ 2/30] [ 143/ 997] time: 0.4781, d_loss: 0.59786928, g_loss: 1.81390572
Epoch: [ 2/30] [ 144/ 997] time: 0.4632, d_loss: 0.96302533, g_loss: 0.62419045
Epoch: [ 2/30] [ 145/ 997] time: 0.4622, d_loss: 0.62077224, g_loss: 1.15416789
Epoch: [ 2/30] [ 146/ 997] time: 0.4726, d_loss: 0.57695013, g_loss: 1.11101508
Epoch: [ 2/30] [ 147/ 997] time: 0.4843, d_loss: 0.64481205, g_loss: 1.35732222
Epoch: [ 2/30] [ 148/ 997] time: 0.4617, d_loss: 0.46775422, g_loss: 1.74343204
Epoch: [ 2/30] [ 149/ 997] time: 0.4668, d_loss: 0.60213166, g_loss: 0.84854925
Epoch: [ 2/30] [ 150/ 997] time: 0.4637, d_loss: 0.75188828, g_loss: 1.56600714
Epoch: [ 2/30] [ 151/ 997] time: 0.4644, d_loss: 0.80763638, g_loss: 0.80851054
Epoch: [ 2/30] [ 152/ 997] time: 0.4685, d_loss: 0.66286641, g_loss: 0.79960334
Epoch: [ 2/30] [ 153/ 997] time: 0.4619, d_loss: 0.64668310, g_loss: 1.32020211
Epoch: [ 2/30] [ 154/ 997] time: 0.4742, d_loss: 0.46545416, g_loss: 1.55003786
Epoch: [ 2/30] [ 155/ 997] time: 0.4970, d_loss: 0.94472808, g_loss: 0.49848381
Epoch: [ 2/30] [ 156/ 997] time: 0.4768, d_loss: 0.78345346, g_loss: 1.03955364

We can clearly observe how much the discriminator and generator loss values are going up and down. Problems associated with training of gans are discussed in later part of this blog.

Interpreting Gan using layer wise visualization of both Generator and Discriminator

In most applications we see, involving neural networks, they act as a black box i.e giving data from one side and getting output. We usually don’t understand what is going inside a network even at abstract level, so I decided to take a step in this direction and trying to understand what is happening at output of each layer in both generator and discriminator, how the network is generating or discriminating the data.

Note: All the Layer-wise output are obtained before normalizing after training network with 30 epochs on Sully Chen.

Generator

Layer 1 : As expected initially the network only generates some noise.

Layer 2: The network starts generating some patterns.

Layer 3: useful patterns emerges.

Layer 4: Final patterns upon which tanh activation will be applied.

Discriminator

Layer 1 : We know lower levels are used to extract simple features like horizontal or vertical edge detection, same is observed in our experiment.

Layer 2: Output of 2nd layer is done upon output of 1st. Not much can be understood by us but all this important for network itself.

Layer 3: Size gets smaller and the prediction becomes relevant to network only.

Layer 4: It’s hard to make any interpretation when the network get’s deeper, but that’s how network does it.

These visualization gives us some basic understanding of how networks predicts and what is happening at output of each layer, but we have just scratched the surface it’s much more to learn what these networks are doing. The full code for this development can be found on my github here.

Problems with Gan’s

In Theory, the working of gan’s sounds good and it seems it should converge to global minima, but in practice when both Generator G and Discriminator D are represented as neural network the chance of convergence goes down dramatically and in most cases the convergence will stop at saddle point.

Following problems are majorly observed while training Gan’s

Nash Equilibrium problem: This problem mainly exist due to min-max formulation of Gan. In this situation both parts of a function can only attain optimum value at a single point and on rest it remains unstable. Following example is taken from The deep learning book, Consider a function F(u,v) = uv, where we have have to minimize value of u and maximize value of v, then the only optimum value of F(u,v) exist when u=v=0.
Mode Collapse: In this situation the Gan fails to capture all classes of data and can only generate certain types of patterns. Usually this problem does not occur but if occur it can be problematic.
Diminishing Gradients: This situation is quite common when the gradients of either discriminator or generator becomes so small it almost stops the learning of other. This mostly happens in case of discriminator, as it get’s much better hence the gradients becomes so small that the generator fails to learn the distribution. To avoid this problem it is advisable not to train discriminator before actually training whole gan.

Stabilizing Gan’s is still an active area of research in deep learning community, many academic institutes and organizations are working in this area. If you are interested in exploring more about the current problems associated with Gan’s checkout this excellent blog post by Jonathan hui.

Applications

Though Gan’s were built keeping in mind to synthesize data and essentially making models to learn more about the representation that constitutes real world data for good but unfortunately most current use of Gan’s were not so friendly and many were even disastrous one of famous example being the deep fakes, the idea is create fake images of people saying or doing something. These attempts can cause serious damage to both society and country. In-spite of this there are some cool applications of gan’s,

In online clothing industry, as customer can see how the clothes look at them before actually buying them by trying them on their artificially generated virtual versions.
In generating cartoons characters, rather relying on expensive softwares.
In content generation, rather than relying on studios. A classic example of this is shown by china recently when its state news agency Xinhua, uploaded a clip where a AI news anchor is reporting instead of some real human. Although they didn’t revealed what their model is, but I am taking a educated guess here.

Summary

Generative Adversarial networks are one of the most promising technique in deep learning currently. They are built upon idea which is extremely simple and feels most generic. Our, seems to be a good experiment with gan’s, I’ll be writing more on this in my upcomming blogs. Complete code for this can be found here.