
From the above image, can you tell which hand-written digits are synthesized by the machine and which are made by humans? The answer is: all of them are synthesized by the machine! In fact, this image comes from a scientific paper "InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets", where the authors developed a special Generative Adversarial Network named InfoGAN (Information Maximizing Generative Adversarial Network) and used it to synthesize MNIST hand-written digits. As you may know, GANs are widely used for synthesizing new data, especially images. However, one drawback of normal GANs is that we have no control over the images GANs produce. For instance, a GAN that is trained to produce fake hand-written digit images may be able to generate very real hand-written digit images, but we have no control over which number it generates. InfoGAN solved this problem: the network can learn to produce images with specific categorical features (such as digits 0 to 9) and continuous features (such as the rotational angle of the digits), in an unsupervised manner. In addition, because the learning is unsupervised, it is able to find the patterns hidden among the images, and generate images that follow these hidden patterns. Sometimes the model can learn very interesting patterns that are beyond your imagination (for example, one of my models learns to transit from number 2 to number 8. You will see it later!). In this notebook, I will introduce how InfoGAN achieves the control of the images being produced, and how to build an InfoGAN from scratch to synthesize feature-specific MNIST hand-written digits, just like the image above.
The Structure of InfoGAN
A normal GAN has two fundamental elements: a generator that accepts random noises and produces fake images, and a discriminator that accepts both fake and real images and identifies if the image is real or fake. During the training, the generator gets "penalized" if the discriminator successfully detects the image it produces is fake. Therefore, the generator will learn to produce fake images that are more and more similar to the real ones to "fool" the discriminator.
In InfoGAN, to control the types of images produced, we need to feed additional information on the top of random noises to the generator and force it to use the information when making the fake images. The additional information we feed should relate to the types of features we want the images to have. For example, if we want to produce specific MNIST digits, we need to feed a categorical vector containing integers from 0 to 9; if we want to produce MNIST digits with different rotational angles, we may want to feed float numbers randomly selected between -1 to 1.
Feeding additional information is easy, as we just need to add extra inputs to the generator model. But how can we ensure that the generator will use the information instead of completely ignoring it? If we still train the generator simply based on the response of the discriminator, the generator won’t use the additional information as the additional information will not help the generator to create more realistic images (it is only helpful to generate specific features of the images). Thus, we need to apply extra "penalties" on the generator if it does not use the additional information. One way is to add an additional network (often named auxiliary network and denoted as Q) that takes fake images and reproduces the additional information we fed into the generator. In this way, the generator is forced to use the additional information as if it doesn’t, there is no way that the auxiliary network can correctly reproduce the additional information, and the generator will be "penalized". The picture below summaries the structures of GAN (left) and InfoGAN(right).
Note: in the paper, theoretically, the generator should be trained through maximizing mutual information. However, mutual information is practically unable to calculate. Therefore the authors approximated mutual information, and the approximation becomes the cross-entropy (i.e. difference) between the inputs of the additional information and the reproduced information. If you are interested in mutual information and how it is approximated, you can check the original paper, or two very good medium articles by Zak Jost and Jonathan Hui.

Build InfoGAN
After understanding the structure of InfoGAN, let’s get our hands dirty and build an InfoGAN to produce feature-specific MNIST digits! As seen from the above image, InfoGAN contains three models: generator (G), discriminator (D), and auxiliary model (Q). The input of the generator includes three parts: a noise vector of size 62, a categorical vector of size 10 (representing 10 digits), and a continuous vector of size 1. The noise vector is generated through normal distribution, the categorical vector is generator through picking an integer from 0 to 9, and the continuous vector is generated by picking a float value from -1 to 1.
The generator in InfoGAN has exactly the same structure as the generator in normal GAN. It firstly contains two fully-connected layers to expand the input shape to 6272 units. Then the 6272 units are reshaped into 128 7×7 layers. Afterward, the reshaped layers are processed through three transposed-convolutional layers to form the final 28×28 pixels image (if you are unfamiliar with the transposed-convolutional layer, I have an article explaining it).
The discriminator is also the same as the ones in normal GANs. It contains two convolutional layers and two fully-connected layers. The last fully-connected layer generates an output with a "sigmoid" activation function to represent real images (1) or fake images (0).
The auxiliary model shares all layers from the discriminator except for the last fully-connected layer, and thus these two models are defined together. The auxiliary model has two additional fully-connected layers to identify additional information. Since we have two additional inputs for our generator (a categorical vector and a continuous vector), we also need to have two different outputs from the auxiliary model. Thus, I set one fully-connected layer with a "softmax" activation function to identify the categorical output, two fully-connected layers to represent "mu" (mean) and "sigma"(standard deviation) of a Gaussian distribution:

Note: as our continuous vector is a float number randomly selected from a uniform distribution between -1 and 1, there can be infinitely many float numbers to choose. Therefore, it is not practical to ask the auxiliary model to predict the exact number that the generator takes. Instead, we can predict a Gaussian distribution, and ask the model to maximize the likelihood of having the continuous vector inside the distribution.
After defining the three models, we can construct our InfoGAN network! I used Keras for network construction:
I know it is quite a long code, so let’s digest it step by step:
- InfoGAN_Continuous is a Keras model class that should be initialized by giving the discriminator, generator, auxiliary model, size of the noise vector, and the number of classes for categorical vector.
- compile function compiles the InfoGAN_Continuous model (for all three optimizers I used Adam).
- _create_geninput is the function that we defined previously to generate inputs for the generator.
- _concatinputs concatenate the three input vectors (size 62, size 10, size 1) into one vector of size 73.
- _trainstep function defines the training step. It only takes batches of real images. At first, the discriminator is trained by discriminating half-batch of real images and half-batch of fake images. The loss of discriminator is the sum of losses from discriminating both real images and fake images. Weights are updated by gradient descent algorithm based on the loss. Then, the generator and the auxiliary model are trained using full batches of fake images. The auxiliary model loss contains a categorical loss and a continuous loss. The categorical loss is just the categorical cross-entropy between the predicted label and the input categorical vector; the continuous loss is the negative log probability density function of the Gaussian distribution for the continuous vector input. Minimizing the negative log probability density function is the same as maximizing the probability of our continuous vectors lying inside the predicted Gaussian distribution, and this is what we want. The generator loss contains both the loss from the discriminator and the loss from the auxiliary model. By doing so, the generator will learn to produce more realistic images with more specific features. Notice that we set the variables in the discriminator to not trainable, as we do not want to modify the neurons in the discriminator when training the generator and the auxiliary model.
Now, you simply need to train it with a few lines of code!!!
Results
Now, let’s see some very interesting results I got from the InfoGAN model!
Varying Categorical Vector
First, if you change the categorical vector input, you can generate different numbers. However, the model won’t know that label 0 corresponds to number 0! It only knows that different labels correspond to different numbers (believe me, I somehow used an entire day to realize this. I always thought that I failed when feeding label 0 generates number 9 for me).

So you may want to rearrange the labels:

Varying Continuous Vector
Varying continuous vector can generate same number with different shapes. For example, for number 1, the number rotates in a clockwise direction when increasing the continuous vector value. Note that even though we trained the model using values from -1 to 1, by feeding -1.5 and 1.5 we can still get meaningful results!
Note that the labels are not the same as the ones above. This is because I trained the model again and the model always correlates labels to numbers randomly.

Number 5 is more interesting. It seems that the model is trying to rotate the number, but because of its more complex shape, the number is sheared.

Increasing The Weight For Continuous Loss
You may notice that I set a ratio of 0.1 to continuous loss, both when adding it to the total generator loss and to the total auxiliary loss. This is to avoid confusing the model. If the continuous loss has a similar ratio as the categorical loss (i.e. close to 1), then the model will think it as another factor to determine the type of the numbers. For example, the images below are generated when I use a ratio of 0.5:


As you can see, even though I use the same label, by varying the continuous vector value, the number gradually changed from 2 to 4, or 2 to 8! This also means that the computer can actually find similarities between number 2 to number 4, and number 2 to number 8, and knows how to convert from one to another. Isn’t it amazing?
Conclusion
InfoGAN is a very powerful GAN that can learn patterns among images in an unsupervised manner and produce images by following the patterns. It is also very fun to play with, as you can generate lots of variations of images by manipulating the input variables. With the codes present in this story, you can build your own InfoGAN and see what amazing images you can produce!
References:
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel, InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets (2016), Cornell University
Predicting Probability Distributions Using Neural Networks – Taboola Tech Blog
How to Develop an Information Maximizing GAN (InfoGAN) in Keras – Machine Learning Mastery