The world’s leading publication for data science, AI, and ML professionals.

Differentiable Generator Networks: an Introduction

An introduction to VAEs, GANs and their challenges

Deep Learning Fundamentals

Images from https://thispersondoesnotexist.com/ (open source)
Images from https://thispersondoesnotexist.com/ (open source)

Introduction

Training generative models is a lot harder than training classifiers or regressors. When training a classifier in a supervised learning problem, we know the optimal mapping between the inputs and the outputs of our training examples. However, training generative models involves optimizing criteria that is intractable.

In this article, I go through the fundamentals of generative models, and the functioning of two core differentiable generator networks. These are Variational Auto Encoders (VAEs), and Generative Adversarial Networks (GANs).

Generative Models

Before going into differentiable generator networks we can recap on generative models.

The ultimate goal of generative models is to generate synthetic data that is as similar to the training data as possible. In summary, it’s all about modeling a probability density function over our training data p(x). Given a set of training data, generative models attempt to construct a function that allows us to determine how likely any data point is to belong to the training set. This density allows generative models to do interesting things, like evaluate the probability of a newly generated datapoint p(x|y=y), or sample from the density to generate new data.

Generative Models: GMMs

Gaussian Mixture Models (GMMs) are a very simple generative method. GMMs learn through expectation maximization, by iteratively evaluating the log-likelihood to maximize the performance of the model. In a future article, I will go into more detail about their learning.

GMMs assume that the distribution of the data consists of a set of Gaussians. They approximate the training data space by fitting K gaussian PDFs on the training data, a little bit like k-mean clustering, but fitting Gaussian distributions over each cluster instead of identifying class membership. By approximating the training data space through a set of Gaussian distributions, we get the following advantages:

  • You can express the probability of a new sample as the weighted sum of your K Gaussians
  • You can sample the distributions to generate data belonging to each cluster

The whole point of generative models is to construct our density function over our training data space p(x). GMMs do this by approximating this latent space through a set of Gaussian densities. Approximating the latent space through a set of Gaussians is advantageous because it allows us to evaluate any new sample in this space, which we can use to understand whether a new data sample fits well within our existing data. Moreover, by having a fully defined latent space one can sample each of the clusters, generating new synthetic data.

Differentiable Generator Networks

When producing generator networks we want differentiable models that learn the mapping between a latent variable Z (from a random distribution) to the distribution of our data X.

Differentiable generator networks assume that the information carried by each sample can be represented in a simpler space than the data’s space. Even if the data lives in a high-dimensional space (like a high-resolution image), the information in the image can be expressed in a simpler manifold. Let’s say we have pictures of handwritten digits, even if the training data lives in a high dimensional space (the images could be 32×32 pixels) the underlying information behind each image is much smaller. The space of all possible handwritten digits is a lot smaller than all possible 32×32 pixel images.

Generative models construct a low dimensional latent space over a latent variable Z which they then use to approximate the training data space. There are two main ways to utilize this latent space to generate new data:

  • To transform a latent variable Z into a new sample X (approximate samples from p(x))
  • To transform Z into a distribution over X (generate a function to approximate p(x) then sample from it)

So generative models either generate new samples (generate X from Z), or generate distributions where all samples in that sub-space are representative of the training data (generate a distribution over X from Z).

Depending on what generative model you use, the model will generate data in one of the two ways described above. In the following sections of the article, I will talk about specific model architectures and how these can be used to generate synthetic data.

Variational Auto Encoders

Optimizing VAEs

Both GMMs and VAEs attempt to build a distribution over the training data X that can then be sampled from to generate new data (generate a distribution over X from Z). However, VAEs don’t construct a direct PDF over our training data X, instead, they exploit the assumption stated earlier, and learn the mapping between the distribution over our latent variable Z and our training data X (where the latent space in Z is a lot simpler than the space of our data X).

In VAEs we choose a very simple distribution of our latent variable Z (usually normally distributed)with form N(0,I). By doing this, VAEs assume that the variation in the data is Gaussian. Take the handwritten digit example from earlier, if you think of all possible handwritten "3s", the variation between all possible handwritten "3s" will be roughly gaussian. In reality, this is often a good assumption to make, as noise that can be found in sensor data for example is often gaussian.

We want to find the distribution over our data p(x) that maximizes the probability of our training data. From the probability product and sum rules we can express p(x) as follows:

where p(x|z, θ) is the posterior (the probability of our training data X given our latent samples Z and our model parameters θ), and p(z) is the prior distribution which we have assumed is gaussian with a mean of zero and an identity covariance matrix.

The goal is to find the set of weights θ that maximizes this probability, maximising p(x) gives:

The resulting expression is known as ELBO (Evidence LOwer Bound) and understanding its terms is important.

The first term is the expectation of the probability of our data given our latent variable (we want to maximize this). This tells us whether the model is producing good samples of X given Z. Maximising this term is actually equivalent to minimizing the mean squared error of auto-encoders when assuming the posterior probability is gaussian.

The second term is the Kullback-Leibler divergence between the posterior and the prior probability. Minimizing this term essentially attempts to bring our posterior as close as possible to a gaussian distribution with mean zero and sigma 1. Again, we are assuming that the variation in our data is gaussian, and therefore we attempt to force the posterior towards this shape, whilst still trying to produce good samples representative of our training data.

Architecture of VAEs

Image by Author
Image by Author

The structure of VAEs is divided into two sections, the encoder, and the decoder. The encoder reduces the dimensionality of our input data to 2 dimensions. These dimensions are used to parametrise our normal distribution of our latent space p(z). This distribution can then be sampled from, the samples are used as inputs to the decoder. The decoder attempts to reconstruct a data point in the training data space that corresponds to the sample in the latent space.

So VAEs map the input data to a latent space over Z. The decoder then maps this latent space back to the image space over X. Sampling over the latent distribution p(z), and passing the inputs through the decoder essentially allows us to sample from p(x). this was our goal!

When training VAEs, since these are differentiable functions from beginning to end, by using the ELBO loss function defined above, one can update the weights of the network like any other neural network architecture through backpropagation and your favorite optimizer.

Generative Adversarial Networks

GANs Theory

So far we have seen that VAEs are capable of generating a distribution over the training data p(x) from the latent distribution p(z). GANs generate data differently. Instead of generating a distribution, they generate samples that are likely to belong to the distribution.

Image by Author
Image by Author

GANs consist of a generator and a discriminator. These are two differentiable functions and together work as a big model. The two models are adversarial, each model wants to achieve the opposite of the other, and they are essentially competing. The generator takes random variable Z, and maps it to a generated sample in our data space X. The discriminator takes the images from the training data and the images produced by the generator and attempts to classify which ones belong to the training set and which ones are fake. The goal of the generator is to fool the discriminator, and the goal of the discriminator is to correctly identify the generated images.

As you can see, this way of generating images is fundamentally different from the way VAEs generate images. We are not producing a density function p(x), we are instead producing samples that are similar to those in p(x).

These are both differentiable functions and therefore can be optimized together. They are trained like any other neural network through backpropagation. I won’t go through their loss functions in this article, but maybe in a future one.

However, GANs have some issues:

Images from https://thispersondoesnotexist.com/ (open source)
Images from https://thispersondoesnotexist.com/ (open source)

GANs are known to be prone to unstable training, resulting in outputs that don’t make sense. Look at the pictures above, these people do not exist and are generated samples from GANs. As you can see, although the model gets close to matching real faces I’m sure you can spot a few things that are wrong with each image.

GANs are difficult to train because of four main obstacles

  • Your GAN might never converge to an optimal solution. This is because the gradients between the discriminator and the generator may have opposite signs, and therefore your parameters can get stuck at a saddle point.
  • There is nothing stopping the generative network from learning one single type of data. For example, if generating pictures of animals, the model could learn to only generate pictures of dogs and nothing else.
  • Because GANs are generating samples instead of generating distributions then sampling from these, we loose the smooth mapping between Z and X. This may make GANs less useful than VAEs in some situations.
  • Nothing stops the generator to learn to replicate the training data.

Deep Convolutional GANs (DCGANs) solve some of these problems by suggesting the following:

  • One should not use pooling layers, these are found to perform worse than using stridden convolutional layers in the generator.
  • Use batch normalization after convolutional layers.
  • Avoid using too many dense layers, instead, use more convolutional layers to make the model deeper.
  • ReLu activation function should be used in the generator and LeakyReLu in the discriminator.

The DCGANs paper reaches these conclusions after a series of experiments. DCGANs have been commonly used since to generate data in a wide variety of fields.

Conclusion

Generative models are extremely difficult to train since the mappings involved in training are intractable. Generative models either learn to produce new data samples or learn to produce distributions over the training data. In this article, I go through VAEs (which generate distributions) and GANs (which generate data samples). I outline how these work, and how these are fundamentally different methods of generating data. Understanding this will enable you to make better decisions when choosing which model to apply.

Support me

Hopefully this helped you, if you enjoyed it you can follow me!

You can also become a medium member using my referral link, get access to all my articles and more: https://diegounzuetaruedas.medium.com/membership

Other articles you might enjoy

VAEs: Indirect Sampling from Latent Image Distribution

Kernel Methods: A Simple Introduction

Kalman Filtering: A Simple Introduction


Related Articles