“Reparameterization” trick in Variational Autoencoders

Sayak Paul
Towards Data Science
8 min readApr 6, 2020

--

In this article, we are going to learn about the “reparameterization” trick that makes Variational Autoencoders (VAE) an eligible candidate for Backpropagation. First, we will discuss Autoencoders briefly and the problems that come with their vanilla variants. Then we will jump straight to the crux of the article — the “reparameterization” trick.

Note: This article is not a guide to teach you about Autoencoders, so I will be brief about them when needed. If you want to know more about Autoencoders, then you can check these articles out.

A Variational Autoencoder (Source)

Autoencoders: What do they do?

Autoencoders are a class of generative models. They allow us to compress a large input feature space to a much smaller one which can later be reconstructed. Compression, in general, has got a lot of significance with the quality of learning.

We, humans, have amazing compression capabilities — we are able to learn neat things and later we can easily broaden them up when needed. For example, oftentimes, you don’t need to actually remember all the nitty-gritty of a particular concept; you just remember specific points about it and later you try to reconstruct it with the help of those particular points.

So, if we are able to represent high-dimensional data in a much lower-dimensional space and reconstruct it later, it can be very useful for a number of different scenarios like data compression, low-dimensional feature extraction, and so on.

Vanilla Autoencoders

It’s easier to explain it with some code —

A shallow Autoencoder network

We have a very basic network here where we are:

  • Feeding a 784-d vector to the network.
  • Compressing the vector to a 32-d one (encoder).
  • Reconstructing the original 784-d vector from the 32-d one (decoder).

The below figure might make this idea more clear —

Schematic diagram of a shallow Autoencoder network

Talking in terms of results, this network, when trained, on the good old MNIST dataset, can yield the following result (notebook available here):

Predictions from the above network when trained on MNIST images

The outputs do not look that bad, but here come some of the problems this network is prone to —

  • The encoder of the network has no way of knowing how it should encode input data points into latent variables (read compress). This forces the representation of the latent variables to not care about the structure of the input data points much. Of course, there is a loss function at the end of the network (typically, L2) that tells it how far-off the predictions are from the original data points. But it still does not account for the way the input data points should be compressed. As a result, a very small change in the latent variables might cause the decoder to produce very different outputs.
  • Determining the dimension of the latent variables is another consideration. In this case, we used 32-d. With a higher-dimensional vector to represent the latent variables, we can improve the quality of the images generated but up to to a certain extent only. The first problem would still be present here.

The first problem seems more problematic than the second one because we can experiment with different dimensions and observe the quality of the predictions. So, what can be done to resolve this problem?

Variational Autoencoders: Encode, Sample, Decode, and Repeat

Semantics of a VAE (Source)

To alleviate the issues present in a vanilla Autoencoder, we turn to Variational Encoders. The first change it introduces to the network is instead of directly mapping the input data points into latent variables the input data points get mapped to a multivariate normal distribution. This distribution limits the free rein of the encoder when it was encoding the input data points into latent variables. At the same time, it introduces stochasticity in the network because we are now sampling points from a probability distribution.

A normal distribution is parameterized by a mean (𝜇) and a variance (𝜎) and which is exactly (with some “variations”) what’s done in the case of a Variational Autoencoder. So, step by step —

  • Each data point in a VAE would get mapped to mean and log_variance vectors which would define the multivariate normal distribution around that input data point.
  • A point is sampled from this distribution and is returned as the latent variable.
  • This latent variable is fed to the decoder to produce the output.

This makes the network constrained to learn a smoother representation. It also makes sure that a small change in latent variables does not cause the decoder to produce largely different outputs because now we are sampling from a continuous distribution. On the other hand, as this sampling process is random by nature decoder outputs start to become more varied.

Following represents the schematic diagram of a shallow VAE —

Schematic diagram of a shallow VAE

As you can see in the figure, we introduced another intermediate dense layer in the network. In a VAE. The mean and log-variance are learnable parameters. The Lambda layer in the above diagram represents the sampling operation and it is defined as follows:

Coding the sampling operation of a VAE in TensorFlow

So, if an input data point is to be mapped into a latent variable 𝑧 via sampling (after getting passed through a neural network), it has to follow the following equation:

where, 𝑠𝑖𝑔𝑚𝑎=𝑒𝑥𝑝(𝑧_𝑙𝑜𝑔_𝑣𝑎𝑟/2).

By taking the logarithm of the variance, we force the network to have the output range of the natural numbers rather than just positive values (variances would only have positive values). This allows for smoother representations for the latent space.

You must be questioning about the little term epsilon, what’s its significance here? We will attend to it in a moment.

Now, before we can finally discuss the “re-parameterization” trick, we would need to review the loss function used to train a VAE. This is because we backpropagate the gradients of the loss function ultimately and the“reparameterization” trick actually helps in the backpropagation process when happening in a VAE.

VAE Loss

Recall from the above section that a VAE is trying to learn a distribution for the latent space. So, besides accounting for the reconstructed outputs produced by the decoder, we also need to make sure the distribution of the latent space is well-formed. From Deep Learning with Python (by François Chollet) (Page 300, 1st Edition) —

The parameters of a VAE are trained via two loss functions: a reconstruction loss that forces the decoded samples to match the initial inputs, and a regularization loss that helps learn well-formed latent spaces and reduce overfitting to the training data.

The regularization loss is handled with the Kullback-Liebler Divergence. An excellent interpretation of KL Divergence is available in GANs in Action (by Jakub Langr and Vladimir Bok) (Page 29, 1st Edition) —

[…] the Kullback–Leibler divergence (KL divergence), aka relative entropy, is the difference between cross-entropy of two distributions and their own entropy. For everyone else, imagine drawing out the two distributions, and wherever they do not overlap will be an area proportional to the KL divergence.

For a more mathematically rigorous treatment for the choice of KL-divergence here, you may find this lecture to be useful

For the reconstruction loss, I have mostly seen the following two options —

  • L2 loss
  • Binary Crossentropy (for comparing each feature of a data point to the value in the reconstructed output)

We can now proceed towards what we have been anticipating to discuss — the “reparameterization” trick.

Training the VAE with backpropagation

To be able to update the parameters of. VAE using backpropagation, we need to consider that the sampling node inside is stochastic in nature. We can compute the gradients of the sampling node with respect to the mean and log-variance vectors (both the mean and log-variance vectors are used in the sampling layer).

A small portion from our VAE network

Remember the little fella in the sampling layer epsilon? That actually reparameterizes our VAE network. This allows the mean and log-variance vectors to still remain as the learnable parameters of the network while still maintaining the stochasticity of the entire system via epsilon.

VAE network with and without the “reparameterization” trick (Source)

where, 𝜙 representations the distribution the network is trying to learn.

The epsilon remains as a random variable (sampled from a standard normal distribution) with a very low value thereby not causing the network to shift away too much from the true distribution. It’s safe to reiterate here that distribution 𝜙 (parameterized by the mean and log-variance vectors) is still being learned by the network. This idea actually allowed a VAE to train in an end-to-end manner and was proposed by Kingma et al. in their paper named Auto-Encoding Variational Bayes.

Conclusion

So, that’s it for the article and thank you for reading it! Autoencoders are first first-class members of generative models even finding their applications in developing GANs (BEGAN). Disentangled VAEs also quite relevant in the field of reinforcement learning (DARLA: Improving Zero-Shot Transfer in Reinforcement Learning). VAEs were one of a kind of discovery that married Bayesian inference with Deep Learning encouraging different research directions.

Following are the references I used to write it and you should definitely check them out in case you are interested to learn more about Autoencoders in general —

You can connect with me via Twitter (@RisingSayak).

--

--