On Distribution of Z’s in VAE

Do we use the generative part correctly

Natan Katz

Published in

Towards Data Science

9 min readOct 2, 2020

https://unsplash.com/photos/Hba7In7vnoM

Motivation

The motivation for this post was raised while learning the theory of VAE and following its common implementations. As we know VAE is constructed of two networks: one (the encoder) is trained to map real data into a Gaussian distribution aiming to optimize its KL distance from the a given distribution (typically standard Normal dist.) and the other (the decoder) to map samples of this Gaussian distribution (Z’s) into a real data. Namely the final objective is that the decoder will be able to generate data from a generic distribution, but it has dual objectives

Getting closer as possible to the standard normal dist.
Generating good data samples by the decoder

VAE commonly uses ELBO as a loss function. This function is used to solve Variational Inference (VI) problems and earlier thermodynamics. It aims to study the distribution of latent variables given an observed data.

while ELBO seems to be the right function for generating data given exiting real data (this is actually the classical definition of VI ), it is unclear how come that one expects ELBO to converge to normal dist.

In order to begin the analytical discussion let’s present the ELBO’s formula.

Our objective is to find the optimal distribution Q of the latent variables (Z’s)that minimizes its KL distance from P(Z|X) where X is the real data

We can write KL as follow:

Which leads to the following:

P(Z) is the prior distribution of Z and assumed to be a the standard Normal distribution

ELBO is the second term. Maximizing ELBO is equivalent to minimizing KL divergence.

Considering the dual objectives of VAE and the algebraic structure of the ELBO’s formula, there are two questions:

Does the ELBO’s KL term (between Q and the prior P) converge to the standard normal dist?
Assuming that the answer for the first question is affirmative, what about the entire ELBO?

In the following section I will study both questions and present some numerical experiments

KL Convergence

In order to justify the convergence of a function to a certain point we have to show that this point is an extremum and that there exists a time dynamics in which this point is stable. We will show that for the KL term of the ELBO (0,1) satisfies these conditions. Since a common method to calculate loss function is Monte Carlo estimator we will show in the next sections that (0,1) is a minimum for both analytical solution and Monte Carlo. In their following section we will show the stability of this point.

Gaussian Analytical Formula

Assuming that we are interested in finding the KL distance between two Gaussian distributions one is denoted by P and the other is the standard normal distribution (S).

For a K dimensional Gaussian, KL divergence follows this formula:

(Remark: We always assume in VAE that the covariance matrix is diagonal. Since covariance matrices are similar to diagonal matrices and similar matrices share their trace and determinant we can learn that KL wise, the diagonal assumption is sufficient )

For simplicity we can write this formula in a different way:

F and G are non negative functions with a unique zero at 0 (F) and 1 (G). In order to illustrate, we present the planar space µ, σ

The KL graph has this form

We can see that when we follow the analytical approach the standard normal distribution is a unique minimum.

However, this is a minimum of the KL term not of the ELBO

Commonly we don’t solve VAE with the analytical form but with Monte Carlo estimator. We will discuss this in the next section.

Monte Carlo

Let’s denote

We denote as before

P(Z) -Prior distributions (The standard normal distribution)

X- The data

We have the ELBO formula

Since we are interested in the convergence of KL We focus on µ, σ . It is clear that the dependency of the decoder on Z is intractable. since it is obtained through the hidden layers of the decoder. We will observe the KL term.

The common method to calculate the KL divergence is by assuming Gaussian distribution and using Monte Carlo estimator

If we write these terms explicitly for each data item we have the following

When we average over the batch size and the samples of Z, we obtain

Recall

We obtained the same result as in the analytical solution. It is pretty clear that if the KL divergence converges to a point (a Gaussian distribution)it will be standard normal distribution.It is still not clear why this convergence takes place.

Dynamical System Approach

In order to create an intuition regrading the process during the learning. Let’s consider the plain µ- σ. A potential learning process may follow the red or the blue curve (Sorry spirals are more cool).

In order to see why the unique minimum is stable fixed point we have to observe Kl divergence. Given a distribution function KL is a non negative function gets 0 only for identical function. This makes it good candidate for Lyaponov function

Now we have the following:

If we consider a contentious time we have

Now we can calculate the time derivative of KL divergence

Hence KL is non negative and inward , namely (0,1) is a stable fixed point.

Intermediate Summary

We have seen using two methodologies that 0,1 is a stable fixed point of ELBO’S KL term. If we assume that the entire ELBO is needed for training It leaves two possible options:

We need the entire ELBO since its conditional probability term (P(X|Z) achieves its minimum (0,1) and improves the entire system
ELBO does not receive its minimum on (0,1)but still improves the VAE

On the other hand If we can explain the convergence to (0,1) without the decoder and preserve the overall results , perhaps we don’t nee the conditional distribution term

Experiments

Mean Field Theorem

In this section I present some of the trials that I did in order to answer my question from the section above. I used for this mean field algorithm

This graph represents the decay for four frame works:

Blue Currently used method

Green -Monte Carlo estimator where training is solely on the encoder

Red -Analytical formula of two Gaussian distributions where training is solely on the encoder

Black -KL divergence of random samples that were sampled from troch.randn (one can think of it as sort of a GAN’s inputs.)

It can be seen that the Z’s distribution is closer to the standard normal distribution than its distribution when we use the entire ELBO. This graph is aligned with our analytical calculations from the section beyond. Furthermore our KL only training achieved even better distribution than simply sample from torch.randn

However, these results provide no information about the decoder.We can see that our encoder is trained to have a distribution Q that is close KL-wise to the standard Gaussian, but how well does the decoder work? We can use one of the following methodologies:

Train the encoder and than the decoder (Separated)
Train them both (Simultaneous)
Follow the current method (train the entire system together)

The first and the second method will provide the same KL However, it is not obvious that the decoder’s performances are similar, The third option will provide lower KL (Z’s dist with bigger distance from the (0,1) Gaussian)

We will estimate the decoder’s performances using two “KPI’s”

Likelihood scores
Generated images

The following two images have been created by the current algorithm

We used the first methodology (namely train the encoder and only afterwards the decoder). The following image has been created using this method with analytical loss

This image was created by random sampled z’s

We can observe the decoder scores in the Separated method

It can be seen both through the scores and the images that for the current method the decoders work slightly better.

The following graph compare the difference between simultaneous and and separated learning

Images that were created using the simultaneous methodology for both analytical and Monte Carlo loss are given

Summary

We saw that as the mathematical intuition suggests, training encoder independently in VAE provides a better KL results (namely the Z’s obtained is closer to the standard normal dist.
Training the VAE in a “tandem manner”: first the encoder and than the decoder is similar to train them simultaneously (namely take a KL step with the encoder and than take BCE step with the decoder and the z’s)
On the other hand when we remove the decoder from encoder training distorts its performances. This makes the entire training less coherent mathematically , since it reveals that good z’s (good -closer to standard normal dist) are not enough for good decoder.

Corollary

Recall that VAE uses ELBO to achieve two objectives:

Train to map the data into a Gaussian distribution which is close to Normal distribution KL-wise (Encoder’s objective)
Use sampling of this distribution to train the decoder to generate data(Decoder’s objective)

We saw that ELBO indeed provides the optimal results for the decoder’s objective, whereas the encoder’s objective achieves better results when we use only the KL divergence

One, can claim that the essential part is the quality of the generated data. Since ELBO works well for this purpose, we are fine. However, since we wish the decoder to a play the role of images generator this claim encounters some obstacles since we saw that the optimal images are obtained by Z’s that are drawn from a non standard distribution.

Theory

The VAE is a variational inference problem: We have an observed data that was created using latent variables and we wish to find the optimal distribution for generating these latent variables. At the inference stage we use this function to generate new data. What we saw is that this function cannot assumed to be a standard Normal distribution.

We can assume that for every data X and a class of latent variables’ distributions P there exists a distribution q

that optimizes the decoder. However, if one wishes to generate images using this distribution he must either uses the encoder for this process or measure q’s parameters and follow them to sample the decoder’s inputs.We cannot assume that the decoder uses a generic distribution

The code of this study can be found here and it is based on the code of professor Altosaar that you can find here