Variational Inference & Derivation of the Variational Autoencoder (VAE) Loss Function: A True Story

Published in

The Blog of RETINA-AI Health, Inc.

13 min readFeb 9, 2020

VAE Illustration by Stephen G. Odaibo, M.D.

When I was in graduate school in computer science at Duke~2007/2008, the then DGS of statistics (Merlise Clyde, I believe, now Chair) attempted to recruit me to leave the computer science department and join the statistics department. I had an emerging appreciation for statistics, but it was not yet fully formed. Her offer surprised me and sounded interesting but I declined it. Had I foreseen how ML and Statistics would merge today*, perhaps I may have obliged. But the Spirit of the LORD had not led me there at the time. Stats is superbly interesting and so much fun now. But CS is also still lots of fun. So who knows? Maybe I should have “quadruple majored”. I digress. One powerful example of statistical machine learning is the Variational Autoencoder.

Variational Autoencoders (VAEs) are a fascinating model that combine Bayesian statistics with deep neural networks. VAEs wear many hats and bridge many different worlds. They consist of and are at least partly and all at once each of the following:

Deep neural networks,
Bayesian statistical machines,
Latent variable models,
Maximum likelihood estimators,
Dimensionality reducers, and
Generative models.

Due to this profound intersectionality, to acquire a deep understanding of VAEs in theory and practice is to acquire a deep understanding of much of data science.

Autoencoders have been around a long time, but they suffer from not being truly generative models in that the latent space is not truly continuous. That problem was addressed by Diederik Kingma and Max Welling around 2013 in work they presented in their paper titled Autoencoding Variational Bayes. Their model modified standard autoencoders by modeling a distribution as the encoder output as opposed to just a brittle vector of numbers. They then sampled from the distribution during forward pass, and utilized the re-parametrization trick allowing for backpropagation through the sampling step. Of note, the re-parametrization trick itself had been around longer, but gained greater popularity due to its application to VAEs. Similarly, the ideas that are now collectively known as variational inference have been around for a few decades (Peterson & Anderson 1988; Jordan et. al. 1999) and are the main subject of this tutorial.

*the merger between ML and stats had really already occurred even back in my grad school days, and more-so many decades prior to that. What we are witnessing today is essentially just a rebranding as opposed to a merger or creation of a new hybrid field.

Variational Inference: Brief Intro

In Bayesian statistics, the goal often is to determine a posterior distribution p(z|x) of a latent variable z given some data evidence x. However, determining this posterior distribution is typically computationally intractable, because according to Bayes,

which is intractable because it involves computing the integral over the entire latent space z, and also typically because it requires knowledge or computation of the entire evidence distribution p(x). To circumvent this intractability problem one instead approximates the posterior with some other distribution q(z|x) in a manner that minimizes some similarity measure between the true posterior and the approximation, q. Here we use the Kullback-Leibler, D_KL:

Do not be intimidated by the above expression, the Kullback-Leibler divergence. We will break it into simple parts and derive it from scratch later on in this tutorial. For now, just know that we use it as a similarity measure between the true posterior p(z|x) and the approximate posterior q_θ(z|x). Also note that in VAEs, the approximate posterior, q_θ(z|x), is modeled by a deep neural network, the encoder, which yields the distribution statistics — in particular the mean — of a distribution — typically a gaussian — of the latent space. A sampling from these distributions are passed into the decoder part of the model during forward pass at training, and this sampling step is repeated with each iteration of the algorithm. This is the variational part of variational autoencoders. It is what distinguishes VAEs from AEs (regular autoencoders), and it is what confers continuity on the latent space, making VAEs truly generative models.

Ok now back to our similarity measure, the Kullback-Leibler (KL) divergence. manipulating it yields the following equation,

And since the KL divergence is non-negative, it follows that,

The term on the right of the above equation is called the variational lower bound or the evidence lower bound (ELBO). This is because is serves as a lower bound on the evidence, x_i. Note from the equation that maximizing the ELBO maximizes the log likelihood of our data. And for a fixed datapoint, the log likelihood is a constant, hence maximizing the ELBO is synonymous with minimizing the KL divergence, since both terms add up to a constant. This is the core strategy of variational inference. In variational Inference we use the maximization of the ELBO as a proxy for minimization of the KL divergence. This in turn optimizes our approximation of the true posterior by the approximate posterior.

In the following tutorial, we will derive each of the above equations in clear stepwise detail.

OBJECTIVES of this Tutorial

Let us reiterate and summarize the above introduction, and say a bit more about what we will be doing in this tutorial. In Bayesian machine learning, the posterior distribution is typically computationally intractable, hence variational inference is often required. In this approach, an evidence lower bound on the log likelihood of data is maximized during training. Variational Autoencoders (VAE) are one important example where variational inference is utilized. In this tutorial, we will derive the variational lower bound loss function of the standard variational autoencoder. We will do so in the instance of a gaussian latent prior and gaussian approximate posterior, under which assumptions the Kullback-Leibler term in the variational lower bound has a closed form solution. We will derive essentially everything we use along the way; everything from Bayes’ theorem to the Kullback-Leibler divergence.

Bayes Theorem

Bayes theorem is a way to update one’s belief as new evidence comes into view. The probability of a hypothesis, z, given some new data x, is denoted, p(z|x), and is given by

where p(x) is the probability of the data x, p(x|z) is the probability of the data given a hypothesis z, and p(z) is the probability of that hypothesis z. While Bayes theorem by itself can appear nonintuitive or at least difficult to intuit, the key to understanding it is to derive it. It arises directly out of the conditional probability axiom, which itself arises out of the definition of the joint probability. The probability of an event X and an event Y occurring jointly is,

Equation (2)

And since the ‘AND’ is commutative, we have,

Equation (4)

Dividing both sides of Equation (4) by p(Y ) yields Bayes theorem

Kullback-Leibler Divergence

When comparing two distributions as we often do in density estimation, the central task of generative models, we need a measure of similarity between both distributions. The Kullback-Leibler divergence is a commonly used similarity measure for this purpose. It is the expectation of the information difference between both distributions. But first, what is information? To understand what information is and to see its definition, consider the following: The higher the probability of an event, the lower its information content. This makes intuitive sense in that if someone tells us something ‘obvious’ i.e. highly probable i.e. something we and almost everyone else already knew, then that informant has not increased the amount of information we have. Hence the information content of highly probably event is low. Another way to say this is that the information is inversely related to the probability of an event. And since log(p(x) is directly related to p(x), it follows that − log(p(x)) is inversely related to p(x), and is how we model information:

Equation (6)

Equation (7)

The difference of information between q(x) and p(x) is therefore:

And the Kullback-Leibler is the expectation of the above difference, and is given by,

Similarly,

Note that the Kullback-Leibler (KL) is not symmetric, i.e,

Equation (11)

In DKL(q(x)||p(x)), we are taking the expectation of the information difference with respect to q(x) distribution, while in DKL(p(x)||q(x)), we are taking the expectation with respect to the p(x) distribution. Hence the Kullback-Leibler is called a ‘divergence’ and not a ‘metric’ as metrics must be symmetric. There recently have been a number of symmetrization devices proposed for KL which have been shown to improve its generative fidelity [ Chen et al. (2017)] [Arjovsky et al. (2017)]. Note the KL divergence is always non-negative, i.e.,

To see this, note that as depicted in Figure (1),

Equation (13)

Therefore

We have just shown,

Equation (15)

which implies,

Equation (16)

VAE Objective

Consider variational autoencoders [Kingma et al. (2013)]. They have many applications including for finer characterization of disease [Odaibo (2019)]. The encoder portion of a VAE yields an approximate posterior distribution q(z|x), and is parametrized on a neural network by weights collectively denoted θ. Hence we more properly write the encoder as qθ(z|x). Similarly, the decoder portion of the VAE yields a likelihood distribution p(x|z), and is parametrized on a neural network by weights collectively denoted φ. Hence we more properly denote the decoder portion of the VAE as pφ(x|z). The output of the encoder are parameters of the latent distribution, which is sampled to yield the input into the decoder. A VAE schematic is shown in Figure (2).

The KL divergence between the approximate and the real posterior distributions is given by,

Applying Bayes’ theorem to the above equation yields,

This can be broken down using laws of logarithms, yielding,

Distributing the integrand then yields,

In the above, we note that log(p(xi)) is a constant and can therefore be pulled out of the second integral above, yielding,

And since qθ(z|xi) is a probability distribution it integrates to 1 in the above equation, yielding,

Then carrying the integral over to the other side of the inequality, we get,

Applying rules of logarithms, we get,

Recognizing the right hand side of the above inequality as Expectation, we write,

From Equation (23) it also follows that:

The right hand side of the above equation is the Evidence Lower Bound (ELBO) also known as the variational lower bound. It is so termed because it bounds the likelihood of the data which is the term we seek to maximize. Therefore maximizing the ELBO maximizes the log probability of our data by proxy. This is the core idea of variational inference, since maximization of the log probability directly is typically computationally intractable. The Kullback-Leibler term in the ELBO is a regularizer because it is a constraint on the form of the approximate posterior. The second term is called a reconstruction term because it is a measure of the likelihood of the reconstructed data output at the decoder.

Notably, we have some liberty to choose some structure for our latent variables. We can obtain a closed form for the loss function if we choose a gaussian representation for the latent prior p(z) and the approximate posterior, qθ(z|xi). In addition to yielding a closed form loss function, the gaussian model enforces a form of regularization in which the approximate posterior have variation or spread (like a gaussian).

Closed Form VAE Loss: Gaussian Latents

Say we choose:

and

then the KL or regularization term in the ELBO becomes:

Evaluating the term in the logarithm simplifies the above into,

This further simplifies into,

which further simplifies into,

And since the variance σ^2 is the expectation of the squared distance from the mean, i.e.,

it follows that,

Recall that,

therefore,

Recall the ELBO, Equation (28),

From which it follows that the contribution from a given datum x_i and a single stochastic draw towards the objective to be maximized is,

where J is the dimension of the latent vector z, and L is the number of samples stochastically drawn according to re-parametrization trick.

Because the objective function we obtain in Equation (42) is to be maximized during training, we can think of it as a ‘gain’ function as opposed to a loss function. To obtain the loss function, we simply take the negative of G:

Therefore to train the VAE is to seek the optimal network parameters (θ∗ , φ∗) that minimize L:

Conclusion

We have done a step-by-step derivation of the VAE loss function. We illustrated the essence of variational inference along the way, and have derived the closed form loss in the special case of gaussian latent.

BIO

Dr. Stephen G. Odaibo is CEO & Founder of RETINA-AI Health, Inc, and is on the Faculty of the MD Anderson Cancer Center, the #1 Cancer Center in the world. He is a Physician, Retina Specialist, Mathematician, Computer Scientist, and Full Stack AI Engineer. In 2017 he received UAB College of Arts & Sciences’ highest honor, the Distinguished Alumni Achievement Award. And in 2005 he won the Barrie Hurwitz Award for Excellence in Neurology at Duke Univ School of Medicine where he topped the class in Neurology and in Pediatrics. He is author of the books “Quantum Mechanics & The MRI Machine” and “The Form of Finite Groups: A Course on Finite Group Theory.” Dr. Odaibo Chaired the “Artificial Intelligence & Tech in Medicine Symposium” at the 2019 National Medical Association Meeting. Through RETINA-AI, he and his team are building AI solutions to address the world’s most pressing healthcare problems. He resides in Houston Texas with his family.

REFERENCES:Odaibo SG. retina-VAE: Variationally Decoding the Spectrum of Macular Disease. arXiv:1907.05195. 2019 Jul 11Kingma DP, Welling M. Autoencoding Variational Bayes. arXiv preprint arXiv:1312.6114. 2013 Dec 20Odaibo SG. Tutorial: Deriving the Standard Variational Autoencoder (VAE) Loss Function. arXiv:1907.08956. 2019 Jul 21Chen L, Dai S, Pu Y, Li C, Su Q, Carin L. Symmetric Variational Autoencoder and Connections to Adversarial Learning. arXiv preprint arXiv:1709.01846. 2017 Sep 6Arjovsky M, Bottou L. Towards Principled Methods for Training Generative Adversarial Networks. arXiv preprint arXiv:1701.04862. 2017 Jan 17Peterson, C. and Anderson, J. (1987). A mean field theory learning algorithm for neural networks. Complex Systems, 1(5):995–1019.Jordan, M. I., Ghahramani, Z., Jaakkola, T., and Saul, L. (1999). Introduction to variational methods for graphical models. Machine Learning, 37:183–233.Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. “Variational inference: A review for statisticians.” Journal of the American statistical Association 112.518 (2017): 859–877.