Thoughts and Theory

Variational Inference with Normalizing Flows on MNIST

Understanding normalizing flows and how they can be used in generative modeling

Mohammadreza (Reza) Salehi

Published in

Towards Data Science

9 min readApr 2, 2021

Introduction

In this post, I will explain what normalizing flows are and how they can be used in variational inference and designing generative models. The materials of this article mostly comes from [Rezende and Mohamed, 2015], which I believe is the first paper that introduces the concept of flow-based models (the title of this article is almost identical to the paper’s title). There are a lot of other interesting papers that followed up this one and used flow-based models to solve other interesting tasks. However, in this post my focus is just on the first paper and on covering the most fundamental concepts.

I have tried to explain how these models work step-by-step and there is a PyTorch code snippet accompanying each step. The code of the model and visualizations are all available in this Github repo.

Before getting into normalizing flows, it is helpful to review what variational inference is and how normalizing flows relate to it.

What is variational inference?

Assume that we have a set of observations x¹, x², …, xⁿ, where they are i.i.d. samples of a distribution p(x) which we are not aware of (these samples are not necessarily in 1-D space and can be multidimensional). Also, assume that there are a set of hypothetical variables z¹, z², …, zⁿ that are behind generating these data samples. In fact, to generate a sample of the data, first a latent variable is sampled from a distribution p(z) and then it is used to compute a data sample x which we observe. As z’s are not observable in the real world they are called latent variables.

Note: This is just our own model of the world and the latent variables merely exist in our own imagination! It is most likely that z’s and the above-mentioned process of generating data samples do not not exist in the real world.

We can mathematically specify our model with the joint distribution of x and z , i.e. P(x, z). This is a sufficient representation of our model and one can compute P(x) and P(z) from this joint distribution. We usually like to parameterize distributions with some parameters θ — where most of the times θ are the parameters of a neural network. Therefore, we can make the dependence on the parameters explicit in our notation by writing the distribution as P(x, z ; θ). Now, similar to most of the problems in machine learning, our objective is to learn the parameters of the model. The most common way of doing so it to maximize the log likelihood of the observed data (i.e. x’s) under our model. To accomplish this, we need to be able to compute the marginal likelihood P(x ; θ) and for that, we have to marginalize the latent variables out by summing/integrating over all possible values of them. But that is unimaginably expensive most of the times as there are lots of z’s! Hmmmm…..If it is not possible to compute a simple likelihood then how can we do maximum likelihood training?

It turns out there is one possible answer to this question:

1) Introduce an auxiliary distribution, such as q(z|x ; ϕ), which we can easily compute and work with. This distribution serves as an approximate to the intractable posterior p(z|x ; θ) and thus is called the approximate posterior or variational distribution.

2) Instead of maximizing the log-likelihood directly, we maximize a lower bound of it called ELBO which is given by the following formula:

ELBO is -F(x) in the above equation. It can be easily proved that maximizing ELBO is equivalent to maximizing the log-likelihood (or minimizing negative log-likelihood is equivalent to minimizing negative ELBO). Therefore our new objective is this:

Learn the parameters of the model, i.e. θ, and the parameters of the variational distribution, i.e. , jointly by maximizing ELBO.

In other words, variational parameters enable us to learn the model’s parameters too. One elegant way of doing this is using variational auto encoders. However, if you are familiar with VAEs, you know that they suffer from one drawback: the variational family is in most of the cases too simple and not expressive enough (usually it is a multivariate Gaussian) and cannot approximate arbitrarily complex distributions. This is where Normalizing flows come and help us overcome this problem.

What is a Normalizing Flow?

Normalizing flows are models that can start from a simple distribution and approximate a complex distribution. They do this by transforming the initial distribution multiple times with some functions until the distribution gets complex enough. To transform a distribution we can use an invertible function f. From the change of variables formula we know that the pdf of a transformed variable can be computed as follows (Eq. 5 of [1]):

Now, If we stack k such invertible transforms sequentially, the density of the last variable can be derived as follows (Eq. 6 and 7 of [1]):

We will see the log probability of this last variable in the loss we define for our generative model later.

One caveat here is that these transformations must be efficient and easy to compute, especially given the fact that there is a term containing determinants in the density formula ( determinants are usually are hard to compute). Fortunately, [1] proposes two family of transformations where their determinants can be computed easily. They are also powerful enough that we can start from a simple distribution and create a very complex distribution just using these two transforms.

Planar flow: The formula for this transform is as follows:

This transform takes a D-dimensional vector and expands/contracts this vector in the direction perpendicular to the hyperplane specified by the weight W and bias b. Planar flow transforms can be implemented in PyTorch as follows:

To better understand how this layer performs, I have visualized the input and output of a simple flow layer in the 2D space with ten different values for |u|. Here is the code for the visualization:

The result is shown in the following figure where the blue points represents the initial distribution and the red points are the transformed version of them. Also, each point is connected to its transformed counterpart with a solid line. We can easily see that the lines are all parallel and perpendicular to the hyperplane w.x+b=0 which is simply the line x+y=0.

Visualization of Planar-flow outputs for different vectors u

2. Radial flow: The second family of transforms is called Radial flow and can be described with the following formula:

This transform expands or contracts the initial distribution around a single point in the space and can be implemented in PyTorch as follows:

Again, we can plot a visualization of this transform in the 2-D space with different values of β and the origin (0, 0) as the center of transform. Similar to the previous visualization, blue points represent the initial distribution and red points represent the points after applying the transform (Note that the lines meet at the origin which is the center of transform):

Visualization of radial-flow outputs with different values of β

We can use a sequence of these two layers and transform a simple initial distribution (e.g. multivariate Gaussian) to a complex distribution (e.g. multimodal distributions).

Implementation of Flow-based Generative Models

To implement a generative model based on normalizing flows, we can use the following architecture proposed by the paper (Fig. 2 of [1]):

Architecture of the flow-based generative model (Fig. 2 of [1])

This model consists of the following three modules and we will implement them one by one in PyTorch.

Encoder: First, there is an encoder which gets the observed input x and outputs the mean (e.g. μ) and log-std (e.g. log(σ)) of the first variable in the flow of random variables, i.e. Z₀. It is worth to note that this encoder is similar to the encoders of variational auto-encoders.

2. Flow-model: After the encoder there are a stack of flow layers which transforms the samples from the first distribution (i.e. Z₀ from the distribution q₀) to the samples Zₖ from the complex distribution qₖ. Note that Zₖ is the main latent variable that generates the data and we have used all the encoder+flow layers just to infer Zₖ. Therefore, as specified in the figure of the architecture, we can call these two modules together the inference network. Following is the implementation of flow-model:

As seen in the above code, we have utilized the reparameterization trick in the implementation of the flow-model where we move the stochasticity to samples from another standard Gaussian distribution, such as ε, and then use σ*ε+μ as the initial samples Z₀. With this trick, we will be able to pass the gradient to the encoder network and train it too.

Also, we accumulate the log determinants computed by each layer in the log_det variable. The model outputs this variable which will later be used in computing the loss. The same argument holds for the log probability of Z₀ and Zₖ where they also appear in the loss we calculate.

3. Decoder: Finally, The decoder takes the latent variable Zₖ and models P(x|zₖ) (or the unnormalized version of it which are sometimes called logits). Again, it is similar to the decoder of variational auto-encoders and can be implemented as a fully connected network:

Now, let’s specify what θ and ϕ are here. θ is the parameters of our model of the real world which consists of the parameters of the prior p(z) and the parameters of P(x|zₖ). Our prior is standard multivariate Gaussian and has no parameters. Therefore the only parameters of our model are the parameters of P(x|zₖ) which is our decoder. ϕ is the variational parameters which consists of all the parameters that helped us approximate true posterior distribution from the data x. Therefore it consists of the parameters of the encoder and the flow-model.

Loss

One of the most important parts of training any machine learning model is the loss function. As stated by the paper, we will optimizer the negative ELBO as our objective which is computed via the following formula (Eq. 15 of [1]):

The term inside the first expectation, ln q₀(z₀), is the log probability of the initial variational samples which was one of the outputs of the flow-model. The term in the second expectation, log p(x, zₖ), can also be written as
log p(x|zₖ) + log p(zₖ). In this equivalent expression, the first term is the normalized output of the decoder (we will normalize the output of the decoder using the sigmoid function) and the second term is the log probability of zₖ which was another output of the flow-model. Finally the term inside the third expectation is the log_det output of the flow-model. Note that we can estimate the expectation stochastically by taking the average over the mini-batch. Here is the computation of the final loss in PyTorch (D is the dimension of random variables):