The world’s leading publication for data science, AI, and ML professionals.

Variational Inference – Where old Physics Solves new Bayesian Problems

A Novel Analytical Approach


https://unsplash.com/photos/fBMpqsBYc3A
https://unsplash.com/photos/fBMpqsBYc3A

What is VI?

Variational Inference is a method to solve the most common Bayesian problem: given an observed data, find the probability functions that govern it generation. While the problem and its solution appear to be common, when Vi firstly appeared it was considered extremely innovative since it was the very first solution for Bayes problem that was analytic rather traditional sampling methods. In this post I will cover the early appearance of VI , describe its mathematical foundation and present the pre scientific links to this frame work.

The problem

We begin by describing the problem that VI aims to resolve in a coherent way. We are dealing with statistics so it is reasonable to expect for a data sample. Let therefor a sample:

x¹ , x²,x³ .., x¹⁰⁰ .. Which is our observed data. The problem is generic namely the sample can be any type of data: images, voice utterances, numbers, bits or vectors.

It is pretty straightforward to assume that this sample has been generated by a machine, a stochastic one. As researchers we wish to reveal the statistical manners of this machine. Mathematically, we assume that our observed data is generated by hidden variables Z¹ , Z²… that govern the generation of the sample X. We therefore to find the function that generates these Z. Namely to find P(Z|X).

Following Bayes formula we have:

Author
Author

The LHS is called posterior distribution. We wish to find this function upon the observed data. Can we use the RHS?

The nominator is can be written as the product of our prior knowledge and the obtained likelihood, thus it is known. Thus our problem focuses on the denominator. In some cases it is easy tractable. However, consider a data that is driven from a GMM of K Gaussians. Our hidden variables are the Gaussians’ parameters and their weights and the assigned clusters. The probability in the denominator will be the following:

Author
Author

Pretty "elegant" isn’t it? Such integrals are commonly intractable. One needs a method to overcome this. There are however good news as well: P(X) is independent of Z. We can observe Bayes formula differently

Author
Author

The intractable term is not an obstacle anymore. We still need to cleverly handle the RHS.

Until 1999 The common approach for solving such problems was using sampling methods such as Metropolis Hastings , Gibbs ad HMC.

These algorithms are bias free but suffer from the regular "diseases" of sampling: they are endowed with huge variance.

In 1999, the famous Michael Jordan quitted. At the same year a less famous Michael Jordan published a [paper](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) in which he offered a novel approach for solving Bayesian problems. The idea was to develop a solution which relies on analytical methods. By deciding in advance on the distributions family (a step that add bias to the process) we can construct a solution that is resilient to the sampling obstacles. In 2003 this approach has been massively boosted as Blei published his paper on topics extraction and used this method for LDA process. He then claimed that the optimal way to work is "using VI while waiting for Gibbs to converge".

In the next sessions I will give a brief description of the mathematical steps that led the development of this method.

Finding an Optimal Function

In order to provide an analytical approximation for the posterior distribution we need two tools:

  • A metric that measures distance between two distribution functions
  • An analytical tool for finding extremums on the space of these functions.

The upper clause is pretty obvious we will use KL-Divergence. Finding extremums in space of functions requires using tools from Calculus of variations. This domain handles in problems such as finding the shortest path between points through their analytical representations. I believe that the readers can guess that this is the reason for the term "Variational Inference". A fundamental tool in this domain is Euler-Lagrange equation

Author
Author

For F real valued function, x independent variable and u a differentiable function of x, we can find u that minimizes J as follow:

Author
Author

Now, when we have this tool we can start approximating the posterior function.

Since we wish to approximate P(Z|X) , we search for a function Q(Z) that minimizes its KL divergence from the real posterior. Clearly when we say "Q that minimizes" it means that we search for optimizing the parameters of Q , Hence we have to define which distributions family we will use. In most of the applied tasks we use exponential functions for Q,

We can write KL divergence between P(Z|X) and Q(z) as follow

author
author

Where the second term in the LHS is denoted by:

Author
Author

ELBO stands for Evidence Lower Bound : minimizing KL divergence is equivalent to maximizing the ELBO since P(X) is independent of Z.

The Physicists among the reader may find this formula similar to Helmholtz free energy

Solving ELBO Techniques

In order to maximize ELBO we need some assumptions and numerical tools.

Mean Field Theorem

The idea of MFT was brought from Ising model for magnetic spins. This model is too rigorous and beyond the scope of this post. However the general idea is that we have an observed sequence of r.v X¹ ,X²… that receive values 1 and -1. We aim to estimate a function that depends on these observed variables. It consists of the linear and pairs product terms. It can be seen that :If we assume absence that the r.v are uncorrelated we can estimate the values using their means.

We can migrate this idea to VI:Rather considering uncorrelated pairs, we will assume independence, e.g. consider a data that is drawn from a Gaussian, our Z is the pair mean and variance. We will assume that these parameters are independent thus Q can be written as follow:

author
author

The idea of writing Q as a product of independent functions simplifies the ELBO optimization process.

Numerical Recipes

There are two common methods for solving ELBO:

Coordinate Ascent VI (CAVI) –This algorithm is described in details [here](https://brunomaga.github.io/Variational-Inference-GMM) and here. The idea is to calculate in each step j , the P part of the ELBO conditioned on all the Z’s that are not j. (Namely each step we update only one Z given the others). The solution at each step is proportional to the exponent of the average of over log of P term (the mutual probability in ELBO) .

Stochastic VI – This algorithm uses Stochastic optimization (Robbins and Monro) to solve VI. It allows some batch manners that do not exist in CAVI, thus make it more practical, and use derivatives of ELBO with respect to the VI parameters.

Summary

We presented VI mathematical foundation. There are some applications that already use this algorithm, the most famous are:

  • GMM
  • Topic Modeling

Of course it is an essential part in the development of VAE

Readers that are interested in hands-on projects can search for Edward, in this implementation and in Pyro, and PyMC3.

Regarding DL, there are already examples in which DL applications are used for solving VI problems (e.g. finding upon data the VI parameters using NN), However, I believe (and this is my main motivation for writing this post) is that the combination of Bayesian inference in general and DL may offer a huge amount of problems, opportunities and mathematical riddles. I think that problems of less common distributions in which high order commulants will be trained by DL can be extremely intriguing.


Related Articles