Understanding the Mathematical Concepts behind Langevin Dynamics

In this post I aim to summarize a pretty "old" paper composed by Max Welling and Yee Whye Teh. It presents the concept of Stochastic Gradient Langevin Dynamics (SGLD). A method that nowadays is used increasingly.
My motivation is to present the mathematical concepts that pushed SGLD forward. For those of you who may find it tedious, you can simply take the methodology and ignore the rigorous parts
Zeitgeist
As I mentioned the paper was written about 10 years ago. During that time The world of ML was both similar and different compared to today:
- Similar – Massive amount of data that the web and modern computation provided boosted significantly the ML and many researchers became enthusiastic with it
- Different – DL was not yet common
ML research topics were therefore slightly different:
Researchers considered neural network a promising tool but still just another tool, such as decision trees or Naïve Bayes. Bayesian methods (it is essential for understanding the authors’ motivation), were mandatory in huge amount of academical training.
In order to have a wider view of ML research in the first decade of the 21th century I recommend reading books such as Duda or Bishop
As an anecdote, the leading Author ,Max Welling was a Bayesian researcher in the academy and wrote papers about this topic. But DL came to town and in 2013 he published a paper Auto-Encoding Variational Bayes that became the foundation of VAE. It illustrates that DL annexed everyone.
Mathematical Tools
In order to well understand this paper one has to become familiar with some mathematical tools:
- Bayesian methods
- Metropolis Hastings
- Stochastic Optimization
- Langevin Equation
I will briefly describe these tools
Bayesian methods
Clearly this is a wide topic that appears in a many domains. However, the Bayesian problem is easy to describe. We have a set of observations (images, numbers or categorical outcomes of an experiment) which is denoted X. We assume that the observations are generated by a distribution P which its shape (e.g. Gaussian or Uniform) is known but not its hyper parameters H.
Our objective is therefore finding H. In a more formal terminology, we have Bayes formula:

Since

We have

As you can see the denominator is independent of H. In addition one can notice that P(X) is often intractable, e.g. consider a GMM when you wish to estimate the probability to get a single observation value.
we can re-write

- P(H) –Prior distribution The distribution of the hyper parameters (as we believe)
- P(X|H)- Likelihood, The probability to get the observation assuming set of hyperparameter values
- P(H|X) – Posterior distribution
Bayesian problem is therefore about finding the Posterior distribution where we assume:
- Posterior has a certain shape
- Hyper parameters are drawn from a given Prior
Examples
- When we study Gaussian distributions, we often assume that the mean itself is drawn from a Gaussian and the standard deviation from a certain Gamma
- A binary sequence, we assume it has a Binomial distribution where its P is drawn from Beta distribution.
The solution that provides the optimal posterior distribution we call Maximum A-posterior Estimation (MAP)
For those among the readers that are familiar with MLE solutions, MAP differs from MLE only in the prior term.
Metropolis Hastings
I wrote once a post on this method (the right to brag), clearly there are better sources in the web. It is a MCMC sampling method that is often used for finding posterior distributions since it relies on the Markov chain convergence property to obtain the required parameters.
- In every step we draw a sample from a proposal function that typically its parameters depend on the previous sample. This function is often "simple" : such as Gaussian
- The accept reject step– We compare the new sample with the its "predecessor" sample, if the score of the new sample is bigger we accept it, otherwise we remain with the previous sample.
This is a bias free algorithm but with a huge variance (as many other sampling algorithms). The main disadvantage of this algorithm is that one must iterate over the entire data many times, namely, sophisticated tricks such as using mini-batches cannot be used.
Stochastic Optimization
An algorithm that was published in the early 50’s by Robbins & Monro that offers a solution to the following problem:
Let f a **** function and **** b a real number and there is a solution

We cannot measure directly the values of f but we can measure a function g such that

The following formula converges to the requires solution:

For the conditions:

If we try to connect the sub section on Bayesian methods with this part, we can guess that Stochastic Optimization can be a good method for finding MAP solution
Langevin Equation
Langevin equation describes the forces on pollen grains and it is an evolution of the study that began with Browns’ work in 1827
It can be written :

Where
m- Mass
a- acceleration
v -Velocity
F-Force
ε -Stochastic term with 0 mean and variance proportional to time
Since the mass is very small, the LHS can be neglected. It is also known that the velocity is a time derivative of the place and force is the potential gradient.

Langevin equation becomes then :

Now let’s do some discretization to the time derivative:

If we ignore the last term, we obtained something that every data scientist met once.
Is it interesting?
Yes. We know from physics that the steady state solution of the equation allows to sample from a posterior distribution. I hinted that Bayesian methods suffers from absence of mini-batches techniques
If we have a tool that:
- Can be written in a nice "gradientable" way (gradients are good with mini batches)
- Knows where to converge
Maybe we will be able to find a nice way to do Bayesian study.
Before we move forward, one can also notice that

can be considered as a Metropolis Hastings with the mean variance pair:

This modifies the problem to "we know that it may work, we just need to study the acceptreject mechanism"
Let’s Paper
First Section
As I described in the Zeitgeist section, the authors give a high level survey on ML during these times. They mention the flood of data that raises the need for good ML and the progress that indeed takes place in this domain. According to the writers (and they are right) ML is everywhere: Vision, language and speech processing. This progress raises the need for algorithms that have two significant properties:
- They can show robust convergence in optimization problems
- They handle well huge amounts of data
Stochastic optimization does them both, it is an optimization algorithm that since it relies on error measuring can work with mini-batches.
Now the authors move to discuss the Bayesian methods. These methods are not in the "ML progress party". They are neither good at online processing nor with handling massive amounts of data. At first glance one can say that it is a time for good bye. Bayesian methods are no longer needed. However, these methods are endowed with two significant properties:
- They handle well overfitting
- They can quantify models uncertainty
Since we work with huge datasets ,overfitting is nearly always a problem, and as model become more complex, the need to estimate their risk becomes mandatory. Hence don’t throw yet Bayesian methods.
The paper’s idea
Let’s combine stochastic optimization and Langevin Dynamics for using Bayesian methods. The former will take us to the MAP solution and the latter will suggest a stable posterior sampling mechanism.
Second Section
In this section the authors describe the implementation of Stochastic optimization for the Bayesian problem.
Recall that X is the observation data and denote the parameters’ set by

The updating step of the optimization problem is then :

where n is the size of a mini-batch and N the amount of data. The epsilons satisfy Robbins and Monro conditions. Such solution converges to MAP. A good solution but one that provides no uncertainty estimations and clearly vulnerable to overfitting. The next step is to modify Stochastic optimization towards Langevin equation

For q the prior. The last term is a zero mean noise term and a variance to be studied. Note that this equation is the discrete form of Langevin equation(which is a stochastic differential equation–SDE) which its solution is a posterior distribution. We don’t know yet that this discretization preserves the required convergence,. However, we know that since this solution is analogue to Metropolis Hastings , if we control well the decrease of the epsilons, we may get a low rejection rate that satisfies the required.
One can see that from data science point of view, the paper suggests simply
"when you do SGD, simply add Gaussian noise , just check carefully the variance".
3rd Section
So we discussed Langevin discrete equation

Now we wish to check whether a it holds for a mini-batch solution. Namely, can it be approximated by

Where :

Note: In the paper, it is written

It is a mistake.
Now we can combine Stochastic optimization with Langevin Dynamics
- We construct a sequence of epsilons that satisfy R&B conditions (e.g. decreases as

- At the early parts of the run the stochastic part is dominant and we simply add it a zero mean term (as R&B allows)
- The system moves towards MAP solution
- As we approach MAP:
- The ratio between two adjacent steps converges to 1 (hence the reject rate goes to 0)
- The epsilons decay, hence the gradient part becomes neglected and Langevin’s term governs.
Why is that? Since we are approaching MAP, the gradient term becomes small , this mean that we can create a noisy steps around it which provides most of the fluctuations (simply since we are near fixed point)
At this stage we can simply start sampling from Langevin equation knowing both by MH and physics that we are sampling from a posterior.

Let’s get Rigorous -Elective
We wish to show now that the mini-batches solution converges to the discrete solution.
Namely:

follows this


We can take take the "batched" Langevin equation :

and re-write it as:

Where :

and the variance of the first term is

For finite V.
As the epsilons decay to 0, the Langevin’s variance follows epsilon magnitude whereas the stochastic term has a variance of magnitude epsilon square. Hence, the Langevin’s term becomes dominant.
The final part shows using Markov properties that the samples converges to the required posterior distribution,
4th Section
As described we frequently use sampling algorithms for Bayesian problems (namely finding the posterior). We have demonstrated a sampling which raises two questions:
- When is the transition point
- What is the expected of a MC estimator
Transition time
Transition time is the point where Langevin equation begins to govern and we can assume that its outputs reflect well the required posterior distribution. Since the gradient may have different scale to different directions. We introduce a matrix M that scales the gradient directions (one can think of M as the analog to batch normalization) . We can write the update step as follow:

Recall that the variance of the stochastic term is

Following CLT we expect that the error of h around g follows a normal distribution. In a sequence of pretty trivial calculation we obtain that this variance is approximately

Where Vs is the empirical covariance matrix. We wish the condition of M to be

Where lambda is lambda is the biggest eigen value of A. This means that we need to choose a step size that follows the alpha condition to verify that we reach Langevin’s regime .
Posterior Expected
We generate a sequence of samples

This sequences is expected to behave as it was drawn from a posterior distribution, f . We have

Since the epsilons decay we expect a different manners than regular Markov chains: Here we may have a stronger correlations in the termination stage. We wish to reduce that thus we may add weights to the MC estimators

Examples
Gaussian
Let’s sample as follow:



The obtained plots

This graph compares between the real posterior and the samples

This graph shows the phase transitions
Logistic Regression
We present the problem of Logistic regression and solves it with SGLD



The beta parameter has a scale=1 Laplace prior distribution
We have

A trial on a a9a dataset and the following graphs have been obtained:

Summary
This algorithm was describes as "the most accurate per computational unit" (not sure that it is still true). The main motivation for further research was developing a mini-batches gradient methods for MCMC. The authors emphasize the need for a coherent theory that will describe some details such as convergence rate and accept reject steps. Clearly SGLD is expected to be a compatible tool for Dynamical systems and derived sampling methods such as HMC
Where is it used today?
There are pretty many papers such as those that are mentioned in the beginning of the posts where people add noise terms to the gradients in different ways.
One topic in which SGLD is often discussed is BNN where posterior distribution is the training objective. thus the concept of using mini-batches for this purpose is appealing and it is among the common tools together with Variational inference.
Acknowledgment
This post was written as part of an amazing project #deepnightlearners which takes place in Israel led by Mike Erlikhson and Avraham Raviv.
In addition I wish to thank for Amit Miller and Uri Itai, for proliferating discussions and providing relevant papers.