What can we learn from posterior distributions?

A frequentist interpretation of Bayesian posterior

Alireza Modirshanechi
Towards Data Science

--

Assume we have observed N independent and identically distributed (iid) samples X = (x1, … , xN) from an unknown distribution q. A typical question in statistics is “what does the set of samples X tell us about the distribution q?”.

Parametric statistical methods assume that q belongs to a parametric family of distributions and that there exists a parameter θ where q(x) is equal to the parametric distribution p(x|θ) for all x; for example, p(.|θ) can be a normal distribution with unit variance, where θ indicates its mean. In this setting, the question “what does X tell us about q?” is translated to “what does X tell us about the parameter θ for which we have q = p(.|θ)?”.

The Bayesian approach to answer this question is to use rules of probability theory and assume that θ is itself a random variable with a prior distribution p(θ). The prior distribution p(θ) is a formalization of our assumptions and guesses about θ prior to observing any samples. In this setting, we can write the joint probability distribution of the parameter and data together:

Using this formulation, all information that X captures about θ can be summarized in the posterior distribution

Equation 1. Posterior distribution

Bayesian statistics is beautiful, self-consistent, and elegant: Everything is derived naturally using probability theory's rules, and the assumptions are always explicit and clear. However, it often looks mysterious and puzzling: (i) what can we really learn from the posterior distribution p(θ|X) about the underlying distribution q? And (ii) how reliable is this information if our assumptions do not hold, e.g., if q does not belong to the parametric family we consider?

In this article, my goal is to gain some intuition about these two questions. I first analyze the asymptotic form of the posterior distribution when the number of samples N is large — this is a frequentist approach to studying Bayesian inference. Second, I show how the general theory applies to a simple case of a Gaussian family. Third, I use simulations and analyze, for three case studies, how posterior distributions relate to the underlying distribution of data and how this link changes as N increases¹.

Theory: Asymptotic case for large N

The logarithm of the posterior distribution in Equation 1 can be reformulated as

Equation 2. log-posterior distribution

The constants (with respect to θ) in Equation 2 are only important for normalizing the posterior probability distribution and do not affect how it changes as a function of θ. For large N, we can use the law of large numbers and approximate the 2nd term in Equation 2 (the sum of log-likelihoods) by

where D-KL is the Kullback-Leibler divergence and measures a pseudo-distance between the true distribution q and the parametric distribution p(.|θ). It is important, however, to note that the approximation works only if the mean and variance (with respect to q) of log p(x|θ) are finite for some parameter θ. We will further discuss the importance of this condition in the next sections.

If p(θ) has full support over the parameter space (i.e., is always non-zero), then log p(θ) is always finite and the dominant term in Equation 2, for large N, is D-KL [q || p(.|θ)] times N. This implies that increasing the number of samples N makes the posterior distribution p(θ|X) closer and closer to the distribution

Equation 3

where Z is the normalization constant. p*(θ; N) is an interesting distribution: Its maximum is where the divergence D-KL [q || p(.|θ)] is minimum (i.e., when p(.|θ) is as close as possible to q)², and its sensitivity to D-KL [q || p(.|θ)] increases by increasing the number of samples N (i.e., it becomes more “narrow” around its maximum as N increases).

When the assumptions are correct

When the assumptions are correct and there exists a θ* for which we have q = p(.|θ*), then

where D-KL [p(.|θ*) || p(.|θ)] is a pseudo-distance between θ and θ*. Hence, as N increases, the posterior distribution concentrates around the true parameter θ*, giving us all information we need to fully identify q — see footnote³.

When the assumptions are wrong

When there is no θ for which we have q = p(.|θ), then we can never identify the true underlying distribution q — simply because we are not searching in the right place! We emphasize that this problem is not limited to Bayesian statistics and extends to any parametric statistical methods.

Although we can never fully identify q in this case, the posterior distribution is still informative about q: If we define θ* as the parameter of the pseudo-projection of q on the space of the parametric family:

then, as N increases, the posterior distribution concentrates around θ*, giving us enough information to identify the best candidate in the parametric family for q — see footnote⁴.

Intermediate summary

As N increases, the posterior distribution concentrates around a parameter θ* that describes the closest distribution among the parametric family to the actual distribution q. If q belongs to the parametric family, then the closest distribution to q is q itself.

Example: Gaussian distribution

In the previous section, we studied the general form of posterior distributions for a large number of samples. Here, we study a simple example to see how the general theory applies to specific cases.

We consider a simple example where our parametric distributions are Gaussian distributions with unit variance and a mean equal to θ:

For simplicity, we consider a standard normal distribution as the prior p(θ). Using Equation 1, it is easy to show that the posterior distribution is

with

Now, we can also identify p*(θ; N) (see Equation 3) and compare it to the posterior distribution: As long as the mean and the variance of the true distribution q are finite, we have

As a result, we can write (using Equation 3)

with

Equation 4

As expected from the general theory, we can approximate p(θ|X) by p*(θ; N) for large N because

To summarize, p(θ|X) concentrates around the true mean of the underlying distribution q — if it exists.

Simulations: The good, the bad, and the ugly

Our theoretical analyses had two crucial assumptions: (i) N is large, and (ii) the mean and variance (with respect to q) of log p(x|θ) are finite for some θ. In this section, we use simulations and study how robust our findings are if these assumptions do not hold.

To do so, we consider the simple setting of our example in the previous section, i.e., a Gaussian family of distribution with unit variance. Then, we consider three different choices of q and analyze the evolution of the posterior p(θ|X) as N increases.

Additionally, we also look at how the Maximum A Posteriori (MAP) estimate q-MAP-N = p(.|θ-hat-N) of q evolves as N increases, where θ-hat-N is the maximizer of p(θ|X). This helps us understand how precisely we can identify the true distribution q by looking at the maximizer of the posterior distribution⁵.

The good: Gaussian distribution

For the 1st case, we consider the best case scenario that q belongs to the parametric family and all assumptions are satisfied:

Equation 5. Equivalent to q(x)=p(x|θ=1)

We drew 10'000 samples from q and found the posterior distribution p(θ|X=(x1,…,xN)) and the MAP estimate q-MAP-N — by adding the drawn samples one by one for N = 1 to 10'000 (Figure 1). We observe that p(θ|X) concentrates around the true parameter as N increases (Fig. 1, left) and that the MAP estimate converges to the true distribution q (Fig. 1, right)⁶.

Figure 1. Gaussian distribution as q. Left: The mean (solid black curve) and the standard deviation (shaded grey area) of the posterior distribution as functions of N. The dashed black line shows the true parameter of q=p(.|θ=1). The posterior distribution converges to the true parameter. Vertical colored lines show N=2, 10, 100, and 1000. Right: MAP estimate of q for N=2, 10, 100, and 1000 (colored curves). The dashed black curve shows the true distribution q.

The bad: Laplace distribution

For the 2nd case, we consider a Laplace distribution with unit mean as the true distribution:

In this case, q does not belong to the parametric family, but it still has a finite mean and variance. Hence, according to the theory, the posterior distribution should concentrate around the parameter θ* of the pseudo-projection of q on the parametric family. For the example of the Gaussian family, θ* is always the mean of the underlying distribution, i.e., θ* = 1 (see Equation 4).

Our simulations show that p(θ|X) indeed concentrates around θ* = 1 as N increases (Fig. 2, left). The MAP estimate, however, converges to a distribution that is systematically different from the true distribution q (Fig. 2, right) — just because we were searching among Gaussian distributions for a Laplace distribution! This is essentially a problem of any parametric statistical method: If you search in the wrong place, you cannot find the right distribution!

Figure 2. Laplace distribution as q. Left: The mean (solid black curve) and the standard deviation (shaded grey area) of the posterior distribution as functions of N. The dashed black line shows the parameter corresponding to the pseudo-projection of q on the parametric family, i.e., θ*=1 (see Equation 4). The posterior distribution converges to θ*. Vertical colored lines show N=2, 10, 100, and 1000. Right: MAP estimate of q for N=2, 10, 100, and 1000 (colored curves). The dashed black curve shows the true distribution q.

The ugly: Cauchy distribution

For our 3rd and last case, we go for the worst possible case and consider a Cauchy distribution (a famous heavy-tailed distribution) as the true distribution:

In this case, q does not belong to the parametric family, but the more crucial problem is that the Cauchy distribution does not have a well-defined mean or a finite variance: All theory’s assumptions are violated!

Our simulations show that p(θ|X) does not converge to any distribution (Fig. 3, left): The standard deviation of p(θ|X) goes to zero and it concentrates around its mean, but the mean itself does not converge and jumps from one value to another. The problem is fundamental: The KL divergence between a Cauchy distribution and a Gaussian distribution is infinite, independently of their parameters! In other words, according to KL divergence, all Gaussian distributions are equally (and infinitely) far from q, so there is no preference for which one to pick as its estimate!

Figure 3. Cauchy distribution as q. Left: The mean (solid black curve) and the standard deviation (shaded grey area) of the posterior distribution as functions of N. The dashed black line shows the median of q: If q had a mean, then this mean would be equal to 1 because of symmetry. The posterior distribution does not converge to any distributions, and its mean jumps from one value to another. Vertical colored lines show N=2, 10, 100, and 1000. Right: MAP estimate of q for N=2, 10, 100, and 1000 (colored curves). The dashed black curve shows the true distribution q.

Conclusion

If our assumed parametric family of distributions is not too different from the true distribution q, then the posterior distribution always concentrates around a parameter that is in a way informative about q. If q is not in the parametric family, however, this information may be only marginal and not really useful. The worst-case scenario is when q is too different from any distribution in the parametric family: In this case, the posterior distribution is informationless.

Acknowledgment

I am grateful to Johanni Brea, Navid Ardeshir, Mohammad Tinati, Valentin Schmutz, Guillaume Bellec, and Kian Kalhor for numerous discussions on related topics.

Code:

All code (in Julia language) for the analyses can be found here.

Footnotes:

¹ There are multiple excellent open-access textbooks where interested readers can find more information on Bayesian statistics: [1] “Computer Age Statistical Inference” by Bradley Efron and Trevor Hastie, [2] “Bayesian Data Analysis” by Andrew Gelman et al., and [3] “Information Theory, Inference and Learning Algorithms” by David MacKay. Also, see footnotes 3 and 4 and this online course.

² An interesting exercise is to estimate how different the maximizers of p(θ|X) and p*(θ; N) are. As a hint, note that you can use Taylor expansions of different distributions around the maximizer of p*(θ; N). A curious reader may also want to think about what these results tell us about the maximum likelihood estimate of the parameter θ.

³ To read more about the asymptotic consistency of Bayesian inference, you can also look at Barron et al. (1999) and Walker (2004) as two well-known examples.

⁴ To read more about statistical inference under model misspecification, one can look at Kleijn and van der Vaart (2006), Kleijn and van der Vaart (2012), and Wang and Blei (2019) as a few well-known examples.

⁵ We could also study how the posterior predictive distribution q-PPD-N(x) = p(x|X) = ∫p(x|θ)P(θ|X)dθ evolves as N increases. In our setting with both Gaussian prior and Gaussian likelihood, we have q-MAP-N(x) = Normal(x|μ=θ-hat-N, σ²=1) and q-PPD-N = Normal(x|μ=θ-hat-N, σ²=1+1/(N+1)). Hence, all our qualitative observations for q-MAP-N are also true for q-PPD-N.

⁶ Fun exercise for interested readers: Can you calculate the speed of convergence as a function of the variance of q? Does what you see in the figure make sense with respect to your calculations?

--

--