When I was a statistics rookie and tried to learn Bayesian Statistics, I often found it extremely confusing to start as most of the online content usually started with a Bayes formula, then directly jump to R/Python Implementation of Bayesian Inference, without giving much intuition about how we go from Bayes’Theorem to probabilistic inference. Therefore I hope this article could really fill in the gap we had between the simple formula to a whole different level of Bayesian study, with as little math as possible. Specifically, I want to touch base on
- Intuition behind a classic Bayes formula example, and why it is NOT useful in understanding Bayesian inference
- How Bayesian inference works in practice
- What’s the difference between probability distribution and likelihood, the two main components in Bayesian inference (therefore we will know the difference between prior and likelihood)
- How we are able to ignore the calculation of both the denominator and nominator of Bayes formula in practice most of the time with a simple real life example in a posterior model to estimate stock market return.
Intuition behind a classic Bayes formula example
Again, I want to reiterate that I will focus on the intuition rather than the math here.
Bayes Theorem usually starts with an example in cancer detection. In this article, there is no exception. When you study this formula in high school / undergraduate, you will always be given the probability of some events, and calculate the probability of other events.

For example, we will know the probability of someone having cancer in advance, like P(cancer)=0.1. It means that if we random sample a dude from this country, 10% of them will have cancer. And with a new cancer testing technology, it could provide more accurate information and update our belief in whether a person has cancer or not. Fortunately, with Bayes formula, we could quantify how much more certain we are at determining if a person has cancer.

Previously if we random sample a dude, we are only 10% sure that he has cancer, this is our prior belief. With the new testing technology and data (likelihood), we increase our certainty to 66%. This serves as a very good starter in gaining the intuition behind Bayes Theorem, but chance is you would never be able to pick this up again in your life, because this formula is useless IRL for two reasons.
1. There is uncertainty in every part of the formula
First of all, the prior, likelihood and denominator are all uncertain in real life. We would never know for sure how many % of our population has cancer, and the best we can do is to make an educated guess.
In statistics world, whenever we tried to model uncertainty, we called this a probabilistic process (versus deterministic process). And we will model the event with probability distribution.
Therefore, for prior (aka how many people have cancer), in order to quantify our belief, we will set certain value not on the probability itself, but on the parameter that generate the probability distribution.
And the probability distribution that we would use depends on the variable / parameter that we tried to estimate. If we are estimating probability of something happen (aka discrete variable), we can use a binomial or poisson distribution. On the other hands, if we are going to estimate stock return (aka continuous variable), we can use a normal / student-t distribution.
Therefore in real life, instead of saying P(stock_up)=0.1, in Bayesian statistics we will say "I, as a financial expert, make an educated guess that the stock return in the next 12 months will follow a normal distribution with mean 30% and variance of 5%, and I hope to update this educated belief (prior) with the new data that we acquire later".
Then, you are off to have a great start in understanding Bayesian inference.
2. Bayesian Statistics is about multiplication of probability function, not real number
We established that prior is always modeled as a probability distribution. And a probability distribution will always have a probability mass function (for discrete variable) or probability density function (for continuous variable). In our stock return example the prior distribution is a normal distribution with a known mean and variance served as our educated guess.

In real life Bayesian statistics, we often ignore the denominator (P(B) in the above formula) not because its not important, but because its impossible to calculate most of the time. I will skip the discuss on why its so difficult to calculate it, but just remember that we will have different ways to calculate/estimate the posterior even without the denominator.
This leaves our "real life" Bayesian statistics with only two components, likelihood and prior.

Likelihood function also follow similar fashion as the prior. We will also assume a probability distribution on our likelihood function. But instead of us having a guess on the parameter, we let the data to determine the parameter value that will be used to generate likelihood function. After generating the likelihood function, we then will be able to multiply the likelihood function with the prior function, resulting a posterior function.

In real life, unlike the textbook cancer example, instead of having a certain value for our likelihood probability, in Bayesian statistics we will say "I, as a data analyst, collect many data from the stock market, and conclude that the stock return follows a normal distribution. And its most probable that the mean is 10 and variance is 2, according to the data I collect. We will use this to update our prior belief"
To have a short pause and summary, we can see that in practice, we will multiply two probability function together and yield a posterior probability function, which proved to be a probability distribution that represents an updated belief in our prior knowledge. In stock market return example, it means that we have an initial belief (prior) in the parameter regarding stock market return (aka the mean and variance). With extra data we collect and quantify as likelihood, we further update our belief in mean and variance (posterior).
But when I study this subject back then, the above explanation only gives out more question than answer. What is the difference between likelihood and probability distribution? And the formula showing above seems very complicated, how can I derive the formula / probability density function for the posterior distribution? Does that mean that I need to be a probability expert in order to implement even a basic Bayesian model?
I assure you that we are almost there. Let’s get on with it.
Difference between probability and likelihood
Let take a step back and talk about high school math again. When we start to learn about probability, we will have a coin flip example. Assume we have a fair coin, we assume that P(head)=0.5. This implies a Bernoulli distribution, with the parameter p
known as 0.5.
This example fits perfectly into our prior component, which we will assume certain value as our parameter (in normal distribution the parameter is the mean and variance, in case you don’t connect this together).
In high school math, we know the parameter, and we try to calculate probability of occurrence of certain data. But in real life, we most likely want to do the reverse. Usually we don’t know anything about our parameter, and we will have a hand full of data, wanting to make inference (estimate) on our parameter value. Likelihood function is designed exactly for such purpose.
Let use coin flip as our example again. Instead of knowing whether a coin flip is fair in advance, we don’t assume anything, and keep flipping the coin 100 times. Turns out exactly 50 times the coin is having a head, 50 times its having a tail. We then conclude that by setting the P(head)=0.5, it will maximize the probability that such data pattern occurs.
Intuitively, while probability set the parameter fix, and estimate the probability of data, which is variable. In likelihood we set the data fix, and vary the parameter until we find the best parameter which provides the best model to generate such dataset.
That’s why when we deal with probability and likelihood, they will show similar form in their function. But in mathematical term they have entirely different properties.
To have a short pause and summary, we can see that in practice, we have prior which has fixed parameter, and likelihood which parameter is entirely determined by (1) the probability distribution we choose and (2) the data that we fitted into the probability distribution.
What’s more awesome about Bayesian statistics is that we can multiply a discrete(continuous) prior with a continuous(discrete) likelihood. The probability distribution we choose in prior and likelihood could be random and adjust according to our use cases, thanks to the fact that all of them have a probability density function which we can multiply and solve both with paper or computational statistics.
Talking about solving the math equation, let’s revisit our stock return example. We have a univariate normal prior and univariate normal likelihood, and when we try to multiply their function together, there is actually a nice form solution which is a normal inverse chi-squared distribution with the mean and variance equal to

If we visit wikipedia, we can see that there is a list of different possible combination of different probability and likelihood. And how we are able to use a different probability distribution to model the resultant posterior. Basically we just need to use the posterior distribution shown in wiki, and plug in the parameter value we had from our likelihood and prior, and a updated belief as posterior distribution will be generated. This technique is known as conjugate prior. And it will save us loads of time in solving the math on paper.
But the conjugate prior could only handle the most basic probability distribution model. What if we have a 100 multivariate normal distribution as our prior? In this case, its impossible to derive a mathematical form solution for our posterior.
How we are able to ignore the calculation of both denominator of Bayes formula and the nominator in practice
Turns out we don’t really need to calculate the multiplication in our nominator nor the denominator.
Because we have everything we need to calculate the prior and likelihood given a certain dataset (X = x1, x2 … xt), we could randomly pick a sample from the dataset (x1), calculate the likelihood and prior value, and multiply them to generate a random sample posterior density.
As a reminder, I have briefly mentioned that posterior is guarantee to be a probability distribution (well-proven math, which I don’ fully understand to be honest, but hey its proven), therefore when we said to calculate the posterior, we refer to the calculation of probability density, which describe how likely a parameter (e.g. mean of stock return) will occur given a dataset.
If we sample the likelihood and prior infinite amount of time, the posterior density value we generate will form a very good approximation of the truth posterior probability distribution.
The sampling technique is known as Markov chain Monte Carlo (MCMC) methods. And there are several implementation on MCMC, including random walk metropolis and Hamiltonian Monte Carlo. Similar to deep learning where certain activation function is usually superior than others. In MCMC, its known that No-U-Turn Sampler (which derived from Hamiltonian Monte Carlo) is the best implementation.
Therefore, MCMC won’t help us to solve all the questions we have in Bayesian statistics, but simply a realization of posterior distribution, should we are not able to calculate it. To have a practical use case, we still need to determine the probability distribution for prior and likelihood, and fit in the data ourselves.
This is actually what perceived as an advantage of using Bayesian statistics, we make our assumption very clear in advance by showing the probability distributions we choose for prior and likelihood, and these assumption could help us understand better in model uncertainty.
What’s even better is that the resultant posterior distribution is a probability distribution, meaning that instead of using some confusing evaluation methods like confident interval, we could take it as literal as possible and conclude that certain event will have a solid percentage of probability to occur in the future.
In python, both Pystan and PyMC aims to run the MCMC for us. The only thing we need to provide is the prior and likelihood function, with the prior parameter and data for likelihood estimation provided.
A complete walk through on stock return
Again, let’s assume that I, as a financial expert, know that in previous years, the overall stock market return in the next 12 months always followed a normal distribution with mean equal to 30% and variance equal to 5%. However, with covid19, this assumption is shaky and my colleague, who is a data expert, disagrees with my belief, but mildly acknowledges that my prior belief is useful for estimating the stock return in the next 12 months after the covid situation is stabilized.
My colleague spent days to collect stock return data from different sectors like bank and technology sector, and he also uses a normal distribution to describe the stock return under covid 19, with mean equal to 10% and variance also equal to 10%.
With either conjugate prior or MCMC (let’s say we have a multivariate normal which further breakdown the stock return into each sector, then MCMC will be a necessity), we are able to calculate the posterior probability distribution.

We can see that after covid situation is stabilized, the mean return is around 22–23%.
As the posterior is also a probability distribution, we could have a educated guess, with a given probability, on the range of value for both mean and variance. Your team boss could then account for different scenario with different chance of occurrence, and be prepared for possible future downside risk.
I hope that this could conclude nicely on the intuition behind Bayesian statistics, and all the prep work you need to do in order to use it in practice, or study it in detail.