To start with, I’m a deep learning guy and I’ve been quite happy with that until recently. But I was recently the victim of a vicious small-sample size high-dimensional problem! So I had to turn my head towards algorithms that are good at "learning from little" and tell me "how confident I can be about that learning".
This was quite a sudden change for me. Because I’ve been away from Bayesian "stuff" for so long. But time has come for things to change. So I had to learn (almost) from the scratch. I was bombarded with terms like priors, posteriors likelihoods, KL-Divergence, Evidence Lower BOund (ELBO) (Hopefully will write another post about the last two some other day!). Then I realized "hey, it’s not that different from frequentist type algorithms". So time to untangle this big ball of mess to structured knowledge hopefully not falling down any rabbit holes.
Let me give a quick tour about what’s going to happen. So we’re going to start with a frequentist type regression example. Then we will see why we might want to move onto solving our example with a little more promising technique like Bayesian linear regression (obviously the uncertainty information). There after we will state the Bayes rule, followed by a quick note on how we can adopt the Bayes rule to find a good model for the given data. With this we can define prior, likelihood and posterior terms (I mean define. Not understand). Then we will drive our discussion on a long path (it’s long, but quite enlightening) discussing each of these terms. While we discuss each of these terms we will see the results we get in our example to build some context. And finally we will do a quick evaluation by comparing what we started with (prior) to what we found to be the best (posterior).
Problem We want to Solve: Y= β1 X + β0 + ε
The first assumption when it comes to linear regression is that, we assume data has the following form.

And data generated from such a model would look like below.

Ordinary Least Square (OLS) Linear Regression
Say we got this dataset and we need to fit a line. We do this by formulating the problem as follows.

Our goal is to find β1 and β0 such that we have the minimum RMSE (root mean squared error) of the data. Which is mathematically,

First let’s fit a simple line with linear regression.

So, What does this Results Say?
It doesn’t look bad. And in fact is almost the correct one. But can I really rely on the answer given by the linear regression for the limited data area? I don’t think so. I would like a measure that basically says,
Hey, I’ve seen a lot of data in this area before, so I’m pretty confident about my prediction.
or
Umm, this point lies somewhere I haven’t seen much data. The answer is probably this, but I’m not so sure.
Something like this.

You can see how the confidence bounds increase (thus uncertainty of the answer increases) where there is no data. But linear regression can’t give you this. And this is why we need Bayesian linear regression. So how can we use the superior "Bayesian" healing to fix this problem?
First things first! The holy grail of all the Bayesian "stuff"
Bayes Rule

The probability of an event A happening given event B happened (posterior – thing you’re interested about but don’t know) is given by, the probability of event B happening given A happened (likelihood) multiplied by the probability of A happening (prior).
How Bayes theorem relate to this Problem?
That’s nice and we’ve heard this a million times! What I’d like to know is how does this related to Machine Learning. For that, let A be the parameters of the learning model (i.e. β0 and β1) denoted by θ, and B be your data, D. Now let’s interpret each entity in the Bayes rule.

By solving this we will get a joint distribution for all parameters in θ (β0 and β1) given the Data. Which is what we need. In other words P(θ|D) will tell us given data β0=0.5 and β1=2.0 with 0.5 probability and β0=1.5 and β1=3.0 with 0.1 probability. So we know that β0=1.5 β1=3.0 is not completely out of this world (which we actually assume in the frequentist method)! This is called the Posterior distribution.
And we calculate this by calculating
- P(D|θ): How likely for us to see the desired data if we have parameters θ in our model
- P(θ): Our prior belief about where the parameters θ might lie. More closer this to the true posterior is, the better and quicker you will find the correct posterior
- P(D): This is a constant value and represents the probability of observing data
Prior P(θ) : Our Belief of What Parameters Look Like?
We need to start with some parameter values right? And in Bayesian settings you don’t specify things with one value, instead you say in terms of distributions (e.g. Gaussian/normal distribution). In our example we might say,
Specify Parameters with Probability Distributions
Hey, I believe the parameters β0 can be represented by a normal with mean 0 and standard deviation 3. That is,

Similarly you say for β1,

If we sample many values of beta, we get closer to the true normal distribution. As you can see for β0 and β1, lot of values have been sampled close to 0, but β1 is more squashed compared to β0 (less concentrated near 0 compared to β1)

Why Normal Distributions?
Why do we use normal distribution you ask? The normal distribution has very nice analytical properties. Thus we have an analytical solution for the mean and variance of the posterior distribution. More details about analytical posterior solution.
Why is a good prior (P(θ)) so important?
Because your posterior depends on that. That is, close your prior to the posterior, quicker you’ll reach the true posterior. To see why, say your prior and posterior are identical, then when you sample from prior, you’ll actually be sampling from posterior (which is what we need).
Now off to the next important bit; the likelihood of seeing data.
Likelihood P(D|θ): How Likely to See Data for a Given θ
How can we calculate the likelihood of our data for a given set of parameters. In other words we wan to calculate,
P(Y|X,θ)
Below, we will do a step-by-step exploration to find an equation that we can compute a value for this entity.
Calculating the Likelihood: P(Y|X,θ)
1. Write Dataset (Y, X) in terms of Individual Data Samples (y and x)
Now let’s try to develop an equation for the above term.

This is true because of the assumption,
- Data points (xi,yi) are i.i.d (independent and identically distributed)
2. Write y in terms of x, β1, β0 and ε
Now let’s recall what our data looks like,

We can fit in this equation in the likelihood equation to get,

3. Using the conditioned variables to Isolate ε
We have all x, β1, β0 given so we can remove them from the equation to have,

Think about this transformation this way, If I need P(A+B|B) this is same as P(A|B) because we know that B is "given". In other words P(B|B) = 1 so we can remove B from top of P(A+B|B) and have P(A|B) unharmed.
4. Assumption: Noise is Independent of Data
Now we are off to the next important assumption,
- Noise is independent of the data
Which gives us,

5. Write Probability with PDF
Remember that our noise is normally distributed and we can get the probability density function (PDF) of a normal as,

6. Assumption: Zero-Mean Constant Variance Noise
In linear regression, we make two important assumptions about the noise component of this. That is, it is a normal distribution and,
- It’s zero-mean
- It has a constant variance
With these two assumptions, and plugging in the values of noise to the equation we get,

And we have that,

7. Final Likelihood Equation
So let’s go and plug that in,

Yup, all done! It’s a simple series of transformation coupled with assumptions to reach some computable term for the likelihood function.
Making Notation Simpler
To make things simpler from this point onwards we say,

so that,

Intuition: How does Likelihood look for β1 if y=β1 x?
We have two parameters in our example. So to make things simple, forget about β0 for the moment. Say we have y and x being generated according to

Now we try to approximate by trying a bunch of different β1 values, which produce a new y

for a range of β1 values. The likelihood P(y|x,β1) would look as follows.

You can see that close to 4 we have a high likelihood of seeing data. Which is what we need.
More Intuition: Likelihood for Our Previous Example
And you can generalize this to any number of β values (β1 and β0 values in our example). This is the graph we get in the example.

What does the Above Graph Say?
The above graph says that when β0 is close to -2.5 and β1 is close to 1.5, we get the best chances of seeing the data X,Y. Also note how the likelihood for β1 is a big spike where for β0 it’s more spread. Now as an exercise think about what this mean in terms of the impact each variable has on the accuracy of the model.
Hopefully, this clarified things enough. Let’s move on to the posterior
Posterior P(θ | D): What We Ultimately Need?
Posterior is the probability density function of parameters (i.e. β0 and β1) given that you observed your data. Quickly recall that this is the final result we want (left side of the Bayes rule). To calculate this we need to calculate _P(Y|X,β1,β0)_, _P(β1,β0)_ and P(X,Y) which is put together to produce,

Now the entity P(X,Y) will be constant. Because whatever the data we have will be there whatever we do with our model. So it won’t change. So I’m going to replacy P(X,Y) with a constant Z. And we’ll discuss how to calculate this Z later.

In other words,
Posterior is just a weighted prior, where the weight is the likelihood of seeing data for a given value of prior
So there are two ways of solving this.
- Get the Analytical solution of the Posterior
- Solve this equation by sampling many β1 and β0 and then approximate the posterior
We are Going to Use Sampling. Because We Want to Learn Things with More Hands-On
We are interested in how the posterior behaves with the likelihood and the prior. So getting the answer with the analytical solution will be cheating for us. So let us calculate this by sampling.
So we will formulate the above equation to fit sampling.

Then we approximate the full posterior by taking 15000 different values for β0 and β1, drawn from their respective priors and calculating the likelihood for each of those β0 and β1 combinations. More concretely, for our example,

So we solve,

we have the posterior. Since data is i.i.d and β1 and β0 are independent we can write the above entity as,

And by assigning the notation simplification we discussed we get,

Evidence (Z=P(D)): Making Posterior Sum up to 1
Z = P(D) = P(X,Y)
P(D) – This is sometimes called the evidence and denotes the probability of seeing data. Given by,

How do we calculate Z?
How can we approximate this? We just get the total of P(D|β1,β0)P(β1,β0) (i.e. top part of the Bayes rule) for all different combinations of β1 and β0 sample. More we sample, truer the probability of observing data, P(D) will be. More specifically we can calculate Z by,

This is nothing to be scared of, this is just adding the top part of P(β1(i),β0(i)|xj,yj) over and over for the total number of times we sample different β1 and β0. And Z is there so that, the area under the posterior adds up to 1 (you know, because it is a probability distribution)
The Prior, Likelihood, Posterior Together for Our Example
Below we show the prior, likelihood and posterior all in one place to show how the look.

What does the Graph Say?
I can make two important observations.
- You can see that since our prior is slightly off from the true posterior, the likelihood actually pulls the posterior away from prior. You will see this force for pulling diminishing if the prior gets closer to the true posterior
- The next observation is that, the parameter β1 seems to be more sensitive compared to β0. Otherwise a change in β1 can impact the accuracy of results more than the same change for β0.
And finally we do a comparison to see the difference of lines sampled from the prior distribution (only the first 500 samples) and the posterior distribution. I’ll leave it up for you to judge which is better.

How do I get the Answer for a New Data Point?
This is quite straight-forward now that we have a posterior distribution of what β1 and β0 looks like. You just sample different β1 and β0 from the posterior and get the value of y (i.e. y= β1 x + β0), for given x, for each of those β1 and β0. With this set of potential candidates of y for x, you can calculate the mean value of y and the standard deviation of y.
Jupyter notebook is available here:
https://github.com/thushv89/exercises_thushv_dot_com/blob/master/bayesian_linear_regression.ipynb
Final Remarks
So this was a bit long, but hopefully should demystify some of the "dark secrets" the Bayesian methods hold. We understood how the Bayesian linear regression works, through an example task; Y= β1 X + β0 + ε.
We first saw that the frequentist approach solve the problem, but ignores some critical information; uncertainty of an answer. However we saw that the Bayesian method will not only give the most likely answer, but also tells us how uncertain it is about that answer.