Bayesian Statistics Overview and your first Bayesian Linear Regression Model

A brief recap of Bayesian Learning followed by implementation of a Bayesian Linear Regression Model on NYC Airbnb open dataset

Published in

Towards Data Science

11 min readDec 25, 2021

Hello there! Welcome to my first article where I will talk briefly about Bayesian Statistics and then walk you through a sample Bayesian Linear Regression model.

Currently, most of the Machine Learning problems that we work on are addressed by frequentist solutions and applications of Bayesian Learning is relatively very less. Hence, in this post, I wanted to take up a very general topic in ML — Linear Regression — and show how it can be implemented using Bayesian approach.

When I first started researching about this, I had many questions like, when is it beneficial to use Bayesian, how does the output differ from its non-Bayesian counterpart (Frequentist), how to define prior distribution, are there existing libraries in python for estimating posterior distribution, etc. I attempt to answer all these questions in this post, while keeping it brief.

1. Bayesian Recap

1.1 What is Bayesian Learning and how is it different from Frequentist statistics

Frequentist and Bayesian are two different versions of statistics. Frequentist is a more classical version, which, as the name suggests, rely on the long run frequency of events (data points) to calculate the variable of interest. Bayesian on the other hand, can also work without having a large number of events (in fact, it could work even with one data point!). The cardinal difference between the two is that: frequentist will give you a point estimate, whereas Bayesian will give you a distribution.

Having a point estimate means that — “we are certain that this is the output for this variable of interest”. Whereas, having a distribution can be interpreted as — “we have some belief that the mean of the distribution is the good estimate for this variable of interest, but there is uncertainty too, in the form of standard deviation“.

So, when is it useful to use Bayesian? Bayesian approach will be useful if you are interested in ML tasks where you care about both the estimate and certainty aspect. For example: if you want to know whether it is going to rain today, having an output like “it might rain with 60% probability” will be more intuitive response rather than just saying “it will rain”. The latter response doesn’t include how confident the model is in its prediction.

1.2 Bayesian Philosophy:

The main underlying formula behind the Bayesian approach is the Bayes theorem. It is a simple formula that helps us calculate the conditional probability of an event A given event B:

1.2.1 Terminology:

P(A|B) is called the posterior probability: the distribution that we wish to compute

P(B|A) is called the likelihood: assuming that event A has occurred, how likely is event B to occur?

P(A) is called the prior: our initial guess of the variable of interest

P(B)is called the evidence: likelihood of event B to occur. Note: this is generally intractable to calculate and is usually not computed while estimating the posterior probability

1.2.2 Applications

Some of the common applications where Bayes theorem could be applied is:

What is the probability that it will rain today (event A) given that it is cloudy today (event B)
What is the probability that Manchester United will win today (event A) given that Ronaldo is not playing (event B)
What is the probability that the coin is biased (event A) given that we have seen 3 heads and 1 tail in 4 coin tosses (event B).

1.3 Frequentist vs Bayesian: a simple example:

Consider the following, most common example for Frequentist vs Bayesian: Evaluating bias in a coin toss:

In Bayes formulation, this would translate as:

Event A = coin is biased (say towards heads) or not

Event B = sample coin tosses (empirical data)

P(A|B) = to be computed!

P(A) = prior probability (here, let’s say that the prior assumption is that the coin has no bias and is uniform between 0 and 1)

Using the above terminology and assumption of the prior, let us calculate the posterior probability up to 500 coin tosses. The posterior probability is computed after every coin toss, and the output can be seen in the below figure. The posterior calculated at every iteration becomes the prior for the following one. The dotted red line is the frequentist output after every trial.

From the above plot, you could see that the distribution quickly transforms from uniform to gaussian, with mean around 0.5 Another point to note is that, after every iteration, the bell curve only becomes thinner, indicating reduction in variance. This means, that it still encodes a slight amount of uncertainty, whereas the frequentist only gives one value.

2. Bayesian Linear Regression:

While I was exploring the applications of Bayesian learning in lieu of common ML solutions to regression tasks, the one application that struck me the most was Bayesian Linear Regression. I have been applying it in some of the ML problems and thought of sharing a glimpse of it here. Before deep diving, below is a short recap of the ordinary Linear Regression.

2.1 Linear Regression recap:

Linear Regression tries to establish a linear relationship between the response variable and the input variable. This is best described using the below formula:

In the above equation, all the x_i’s are the input features and β_i’s are the associated coefficients. ε corresponds to the statistical noise in the equation (which is not being accounted by the linear relationship of x and β). The above represented equation is the most general form of regression. The coefficients are calculated by minimizing a loss function (usually L2 loss function). Hence it is also called Ordinary Least Squares (OLS) algorithm. Since we have only a single estimate for the coefficients, we will end up with only one value for the response variable (y)

2.2 Bayesian Linear Regression:

From the perspective of Bayesian, the linear regression equation would be written in a slightly different way, such that there is no single estimate for the coefficients. This mainly comprises of two steps:

Computing Posterior distribution of β (ie, P(β|X,y))
Computing the response variable using the posterior distribution

2.2.1 Computing Posterior distribution of β

Taking inspiration from Bayes Theorem, we have

Since the evidence term is intractable to calculate, this can be written as:

Looking at the above equation, we could see that we need to input two things in the equation — likelihood and the prior. This is not very straightforward.

2.2.1.1 Prior selection:

Usually, selection of the prior distribution is based on the domain knowledge. Prior, in literal terms, means what our belief about the unknowns are. In most of the cases, we would have some knowledge about the prior, which can make it easier to assign a particular distribution to it. In the worst case, we can extract some knowledge from literature survey, or use a uniform distribution. In the coin toss for example, we used a uniform prior assuming we didn’t know anything about the bias in the coin. This prior would be given as:

Similarly, we would assign prior for other unknowns as well (ε, σ).

2.2.1.2 Likelihood:

Likelihood term is given by P(X,y|β). Since X is constant and doesn’t depend on β, the likelihood term can be re-written as P(y|β). The most common approach is to say that ‘y’ follows a normal distribution with the OLS prediction as the mean. Hence the likelihood term will be given by:

Now, if we plug in both the prior and likelihood terms in the Bayes equation, things will start to look really complicated. Computing posterior probability with the help of multiple probability density functions can look very daunting. Although, researchers have come out with various techniques to solve this. One very popular and successful method is Markov Chain Monte Carlo algorithms (MCMC). It computes an approximate to the actual posterior distribution. So, when you draw samples from this approximate distribution, you are basically drawing from the true (or near true) distribution of the unknowns (in our case — β, ε, σ). We can use these drawn samples to calculate other metrics of interest like mean, standard deviation, etc. I won’t go in depth into the algorithm, but I have included links in the resources if you are interested to learn more about it.

2.2.2 Computing the response variable using the posterior distribution

As mentioned before, the output from the Bayesian Linear regression model would be a distribution rather than a single estimate. Hence, as mentioned before, one way to think about it is using Gaussian distribution with mean as (β.T * X) + ε with some variance σ². Again, this will be represented as:

[Note that, we will use mean of the calculated posterior distribution in the previous step for all the unknowns]

Now, to make a prediction for a given data point (x_i), we could draw samples from the above normal distribution by replacing X with x_i.

3. Implementing Bayesian Linear Regression on NYC Airbnb dataset:

[Note: the code for this entire section is available here github / colab ]

Let’s try to implement all that we have learnt on a public dataset — NYC Airbnb data. This dataset contains pricing information for renting out an entire apt in different boroughs of New York. A small preview of the dataset is:

The distribution of prices in different borough is:

You could see that the price is varying a lot within each borough. Unsurprisingly, Manhattan seems to have much higher variance. This is because some neighborhoods like Upper West Side, West Village, etc., can see a very different (more expensive) pricing for their apartments (shown below):

Now, let’s say, you own an apartment in Chinatown, Manhattan and you want to put it up on Airbnb. What is the price you will quote for it? (Assuming that Airbnb isn’t quoting the price).

There could be many solutions to this, but one of them would be Bayesian Linear Regression as it will give us a range of values. And, having this range will help you understand the distribution of the prices well, to make an informed selection.

3.1 OLS Prediction:

Using scikit learn to implement the OLS gives us an estimate of $227 for this apartment in Chinatown, Manhattan. Let’s also include its comparison with other neighborhoods too:

[To keep things simple, lets focus only on few of the neighborhoods: Midtown, East Harlem, Chinatown, Upper West Side, NoHo]

Looking at the above graph, we could see that Chinatown has significant variance in their prices (between $65 and $1500). Other neighborhoods like Upper West Side and Midtown have a much higher variance than Chinatown. Hence, relying on just a single estimate might not seem totally right.

2.4 Coding Bayesian Linear Regression:

For coding the Bayesian Linear Regression, I would be using the pymc3 package.

Just to recall again, the response variable is defined as:

Now, to get started, we will first have to assign the prior distribution for the three unknowns: β, ε, σ

As I have normalized the data, I will use a standard normal distribution for these parameters with mean 0 and a slightly higher standard deviation.

The code using pymc3 will be:

Now, to utilize the in-built algorithms in pymc3 for the estimation of the posterior distribution, run the below code:

The ‘sample’ function is responsible for automatically assigning the appropriate algorithm. For example: NUTS (a Hamiltonian Monte Carlo method) for continuous output, Metropolis for Discrete output, Binary Metropolis for binary output, etc. You can find more details on the inference step here.

You can now plot the posterior distribution of the parameters using the below code:

β parameter has multiple plots, corresponding to the marginal distribution in each dimension. (To have a detailed look at all the features, please refer the code [github] [colab] )

Next, you can use the computed distributions of β, ε, and σ to plot a normal distribution for a particular data point x_i.

Using the above formula, the distribution of prices for listing the apartment in Chinatown neighborhood in Manhattan will look like:

The distribution will look different for different neighborhoods. Also, the variance will reduce/increase depending on how much data we have. The above distribution was computed using only 3K data points.

Having a distribution like above would help us understand the price market better and help us to make an informed decision.

2.5 What happens when ’N’ is large:

Let 𝑁 denote the number of instances of evidence we possess. As we collect more and more evidence, i.e., when 𝑁→∞ , our Bayesian results (often) align with frequentist results. Hence, for an infinitely large N, the statistical inference is similar for both Bayesian or Frequentist. Whereas, for small 𝑁 , inference is not stable: Frequentist results have higher variance. This is where Bayesian results have an advantage.

With the 3000 data points, the comparison between Frequentist and mean of Bayesian are not very different:

Conclusion:

This post was just an introduction to Bayesian Linear Regression along with recap of some of the Bayesian concepts. As a Data Scientist, it is good to have knowledge on different approaches to the same problem and how they compare with each other. I was personally thrilled when I started exploring Bayesian solutions to some of the common ML tasks that we work on and was compelled to share some of it here.

Secondly, this post should not be perceived as Bayesian > Frequentist. Both are different approaches, and it would be incorrect to claim that one is better than the other. It mainly depends on what your use case is. Cassie Kozyrkov goes in depth into this comparison and I would highly recommend this article if you are interested in it.

Thanks for Reading!

I greatly appreciate you reading through the entire article. If you have any questions/thoughts or want to share any constructive criticism, you are more than welcome to reach out to me at @Akashkadel. This is my first article and I truly wish to improve.

References to learn more:

Bayesian priors: https://fukamilab.github.io/BIO202/05-B-Bayesian-priors.html
Comparison of posterior output based on different Prior selection: https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading15b.pdf
Introduction to Markov Chain Monte Carlo methods: https://towardsdatascience.com/a-zero-math-introduction-to-markov-chain-monte-carlo-methods-dcba889e0c50
Markov Chain Monte Carlo Video explanation: https://www.youtube.com/watch?v=yApmR-c_hKU&t=612s
Another good article on Bayesian Linear Regression: https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7
Pymc3 guide: https://docs.pymc.io/en/v3/pymc-examples/examples/getting_started.html