Linear Regression Model Selection through Zellner’s g prior

Published in

Towards Data Science

7 min readJul 27, 2020

Linear Regression is a building block for complex models, widely used because of its simplicity and ease of interpretability.

Often the classical approach to model selection in linear regression resumes in choosing the model with the highest R² or finding the right balance between complexity and goodness of fit through the Akaike Information Criteria. By Contrast, in Bayesian Inference one relies heavily on distributions because we get one when doing estimates.

In this article, we’ll use Zellner’s g prior to perform model selection. Additionally, while it is not common to build an ML model with both R and Python, we’ll take advantage of R packages bas and learnbayesto illustrate the computations.

Unraveling the Bayesian regression model

In multiple linear regression, we’re interested in describing the variability of an objective variable y by a set of covariates x₁, … , xₙ. Each one of these variables will normally have different contributions given by the weight of the coefficient relative to each variable β₁, …, βₙ, that is,

where α is the intercept (the value of y when the covariates are zero) and we hold the assumption of equal variances σ². If one considers the least-squares approach (classical statistics) one would estimate the values of the unknown parameters (α, βᵢ, σ²) through Maximum Likelihood Estimation. However, a Bayesian approach lets you consider each one of the coefficients as random variables. Thus, the goal is to get a probability model of the possible values these coefficients could take, with the advantage of getting an estimation of the uncertainty we have about the values of these coefficients.

To achieve this, the Bayesian approach is very flexible. First, one sets a prior belief on what these values could be, that is, before seeing the data y we reflect our knowledge about the values of the parameters. For example, let’s assume that we know nothing about them, then we can consider a noninformative prior

where one reads this expression as the joint distribution of (α, β, σ²) is proportional to the inverse of the variance σ², this is called the prior distribution. Notice that β is a vector with components β₁, …, βₙ.

Then, after observing data y, we set how it is described given the parameters and the covariates. Because we’re holding the linear regression assumptions we have

or given x, α, β, σ² it is plausible that the data follows a normal distribution with mean α + βx and variance σ², this is called the likelihood. Notice that x is a vector with components x₁, …, xₙ.

Finally, Bayes Theorem states that the posterior distribution is proportional to the product of the likelihood and the prior distribution,

Knowing these components, you’ll probably notice that choosing the prior is a little bit subjective. For this reason, Zellner introduced a way of assessing how certain one is about the prior selection.

Zellner’s approach

In 1986 Arnold Zellner proposed a simple way of introducing subjective information in a regression model. The idea is that the user specifies the location of the beta coefficients, for instance, assume that the beta coefficients are all unitary,

then, one’s belief about this assumption is going to be reflected by a constant g that reflects the amount of information in the prior relative to the data. Therefore, choosing a smaller value for g means that one has a stronger belief in this guess. Conversely, choosing larger values of g has an effect similar to choosing the noninformative prior for (α, β, σ²) because as g goes to infinity it’s influence on the prior vanish, see [2] for more details.

Model Selection through Zellner’s g prior

If we have n predictors for the response variable y, there are 2ⁿ possible regression models. Zellner’s g prior can be used to select the best model among the 2ⁿ candidates.

For this, assume that the models have the same prior probability, that is, for every model we assign the beta coefficients a prior guess of 0 and the same value of g. Then, we can compare the regression models by calculating the prior predictive distribution (the distribution of the data y averaged over all possible values of the parameters),

notice that this integral could be hard to compute if it does not has a closed-form or if it has several dimensions. One approach for approximating the result is the Laplace method, see [6] and chapter 8 of [1] for more details.

Once, we have computed each value p(y) for each m model, we need a way to compare these models. For example, when comparing just two of them we can calculate the ratio of their prior predictive densities,

this is known as the Bayes Factor. The larger the value of the ratio, the larger the support of model j in favor of model k, you can read more about Bayes Factors in chapter 8 of [1] and chapter 6 of [5].

Since we are comparing 2ⁿ models, we can compute the posterior probability for each model,

Finally, we will choose the model with the highest probability. In the following sections, we’ll see this procedure in action with the aid of the R packages BAS and learnbayes.

Code Example

To run R and Python, we’ll use the Docker image datascience-notebook from Jupyter. For an alternative approach using Anaconda, check this Stackoverflow question.

After installing Docker, run in your terminal

docker run -it -p 8888:8888 -p 4040:4040 -v D:/user/your_user_name:/home/jovyan/work jupyter/datascience-notebook

if this is the first time running this command, Docker will automatically pull the image datascience-notebook. Notice that one should enable file sharing to mount a local directory as a volume for the Docker container, see more here.

Inside Jupyter, we will be using the R packages learnbayes and bas for computing the linear regression model with g priors. Additionally, we’re using a Kaggle’s dataset from car crashes, which comes by default in seaborn.

Following the example in chapter 9 of [1], we will be using the function bayes_model_selection from the LearnBayes package to compute the posterior probability for each model.

Our objective variable y will be total, “the number of drivers involved in fatal collisions per billion miles”. For simplicity, we’ll take the first four variables as covariates.

The bayes_model_selection function is very intuitive in this case because we just need to provide the target variable, the covariates, and the belief g in the prior guess of β.

Notice that we’re choosing a value of g=100 because we’re not quite sure about the prior guess. You can freely play with this value considering that the higher this value is, the more similar the estimates will be to least-square estimates.

By accessing mod.prob on the model, we can visualize the results. In the code below, we’re sorting by the higher value of the posterior probability.

From the table, we see that the most probable model considers as covariates not_distracted and no_previous. Also, the individual variables with the highest contributions are no_previous and speeding.

What if there is a large number of covariates?

When dealing with a large number of covariates, thus a large number of models, the bayesian adaptative sampling algorithm is a great alternative. It works by sampling models without replacement from the space of possible models. To illustrate it, we’re now considering all the variables in the dataset, we got 2⁶ possible models to choose from.

Again the model with the higher posterior probability or the higher marginal likelihood is the most likely, that is, the model with the set of covariates alcohol, not_distracted, and no_previous.

Key Findings

To summarize, let’s outline some Key Findings:

While in classical statistics one gets a point estimate, in Bayesian Statistics one gets a probability distribution of the possible values the parameter can take.
The prior is the belief, the likelihood the evidence, and the posterior the final knowledge.
Zellner's g prior reflects the confidence one takes on a prior belief.
When you have a large number of models to choose from, consider using the BAS algorithm.

Finally, we’ve seen that a Bayesian approach to model selection is as intuitive and easy to implement as the classical approach, while also giving you greater insight into the inner workings of your model.

Got any questions? Leave a comment. Thanks for reading and feel free to share if you like the article.

References

[1] Albert, Jim. Bayesian Computation with R 2nd Edition. Springer, 2009.

[2] Marin, J.M.; Robert, Ch. Regression and Variable Selection. URL: https://www.ceremade.dauphine.fr/~xian/BCS/Breg.pdf

[3] Clyde, Merlise; Ghosh, Joyee; Littman, Michael. Bayesian Adaptive Sampling for Variable Selection and Model Averaging. URL: https://homepage.divms.uiowa.edu/~jghsh/clyde_ghosh_littman_2010_jcgs.pdf

[4] Zellner, Arnold. On Assessing Prior Distributions and Bayesian Regression Analysis with g-Prior Distribution. 1986

[5] Ghosh, Jayanta; Delampady, Mohan; Samanta, Tapas. An Introduction to Bayesian Analysis: Theory and Methods. Springer, 2006.

[6] Easy Laplace approximation of Bayesian models in R. URL: https://www.r-bloggers.com/easy-laplace-approximation-of-bayesian-models-in-r/

[7] Rpy2 documentation. URL: https://rpy2.github.io/doc/latest/html/index.html

[8] The Ultimate Guide to Evaluation and Selection of Models in Machine Learning. Neptune.ai. URL: https://neptune.ai/blog/the-ultimate-guide-to-evaluation-and-selection-of-models-in-machine-learning