How to Build a Bayesian Ridge Regression Model with Full Hyperparameter Integration

How do we handle the hyperparameter that controls regularization strength?

Ryan Burn
Towards Data Science

--

In this blog post, we’ll describe an algorithm for Bayesian ridge regression where the hyperparameter representing regularization strength is fully integrated over. An implementation is available at github.com/rnburn/bbai.

Let θ = (σ², w) denote the parameters for a linear regression model with weights w and normally distributed errors of variance σ².

If X represents an n×p matrix of full rank with p regressors
and n rows, then θ specifies a probability distribution over possible
target values y

Suppose we observe y and assume y is generated from a linear model of unknown parameters. A Bayesian approach to inference seeks to quantify our belief in the unknown parameters θ given the observation.

Applying Bayes’ theorem, we can rewrite the probability as

where we refer to

  • P(θ|y) as the posterior distribution
  • P(y|θ) as the likelihood function
  • P(θ) the prior distribution

The prior distribution describes our belief of θ before observing y and the posterior distribution describes our updated belief after observing y.

Suppose the prior distribution can be expressed as

where h(⋅, η) denotes a family of probability distributions parameterized
by what we call a hyperparameter η.

Traditional approaches to Bayesian linear regression have used what are called conjugate priors. A family of priors h(⋅, η) is conjugate if the posterior also belongs to the family

Conjugate priors are mathematically convenient as successive observations can be viewed as making successive updates to the parameters of a family of distributions, but requiring h(⋅, η) to be conjugate is a strong assumption.

Note: See Appendix A for a more detail comparison to other Bayesian algorithms.

We’ll instead describe an algorithm where

  1. Priors are selected to shrink w, reflecting the prior hypothesis that w is not
    predictive, and be approximately noninformative for other parameters.
  2. We fully integrate over hyperparameters so that no arbitrary choice of η is required.

Let’s first consider what it means for a prior to be noninformative.

How to Pick Non-informative Priors?

Note: The presentation here of non-informative priors closely follows chapter 2 of Box & Tiao’s book Bayesian Inference in Statistical Analysis.

Suppose y is data generated from a normal distribution of mean 0 but unknown variance. Let σ denote the standard deviation and let ℓ denote the likelihood function

Suppose we impose a uniform prior over σ so that the posterior is

and the cumulative distribution function is

where N is some normalizing constant.

But now let’s suppose that instead of standard deviation we parameterize over variance. Making the substitution u=σ² into the cumulative distribution function gives us

Thus, we see that choosing a uniform prior over σ is equivalent to choosing the improper prior

over variance. In general, if u=φ(θ) is an alternative way of parameterizing the likelihood function where φ is some monotonic, one-to-one, onto function. Then a uniform prior over u is equivalent to choosing a prior

over θ.

So, when does it make sense to use a uniform prior if the choice is sensitive to parameterization?

Let’s consider how the likelihood is affected by changes in the observed value y^⊤y.

Likelihood function for the standard deviation of normally distributed data with zero mean,
n=10, and different values of y^⊤y.

As we can see from the figure, as y^⊤y is increased the shape of the likelihood function changes: it becomes less peaked and more spread out.

Observe that we can rewrite the likelihood function as

where

Thus, in the parameter log σ, likelihood has the form

where

And we say that the likelihood function is data translated in log σ because everything is known about the likelihood curve except the location and the value of the observation serves only to shift the curve.

Likelihood function for the log standard deviation of normally distributed data with zero mean,
n=10, and different values of y^⊤y.

When the likelihood function is data translated in a parameter, then it makes sense to use a uniform function for a noninformative prior. Before observing data, we have no reason to prefer one range of parameters [a, b] over another range of the same width [t + a, t + b] because they will be equivalently placed relative to the likelihood curve if the observed data translates by t.

Now, let’s return to picking our priors.

Picking Regularized Bayesian Linear Regression Priors

For the parameter σ, we use the noninformative prior

which is equivalent to using a uniform prior over the parameter log σ. For w, we want an informative prior that shrinks the weights, reflecting a prior belief that weights are non-predictive. Let η denote a hyperparameter that controls
the degree of shrinkage. Then we use the spherical normal distribution with covariance matrix (σ/λ_η)² I

Note that we haven’t described yet how η parameterizes λ and we’ll also be integrating over η so we additionally have a prior for η (called a hyperprior) so that

Our goal is for the prior P(η) to be noninformative. So we want to know: In what parameterization, would P(y|η) be data translated?

How to Parameterize the shrinkage prior?

Before thinking about how to parameterize λ, let’s characterize the likelihood function for λ.

Expanding the integrand, we have

where

and

Observe that #1 is independent of w, so the integral with respect to w is equivalent to integrating over an unnormalized Gaussian.

Note: Recall the formula for a normalized Gaussian integral

Thus,

Next let’s consider the integral over σ.

Put

Then we can rewrite

After making a change of variables, we can express the integral with respect to σ as a form of the Gamma function.

Consider an integral

Put

Then

And

where Γ denotes the Gamma function

Thus, we can write

Let U, Σ, and V denote the singular value decomposition of X
so that

Let ξ₁, ξ₂, … denote the non-zero diagonal entries of Σ. Put Λ=Σ^⊤Σ. Then Λ is a diagonal matrix with entries ξ₁², ξ₂², … and

Note: To implement our algorithm, we will only need the matrix V and the nonzero diagonal entries of Σ, which can be efficiently computed with a partial singular value decomposition. See the LAPACK function gesdd.

Put

Then we can rewrite h and g as

And

Here we adopt the terminology

Put

Then r is a monotonically increasing function that ranges from 0 to 1,
and we can think of r as the average shrinkage factor across the eigenvectors.

Now, let’s make an approximation by replacing individual eigenvector shrinkages with the average:

Substituting the approximation into the likelihood then gives us

We see that the approximated likelihood can be expressed as a function of

and it follows that the likelihood is approximately data translated in log r(λ).

Thus, we can achieve an approximately noninformative prior if we
let η represent the average shrinkage, put

and use the hyperprior

Note: To invert r, we can use a standard root-finding algorithm.

Differentiating r(λ) gives us

Using the derivative and a variant of Newton’s algorithm, we can then quickly iterate to a solution of r⁻¹(η).

Making Predictions

The result of fitting a Bayesian model is the posterior distribution P(θ|y). Let’s consider how we can use the distribution to make predictions given a new data point x’.

We’ll start by computing the expected target value

And

Note: To compute expected values of expressions involving η, we need to integrate over the posterior distribution P(η|y). We won’t have an analytical form for the integrals, but we can efficiently and accurately integrate with an adaptive Quadrature.

To compute variance, we have

and

Starting with E[σ²|y], recall that

And

Thus,

Note: The Gamma function has the property

So,

And

It follows that

For E[w w^T|y], we have

Outline of Algorithm

We’ve seen that computing statistics about predictions involve integrating over the posterior distribution P(η|y). We’ll briefly sketch out an algorithm for computing such integrals. We describe it only for computing the expected value of w. Other expected values can be computed with straightforward modifications.

The proceedure SHRINK-INVERSE applies Newton’s algorithm for root-finding with r and r’ to compute r⁻¹.

To compute the hyperparameter posterior integration, we make use of an adaptive quadrature algorithm. Quadratures approximate an integral by a weighted sum at different points of evaluation.

In general, the more points of evaluation used, the more accurate the approximation will be. Adaptive quadrature algorithms compare the integral approximation at different levels of refinement to approximate the error and increase the number of points of evaluation until a desired tolerance is reached.

Note: We omit the details of the quadrature algorithm used and describe only at a high level. For more information on specific quadrature algorithms refer to Gauss-Hermite Quadratures and Tanh-sinh Quadratures.

Experimental Results

To better understand how the algorithm works in practice, we’ll set up a small
experimental data set, and we’ll compare a model fit with the Bayesian algorithm to Ordinary Least Squares (OLS), and a ridge regression model fit so as to minimize error on a Leave-one-out Cross-validation (LOOCV) of the data set.

Full source code for the experiment is available at github.com/rnburn/bbai/example/03-bayesian.py.

Generating the Data Set

We’ll start by setting up the data set. For the design matrix, we’ll randomly generate a 20-by-10 matrix X using a Gaussian with zero mean and covariance matrix K where

We’ll generate a weight vector with 10 elements from a spherical Gaussian with unit variance, and we’ll rescale the weights so that the signal variance is equal to 1.

Then we’ll set

where ε is a vector of 20 elements taken from a Gaussian with unit variance.

Here’s the Python code that sets up the data set.

Fitting Models

Now, we’ll fit a Bayesian ridge regression model, an OLS model, and a ridge regression model with the regularization strength set so that mean squared error is minimized on a LOOCV.

We can measure the out-of-sample error variance for each model using the equation

Doing this gives us

Out-of-sample performance of models

We see that both the Bayesian and ridge regression models are able to prevent overfitting and achieve better out-of-sample results.

Finally, we’ll compare the estimated noise variance from the Bayesian model to that from the OLS model.

This gives us

Model estimates for σ²

Conclusion

When applying Bayesian methods to ridge regression, we need to address: how do we handle the hyperparameter that controls regularization strength?

One option is to use a point estimate, where a value of the hyperparameter is chosen to optimize some metric (e.g. likelihood or a cross-validation). But such an approach goes against the Bayesian methodology of using probability distributions expressing belief for parameters, particularly if the likelihood is not strongly peaked about a particular value of the hyperparameter.

Another option commonly used is to apply a known distribution for the hyperparameter prior (for example a half-Cauchy distribution). Such an approach gives us a posterior distribution for the hyperparameter and can work fairly well in practice, but the choice of prior distribution is somewhat arbitrary.

In this blog post, we showed a different approach where we selected a hyperparameter prior distribution so as to be approximately noninformative. We developed algorithms to compute standard prediction statistics given the noninformative prior, and we demonstrated on a small example that it compared favorably to using a point estimate of the hyperparameter selected to optimize a leave-one-out cross-validation.

Appendix A: Comparison with Other Bayesian Algorithms

Here, we’ll give a brief comparison of the algorithm presented to scikit-learn’s algorithm for Bayesian ridge regression.

Scikit-learn’s algorithm makes use of conjugate priors and because of that is restricted to use the Gamma prior which requires four hyperparameters chosen arbitrarily to be small values. Additionally, it requires initial values for parameters α and λ that are then updated from the data.

The algorithm’s performance can be sensitive to the choice of values for these parameters, and scikit-learn’s documentation provides a curve fitting example where the defaults perform poorly.

In comparison, the algorithm we presented requires no initial parameters; and because the hyperparameter is integrated over, poor performing values contribute little to the posterior probability mass.

We can see that our approach handles the curve-fitting example without requiring any tweaking.

The full curve-fitting comparison example is available at github.com/rnburn/bbai/blob/master/example/04-curve-fitting.

This blog post was originally published at buildingblock.ai/bayesian-ridge-regression.

--

--