How to Build a Bayesian Ridge Regression Model with Full Hyperparameter Integration
How do we handle the hyperparameter that controls regularization strength?
In this blog post, we’ll describe an algorithm for Bayesian ridge regression where the hyperparameter representing regularization strength is fully integrated over. An implementation is available at github.com/rnburn/bbai.
Let θ = (σ², w) denote the parameters for a linear regression model with weights w and normally distributed errors of variance σ².
If X represents an n×p matrix of full rank with p regressors
and n rows, then θ specifies a probability distribution over possible
target values y
Suppose we observe y and assume y is generated from a linear model of unknown parameters. A Bayesian approach to inference seeks to quantify our belief in the unknown parameters θ given the observation.
Applying Bayes’ theorem, we can rewrite the probability as
where we refer to
- P(θ|y) as the posterior distribution
- P(y|θ) as the likelihood function
- P(θ) the prior distribution
The prior distribution describes our belief of θ before observing y and the posterior distribution describes our updated belief after observing y.
Suppose the prior distribution can be expressed as
where h(⋅, η) denotes a family of probability distributions parameterized
by what we call a hyperparameter η.
Traditional approaches to Bayesian linear regression have used what are called conjugate priors. A family of priors h(⋅, η) is conjugate if the posterior also belongs to the family
Conjugate priors are mathematically convenient as successive observations can be viewed as making successive updates to the parameters of a family of distributions, but requiring h(⋅, η) to be conjugate is a strong assumption.
Note: See Appendix A for a more detail comparison to other Bayesian algorithms.
We’ll instead describe an algorithm where
- Priors are selected to shrink w, reflecting the prior hypothesis that w is not
predictive, and be approximately noninformative for other parameters. - We fully integrate over hyperparameters so that no arbitrary choice of η is required.
Let’s first consider what it means for a prior to be noninformative.
How to Pick Non-informative Priors?
Note: The presentation here of non-informative priors closely follows chapter 2 of Box & Tiao’s book Bayesian Inference in Statistical Analysis.
Suppose y is data generated from a normal distribution of mean 0 but unknown variance. Let σ denote the standard deviation and let ℓ denote the likelihood function
Suppose we impose a uniform prior over σ so that the posterior is
and the cumulative distribution function is
where N is some normalizing constant.
But now let’s suppose that instead of standard deviation we parameterize over variance. Making the substitution u=σ² into the cumulative distribution function gives us
Thus, we see that choosing a uniform prior over σ is equivalent to choosing the improper prior
over variance. In general, if u=φ(θ) is an alternative way of parameterizing the likelihood function where φ is some monotonic, one-to-one, onto function. Then a uniform prior over u is equivalent to choosing a prior
over θ.
So, when does it make sense to use a uniform prior if the choice is sensitive to parameterization?
Let’s consider how the likelihood is affected by changes in the observed value y^⊤y.
As we can see from the figure, as y^⊤y is increased the shape of the likelihood function changes: it becomes less peaked and more spread out.
Observe that we can rewrite the likelihood function as
where
Thus, in the parameter log σ, likelihood has the form
where
And we say that the likelihood function is data translated in log σ because everything is known about the likelihood curve except the location and the value of the observation serves only to shift the curve.
When the likelihood function is data translated in a parameter, then it makes sense to use a uniform function for a noninformative prior. Before observing data, we have no reason to prefer one range of parameters [a, b] over another range of the same width [t + a, t + b] because they will be equivalently placed relative to the likelihood curve if the observed data translates by t.
Now, let’s return to picking our priors.
Picking Regularized Bayesian Linear Regression Priors
For the parameter σ, we use the noninformative prior
which is equivalent to using a uniform prior over the parameter log σ. For w, we want an informative prior that shrinks the weights, reflecting a prior belief that weights are non-predictive. Let η denote a hyperparameter that controls
the degree of shrinkage. Then we use the spherical normal distribution with covariance matrix (σ/λ_η)² I
Note that we haven’t described yet how η parameterizes λ and we’ll also be integrating over η so we additionally have a prior for η (called a hyperprior) so that
Our goal is for the prior P(η) to be noninformative. So we want to know: In what parameterization, would P(y|η) be data translated?
How to Parameterize the shrinkage prior?
Before thinking about how to parameterize λ, let’s characterize the likelihood function for λ.
Expanding the integrand, we have
where
and
Observe that #1 is independent of w, so the integral with respect to w is equivalent to integrating over an unnormalized Gaussian.
Note: Recall the formula for a normalized Gaussian integral
Thus,
Next let’s consider the integral over σ.
Put
Then we can rewrite
After making a change of variables, we can express the integral with respect to σ as a form of the Gamma function.
Consider an integral
Put
Then
And
where Γ denotes the Gamma function
Thus, we can write
Let U, Σ, and V denote the singular value decomposition of X
so that
Let ξ₁, ξ₂, … denote the non-zero diagonal entries of Σ. Put Λ=Σ^⊤Σ. Then Λ is a diagonal matrix with entries ξ₁², ξ₂², … and
Note: To implement our algorithm, we will only need the matrix V and the nonzero diagonal entries of Σ, which can be efficiently computed with a partial singular value decomposition. See the LAPACK function gesdd.
Put
Then we can rewrite h and g as
And
Here we adopt the terminology
Put
Then r is a monotonically increasing function that ranges from 0 to 1,
and we can think of r as the average shrinkage factor across the eigenvectors.
Now, let’s make an approximation by replacing individual eigenvector shrinkages with the average:
Substituting the approximation into the likelihood then gives us
We see that the approximated likelihood can be expressed as a function of
and it follows that the likelihood is approximately data translated in log r(λ).
Thus, we can achieve an approximately noninformative prior if we
let η represent the average shrinkage, put
and use the hyperprior
Note: To invert r, we can use a standard root-finding algorithm.
Differentiating r(λ) gives us
Using the derivative and a variant of Newton’s algorithm, we can then quickly iterate to a solution of r⁻¹(η).
Making Predictions
The result of fitting a Bayesian model is the posterior distribution P(θ|y). Let’s consider how we can use the distribution to make predictions given a new data point x’.
We’ll start by computing the expected target value
And
Note: To compute expected values of expressions involving η, we need to integrate over the posterior distribution P(η|y). We won’t have an analytical form for the integrals, but we can efficiently and accurately integrate with an adaptive Quadrature.
To compute variance, we have
and
Starting with E[σ²|y], recall that
And
Thus,
Note: The Gamma function has the property
So,
And
It follows that
For E[w w^T|y], we have
Outline of Algorithm
We’ve seen that computing statistics about predictions involve integrating over the posterior distribution P(η|y). We’ll briefly sketch out an algorithm for computing such integrals. We describe it only for computing the expected value of w. Other expected values can be computed with straightforward modifications.
The proceedure SHRINK-INVERSE applies Newton’s algorithm for root-finding with r and r’ to compute r⁻¹.
To compute the hyperparameter posterior integration, we make use of an adaptive quadrature algorithm. Quadratures approximate an integral by a weighted sum at different points of evaluation.
In general, the more points of evaluation used, the more accurate the approximation will be. Adaptive quadrature algorithms compare the integral approximation at different levels of refinement to approximate the error and increase the number of points of evaluation until a desired tolerance is reached.
Note: We omit the details of the quadrature algorithm used and describe only at a high level. For more information on specific quadrature algorithms refer to Gauss-Hermite Quadratures and Tanh-sinh Quadratures.
Experimental Results
To better understand how the algorithm works in practice, we’ll set up a small
experimental data set, and we’ll compare a model fit with the Bayesian algorithm to Ordinary Least Squares (OLS), and a ridge regression model fit so as to minimize error on a Leave-one-out Cross-validation (LOOCV) of the data set.
Full source code for the experiment is available at github.com/rnburn/bbai/example/03-bayesian.py.
Generating the Data Set
We’ll start by setting up the data set. For the design matrix, we’ll randomly generate a 20-by-10 matrix X using a Gaussian with zero mean and covariance matrix K where
We’ll generate a weight vector with 10 elements from a spherical Gaussian with unit variance, and we’ll rescale the weights so that the signal variance is equal to 1.
Then we’ll set
where ε is a vector of 20 elements taken from a Gaussian with unit variance.
Here’s the Python code that sets up the data set.
Fitting Models
Now, we’ll fit a Bayesian ridge regression model, an OLS model, and a ridge regression model with the regularization strength set so that mean squared error is minimized on a LOOCV.
We can measure the out-of-sample error variance for each model using the equation
Doing this gives us
We see that both the Bayesian and ridge regression models are able to prevent overfitting and achieve better out-of-sample results.
Finally, we’ll compare the estimated noise variance from the Bayesian model to that from the OLS model.
This gives us
Conclusion
When applying Bayesian methods to ridge regression, we need to address: how do we handle the hyperparameter that controls regularization strength?
One option is to use a point estimate, where a value of the hyperparameter is chosen to optimize some metric (e.g. likelihood or a cross-validation). But such an approach goes against the Bayesian methodology of using probability distributions expressing belief for parameters, particularly if the likelihood is not strongly peaked about a particular value of the hyperparameter.
Another option commonly used is to apply a known distribution for the hyperparameter prior (for example a half-Cauchy distribution). Such an approach gives us a posterior distribution for the hyperparameter and can work fairly well in practice, but the choice of prior distribution is somewhat arbitrary.
In this blog post, we showed a different approach where we selected a hyperparameter prior distribution so as to be approximately noninformative. We developed algorithms to compute standard prediction statistics given the noninformative prior, and we demonstrated on a small example that it compared favorably to using a point estimate of the hyperparameter selected to optimize a leave-one-out cross-validation.
Appendix A: Comparison with Other Bayesian Algorithms
Here, we’ll give a brief comparison of the algorithm presented to scikit-learn’s algorithm for Bayesian ridge regression.
Scikit-learn’s algorithm makes use of conjugate priors and because of that is restricted to use the Gamma prior which requires four hyperparameters chosen arbitrarily to be small values. Additionally, it requires initial values for parameters α and λ that are then updated from the data.
The algorithm’s performance can be sensitive to the choice of values for these parameters, and scikit-learn’s documentation provides a curve fitting example where the defaults perform poorly.
In comparison, the algorithm we presented requires no initial parameters; and because the hyperparameter is integrated over, poor performing values contribute little to the posterior probability mass.
We can see that our approach handles the curve-fitting example without requiring any tweaking.
The full curve-fitting comparison example is available at github.com/rnburn/bbai/blob/master/example/04-curve-fitting.
This blog post was originally published at buildingblock.ai/bayesian-ridge-regression.