Hands-on Tutorials

Bayesian Generalized Linear Models with Pyro

Predicting House Prices with Linear Models and Pyro for a Fully Transparent Approach

Boris Shabash

Published in

Towards Data Science

14 min readOct 27, 2020

One of the most common “first lines of attack” when faced with a predictive or analytical data project is the family of Generalized Linear Models (GLMs), and most commonly the linear or logistic regressions.

GLMs seek to model a response variable, y, as a function of a linear combination of features, X. The reason for the linear combination is largely for the purpose of explainability; We want to not only be able to predict y well, but also be able to explain what is the effect of each feature on our response.

Thus, we usually specify a GLM as our response being a combination of features and coefficients as follows:

Formula prepared using https://www.codecogs.com/latex/eqneditor.php

where f is a function of our linear combination, and the coefficients are denoted with the Greek letter β

While libraries such as sklearn offer a rich variety of regression models, they have one major drawback in that they create point estimates of β without taking uncertainty into account. That is, while we can find out what are the most likely values for the coefficient, we don’t estimate how likely are other values for the coefficients. Other packages such as statsmodels offer some measure of uncertainty, but do so under a series of implicit assumptions the analyst may not get to verify (or know about).

In this article we’ll discuss how to leverage PyTorch and Pyro to produce GLM models which create uncertainty estimates both for the parameters, as well as for predictions, and do so with a set of very explicit assumptions.

Employing Scikit-learn’s Linear Regression

We’ll start by exploring a simple linear regression from sklearn, and see how it behaves on one of the built in datasets, the California Housing dataset.

We’ll start by importing all our required libraries:

Now let’s import the housing dataset, and explore its features:

>>> Data shape is (20640, 8)
>>> Target shape is (20640,)

The base linear regression model in many libraries assumes the response is normally distributed around the predicted means. However, generally house prices won’t be normally distributed. In this dataset as well, the distribution of prices is not normal, it is closer to a gamma distribution (In reality, we won’t always get that directly from the data, but might have to look at residuals, but this simple alternative works in our case). So it would be nice if we were able to build our model so it takes that into account.

First, let’s explore how well an sklearn linear regression performs on this data. We can split the data into train and test sets to get an estimate of how our algorithms will work:

>>> (16512, 8) (16512,)
>>> (4128, 8) (4128,)

Let’s fit our linear regression:

>>> beta_intercept: -3664645.74
>>> beta_MedInc: 44223.44
>>> beta_HouseAge: 944.77
>>> beta_AveRooms: -11532.16
>>> beta_AveBedrms: 66340.59
>>> beta_Population: -0.16
>>> beta_AveOccup: -411.33
>>> beta_Latitude: -41808.42
>>> beta_Longitude: -43108.51

Let’s define a function to plot our predictions and the true values for the held out test set:

There are some observations in our data which are censored. That is, their label seems to be houses which cost $500,000+. We can define another function which plots only points which are not censored:

While this approach can produce satisfactory results, it suffers from a few main drawbacks:

First, the linear model generally ignore the fact that the prices come from a Gamma distribution. Its calculations of the expected value for every point are predicated on the mean coming from a Normal distribution.
Second, for each coefficient, we only get a point estimate of its most likely value. However, we might be interested in a range which accounts for uncertainty. For example, we might want to know what is the range of price increases we can expect for each additional bedroom.

To address these problems, we can employ Pyro and PyTorch to construct our own linear model which will address all the pain points just mentioned.

Reconstructing the Linear Model with Pyro

First, let’s try and replicate the findings of the simple linear regression with Pyro. This will give us an intuition for how the different Pyro primitives work:

First, we will define our model in Pyro. Pyro models are defined as functions (actually they are defined as callables, but the simplest callable is a function). The function will accept our features 𝑋, our target 𝑦, and also the feature names for easier naming of priors:

In essence what we’ve done here is define our linear regression as the following linear combination of parameters

Formula prepared using https://www.codecogs.com/latex/eqneditor.php

However, unlike traditional linear regressions, we’ve defined each beta coefficient, as well as the error term, to be a distribution instead of a single values. That is, for each coefficient, we can ask what is the range of possible values this coefficient can assume given the data we observed. We gave a name to each of those distributions (e.g. “beta_intercept") for easy reference later.

We had to define priors on each coefficient. A prior is like our “best guess” for that value. Our chosen priors were:

Formula prepared using https://www.codecogs.com/latex/eqneditor.php

These are not very informative priors, but they are commonly used for regression coefficients and error terms.

One important point to notice is that we have to be explicit about these choices when building our model. That is, we have to be clear about what are reasonable priors for the coefficient values, the error term, and the distribution of values around our predicted value.

Once the priors are defined, we can ask Pyro to update them into better and better guesses through the magic of MCMC samplers:

>>> Sample: 100%|██████████| 3100/3100 [22:45,  2.27it/s, step size=7.72e-04, acc. prob=0.955]>>> Inference ran for 22.75 minutes

If you are interested in a breakdown of what has happened here, I recommend that you check out my previous post which explores the use of MCMC methods to optimize a single parameter.

We can explore the estimates found by our MCMC sampler for each of our coefficients using the .summary() method:

>>>                  mean    std   median    5.0%   95.0% n_eff r_hat
 beta_intercept   1.58   0.01    1.58   1.56   1.59   9.48  1.21
    beta_MedInc   1.96   0.02    1.96   1.93   1.98   2.83  2.28
  beta_HouseAge   0.80   0.09    0.80   0.68   0.93   2.41  3.00
  beta_AveRooms  -1.88   0.02   -1.87  -1.90  -1.86   2.56  2.53
 beta_AveBedrms   1.63   0.01    1.64   1.61   1.65   4.59  1.45
beta_Population  81.19  18.19   87.50  71.45  87.91   9.00  1.13
  beta_AveOccup  -0.56   0.01   -0.56  -0.57  -0.55   4.61  1.97
  beta_Latitude  -0.57   0.04   -0.58  -0.63  -0.52   2.39  3.23
 beta_Longitude  -2.37   3.30   -2.12  -7.11   1.80   2.44  2.82
      sigma  15062.73 460.33 14933.00 14908.34  15057.25  12.98 1.09

Number of divergences: 0

Those don’t look quite right… The means seem very different from the point estimates found by the regression from sklearn.

Let’s grab the individual samples from our sampler, and turn those into a dataframe (they are returned as a dictionary). We can grab the mean of each distribution as a coefficient point estimate, and then calculate a set of predictions for our data points. Then, we can compare them to our known values for house prices:

Let’s plot the results and calculate the 𝑅² value for our predictions:

Well that looks like a disaster! What happened?

Let’s define a function that will draw the coefficients’ distributions for us when given a coefficient dataframe:

Those plots don’t look like they converged.

Turns out for this problem, MCMC methods have a hard time with different scales for our data. They work much better when our features and target are scaled. Let’s explore the performance of the same methodology with the data scaled

Scaling Our Data

We are going to scale our X and y data using a standard scaler from sklearn. The standard scaler will scale each feature in X in such a way that it has a mean of 0.0, and a standard deviation of 1.0. To do so it will employ each variable's mean x̅ and standard deviation 𝑠

This means, instead of trying to find a set of coefficients on the original data

Formula prepared using https://www.codecogs.com/latex/eqneditor.php

we will find a set of coefficients on the scaled data:

Where

That is, y is scaled to be between 0–1 and the columns of X have been standardized.

>>> Sample: 100%|██████████| 3100/3100 [05:27,  9.46it/s, step size=1.71e-02, acc. prob=0.926]>>> Inference ran for 5.46 minutes

Our algorithm ran much faster now, but if we recover the coefficients the algorithm found, those will be the coefficients on the scaled data. We would like to translate them back into the unscaled data so we can ask questions such as: “For each extra bedroom in the house, what will be the effect on the price?”

Luckily, we can manipulate our equation to retrieve the coefficients on the unscaled data. We begin with our original equation

and we expand each fraction:

We can then rearrange the equation as follows:

Recalling that

we can finally rewrite our formula as follows:

We can create a function to perform the processing of the coefficients from the scaled data to the unscaled data

Then we just employ that function on the retrieved coefficients after we retrieve them from the dictionary again:

Let’s compare the prediction results:

These values seem much closer to the values found by the linear regression from sklearn.

It seems like we got a comparable performance. The distributions look close to the values we got from scikit-learn. We also get a probability distribution for each coefficient so we can see our level of confidence in the value we found.

However, we can actually employ Pyro to do better! Recall that we saw our house prices are not normally distributed but in fact follow a gamma distribution. We can modify our code to reflect that in the model.

Improving our Predictions Using a Gamma Distribution

In order to better reflect the house distribution, we can employ a gamma distribution for our target values. Unlike the normal distribution which is defined by its mean and standard deviation, the gamma distribution is defined by two positive parameters which are the shape and the rate.

When constructing our model for a distribution other than normal, we need to employ a link function which will translate the linear combination of our parameters to the expected value, or the mean, of the distribution. We also would like to know the relationship between the mean and the distribution parameters. Luckily, for the gamma distribution this is defined as:

However, if both the shape and rate parameters are positive, that means the mean must be positive as well. We need to make sure that our link function captures that. Therefore, I will use the following link function for the linear equation:

Readers familiar with the statsmodels package will know it has a GLM module (for more details on statsmodels GLMs, I recommend this article) which can model the relationship proposed here by employing the statsmodels.genmod.families.family.Gamma family and the statsmodels.genmod.families.links.log link functions. However, for the purposes of this article I wanted to show the process involved in reproducing this functionality in a Bayesian framework.

Interestingly enough, to recover the coefficients for the unscaled data, the math works out fairly similarly except for the constant. Keeping in mind that:

and that

We can find that our equation can be written as:

Let’s define our model, but now using a gamma distribution:

You’ll notice our code is slightly different but we are still calculating a linear combination of our X data and our coefficients, except now we take the exp of that combination to get the mean value of our data point. We also sample a rate parameter, and use the mean and rate to calculate the appropriate shape parameter.

Given our shape and rate parameters, we can define a gamma distribution and ask Pyro to optimize our coefficients and rate parameters in order to build a model most likely based on our data.

Let’s optimize this new model and look at the results:

>>> Sample: 100%|██████████| 3100/3100 [11:08,  4.64it/s, step size=1.47e-02, acc. prob=0.930]>>> Inference ran for 11.14 minutes

We can grab the coefficients from this sampler as well. However, this time we have to treat the value of y_max slightly differently.

Let’s compare this model’s performance by calculating the predictions and comparing them to the observed values

How do we interpret these coefficients?

Well, they now modify our price by a factor. Recall that our equation is now:

which is equivalent to

So each increase of x_1 by one unit, will increases the house price by a factor of e^{x_1}.

For example, the mean value for beta_AvgRooms is ~0.007, so each additional room will increase the house price, on average, by a factor of e^0.007 = 1.007 (so if a house cost $250,000, adding an additional room would make its value $251,756).

Other parameters seem to qualitatively agree between the two models, but the model employing the gamma distribution is giving us better predictions on the unseen data.

There is still the problem of run-time, however. In general, if we want to do inference in reasonable time, we have to turn to the method of variational inference.

Faster Run-time with SVI

Pyro implements Stochastic Variational Inference (SVI) for faster inference. Under the SVI approach, instead of trying to sample from the posterior distribution directly, we simply optimize the parameters of some pre-defined distribution to match our observed data.

For example, we can choose to represent all our coefficients, and rate, as normal distributions. This way SVI simply needs to find an appropriate mean and standard deviation for each normal distribution such that it agrees with our data as much as possible.

>>> iter: 0, normalized loss:5.28
>>> iter: 250, normalized loss:-0.64
>>> iter: 500, normalized loss:-0.68
>>> iter: 750, normalized loss:-0.68
>>> iter: 1000, normalized loss:-0.68
>>> iter: 1250, normalized loss:-0.69
>>> iter: 1500, normalized loss:-0.69
>>> iter: 1750, normalized loss:-0.68
>>> iter: 2000, normalized loss:-0.68
>>> iter: 2250, normalized loss:-0.69
>>> iter: 2500, normalized loss:-0.69
>>> iter: 2750, normalized loss:-0.69
>>> iter: 3000, normalized loss:-0.69
>>> iter: 3250, normalized loss:-0.68
>>> iter: 3500, normalized loss:-0.69
>>> iter: 3750, normalized loss:-0.68
>>> iter: 4000, normalized loss:-0.68
>>> iter: 4250, normalized loss:-0.69
>>> iter: 4500, normalized loss:-0.68
>>> iter: 4750, normalized loss:-0.68>>> Inference ran for 0.74 minutes

Grabbing the result from the SVI samples is a bit more involved than the MCMC samples since the samples return as a dictionary of tensors with gradients, and it also returns an additional key called obs which is not relevant for our analysis of the coefficients (it is a set of estimations for the training data).

And there we go. We now have the power to build custom GLMs using Pyro using either MCMC sampling methods or SVI optimization methods.

One important feature of Pyro is that it forces the writer to be very explicit about their assumptions and their understanding of the data generating process. This can be extremely useful in promoting an in-depth exploration of the data before any modeling is done, contributing to a deeper understanding of all important factors.

For an version of this article as a Colab notebook, visit this link

Result Summary

After finishing the notebook version of this blog, I set out to profile the run-time as well as performance results I’ve observed. By running the same code 40 times, each with a different train/test split, I was able to obtain confidence in the results I’ve shown here throughout this article.

In terms of R² performance, the gamma distribution consistently outperforms the normal distribution models, as can be seen in the following two figures.

It’s worth noting that the high standard deviation that is evident in the R² values for the Linear Regression model (sklearn's linear regression), and the model produced after scaling the data and employing MCMC methods are the results of a single train/test split which causes a terrible mis-estimation of the data coefficients. The gamma distribution based approaches were more adaptable to the misleading train/test split.

Additionally, I have profiled the run time for each methodology and the results can be seen in the following figure:

As you can see, while sklearn's linear regression consistently outperforms all other methods, employing the SVI based approach is a relatively fast solution with combines the flexibility of a custom specified GLM with the fast run-time required for many problems.

Additionally, if we so desire we can also write our own custom guides which employ distributions other than the normal distribution for the coefficients.

Overall, we can see that Pyro offers a rich set of methodologies to build linear models for different problems which combine the explicitness involved in probabilistic programming and the explainability which has become of focus lately. I hope this post served as a good guide for analysts and scientists looking to harness the power of an upcoming probabilistic programming language.

Hands-on Tutorials

Bayesian Generalized Linear Models with Pyro

Predicting House Prices with Linear Models and Pyro for a Fully Transparent Approach

Employing Scikit-learn’s Linear Regression

Reconstructing the Linear Model with Pyro

Scaling Our Data

Improving our Predictions Using a Gamma Distribution

Faster Run-time with SVI

Result Summary

Written by Boris Shabash