The world’s leading publication for data science, AI, and ML professionals.

Probabilistic Programming: Generalized Linear Models

An alternative, distribution first approach to modelling data.

Photo by Carlos Muza on Unsplash
Photo by Carlos Muza on Unsplash

What is Probabilistic Programming

There is uncertainty in the world. Data scientists aim to attempt to model the world using past data to predict the future effectively. This strategy is relatively effective in many different domains and many use cases. However, there is still this uncertainty. It would be beneficial to predict a future outcome and understand how ‘good’ is the model.

Probabilistic programming has been around for some time, well over ten years. However, many people have likely not seen, used, or even heard about it before. So what is it? Very basically, it is a way to create a model in terms of probabilities. However, it is explicitly not for those who want their outcomes to be probabilistic.

Probabilistic programming is a paradigm that mixes programming frameworks with Bayesian statistical modelling, inference algorithms and elements of machine learning. Moreover, it is based on a Bayesian reasoning approach. Which stands in contrast to the common frequentism approach that most models follow. Fortunately, the nuances between each system of reasoning are not required for this post.

Frameworks

There are many different options available for probabilistic programming.

  • ‘STAN’ is well supported in R with ‘RStan’ and Python with ‘PyStan’. The framework is well-established.
  • ‘Pyro’ is for deep probabilistic models with a PyTorch backend. It is developed and maintained by the Uber Engineering division.
  • ‘TensorFlow Probability’ is built with a TensorFlow and backed by Google.
  • ‘PyMC3’ is a python API for probabilistic programming.

For example, with probabilistic programming, linear regression is transformed into the composition of multiple distributions. At first, this change may seem like overkill.

Why should everything be in terms of probabilities?

Well, for starters, when terms are defined probabilistically, you have the tools available to understand the model statistically. For example, the original formula for linear regression can be written as

Linear Regression
Linear Regression

Where Y is the output we want to predict, X is our independent data; the coefficients are defined with β and the error term with ϵ. For standard linear regression, the error term is assumed to be normally distributed.

With a probabilistic approach, the model is expressed in terms of probability distributions.

Probabilistic Linear Regression
Probabilistic Linear Regression

What this means is that the data is assumed to follow a normal distribution. Thus, the model’s outputs are the mean values of the linear model with the variance of the distribution σ².

In the second formulation of linear regression, the terms are defined as probabilities. This formulation means that the weights are represented as distributions and that you can measure how likely different values of β are. So, for example, when there is less data, these weights are more uncertain. But notably, that uncertainty is represented within the model.


Generalized Linear Models (GLMs)

Linear regression is a sum of several variables and weights followed by an error term. This error term is assumed to come from a Gaussian distribution.

However, as you probably already considered, the error term does not need to follow a Gaussian distribution.

In generalized linear models, the distributions do not need to be Gaussian. Additionally, while the terms within these models are always added linearly, a non-linear function can also be applied to the linear terms. Both these reasons give rise to the name ‘generalized’ linear models.

Formulation

These properties are beneficial when the outcome isn’t linear. For example, logistic regression is a linear model. However, for logistic regression, the linear terms are passed through the logistic function. The model also assumes a Bernoulli distribution to produce the final binary outcome.

The formulation of GLMs reflects these properties:

Generalized Linear Models Formula (Photo by Author)
Generalized Linear Models Formula (Photo by Author)

In this formula, g is known as a ‘link’ function that can be non-linear. The term E represents the distribution coming from the exponential family of distributions. This last phrase is to ensure that the formulation is general. The exponential family consists of many different distributions defined by a set of parameters. For example, the normal distribution is in the exponential family.

As you might have determined, generalized linear models involve understanding the distributions of the underlying data well and are perfect for probabilistic programming.


Probabilistic Programming in Python

In Python, a probabilistic programming package called PyMC3 allows users to fit Bayesian models using various numerical models. Probabilistic programming with PyMC3 applies to a wide range of problems. Critically, the functionality includes summarizing outputs and model diagnostics.

PyMC3 has many different tutorials and walkthroughs of all the functionality they have available. These examples offer some quick start to many different models, defined probabilistically. PyMC3 includes many different models, analyses, and methods such as autoregressive models, survival analysis, normalizing flows, Markov chain Monte Carlo sampling, Gaussian processes, and hierarchical linear regression. To name a few.

The focus for this post is on generalized linear models only.

Experiments

For this post, I will be using the Boston housing dataset. This dataset is composed of 506 records. This dataset aims to predict the house price using features such as the per capita crime rate, the proportion of non-retail business acres per town, and the full-value property-tax rate, among other features.

To generate GLMs in PyMC3, the setup is relatively straightforward. However, the structure is more complicated with more complex models (i.e. custom link functions and interaction terms).

First, the data is placed into a dictionary with keys ‘x’ and ‘y’ for the input and output data, respectively.

import arviz as az
import matplotlib.pyplot as plt
import pymc3 as pm
from pymc3 import *
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
X = df.drop(['target'], axis=1)
y = df['target'].astype(float)
data = {'x': X, 'y': y}

Next, with PyMC3, each model specification is wrapped in a ‘with’ statement. Then the GLM model is specified, and the data is passed in. This step ensures all the parameters are added to the model. Finally, the model is produced by taking many posterior samples.

with Model() as model:
    glm.GLM.from_formula("y ~ x", data)
    trace = sample(3000, cores=2)  # draw 3000 posterior samples 

Model Analysis

After the model is generated, we can analyze the model. First, the parameters are modelled with distributions, providing us with the full posterior distribution of likely parameters.

plt.figure(figsize=(7, 7))
traceplot(trace)
plt.tight_layout();

Only the first three features from the Boston housing dataset is shown along with the intercept.

  1. CRIM – per capita crime rate by town
  2. ZN – the proportion of residential land zoned for lots over 25,000 sq. ft.
  3. INDUS – the proportion of non-retail business acres per town.
Marginal Posterior for CRIM, ZN, INDUS, and CHAS from the Boston Housing Dataset (Photo by Author)
Marginal Posterior for CRIM, ZN, INDUS, and CHAS from the Boston Housing Dataset (Photo by Author)

The left side shows our marginal posterior – for each parameter value on the x-axis, we get a probability on the y-axis that tells us how likely that parameter value is.

There is something to take note of here. The sampling chains for the individual parameters (left side) appear to be well converged and stationary. There don’t appear to be any large drifts or other anomalies.

Conclusion

Probabilistic programming is an exciting area of Data Science.

When large complex models are used, interpretability and simplicity can be lost. While linear regression certainly has its limitations, generalized linear models offer an alternative approach.

When using probabilistic programming for these models, the uncertainty around the model can be measured effectively. This property leads to models that predict in an interpretable linear fashion and provide measures of uncertainty about the models themselves.


If you’re interested in reading articles about novel data science tools and understanding Machine Learning algorithms, consider following me on Medium.

If you’re interested in my writing and want to support me directly, please subscribe through the following link. This link ensures that I will receive a portion of your membership fees.

Join Medium with my referral link – Zachary Warnes


Related Articles