The world’s leading publication for data science, AI, and ML professionals.

Introduction to Generalized Linear Models

Expand your modelling skills beyond linear regression

Background

Linear regression is by far the most common algorithm we learn in Data Science. Every practitioner has heard of it and used it. However, for some problems, it is not suitable and we need to ‘generalise’ it. This is where generalized linear models (GLMs) come in and provide greater flexibility to your regression modelling and are an invaluable tool for data scientists to know about.

What Are GLMs?

As we said above, GLMs ‘generalise’ ordinary linear regression, but what do we really mean by that?

Let’s consider the simpler linear regression model:

Where β are the coefficients, x is the explanatory variable and ε are the normally distributed errors.

Let’s say we want to model how many claims calls an insurance company gets in an hour. Would linear regression be a suitable model for this problem?

No!

The reasons are:

  • Linear regression assumes normally distributed errors, and the normal distribution can take on negative values. However, we can’t get negative claim calls.
  • The second point is that the normal distribution, hence linear regression, is continuous. Whereas claims calls are all integer and discrete, we can’t get 1.1 calls.

Therefore, the linear regression model can’t correctly handle this exact problem. However, we can generalise the regression model to a probability distribution that meet the requirements specified above. In this case, it would be the Poisson distribution (more on this later).

GLMs then simply provide a framework of how we can link our inputs to the desired outputs of the target distribution. They help unify many regression models together under one ‘mathematical umbrella.’

Theoretical Framework

Overview

The basis of GLMs relies on three key things:

We will now run through what each of these things means.

Linear Predictors

This is the simplest one to understand. A Linear predictor, η_, just means we have a linear sum of the inputs (explanatory variables/covariates), x, multiplied by their corresponding coefficients, β**_:

Link Function

The link function, g, is literally responsible for ‘linking’ the linear predictor to the mean response of our target distribution, μ:

Exponential Family

Overview

A requirement for GLMs is that the target distribution of the output needs to be part of the exponential family. This family of distributions contains many famous distributions that you probably heard of such as Poisson, Binomial, Gamma, and Exponential.

In the GLM framework we actually use the exponential dispersion model, which is a further generalisation of the exponential family.

In order to belong to the exponential family, the probability density (PDF) or mass function (PMF) requires to be re-factored and parameterised in the form shown below:

This form is chosen for statistical convenience, but we don’t need to worry too much about why this is the case in this article.

Notice there are two parameters θ, which is the natural or canonical parameter that relates the inputs to the outputs, and ϕ which is the dispersion parameter.

Another cool fact is that the distributions in the exponential family all have conjugate priors. This makes them useful for Bayesian problems. If you want to learn more about conjugate priors, checkout my article on it here:

Bayesian Conjugate Priors Simply Explained

Canonical link function

There is something called the canonical link function, which is given by:

So, if we can describe θ in terms of μ, then we have derived the natural link function for our target distribution!

Mean and Variance

It can be mathematically shown that the mean, E(Y), of the exponential family, is given by the following:

Likewise, the variance, Var(Y), is given by:

If you want to see the proof for this derivation, refer to page 29 in the following linked book. In general, the solution to this is taking the derivative of the log-likelihood function with respect to θ.

Poisson Regression Example

Poisson Distribution

The Poisson distribution is a famous discrete probability distribution that models the probability of an event happening a specific number of times with a known mean rate of occurrence. Checkout my previous post if you want to learn more about it here:

Predicting the Unpredictable: An Introduction to the Poisson Distribution

Its PMF is given by:

Where:

  • e: Euler’s number (~ 2.73)
  • x: Number of occurrences (≥ 0)
  • λ: Expected number of occurrences (≥ 0), this is also the mean in the GLM notation μ

In Exponential Form

We can write the above Poisson PMF in exponential form by taking the natural log of both sides:

Then, we raise both sides with respect to Euler’s number:

And voila, the Poisson PMF is now in exponential form!

By matching the coefficients with the above equation and the exponential family PDF, we find the following:

Therefore, the mean and variance of the Poisson distribution is:

This is a known result for the Poisson distribution and we have just derived a different way!

Poisson GLM

The canonical link function for the Poisson distribution is then given by:

Therefore, the Poisson regression equation is:

We can verify the output of this equation can only be positive, therefore would satisfy the requirement for predicting the number of claim calls an insurance company receives problem.

You can then solve for the estimators through maximum likelihood estimation or iterated reweighted least squares.

Whats The Point?

You may be wondering why I have just taken you through all this arduous maths. Well let me quickly summarise the key take-home messages:

  • It is paramount to check the requirements of your problem and your target distribution to avoid non-sensical results.
  • GLMs provide a mathematical first-principles approach to how you can link your input to your desired output for that specific problem.

Summary & Further Thoughts

The standard linear regression model is powerful but is not suitable for all types of problems such as where the output is non-negative. For these specific problems, we must use other distributions, like the Poisson, and GLMs provide a framework for how we can carry out this process. They do this by deducing a link function from first principles which enables you to transform your input to your desired target output distribution. GLMs are a powerful modelling tool that most data scientists should at least be aware of due to their versatility.

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.

Dishing The Data | Egor Howell | Substack

Connect With Me!

References & Further Reading


Related Articles