Generalized Linear Models Decomposed

Using Maximum Likelihood and Gradient Descent to fit GLMs from scratch in Python

Published in

Towards Data Science

9 min readMar 15, 2020

In ordinary linear regression, we treat our outcome variable as a linear combination of several input variables plus some random noise, typically assumed to be Normally distributed. While this modeling approach is easily interpreted, efficiently implemented, and capable of accurately capturing many linear relationships, it does come with several significant limitations. A simple extension of linear models, a Generalized Linear Model (GLM) is able to relax some of linear regression’s most strict assumptions. These assumptions include:

Linearity between the outcome and input variables
Normal distribution of error terms
Constant variance of error terms.

Relaxing these assumptions allows us to fit much more flexible models to much broader data types.

GLMs can be easily fit with a few lines of code in languages like R or Python, but to understand how a model works, it’s always helpful to get under the hood and code it up yourself. This article shows how to implement GLMs from scratch using only Python’s Numpy package. For more on the basics and intuition on GLMs, check out this article or this book.

GLM Structure

Fitting a GLM first requires specifying two components: a random distribution for our outcome variable and a link function between the distribution’s mean parameter and its “linear predictor”.

The Random Component

The first step to building our GLM is identifying the distribution of the outcome variable. If the data has a binary response, we might want to use the Bernoulli or Binomial distributions. If we are working with count data, a Poisson model might be more useful. This distribution is typically assumed to come from the Exponential Family of distributions, which includes the Binomial, Poisson, Negative Binomial, Gamma, and Normal.

Each of these models can be expressed in terms of its mean parameter, 𝜇 = E(Y). For instance, we specify a binomial model as Y ~ Bin(n, p), which can also be written as Y ~ Bin(n, 𝜇/n). Once we estimate 𝜇, we model Y as coming from a distribution indexed by 𝜇̂ and our predicted value of Y is simply 𝜇̂.

The Link Function

In a GLM, we estimate 𝜇 as a non-linear function of a “linear predictor” 𝜂, which itself is a linear function of the data. The non-linear function connecting 𝜇 to 𝜂 is called the link function, and we determine it before model-fitting. The link function is written as a function of 𝜇, e.g. 𝜂 = g(𝜇).

Suppose we have the following training data where each x is a D-dimensional vector:

We first write 𝜂 as a linear function of x for each observation n = 1, … , N:

Then we connect 𝜂 to 𝜇 with the link function:

Fitting a GLM

To fit the GLM, we are actually just finding estimates for the βs: from these, we obtain estimates of 𝜂, which leads immediately to an estimate for 𝜇, which then gives us an estimated distribution for Y! To estimate the βs, follow these steps:

Specify the distribution of Y as a function of 𝜇.
Specify the link function, 𝜂 = g(𝜇).
Identify a loss function. Here, we use the negative log-likelihood. We can decompose the loss function into a function of each of the linear predictors and the corresponding true Y values as shown in the image below.
Find the β values to minimize the loss function, either through a closed-form solution or with gradient descent.

Linear Regression — A Special Case of the GLM

To reinforce our understanding of this structure, let’s first write out a typical linear regression model in GLM format. As step 1, let’s specify the distribution of Y. Recall that a typical linear model assumes

where β is a length-D vector of coefficients (this assumes we’ve added a 1 to each x so the first element in β is the intercept term). Note that the mean of this distribution is a linear combination of the data, meaning we could write this model in terms of our linear predictor by letting

Then for step 2, we need to find the function linking 𝜇 and 𝜂. In the case of linear regression, it’s simple. Since E(Y) = 𝜇 and the mean of our modeled Y is 𝜂, we have 𝜂 = g(𝜇) = 𝜇!

Step 3: let’s find the negative log-likelihood.

Note that our loss function is proportional to the sum of the squared errors.

Finally for step 4, let’s see if we can minimize this loss function analytically. Start by taking the derivative with respect to β and setting it equal to 0.

This gives the closed-form solution we know and love from ordinary linear regression.

Now for the simple coding. Let’s randomly generate some normally-distributed Y values and fit the model. The scatterplot below shows that our fitted values for β are quite close to the true values.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as snsnp.random.seed(123)## GENERATE STUFF ### Generate X
N = 1000  # number of observations
D = 10 # dimension of each observation
X = np.random.randn(N, D-1) # (minus 1 because we have to add the intercept)
intercept = np.ones(N).reshape(N, 1) 
X = np.concatenate((intercept, X), axis = 1)# Generate true beta
beta = np.random.randn(D)# Generate response as function of X and beta
y = X @ beta + np.random.randn(N)## FIT STUFF ##
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y## PREDICT STUFF ##
y_hat = X @ beta_hat## PLOT ##
fig, ax = plt.subplots()
sns.scatterplot(beta, beta_hat)
ax.set(xlabel = 'True Betas', ylabel = 'Estimated Betas', title = 'Beta vs. Beta Hats');

Logistic Regression — A GLM for Binary Data

In logistic regression, we model our outputs as independent Bernoulli trials. I.e. we assume

That completes step 1. For step 2, we must find a way to relate our linear predictor 𝜂 to our parameter p. Since p is between 0 and 1 and 𝜂 can be any real number, a natural choice is the log-odds. I.e.,

Inversely, we use the sigmoid function to get from 𝜂 to p (which I will call S):

This wraps up step 2. For step 3, find the negative log likelihood.

This gives us our loss function and finishes step 3. For step 4, we find the values of β to minimize this loss. Unfortunately, in the logistic regression case, there is no closed-form solution, so we must use gradient descent. So, let’s find the derivative of the loss function with respect to β. First, note that S’(x) = S(x)(1-S(x)):

Then we can derive the entire gradient:

To speed up calculations in Python, we can also write this as

where

and

Now let’s fit the model using gradient descent. Again, the scatterplot below shows that our fitted values for β are quite close to the true values.

np.random.seed(123)## GENERATE Ys ### Generate response as a function of the same X and beta
# only now it's a Bernoulli r.v. 
def sigmoid(x):
    return 1/(1 + np.exp(-x))
p = sigmoid(X @ beta)
y = np.random.binomial(1, p)## FIT WITH GD ##
nruns = 10**5
learning_rate = 0.0001
beta_hat = np.random.randn(D) # initialize beta_hat estimates
p_hat = sigmoid(X @ beta_hat) # initialize p_hats
for run in range(nruns):
    # get gradient 
    a, b = y*(1-p_hat), (1-y)*p_hat 
    grad = X.T @ (b-a)
    # adjust beta hats
    beta_hat -= learning_rate*grad
    # adjust p hats
    p_hat = sigmoid(X @ beta_hat)## PLOT ##
fig, ax = plt.subplots()
sns.scatterplot(beta, beta_hat)
ax.set(xlabel = 'True Betas', ylabel = 'Estimated Betas', title = 'Beta vs. Beta Hats');

Poisson Regression — A GLM for Count Data

The Poisson is a great way to model data that occurs in counts, such as accidents on a highway or deaths-by-horse-kick.

Step 1: Suppose we have

Step 2, we specify the link function. The link function must convert a non-negative rate parameter λ to the linear predictor η ∈ ℝ. A common function is

which of course has inverse

Now for step 3, find the negative log-likelihood.

And step 4, optimize for β.

Sadly, there is no closed-form solution, so again we turn to gradient descent, as implemented below. Once again, the estimated parameters are plotted against the true parameters and once again the model does pretty well.

np.random.seed(123)## GENERATE Ys ### Generate response as a function of the same X and beta
# only now it's a Poisson r.v. 
beta = np.random.randn(D)/2
lam = np.exp(X @ beta)
y = np.random.poisson(lam = lam)## FIT WITH GD ##
nruns = 10**5
learning_rate = 0.00001
beta_hat = np.random.randn(D)/2 # initialize beta_hat estimates
lam_hat = np.exp(X @ beta_hat) # initialize lam_hatsfor run in range(nruns):
    # get gradient 
    c = y - lam_hat
    grad = X.T @ c
    # adjust beta hats
    beta_hat -= learning_rate*grad
    # adjust lambda hats
    lam_hat = np.exp(X @ beta_hat)## PLOT ##
fig, ax = plt.subplots()
sns.scatterplot(beta, beta_hat)
ax.set(xlabel = 'True Betas', ylabel = 'Estimated Betas', title = 'Beta vs. Beta Hats');

When building GLMs in practice, R’s glm command and statsmodels’ GLM function in Python are easily implemented and efficiently programmed. But becoming familiar with the structure of a GLM is essential for parameter tuning and model selection. When it comes to modeling, often the best way to understand what’s underneath the hood is to build the car yourself.