And why you need to know about it…

Introduction
Linear Models are considered the Swiss Army Knife of models. There are many adaptations we can make to adapt the model to perform well on a variety of conditions and data types.
Generalised Additive Models (GAMs) are an adaptation that allows us to model non-linear data while maintaining explainability.
Table of Contents
- What is a GAM?
- Dataset
- Estimating non-linear functions with Linear Regression
- How do GAMs work?
- Applying GAMs to the Bike Dataset
- How do GAMs Work Part II
- Implementing a Linear GAM using PyGAM
- Conclusions
What is a GAM?
A GAM is a linear model with a key difference when compared to Generalised Linear Models such as Linear Regression. A GAM is allowed to learn non-linear features.
GAMs relax the restriction that the relationship must be a simple weighted sum, and instead assume that the outcome can be modelled by a sum of arbitrary functions of each feature.
To do this, we simply replace beta coefficients from Linear Regression with a flexible function which allows nonlinear relationships (we’ll look at the maths later).
This flexible function is called a spline. Splines are complex functions that allow us to model non-linear relationships for each feature. The sum of many splines forms a GAM.The result is a highly flexible model which still has some of the explainability of a linear regression.
Let’s understand how we can model non-linear features without GAMs.
Dataset
This post will use the classic Bikeshare dataset which can be downloaded from Kaggle. This dataset tracks bike rental data in Washington D.C.
The notebook for this post is here:
Estimating non-linear functions with Linear Regression
The world is not linear. This means Linear Regression will not always represent what we see in reality. Sometimes a linear relationship is a good enough estimate but often it isn’t.
Let’s look at median bike rentals per hour of the day…
As expected, this is not linear. So what happens if we fit a simple Linear Regression?
Again, as expected, this line doesn’t make much sense. It doesn’t capture our relationship and we can’t really use this model.
It is worth noting that I did cheat a little bit, I didn’t encode hour meaning the model treated each value as increasing, however I think my point still stands..
Let’s try and model this as a non-linear relationship. To do this, we create Polynomial features using the hour variable, e.g. hour² , hour³, etc. As the order of poynomials gets higher, we use more variables, so for an order of 5 we use x, x², x³,x⁴ and x⁵.

In this example I have trained the model on hours 0–21 and predicted hours 22-23.
We can see that the x³ model does a better job than the linear and the higher order models are able to start to simulate our relationship quite well, picking out the peaks in the morning and afternoon.
However, if we look at the predicted points (hours 22 and 23)… Disaster! We’ve encounter the Runge Effect.
Runge’s Phenomenon finds that when we use high order polynomials, the edges of the function can oscillate to extreme values, meaning models with polynomial features don’t always lead to better values.
If we look at our x¹⁰ model, we can see that the orange line shoots into the stratosphere as soon as it predicts something. This is bad. Not only is it inaccurate, it highlights that we have no idea what this model will do on unseen data. This is a huge risk if we wanted to deploy this model and makes this simple model almost completely uninterpretable.
So how do we model this non-linear relationship with a simple model?
Enter the GAM.
How do GAM’s Work?
From Lines to Splines
Lets go back to our Linear Regression for a minute. The equation is defined by the sum of a linear combination of variables. Each variable is given a weight, β and added together.

In GAMs, we drop the assumption that our target can be calculated using a linear combination of variables by simply saying we can use a non-linear combination of variables, denoted by s, for ‘smooth function‘.

But what it s? We define it with the equation below, here we see β coming back and it represents the same thing; a weight. Our other term, b is a basis expansion. A basis expansion is what we did earlier with polynomials, taking x⁰, x¹, x², etc. There are other basis functions and they can be multi-dimensional.

The great thing about this is that we can have k weights and functions per variable in our equation. This is much more flexible and much less linear than our linear regression.
This smooth function is also known as a spline. Unfortunately, splines are really hard to define, they are essentially polynomial functions that cover a small range. Splines are easier to understand if we visualise them. Here’s an example of 4 splines, from the GAM we will fit shortly!

As you can see, these are smooth functions with a small range. They can vary in appearance and even be linear. Rather than go through more theory, let’s apply GAM’s to our problem and try to understand splines.
"There is a confusing number of ways to define these spline functions. If you are interested in learning more about all the ways to define splines, I wish you good luck on your journey"
- Christoph Molnar, Interpretable Machine Learning
Check out Christoph’s book here
Applying GAMs to the Bike Dataset
Let’s go ahead and fit a GAM to the previous problem
As you can see, the GAM does a much better job at estimating our function. It follows the curve and on our predicted points (hours 22 and 23), there is no sign of the Runge’s Phenomenon.
So how does this work? We’ve used 12 splines. Remember the 4 splines we viewed above? We can look at all 12 across the feature space.
This doesn’t look like it relates to our curve, this is because each spline function also has a weight. We can multiply the function output by the coefficients to understand what the model is doing across our 24 hours.
Now this looks more like our curve! Hopefully, it is intuitive as to why. The curve is just the sum of our individual splines! Don’t forget our intercept term in green along the top.
This is a problem with a single variable, but GAM’s can easily be applied to multiple variables. We can even select the number of splines per variable – we don’t have to have the same number for every one. We can also program interactions between variables manually.
GAM Understanding Part II
There is a lot more to be understood about GAMs which I will tackle in a later article. We’ve only looked at one variable here. GAM’s work just fine with many but we’ll look at this later on. Here’s an outline of the main concepts
Wiggliness
Wiggliness is literally how wiggly our line is. This is the correct term for describing this! The more splines we include, the more wiggly our line gets with respect to our feature. The issue with this is that it will start to overfit to our data. We need to find the right number of splines so the model can learn the problem but generalise well.
Preventing Overfitting
Luckily, we don’t just have to guess the number of splines. We have another parameter called lambda, λ. This penalises our splines. The higher lambda is, the less wiggly our line will be, until it reaches a straight line.
In the diagram below, we can see using a lot of splines and low λ leads to a very wiggly line.

A general rule of thumb is to use a high number of splines and use cross-validation of lambda (λ) values to find the model that generalises best. Remember we can have different splines and lambda values for every variable in our model.
Link Functions
Much like regular Generalised Linear Models, link functions can be used for different distributions; the Logit function for classification problems or Log for a log transformation.
Distributions
We can also select different distributions such as Poisson, Binomial, Normal.
Tensor Products
We can program interactions into our GAM. This is known as a tensor product. This way we can model how variables interact with each other, rather than just considering each variable in isolation.
Implementing a Linear GAM using PyGAM
From my research, it seems that the mgcv package in R is the best for GAMs. However, I prefer Python; the two best options are Statsmodels and PyGAM.
Here’s how to fit a GAM using PyGAM. This is assuming your data has been cleaned and preprocessed and is ready to model and your data is already split into training and test datasets.
import numpy as np
import pandas as pd
from pygam import GAM, LinearGAM, s, f, te
# your training and test datasets should be split as X_train, X_test, y_train, y_test
n_features = 1 # number of features used in the model
lams = np.logspace(-5,5,20) * n_features
splines = 12 # number of splines we will use
# linear GAM for Regression
gam = LinearGAM(
s(0,n_splines=splines))
.gridsearch(
X_train.values,
y_train.values,
lam=lams)
gam.summary()
print(gam.score(X_test,y_test))
There is a lot more depth to PyGAM as it offers a variety of GAM types. Check out the PyGAM documentation to take a look at the other types of GAM (e.g. for classification) and the different plot types available.
Conclusion
And that is a whistle-stop tour of GAMs! Hopefully you now know what a spline is and how we can use them to model non-linear data. Because we can understand how our model will react to unseen data and we have to include interactions explicitly, GAMs are considered relatively interpretable.
GAMs are best when we need an interpretable model for non-linear data.
In my next post on GAMs, we will look at Wiggliness, Overfitting, Distributions, Link Functions and Tensor Products through the lens of a more complex classification problem.
Learn More
Get my content straight to your inbox!
Data Siens: A resource for Machine Learning tips, tricks and tutorials.