Producing insights with Generalized Additive Models (GAMs)

Learn how to interpret GAMs and extract useful insights from data

Alvaro Peña
Towards Data Science

--

Photo by A. L. on Unsplash

Today we are going to learn how to use Generalized Additive Models to predict the number of bicycles rented in Washington D.C. between 2011 and 2012. This dataset was provided by the bike-sharing company: Capital Bikeshare. Bike-sharing systems are a new generation of service that allows users to pick up and drop off bicycles at convenient locations. Thus, promoting zero-emission transportation that has positive effects on traffic, the environment, and health issues.

What are GAMs?

“A generalized additive model is a generalized linear model with a linear predictor involving a sum of smooth functions of covariates” (Wood, 2017).

GAMs work the same way as linear regression by adding the estimated weights of their covariates. The biggest difference is that these weights represent flexible relationships as opposed to just linear ones and we use a link function to model the target variable.

We can use this type of model for different applications: Model the spread of a disease over time using the Poisson distribution, predict if a patient has a disease based on numeric/categorical variables using the Binomial distribution (logistic regression), study the spatial behavior of a species over an area using latitude and longitude data. GAMs is a versatile framework that can be applied to almost any field.

In our context, we could create a model that explains how the number of rented bikes changes due to the flexible behavior of time, humidity, and temperature. Additionally, the number of rented bikes behaves like a Normal distribution.

We can write our model formally as:

The mathematical expression of an example GAM

The functions f1, f2, and f3 allow us to model flexible relationships between the target and the explanatory variables. Finally, the sum of the estimated weights of those relationships is the predicted/estimated number of rented bikes. Before training a model we’ll describe the dataset and start with some exploratory data analysis to decide what explanatory variables to use.

Load libraries

# Basic wrangling functions
library("dplyr")
# Beautiful plots and data exploration
library("ggplot2")
# Comparing variables and data exploration
library("GGally")
# Library to fit gams
library("mgcv")
# Modern library for visualizing and assessing gams
library("gratia")

The dataset

Let’s download the dataset from the Interpretable Machine Learning book GitHub repository, this data has been preprocessed and is ready for us to work on. Let’s explain each variable:

  • season: Season of the year.
  • holiday: If that day was a holiday or not.
  • workingday: If that day was workable or not, essentially weekend or not.
  • weathersit: The weather situation of that day, in three categories.
  • temp: Temperature in degrees Celsius.
  • hum: Relative humidity percentage.
  • windspeed: Wind speed in km/h.
  • days_since_2011: A timestep variable to account for the passing of time.
  • cnt: The number of bicycles rented.
  • weekday: Day of the week

We are going to use the function below and summarize the variables of interest.

get.bike.data = function(){
url = "https://raw.githubusercontent.com/christophM/interpretable-ml-book/master/data/bike.csv"
# Download file and save it as bikes.csv in our current folder
file = download.file(url, "bikes.csv")

# Read the file and return it as a data frame
bike = read.csv('bikes.csv', stringsAsFactors = T) %>% data.frame()
}

# Relevant variables
variables.of.interest = c('season','holiday', 'workingday', 'weathersit', 'temp', 'hum', 'windspeed', 'days_since_2011', "cnt", "weekday")

# Read data and extract variables of interest
bikes = get.bike.data() %>% dplyr::select(variables.of.interest)

# Summarise data
summary(bikes)
The output of the summary function

We have cleverly summarized our data with a single function, we have categorical and numerical data for each day. We have 730 entries in our dataset which is about two years worth of data.

Data exploration

Let’s start analyzing the numeric variables to check if they have any effect on the number of rented bikes.

# Compare each variable against the other
ggpairs(bikes %>% select(c(temp, hum, windspeed, cnt))) +
labs(subtitle = "Numeric variable exploration") +
my_theme()
Fig 1. Checking variable correlation

We used the ggpairs function to create this graphic where we can visualize what variables influence each other and spot patterns. In broad terms, we can see that temperature positively impacts the number of bikes rented (see the plot in the bottom-left corner). Let’s investigate the effect of temperature with the rest of the categorical variables that explain the weather situation and days.

Fig 2. Weather situation by season and its effect on the number of rented bikes

Regardless of the season, if the weather is good the number of rented bikes will be higher compared to bad weather. For instance, during winter there have been three types of weather but on average the number of bikes is higher when the weather is not bad. Additionally, the number of rented bikes is significantly higher during summer compared to winter. Perhaps we should consider the weather situation for our model.

Fig 3. Effect of temperature within seasons and weather situations

The left plot confirms our findings, suggesting that good weather contributes significantly towards increasing the number of rented bikes. On the other hand, the right plot shows that temperature has a positive effect on colder seasons like Fall and Winter. Interestingly enough we observe a slight negative trend when temperatures get close to 30 degrees Celsius during Summer. Suggesting some cyclists don’t want to cycle in high heat.

Training a Generalized additive model (GAM)?

To train/fit a GAM in R, we’ll use the mgcv library. It’s a powerful and well-maintained library.

M = gam(cnt ~ season + weathersit  + s(days_since_2011, bs ="cr", k = 70) +
s(temp, bs = "cr", by = season, k = 15), data = bikes, )

Our model will predict the number of rented bikes taking into account the following:

  • weathersit: A different intercept for each weather situation will be estimated. What is an intercept? The average value of rented bikes that depends only on a given weather situation if all other variables are set to zero.
  • s(days_since_2011, bs = “cr”, k = 70): This term means that we will use a cubic regression spline (smooth function) to estimate how the number of rented bikes changes over time. The k value, is the rank of the function which dictates how flexible it is. This will become clear later.
  • s(temp, by = season, k = 15): This term estimates the effect of temperature during different seasons on the number of rented bikes. Because k is smaller it’s not as flexible as the term above.
  • season: A different intercept for each season because we are using a term that estimates temperature by season.

The sum of all these terms will result in the predictions of our model.

Summarizing and checking our model

# Summarize model
summary(M)
Fig 4. Model summary

For starters, the model summary in Fig 4. tells us the distribution and link function of the target variable. In our case: Gaussian (Normal) and identity function (variable was unchanged). We have the parametric coefficients or intercepts, we calculated an intercept for each season and for each weather situation (also called parametric effects). In the next block, the effective degrees of freedom of our smooth functions are presented. These values tell us how flexible the relationships are (essentially how much they differ from a linear relationship which would be edf = 1). Finally, we have the deviance explained which in our case is 88%. Our model explains 88% of the data.

# Checking k-value and edf
k.check(M)
Fig 5. Checking the model’s flexibility

The next bit of model checking in Fig 5. has to do with the rank of the smooth functions. For instance, the first smooth function that explains the relationship between the number of bikes and time was given a rank of 70. One of the ranks was used to calculate the intercept and the rest to model the relationship. The edf tells us how much of the available flexibility (69 ranks) was used by the used function. As a good rule of thumb, we want k to be larger than the edf and k-index to be larger than 1. We also want a large p-value not a small one, which is often the case. In our case, we do not manage to obtain the latter two but we have given the functions enough flexibility as the edf demonstrate.

Interpreting smooth effects

This section will show how the smooth functions model complex relationships between the explanatory variables and the number of rented bikes. The graphics shown below can be created using the following function:

# Plot smooth and parametric effects
draw(M, parametric = TRUE)
Fig 6. Non-linear relationship between time and rented bikes

One important thing to note is that all partial effect plots are centered on the mean and the shaded area represents the 95% confidence interval. This means that the increase or decrease shown in the y-axis is reflected in the average predicted value of rented bikes. For instance, one could interpret Fig 6. by stating that during the first half of the days the number of bikes was lower than the average. We can also see that the relationship is quite flexible due to the 30 effective degrees of freedom the function used.

Fig 7. Effect of temperature on rented bikes during fall

The effect of temperature in Fig 7. does not have a lot of flexibility but we can clearly spot some “wiggliness”. One important aspect to note is the rug plot on the x-axis which shows the number of data points. When there are no data points the confidence intervals get considerably larger because we are not certain about the relationship. We can interpret this plot by stating that during the Fall season the increase in temperature increases the number of rented bikes. More precisely we could say that when the temperature is about 5 degrees Celsius the predicted number of bikes is — 1200 (1200 lower) than the average.

Fig 8. Effect of temperature on rented bikes during spring

During Spring the temperature behaves differently, the predicted number of bikes rises sharply when the temperatures goes from 4 to 20 degrees Celsius but after that, the predicted number of bikes starts to drop. We need to consider carefully the uncertainty around higher values of temperature, the rug plot indicates there are not many data points. Perhaps if there were more data that portion of the plot could show a flat curve?

Fig 9. Effect of temperature on rented bikes during summer

Summer is very interesting because we see a decline during warmer days. In Fig 9. we can see that temperatures above 30 degrees Celsius have a negative effect on the predicted number of rented bikes, about — 2500 than the predicted average. It could be that cyclist are not keen on cycling during really warm days.

Fig 10. Effect of temperature on rented bikes during winter

Finally, during the Winter season as temperature rises more bikes are predicted to be rented. Note Fig 10. shows an almost linear relationship corroborated by the 1.24 edf shown on the model summary. Although its certainly not as flexible as the other smooth functions it seems appropriate to model temperature by season. We could interpret this plot by stating that at 15 degrees Celsius during winter the predicted number of bikes is about the same as the average.

Next we are going to interpret the estimated intercept or parametric effects.

Interpreting parametric effects

Fig 11. Effect of season of the year on rented bikes

Fig 11. shows the estimated parametric effects with a 95% confidence interval. We can interpret this plot by stating that during Spring the predicted number of bikes is 1489 lower than during fall. We are confident about this statement because the confidence interval do not include zero (our point of comparison). The effect of the Spring season would not be significant if the predictions also included the predictions from the fall season, right? We can make the same statement regarding the effect of Winter on the predicted number of bikes. On the other hand, the effect of summer is not significant because its confidence interval is quite large and includes zero. This indicates that during Summer our model predicts a quantity of rented bikes that could be predicted on fall too. We can extract an insight about this, the number of rented bikes rises not because of the Summer season but because of cold/mild temperatures getting warmer.

Fig 12. Effect of the weather situations on rented bikes

Finally, Fig 12. tell us that when the weather is misty or raining/snowing/storming the change on predicted bikes is significant (because the confidence do not include zero). In this case the effect is negative because we can state the the predicted number of rented bikes during a misty day is 690 lower than a day with good weather.

Conclusion

We trained a Generalized Additive Model to predict the number of rented bikes in Washington D.C. based on the time variation, temperature change during seasons and overall weather situation. We learned how to read the model summary and check the effective degrees of freedom for a correct fit. Finally, we interpreted the smooth and parametric effects of the model to understand what drives the number of rented bikes and obtained the following insights:

  • Cyclist tend to avoid renting bikes when the temperatures get too warm during the summer (above ~ 25 degrees Celsius).
  • The rise in rented bikes occurs during mild weather seasons that start to get warmer temperatures (fall and winter).
  • We have high confidence that rain/snow is a significant factor that negatively influences the number of rented bikes.

In conclusion, a GAM is a powerful machine learning framework that is vastly interpretable and can be used to extract insights from data. Turning noise into sentences.

All images unless otherwise noted are by the author.

References

Wood, S.N., 2017. Generalized additive models: an introduction with R. 2nd Edition. Chapman and hall/CRC.

Code

Can be found in my GitHub repo:

https://github.com/alvarofps/Producing-Insights-with-GAMs

--

--