
Introduction
I recently had to fix the fence in my back yard. It’s old, wooden, and has been threatening to topple over for a while now. Between curses it really struck me how many tools I needed to use to get the job done, and how sometimes you really need more than one tool for the job.
What does this have to do with time series regression? In general, very little. In particular, quite a bit: today we’ll be diving into using mixed models for time series analysis and Forecasting. Or in more DIY terms – using more model tools to get the forecasting job done.
So, without too much more rambling, we will get cracking with:
- Revisiting the big picture, and talking a bit about mixed models.
- Looking at some real-world data.
- Using a simple model to capture the trend in our time series.
- Seasonality three ways: decision trees, linear regression, and classic time series.
- Putting Humpty Dumpty back together to get a single time series prediction.
Aside: you’ll be pleased to know that my thumbs have made a full recovery following some hammer-related accidents.
The big(ish) picture
As always, we’re trying to build the most "accurate" model possible. In this case, we’re focusing on forecasting so we’ll prioritise models which can produce the most accurate estimates of future time series values.
I’ve previously written at length about using a regression approach based on Meta’s Prophet methodology. In those articles, I explained why I chose to use the LASSO model: non-stationarity made using classic methods difficult, the need for sensible extrapolation ruled out tree-based approaches, and the desire for simplicity and explainability excluded any sort of neural network.
But I also hinted at using mixed model forms – that is, modelling each of the time series components using a different model form, and combining the output of the various models to produce a single forecast.
As we’ll soon see, breaking the modelling task down into smaller and more targeted problems allows us to use multiple tools better suited to a task than one generalised tool which tackles the task in its entirety.
Data
We’ll be using some real-world data from the UK¹. In this case, it’s a summarised version of road traffic incidents, which looks something like this:

This series is a little messier than what you’d usually see in a demonstration – note the downward (changeable) trend, strong seasonality, and general noisiness inherent in the target.
We also discussed previously how the magnitude of the seasonal variation is linked to the magnitude of the trend, and so it’s more appropriate to specificy the time series using a multiplicative model. Spoiler alert – this becomes important later!
Aside: if you need a refresher on the components of a time series, I go into some detail here:
A simple estimate of the trend
First up, we need to accurately model the trend component. We can use the seasonal_decompose
functionality in statsmodels
to get a naive estimate of each of the time series components – here’s what the decomposition’s trend estimate looks like:

That looks quite sensible to me:
- It’s a little bumpy but looks to be fairly centred through the series.
- It also looks like it captures (more or less) the change in trend around mid-2013.
Let’s exploit the fact that the estimate looks like a set of linear curves, and actually fit a set of linear curves to the trend estimate.
For this, we’ll adapt this great answer from StackOverflow². This relies on the simple observation that points with a similar-enough gradient should probably be plotted on the same line segment. The approach to determining which points belong to which line is pretty slick:
- Calculate the gradient of the trend we’re trying to estimate. Calculus-phobes, fear not – we can use
numpy
for this. - Use a decision tree to group points of similar gradients together. Constrain the decision tree to have as many terminal nodes (leaves) as desired number of piece-wise curves.
- Identify the data points which belong to each terminal node and fit a linear curve through these points.
Let’s first take a look at the gradient of the points:

Note how the gradient is fairly stable until mid-2012, where it gets quite noisy through to about 2014 or so. It then settles down to a new stable pattern. That seems to correspond well to our observation of a change in trend around that time; it looks even better when compare the actual series against the linear piece-wise curves we fit.

Aside: the original StackOverflow post didn’t keep the fitted lines. Since we’re interested in forecasting into the future, we’ll need to keep at least the last curve.
A not-so-simple estimate of the seasonality
Let’s move on to estimating the seasonal effects in the time series.
Seasonality
… is any regular and repeated pattern that we see in the series, which is not related to long-term change. Since we’re using a multiplicative model, we can remove the effect of trend by simply dividing the fitted trend out of the original time series. This leaves a fairly volatile – but stationary – "residual":

"Residual" is meant in the broadest sense here – what we’ve got after we account for trend is actually seasonality and noise (the real residual).
We’ll use another model to sift through and pick up any signal. In fact, we’ll use three approaches to capture the seasonal effect and then pick one. First up, a decision tree.
Rule-based decisions: the decision tree
Decision tree models are quite widely known and understood, so I won’t go into too much detail here other than to say that they are rule-based learners and often suffer from overfitting to the target data.
Before we get into the training, we need to do some thinking around what we want from the model, how we want to minimise overfitting and the fitting of noise, and what that means for the data we feed into the model.
One way we can counter overfitting is by controlling the number of terminal nodes (or leaves) in the tree. In the sklearn
implementation I’m using, this is the max_leaf_nodes
parameter.
Secondly, we want the model to produce a prediction of seasonality for each month in our data set. But we want the same seasonality prediction for each month in the year (i.e. we want the same seasonality to apply to January 2000 as to January 2001, and so on); since there are 12 months in the year, we want to restrict the model to outputting only 12 distinct values. So we use max_leaf_nodes=12
, and train as usual.
Last – but not least – data. Since we care only that seasonality predictions vary by month, that’s all we need to feed into the model.
This gives a fairly compact tree, which we can visualise as so (I’ve tidied the variable names up a little):

Notice how we have 12 terminal nodes with each containing ~8.3% – or 1/12th – of the data? It looks as if we get exactly what we want – a prediction for each month of the year, which is consistent across time. Let’s see if that’s the case, paying particular attention to see if the model starts to pick up noise:

That looks fairly decent to me. In fact, we see a repetitive prediction for each month, which is consistent over time and looks to be stationary. Magic!
Let’s take another approach, using a linear model.
Same-same-but-different: the LASSO model
We can also model seasonality using simpler, linear models. I’m going to borrow from previous work (which in turn borrowed from somewhere else), and model the seasonality using a Fourier series.
Now, I’ve no idea how many different sine and cosine waves to use, so I’m going to create a bunch and use the LASSO model to choose the most useful of the lot.
Let’s first create the waves – that’s fairly simple:
# set periodicity of seasonality
P = 12
# empty
X = pd.DataFrame(index=range(len(y)))
# seasonality features
for j in range(1, 12):
X[f'sin_{j}'] = np.sin(X.index * 2 * np.pi * j / P)
X[f'cos_{j}'] = np.cos(X.index * 2 * np.pi * j / P)
Aside: notice how you don’t actually need any date or time information to create these features – I’ve used the DataFrame index here. If you follow this approach, be sure that your time series is sorted chronologically, as order will be important!
Sine and cosine outputs all lie in the same range so we don’t need to worry about standardisation. Doubly useful is scikit-learn’s LassoCV
function, which uses cross-validation during model training to select the "optimal" alpha parameter so we don’t have to.
Checking model fit on the training data gives a fairly good fit:

Notice how complex seasonality predictions are (aren’t Fourier series great!), and how the predictions look to be fairly immune to noise.
Let’s fit one more model before taking a step back to look at the big picture.
The classic: a SARIMA model
The Seasonal Auto-Regressive Integrated Moving Average (SARIMA) is an extension of a classical time series model and is itself is a mixture of models; it can be decomposed (see what we did there!) into various components.
The auto-regressive³ (AR) element assumes that the current – or future – values of the time series is a function of previous values while the moving average⁴ (MA) part of the model focuses on the regression error, and assumes that it is a linear combination of noise terms.
The I part of the model refers to integration – or more accurately, differencing⁶ – applied to the time series values. This operation is used to enforce stationarity on the time series (although that doesn’t always work).
Together, these three elements form the ARIMA model⁵. The ARIMA is quite a powerful time series model in and of itself but tends to fall short if the time series in question has strong seasonal effects⁷.
Enter the Seasonal ARIMA, or SARIMA. This is an extension to the classic which allows for repetitive signal to be incorporated into the mix.
Aside: I’ve actually used the SARIMAX model class available in statsmodels
. This is itself an extension of an extension, as it allows the user to incorporate eXternal regressors into the time series model. We’re not doing that this time, but keep an eye out for forthcoming articles where external data will be useful!
The SARIMA model needs specification from the user, particularly around the seasonal period, and how many (seasonal) AR, MA, and I terms to include. Luckily for us, the seasonality "residual" we’re trying to model is stationary so we don’t have to fiddle around with the I parameter. We’ll go ahead and parameterise the model the old-fashioned way, with a little help from the autocorrelation and partial autocorrelation visuals.
Autocorrelation tells us how strongly related a given time series observation is to previous values of itself. The number of time steps between the given observation and the observation(s) used in the comparison are referred to as "lags". It’s often useful to create an autocorrelation plot where lags are plotted along the x-axis and correlation is plotted along the y-axis – something like this:

statsmodels
also includes confidence intervals to help us determine how significant each correlation estimate is. This is the blue shaded area.
For example, in the chart above there is a ~0.5 correlation between the current value of the time series and the value immediately observed before (i.e. lag 1). The observation lies well outside of the confidence interval, so we can see that it is statistically significant.
We can see how wave-like the autocorrelation is – this indicates a very strong degree of seasonality in our time series. We already know this (in fact, we should hope there is a lot of seasonality in what we’re modelling!) so it’s not really news.
One of the shortcomings of autocorrelation is the concept of intervening pairs and their effect on the measurement of correlation.
Imagine that we’re estimating the correlation at lag k. Since our time series has fairly strong lag 1 correlation we know that the observation at time t of our time series model is strongly correlated with the value at time t – 1. Likewise, observation t – 1 is strongly related to observation t – 2 and so on, all the way to observation t – k. It’s clearly difficult for us to get a reliable estimate of the relationship between observation t and observation t – k without taking into account the observations in between.
This is where partial autocorrelation comes in handy, as it calculates autocorrelation while accounting for those pesky intervening observations. Like we did above, we can create a plot of partial autocorrelation.

Now that was a cool little detour around Theory Town, but so what? Why do we need to look at the ACF and PACF charts?
As it turns out, these can actually help us parameterise our SARIMA model! I’ve used a rule set⁸ to get started, and then manually tuned a bit using the Akaike Information Criterion⁹ (AIC) to get to the "best" model (lower AIC indicates a "better" model). We can summarise the model as follows:

… and we can see that we ended up using 1 AR term, 2 MA terms, and 1 seasonal AR and MA term with a seasonality of 12. In more summarised notation, this is a (1, 0, 2) x (1, 0, 1)(12) model. This gives us an AIC of -814 (top right).
Let’s take a look at the model fit.

We immediately see some clear blue water between this model and the two previous: predictions don’t look too rosy for the first 12 months and the prediction changes over time.
It’s largely due to the way these types of models work: they assemble predictions together using previous observations, substituting in predictions when they run out of observations.
This means that these model predictions can look shaky to start with: it’s not unreasonable that in the first x months of a model which requires at least x months of seasonal data to get going, things are not going to work. That’s quite clear from the chart above.
It also means that predictions are likely to change over time – and not just in the desired seasonal way. Sure, we’ll likely get the expected peaks and troughs but it’s unlikely that we’ll always get the same peaks and troughs.
Of course, this may not be the end of the world especially if we’re more concerned with future predictions than we are with model fit. It’s just worth bearing in mind.
It’s time to put Humpty Dumpty back together by combining modelled trend and seasonality.
Putting it back together
Now that we’ve got our estimate of trend and various ways to predict seasonality, we can combine them back into a single series.
For demonstration’s sake, today we’ll pick the decision tree model of seasonality. The combined prediction looks like a good fit, even these predictions are on training data only:

A few things to notice here:
- We still miss in a few areas – the start of 2009, 2010, and 2011 stand out as the model overpredicts actual experience.
- Even though the seasonal pattern per the decision tree was consistent over time, the seasonal variation in the combined time series prediction decreases over time. This is due to the multiplicative specification of the time series model – this is a really important point!
Overall, not too shabby!
Wrap up and ramble
That’s a lot of modelling.
As is tradition, a quick recap before a bit of a ramble.
The recap
We started off by looking at quite a messy real world time series. We saw how the monthly number UK road traffic accidents had a clear trend, with changing seasonality over time. We deduced that a multiplicative time series model would be most appropriate to use and set off modelling each of the trend and seasonality components separately.
We started off by modelling the trend. A seasonal decomposition of the series gave us the idea that the trend could be captured through a series of piecewise linear curves. With a little help from StackOverflow, we built these curves using decision trees and gradient calculations.
We then moved on to modelling the seasonality, which we determined to be whatever remained after removing the modelled trend from the actual time series. We spent a good time here, exploring various ways to model seasonality: starting off with a decision tree, we used a linear model before moving onto a classic statistical time series model.
Being spoiled for choice, we picked the decision tree model of seasonality to use for demonstration purposes, and put the modelled time series back together before comparing to the original where we saw that we got a pretty decent fit.
The ramble
Firstly, some notes on statistical rigour. Did I perform statistical tests to ensure that the "seasonality residual" we were modelling was indeed stationary? No. Did I only include statistically significant terms in the SARIMA model? No. Did everything work out in the end, and will the sun still rise tomorrow? Yes. This time.
Of course, I’m being fairly facetious here. Some obvious improvements include being more statistically responsible, especially when it’s not that difficult to do (for example, statistical tests of stationarity are quite easily available¹⁰).
Secondly, a note on data science rigour. I’ve not used any train-test (or validation) data splits. Of course, this is just a demonstration, and doing so would be another obvious improvement.
Let’s spend some time talking about the models.
Modelling what looks to be a fairly linear trend using linear models was a no-brainer to me. I’m really grateful to the StackOverflow poster, as his implementation for the selection and fit of piecewise curves was extremely slick. However, eagle-eyed readers will notice that there is a small discontinuity where some of the curves join. It’s not the end of the world this time, but might need some tightening up in other contexts.
We really glossed over this point but from a conceptual stand point, modelling a trend with a set of linear functions makes a lot of sense. This is because we can reasonably expect trend to change over time, so our model of trend needs to be able to change – or extrapolate – with it. Models like decision trees can’t extrapolate at all, and while other models like SARIMA can do it, it’s probably better if they don’t have to.
We could of course use higher powered polynomials to model the trend but need to be careful when we extrapolate these – things can get really large (or negative) really quickly.
We captured our trend pretty well, so the seasonality residual was – by eye – stationary. The decision tree model worked really well to get us to 12 unique monthly estimates with very few data points. A word of warning here: think about your features before you feed them into the model. I’ve come across instances where using the number of days in a month has led to different estimates of seasonality for February. Why? You guessed it – leap years.
That neatly segues into a slightly different topic: being able to vary the seasonality over time. We didn’t have to worry about it here but there’s no reason why seasonality can’t change over time, especially if a large event (like COVID-19) significantly changes the underlying drivers of the time series. That would obviously change the way we approach things, and potentially mean we’d need to feed more data into the model: here, date features above and beyond month of the year might become crucial.
Lastly, note how we don’t explicitly model "noise". Some seemingly random behaviour in the time series is likely to be truly random fluctuation, whilst some will be due to actual events.
- Unless you’re planning on baking in some allowance for an event into future forecasts or trying to exactly quantify the impact of the event, there’s almost no point in explicitly modelling it. For example, I wouldn’t lose any sleep over a model that didn’t fit well to a period where a volcano erupted, unless I was planning on allowing for potential future volcanic eruptions in forecasts.
- If these events can be reliably predicted, then they’re arguably not that random and should be modelled along with seasonality.
There we go: mixed models for time series regression. Hopefully you’ve enjoyed this deep-ish dive into the world of mixing and matching models. I’d love to hear about any experiences that you might have had with mixed models so please do share thoughts.
Until next time.
References and useful resources
- https://roadtraffic.dft.gov.uk/downloads used under the Open Government Licence (nationalarchives.gov.uk)
- https://stackoverflow.com/a/61733206/11637704
- Autoregressive model – Wikipedia
- Moving-average model – Wikipedia
- Autoregressive integrated moving average – Wikipedia
- Differencing (of Time Series) – Statistics.com: Data Science, Analytics & Statistics Courses
- A Gentle Introduction to SARIMA for Time Series Forecasting in Python – MachineLearningMastery.com
- Rules for identifying ARIMA models (duke.edu)
- Akaike information criterion – Wikipedia
- Statistical Tests to Check Stationarity in Time Series (analyticsvidhya.com)