
Some wisdom transcends the ages!
Introduction
This article provides an overview of time series analysis. Time series are an extremely common data type. A quick Google search yields many applications, including:
- Demand forecasting: electricity production, traffic management, inventory management
- Medicine: Time-dependent treatment effects, EKG
- Financial markets and economics: seasonal unemployment, price/return series, risk analysis
- Engineering/Science: signal analysis, analysis of physical processes
For this article I will cover:
- Basic properties of time series
- How to perform and understand decomposition of time series
- The ARIMA model
- Forecasting
References: selection of references you can use to go deeper into time series analysis with Python:
- Time series in python: https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/
- Pandas time series docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
- ARIMA in python: http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/
- ARIMA models tutorial from the 2011 scipy conference: https://conference.scipy.org/scipy2011/slides/mckinney_time_series.pdf
- Cross-validating time series models: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/
- Exponential smoothing forecasting: https://grisha.org/blog/2016/01/29/triple-exponential-smoothing-forecasting/
- Tuning hyperparameters in keras/scikit-learn: https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/
- Introduction to autocorrelation and partial autocorrelation: https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/
- Tutorial for building MLOps pipeline project on Neptune.ai blog: https://neptune.ai/blog/mlops-pipeline-for-time-series-prediction-tutorial
Time series is a little bit different from pretty much all the data that I have demonstrated so far. At some point, we have a table, and then we analyze a table of data; we analyze it either for counts, we analyzed it with regressions – which are a type of supervised learning.

We look at all of the columns as input columns, and one is an output column. And so that is typical in supervised learning. Even if I did not draw attention to it we have always used a table and in all of those tables, the row order pretty much did not matter. Even when we were grouping by stuff, the row order in the table, to begin with, did not matter. As you might surmise, in a time series, the row order matters. Instead of having a specific row order, you could have a time dimension – the time dimension has one unique feature that is not necessarily found in any of the other dimensions and that is that it is monotonically increasing and therefore, implies an order. This is important!
So that makes time series data unique. The other thing is in most cases the periodicity, or the spacing between the times of that time dimension is equal. In the cases where it is not equal one might endeavor to redo that column with a little bit of interpolation – a little bit of playing around back and forth – to make the time periods equally spaced. This equal spacing is necessary if we want to use our analysis – rather a lot of the analyses we want to do will rely on equal spacing.
That is how time series data distinguishes itself from most other kinds of tabular data. So, what is a time series and when is it used? By itself, a time series only requires that you have a time dimension. You can have additional dimensions, but the first dimension that a time series must-have is the time dimension. And the time dimension means you have a timestamp. That timestamp can be a date, it can be a time, it can be an integer, it can start with zero and go to one, two, three, four and those time periods can then be translated to whatever time unit you desire. But you need to have that time dimension! And that time dimension has to be called out and it has to be handled specially in time series data. Additional columns that then form a more complex row of data, whereby one of the cells in that row is a timestamp, are attributes. Typical attributes could be sales volume; people like to look at sales volume over time.
Time series basic concepts
Important concepts to understand for time series analysis are:
- stationarity
- white noise
- residual
- seasonality
- trend
- ARIMA model: autoregressive (AR) integrated (I) moving average (MA)
The ARIMA model combines the capabilities of looking at a moving average and predicting or describing a time series. As an aside: sometimes predicting and describing mean sort of the same thing. Describing means, if you are sort of predicting your own data, the data that you already have; and predicting means that you can use that model then to go beyond the data that you have. So when I say the moving average can describe the data, it can describe a value that is not too distant in the past or can predict a value that is in the future.
Then you have a term called autoregressive. Autoregressive – the key term here to know is regression. Regression means you will be doing a little in linear regression – typically the least-squares linear regression (not always though) in or from data that are more distant in the past, in order to explain data that are less distant in the past or possibly in the future.

For instance, I could use the highlighted values below to predict the circled point by using a regression:

We can come up with an estimate for the circled value. Or you can just take the moving average of these values, maybe some kind of weighted average. And that would be what the moving average does!
The integration part of ARIMA will be elaborated on later.
White noise is a concept that we need to deal with, in order to make sense of the way ARIMA and other models are built. White noise is typically not predicted, not predictable. And what white noise means is that it is random. That is what the ‘noise’ part means. White refers to it not being any specific frequency or range; that it is all over the place. The white part is analogous to light, white light contains all frequencies of light.
Okay, so a time series is a series of data points that occur in chronological order. Discrete values are measured at some point in time. So the discrete values, which means something that is associated with a time is, is measured. And these things are measured at even intervals. And if you do not have even intervals, then you will need to ‘fudge it’ and sort of force the intervals to be even after the fact. You can do this through interpolation and a variety of other techniques.
Time series examples
- Capital markets: stock market prices
- Science: population growth, radioactive decay
- Demand forecasting: internet bandwidth
- Medicine: treatment response time
- Signal/Image processing: de-noising, de-blurring
- BTC price trading – check out this tutorial for using an MLOps pipeline for creating a project that analyzes BTC price movements for paper trading
A classic example of things you can do with a time series analysis is forecasting stock prices (also crypto). This is not new, maybe it was 40 years ago. Everyone does it now such that it is always ‘baked in’ to the current stock price; the predictions are already part of the anticipation of the market price. You really can not gain any money with these types of simpler time series analyses.
A lot of time series analyses are used in science. As noted above with population growth and radioactive decay but primarily, I would say that where you see a huge amount of work is in signal and image processing. Sensor readings of various types, including images, come generally at equally spaced intervals which lends itself to time series analysis.
Now that I have established what a time series is, I can hammer home their primary purpose: forecasting and predicting the next value.
N.B. I use forecasting and predicting interchangeably. Forecasting is usually reserved for time series analysis. For example, you would not say that you forecast whether or not someone is going to leave the company – this is usually called classification.
To data scientists and machine learning engineers, when you mention a forecast you are primarily talking about time series analyses and their associated noise and the trends. What we want to do is we want to predict and forecast as much as we can then subtract away the forecast and prediction from the original data and see if what is leftover is white noise. Remember that if we have white noise, we cannot predict anything. Therefore when we have white noise leftover we know we have extracted all the information we can gain.
Trend, seasonality, and noise
A time series generally has these three components: a trend, seasonality, and noise. The trend is something that is typically described by linear regression or polynomial regression. So a trend in its simplest form could just be a slope through the data points, but it can easily be much more complex, spanning multiple dimensions. The data could be whatever, sales volume for instance.
Seasonality means that you have something recurring. Typical seasonality looks something like this:

This is what seasonality is and can be added to the trend.
Noise is a little different. By definition, noise is unpredictable, noise is random. Random can still follow some rules, and it does. So, what we do at the end is we try to demonstrate that all that is left is just noise. It is important that the peaks/troughs be randomly spaced, for if they were equally spaced that would be seasonality and thus useful info. The mean of the noise has to be constant, meaning no trend. Subsampling different regions of the data should result in the noise having the same mean – if this is not the case then there is a trend that can be extracted. Lastly, the approximate magnitude of the noise has to be roughly equal. We encountered this before in regressions, where we say that we expect homoscedastic residuals. This noise is going to be our residuals. We expect homoscedastic variance. That means the uncertainty, although there is an uncertainty, and it is random, and by the fact that it is random it is somewhat unpredictable, noise does follow rules.
The key rules are, is that the mean is stable, that the variance is stable and the third rule for time series noise is that the peaks do not follow a pattern, or at least do not follow a periodicity or a frequency – seasonality repeats itself at intervals, noise does not.
Data structures – datetimeindex
An important concept for time series analysis is the date-time index. What is the date-time index? Since we use the pandas library almost exclusively, you may have noticed, should have noticed that in your pandas dataframes, or in your pandas series, the row number is indicated: row zero, row one, row two, row three, row four; those are indices. These indices can be swapped out for other indices that are more appropriate. If you have a time series then the more appropriate index is a date-time. So the point is, you swap out those row numbers, and instead of row numbers, you now have dates there.
Stationarity
So what is a stationary time series? A stationary time series is when the mean and variance and other statistical properties are constant over time. So stationarity means when you have the white noise, and it is stationary, that means your residuals are random. A time series is typically not stationary. And if they were absolutely stationary, there would not be much that we could get out of them. The idea is we can extract information from the time series until what is left over is stationary.
White noise
So, that brings us to white noise and I mentioned this earlier but repetition is good here. So a time series is white noise if its variables are independent and identically distributed with a mean of zero. Earlier I said constant but typically, you want the mean of your residuals to be zero.
With white noise, a given variable over time does not change variance. So the point is, if your errors from a model are white noise, then that is good. And why is that good? Well, it is good in the sense that you have done your job, it is bad in the sense that there is no more left for you to do. If the errors in your model are not white noise, then there is probably still information in there. And you can probably take that information out of the remaining residuals, and still get some more and add that information to your model.
Time domain models – ARIMA
The parameters of the ARIMA model are:
- p: number of lag observations included in the model (lag order)
- d: number of times that the raw observations are differenced (degree of differencing)
- q: size of the moving average window (order of moving average)
We are working our way towards something called an ARIMA: an acronym for autoregressive, AR, integrated, what the I means, and moving average, the MA. Many people do not know what a moving average is. A moving average of three would be the last three values in a time series are going to be weighted, and they will then predict the current value. Put another way, the value that I have now, and the previous two values will be used to predict tomorrow’s value.
Actually, it is the moving average of the residuals. In any model, your model will then have an error term, and then those will be your residuals. And then those will be used for a prediction.
So then what is this autoregressive thing? To illustrate, say I have an autoregressive where I have an order of three. This means I am going to go back three time points. Then I am going to use those as inputs to a regression and therefore predict the current time point. Differently said, I am going to use today’s value and yesterday’s value, and the day before yesterday’s value, in order to predict tomorrow’s value through a regression. This is an autoregression, or what the AR in ARIMA stands for.
What does the integrated stand for? Well, to understand what the integrated stands for, we need to understand a little bit more of what a random walk is. For those who do not know, what a random walk does is it jumps around with a random amount, say up or down. If you do that, or say you have a trend that just keeps on going up. In either case, what you have is your value for tomorrow is going to be heavily based on your value today. And that means that the value for the day after tomorrow is also going to be based on the value from today. And these things can accumulate.
This accumulation, this addition of the random values or random fluctuations add up, and they do not always add up to zero. They will be approximately zero. But over a period of time, they are going to drift away and go pretty far away. And that leads to either something like a trend or to something like a random walk. In either case, your errors, or your values are no longer going to be mean; there is going to be a trend. To account for this what we need to do is differencing – where we look at only the difference between the values in two time points. So you look at the difference between tomorrow’s value and today’s value. So when you talk about the value for tomorrow, you actually do not talk about the actual value of sales tomorrow, but rather, what is the value of sales tomorrow minus the value of sales today. This is the value you have for tomorrow – called differencing. This is the I, the integrated in ARIMA. I will touch more on this later.
ARIMA parameters

There are three variables that are important to us in an ARIMA model: p, d, and q. And the three variables p, d, and q map to the autoregressive part, the integrated part (the differencing), and the moving average. The parameters p and q are what are called lags. Earlier when I mentioned we go three days back, you know, or I take today, yesterday, and the day before yesterday in order to predict tomorrow. Well, this is a lag of three. Why is it a lag of three? Well, if the value that I am focused on is tomorrow, then I am going back three steps on my table, I am going up three rows, I am going back in time three; so I am lagging behind, hence the term lag. If I lag, three for my regression, then my p value would be three. If I were to lag four on my moving average, then my q value would be four.
The d is different because it is not a lag. The d parameter basically has two flavors: zero and one. There are some rare cases where it is not zero and one and people always look for that. For most cases, d is zero or one. Often you do not difference at all. And sometimes you difference. The parameter d is equal to zero or one, rarely anything else.
So p, and q, are lags whereas d is what amounts to a boolean switch and we tune these parameters in the ARIMA to derive forecastable results. Basically, all of time series, and pretty much most predictive analytics in general, what you want to do is you learn from the past, to predict values of interest – typically values in the future.
Analysis
- Identify trends – rolling average
- Examine Seasonality – subtract trend from the original signal
- Analyze periodicity – does it repeat itself at periodic intervals
You want to identify trends and to identify trends for a time series you use a rolling average. The q parameter is the number of points specified to be the window size. And then we have the rolling average. A moving average does smooth things out; the prediction is always going to be smoother than reality.
How do we examine seasonality? Well, we need to get rid of any kind of trend first, and then we look for repeating stuff. There are a variety of ways to do that of which I will not go into detail. Many of you probably know how to do this. There are a variety of ways to do either convolutions or, more typically, you would (some data processing) a Fourier transform. And then you can probably identify the primary seasonal components in your data. Again, I am not going to go over that. So this is a tangent that was actually not necessary for the topics today.
Autocorrelation
Autocorrelation is another way to identify if there is a cyclic component. What is a cyclic component? Well, that is periodicity or seasonality. What you do is you take your data and you correlate it with itself. That sounds pretty lame, because well, if you take seriously what I just wrote, then you know that the correlation always has to be perfect. Because if you correlate it with itself, that correlation is always perfect. Okay. So you do not correlate, you only start off that way, then you introduce a lag of one. So you correlate the first value. So you line up the first value with the second value, and you line up the second value with the third value, which means you have two columns, instead of correlating x with x, you correlate x with x that is lagged one behind. So if I have the values following values:

then it is easy to see that each value in this series correlates with itself:

Now let me introduce a lag of one, I have shifted the identical bottom series to the left by one:

I am correlating this one vector versus the other vector. And what happens is, you are going to typically have a pretty good correlation with that because one value here is reasonably well related to the value next to it. Not perfectly, but it is reasonably well related.
That is how you start off, then you add a new lag and another lag and another lag. And then at some point, if you have periodicity, that correlation is going to go all the way down to zero. So it started high for a while, goes down to zero. And then the amazing thing is it is going to come back up because once I have moved along this vector far enough I am going to reach the beginning of a new cycle, which should have a very strong correlation:

Outliers
There are lots of ways we can get rid of outliers, and I will only briefly mentions this. If you have something that looks weird there are little methods that we can figure out; you can have a local average mean and say, if anything drops off by more than two standard deviations or something like that from the previous five data points, then its an outlier, and you just kick it out, and you do an interpolation.
Sudden shifts
Another thing I really will not talk about, but in brief:

One way to deal with sudden shifts is to take one group and move them up, or down to match the others. Sometimes we do not even need to do that, because you will have differencing. And you are only looking at the difference between individual time points, in which case, you only have one really big weird difference. And then you just kick that one out, because it is an obvious outlier.
Seasonal patterns vs cycles
There is such a thing as seasonal patterns and there is such a thing as cycles. Seasonal patterns are what we are been talking about. So far when I have been talking about cyclical patterns I have meant seasonal patterns – seasonal is a subset if you will of cyclic data. Seasonal is the jargon that you need to remember; seasonal is what we call this stuff. Because just like seasons repeat, seasonal data repeats with a given frequency. And so something seasonal could be every five milliseconds, you see a spike in the voltage; we would call this a seasonal effect.
Seasonal refers to patterns repeating with a constant frequency; cycles do not repeat at regular intervals but still have distinct ups and downs. There is not constant spacing between the intervals of successive peaks. Random noise goes up and down, and that is not special. Seasonality is what we care about as data scientists and machine learning engineers.
Randomness

This is an important one: randomness. There is no pattern to be found. If we call the above stationary, then what we have are some criteria. So our randomness is not just generally random, our randomness hovers around a constant, and we can adjust this constant to be zero (like the above). Most of the time, we will say the randomness hovers around zero. And another way of saying that is that the mean is zero.
The second thing is this randomness has roughly equal variance over time or consistent variance over time.

If you take a statistically significant sample from one time period, any statistically significant sample from another time period, you will have the same variance. Note that the difference in peak heights may not be equal – meaning no seasonality with constant mean – no trend. Finally, it is homoscedastic – constant variance.
Jupyter notebook python example
So far I have thrown a lot at you, and some of it might be more abstract and hard to grasp than others. Here is a kitten to mentally refresh and reward you for making this far:

Now that we have been rejuvenated by the kitten – let us now "get our hands dirty" and go through some Python code examples in a Jupyter notebook that will illustrate the basic properties of time series and their decompositions.
We will begin by creating a pandas series, and remember that a series is basically a column in a pandas dataframe:
from math import sin
import pandas as pd
ts = pd.Series([sin(x/20.0) for x in range(365)])
ts.head(20)

This object is a series of floating-point values. Note that this is not yet a time series: we need to add an index! Let us add a new set of index values to the pandas series, a date_range
:
ts.index = pd.date_range(start = '1-1-2020', end = '12-31-2020', freq = 'D') # daily frequency
ts.head(20)

An important part of a time series is equally spaced time intervals – this sample has equally spaced time periods of one day. Of course, it can be much more granular, on the order of milliseconds or years – depending on your data.
Now we plot the time series. The following code plots the values of the time series against the index. Notice that there is no need to explicitly specify the values for the x-axis as the index is implied:
%matplotlib inline
import matplotlib.pyplot as plt
ts.plot()
plt.title('A simple time series plot')
plt.ylabel('Value')
plt.xlabel('Date')

Pandas provides many methods to manipulate and transform time series. For example, one can subset a time series using a range of time values from the index. Let us take a subset of the time series by specifying a date range and displaying the plot:
ts['1/1/2020':'6/30/2020'].plot()
plt.title('A simple time series plot')
plt.ylabel('Value')
plt.xlabel('Date')

Basic time series properties
A quick review of the theory from earlier.
Properties of white noise
A random series drawn from independent identically distributed (IID) noise drawn from a normal distribution. Such a series is said to be a white noise series. Since the series is IID there is no correlation from one value to the next. We can write a discrete white noise time series as:



The white noise does not need to be normally distributed, but it often is. If white noise is not consistently distributed it is no longer white noise – as an aside I can mention pink noise; when the distribution is skewed to lower frequencies.
Notice that the standard deviation and therefore the variance of the series, 𝜃, is constant in time. We say that a time series with a constant variance is stationary. The properties of a stationary time series do not vary with time – but these rarely occur "in the wild".
Furthermore, the values of the time series are given at specific or discrete times, making this a discrete time series. In computational time series analysis we nearly always work with discrete time series. Some time series are inherently discrete including the unemployment rate average over a month, the daily closing price of a stock. Even if the underlying time series is continuous, we typically work with values sampled at discrete points in time. For example, temperature is a continuous variable, but we will generally work with sampled variables, such as hourly measurements.
The following code creates a time series from an IID normal distribution with mean zero:
def plot_ts(ts, lab = ''):
ts.plot()
plt.title('Time series plot of ' + lab)
plt.ylabel('Value')
plt.xlabel('Date')
import numpy.random as nr
nr.seed(2021)
white = pd.Series(nr.normal(size = 730),
index = pd.date_range(start = '1-1-2018', end = '12-31-2019', freq = 'D'))
plot_ts(white, 'white noise')

Next, let us look at the distribution of the time series values. The code in the cell below plots the histogram and Q-Q normal plot of the values of the time series:
def dist_ts(ts, lab = '', bins = 40):
import scipy.stats as ss
# two subplots side by side
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 3))
# histogram with labels
ts.hist(ax = ax1, bins = bins, alpha = 0.5)
ax1.set_xlabel('Value')
ax1.set_ylabel('Frequency')
ax1.set_title('Histogram of ' + lab)
# plot the q-q plot on the other axes
ss.probplot(ts, plot = ax2)
dist_ts(white, 'white noise')

The values of the white noise series are IID, so we do not expect the values to show any dependency over time.
Autocorrelation
In time series analysis we measure dependency using autocorrelation. Autocorrelation is the correlation of a series with itself lagged (offset in time) by some number of time steps (the lag). Autocorrelation at lag k can be written as:






Notice that for any series, p_0=1. The autocorrelation of a series at lag zero equals one.
We can also define a second-order partial autocorrelation. The partial autocorrelation at lag k is the correlation that results from removing the effect of any correlations due to the terms at smaller lags. I like relating the p parameter of ARIMA to partial autocorrelation. The q is usually related to the autocorrelation:

The autocorrelation will show you periodicity. The effect highlighted in yellow would not appear unless you had periodicity. How do I have periodicity? Well, I have here three low values, and then three high values; then I have three low values. Then I have three high values, and I have three low values. The correlation will be high at first, go to almost zero, then rise again.
The partial autocorrelation is, well, I want to know how much effect the lag of four, highlighted in blue, has on my time. I want it to be considered in isolation and not with the earlier lags of 0, 1, 2, and 3. When I do the partial autocorrelation on the lag of four (L4) I am fishing out the coefficient D. Likewise the partial autocorrelation of a lag of three (L3) will yield the coefficient for c and so on until I have all the coefficients of linear regression.
Let us plot the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the white noise series:
import statsmodels.graphics.tsaplots as splt
splt.plot_acf(white, lags = 40)
splt.plot_pacf(white, lags = 40)
plt.show()

As expected the white noise series only has significant autocorrelation and partial autocorrelation values at lag zero. There are no significant partial autocorrelation values. The shaded blue area on these plots shows the 95% confidence interval.
N.B. The statsmodels
packages use the engineering convention for displaying partial autocorrelation. The value at 0 lag, which always must be 1.0, is displayed. In many statistical packages, including R, this 0 lag value is not displayed. This difference in conventions can lead to a lot of confusion.
The following example will add white noise to a generated series:
from math import pi
nr.seed(2021)
dates = pd.date_range(start = '1-31-2001', end = '1-31-2021', freq = 'M')
periodic = pd.Series([2 * sin(pi*x/6) for x in range(len(dates))],
index = dates)
periodic = periodic + nr.normal(size = len(dates))
plot_ts(periodic, 'periodic series with noise')
splt.plot_acf(periodic, lags = 40)
splt.plot_pacf(periodic, lags = 40)
plt.show()


From the autocorrelation, counting the points between peaks, the periodicity appears to be 12 for a full cycle, six for a half cycle. Thus the correlation will be 1 with a lag of 12 and -1 with a lag of six and approximately 0 with lags of three and nine.
What the partial autocorrelation is saying is that your primary value, the value you are looking at, is heavily dependent on the next value, and not at all dependent on the third circled value. Thus we have the two coefficients to use for our autoregression.
Random Walks
A random walk is defined by the sum of a white noise series. In other words, the value of the random walk is the cumulative sum of the preceding white noise series.
But note that the covariance of a random walk increases with time and is not bounded:

Therefore, the random walk is not stationary.
The following code simulates a random walk series:
nr.seed(2021)
def ran_walk(start = '1-1990', end = '1-2021', freq = 'M', sd = 1.0, mean = 0):
dates = pd.date_range(start = start, end = end, freq = freq)
walk = pd.Series(nr.normal(loc = mean, scale = sd, size = len(dates)),
index = dates)
return(walk.cumsum())
walk = ran_walk()
plot_ts(walk, 'random walk')

The seed chosen here highlights how a random walk can go far from 0, but many hover right around 0. What was done differently above is that the white noise values were added. If you subtract one value from the next you will recover a familiar white noise series.
walk.head(20)

dist_ts(walk, 'random walk')
splt.plot_acf(walk, lags = 40)
splt.plot_pacf(walk, lags = 40)


White noise series with trend
What happens when we add a trend to the white noise series? The following code adds a linear trend to a white noise series
import pandas as pd
import numpy.random as nr
nr.seed(2021)
def trend(start = '1-1990', end = '1-2020', freq = 'M', slope = 0.02, sd = 1.0, mean = 0):
dates = pd.date_range(start = start, end = end, freq = freq)
trend = pd.Series([slope*x for x in range(len(dates))],
index = dates)
trend = trend + nr.normal(loc = mean, scale = sd, size = len(dates))
return(trend)
trends = trend()
plot_ts(trends, 'random walk')

dist_ts(trends, lab = 'n trend + white noise')

Let us see if adding a trend changes the ACF and PACF?
splt.plot_acf(trends, lags = 40)
splt.plot_pacf(trends, lags = 40)
plt.show()

Time series with a seasonal component
Real-world time series often include a seasonal component. A seasonal component is a periodic variation in the values of the time series. The periods can be measured in years, months, days, days of the week, hours of the day, etc. Some examples of seasonal components of time series include:
- Annual holidays which can affect transportation, utility use, shopping habits, etc
- Weekend vs. business days, which account for volumes of certain transaction behavior
- Option expiration dates in capital markets
- Month of the year which can affect employment patterns, weather, etc
Let us investigate the properties of a time series with seasonal, trend, and residual (noise) – commonly referred to as STL – components. The following example creates and plots a time series with a trend, a sinusoidal seasonal component with added white noise:
nr.seed(2021)
from math import pi
def seasonal_ts(start = '1-1990', end = '1-2021', freq = 'M', slope = 0.02, sd = 1.0, mean = 0):
dates = pd.date_range(start = start, end = end, freq = freq)
# trend component
seasonal = pd.Series([slope*x for x in range(len(dates))],
index = dates)
# noise component
seasonal = seasonal + nr.normal(loc = mean, scale = sd, size = len(dates))
# seasonal component
seasonal = seasonal + [2.0*sin(pi*x/6) for x in range(len(dates))] + 5.0
return(seasonal)
seasonal = seasonal_ts()
plot_ts(seasonal, 'seasonal data')

dist_ts(seasonal, 'n seasonal series with trend and noise')
splt.plot_acf(seasonal, lags = 40)
splt.plot_pacf(seasonal, lags = 40)


Decomposition of time series
We have looked at the properties of several types of time series.
- White noise series
- Random walks
- White noise series with trend
- White noise series with a seasonal component
Next, we have to look into methods for decomposing time series data into its trend, seasonal and residual components.
STL decomposition models
A direct decomposition model is known as the seasonal, trend, and residual or STL model. This model does the following:
- The trend is removed using a LOESS regression model
- The seasonal component is removed using a regression on periodic components
- The remainder is known as the residual
N.B. LOESS is a nonparametric smoothing or interpolation method
A decomposition can be either additive or multiplicative. If you remember, a log(A) times log(B) equals the log(A +B). Therefore if we take the log of the values of our output, then actually and look at it that then it is the same thing is pretending that that seasonal trend and residuals are all multiplicative with of each other as opposed to additive. When you do the multiplicative decomposition you are just taking the log of the output.
The following time series has a seasonal component, trend, and white noise residual component. We use the seasonal_decompose
function to decompose the time series:
import statsmodels.tsa.seasonal as sts
def decomp_ts(ts, freq = 'M', model = 'additive'):
res = sts.seasonal_decompose(ts, model = model) #, freq = freq)
#resplot = res.plot()
res.plot()
return(pd.DataFrame({'resid': res.resid,
'trend': res.trend,
'seasonal': res.seasonal},
index = ts.index) )
decomp = decomp_ts(seasonal)
print(decomp[:12])
decomp[-12:]



The residual does not look like white noise, meaning that there still might be some information encoded in it.
Let us plot the ACF and PACF of the residual component for the STL decomposition:
splt.plot_acf(decomp['1990-07-31':'2014-06-30'].resid, lags = 40)
splt.plot_pacf(decomp['1990-07-31':'2014-06-30'].resid, lags = 40)
plt.show()

ARIMA models for the residual series
Now that we have investigated the basic properties of time series and some decomposition methods, we can investigate models for dealing with the residuals.
The class of models we will investigate are known and AutoRegressive Integrated Moving Average or ARIMA model. We will work our way through each component of this model in the next few subsections.
The ARIMA model and its relatives are linear in their coefficients. ARIMA models are in effect linear regression models. However, as you will see, ARIMA models are constructed to account for the serial correlations in time series data.
AR models are specifically for stationary time series. If the variance is not constant or the trend component has not been removed, AR models will not produce satisfactory results.
The following code does the following:
- Defines an ARMA process of the specified orders – a primitive form of ARIMA
- Tests the stationarity and invertibility (stability) of the process
- Returns pandas series with samples generated from the ARMA process. Note that the
generate_sample
function adds white noise with a standard deviation of 1.0 unless otherwise specified
nr.seed(2021)
import statsmodels.tsa.arima_process as arima
def ARMA_model(ar_coef, ma_coef, start = '1-2011', end = '1-2021'):
dates = pd.date_range(start = start, end = end, freq = 'M')
ts = arima.ArmaProcess(ar_coef, ma_coef)
print('Is the time series stationary? ' + str(ts.isstationary))
print('Is the time series invertible? ' + str(ts.isinvertible))
return(pd.Series(ts.generate_sample(120), index = dates))
ts_series_ar2 = ARMA_model(ar_coef = [1, .75, .25], ma_coef = [1])

Let us now create a plot of this time series:
def plot_ts(ts, title):
ts.plot()
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Value')
plot_ts(ts_series_ar2, title = 'Plot of AR(2) process time series')

splt.plot_acf(ts_series_ar2, lags = 40)
splt.plot_pacf(ts_series_ar2, lags = 40)
plt.show()

There appears to be information still inside the residuals.
The AR time series model estimates the coefficients for the order of the model specified:
- Uses the
ARIMA
function from thestatsmodels
package to define the model. The order of the AR model is specified as(p,0,0)
. - Fits the coefficient values using the
fit
method on the model object. - Prints the output of the
summary
method, showing useful statistics to understand the model.
def model_ARIMA(ts, order):
from statsmodels.tsa.arima_model import ARIMA
# statsmodel package since version 0.8 strongly recommends to set start_ar_lags.
# So added it here. Also altered the code to make it robust against compute problems.
try:
model = ARIMA(ts, order = order)
model_fit = model.fit(disp=0,
method='mle',
trend='nc',
start_ar_lags=7,
)
print(model_fit.summary())
return(model_fit)
except ValueError:
print('This model does not properly compute!')
ar2_model = model_ARIMA(ts_series_ar2, order = (2,0,0))

Note the following about the AR model:
- The estimated AR coefficients have values fairly close to the values used to generate the data. Furthermore, true values are within the standard errors and confidence intervals of the estimated coefficients. Notice negative signs of the coefficient values
- The p-values are small and standard error ranges do not include 0 so the coefficient values are significant
Moving average model
For a moving average or MA model, the value of the time series at time t
is determined by a linear combination of past white noise terms. In other words, the MA model accounts for series correlation in the noise terms. We can write the MA (q) model as the linear combination of the last q
white noise terms
MA models are specifically for stationary time series. If the variance is not constant or the trend component has not been removed, MA models will not produce satisfactory results.
The following code computes an MA(1):
ts_series_ma1 = ARMA_model(ar_coef = [1], ma_coef = [1, .75])
plot_ts(ts_series_ma1, title = 'Plot of MA(1) process time series')

splt.plot_acf(ts_series_ar2, lags = 40)
splt.plot_pacf(ts_series_ar2, lags = 40)
plt.show()

Let us try to estimate the coefficients of the MA time series. The following code fits the MA(1) model to the time series. The model is specified as (0,0,q):
ma1_model = model_ARIMA(ts_series_ma1, order = (0,0,1))

Note the following about the AR model:
- The estimated MA coefficient has a value fairly close to the value used to generate the data. Further, the true value is within the standard error and confidence intervals of the estimated coefficient.
- The p-values are small and standard error ranges do not include 0 so the coefficient values are significant.
Autoregressive moving average model
We can combine the AR and MA models to create an autoregressive moving average or ARMA model. This model accounts for serial correlation in both noise terms and values.
The code in the cell below simulates and plots an ARMA(1,1)
model. The model is specified by (p,0,q)
:
# the values after the p and q are the coefficients (.6 and .75)
ts_series_arma11 = ARMA_model(ar_coef = [1, .6], ma_coef = [1, .75])
plot_ts(ts_series_arma11, title = 'Plot of ARMA(1,1) process time series')

Lets us now print a summary of the model and plot the ACF and PACF of the residual:
splt.plot_acf(ts_series_arma11, lags = 40)
splt.plot_pacf(ts_series_arma11, lags = 40)
plt.show()

arma11_model = model_ARIMA(ts_series_arma11, order = (1,0,1))

Autoregressive integrated moving average model
The autoregressive integrated moving average or ARIMA model adds an integrating term to the ARMA model. The integrating component performs differencing to model a random walk component. The integrating component models one of the non-stationary parts of a time series. The ARIMA model is defined by orders p, d, q. We have already looked at AR(p) and MA(q) models. The order of the differencing operator of the integrating term is defined by d
. Since the integrating term is a differencing operator, there is no coefficient to estimate.
Earlier we simulated a random walk series, and investigated its properties. The code in the cell below uses the pandas diff
method to perform first-order differencing on the time series. This operation is an ARIMA(0,1,0) model:
walk_diff = walk.diff()
plot_ts(walk_diff, title = 'Plot of first order difference of random walk time series')

Let us investigate the statistical properties of the remainder series after the ARIMA(0,1,0) has been applied by plotting the distribution properties, the ACF and PACF of the remainder.
N.B. We need to remove the first value from the difference series since the difference operator cannot work on the first element of a time series.
print(walk_diff[0:10])
dist_ts(walk_diff[1:], lab = 'n after difference operator')
splt.plot_acf(walk_diff[1:], lags = 40)
splt.plot_pacf(walk_diff[1:], lags = 40)
plt.show()


Example
Now lettuce apply the models we have been working with on some real-world data. I will demonstrate with a famous data set which shows the consumption of chocolate, beer, and electricity (CBE)in Australia from 1958 to 1991:
# need to add date index to make a time series
CBE.index = pd.date_range(start = '1-1-1958', end = '12-31-1990', freq = 'M')
print(CBE.head())
print(CBE.tail())

Plot the three time series:
f, (ax1, ax2, ax3) = plt.subplots(3, 1)
CBE.choc.plot(ax = ax1)
CBE.beer.plot(ax = ax2)
CBE.elec.plot(ax = ax3)
ax1.set_ylabel('Choclate')
ax2.set_ylabel('Beer')
ax3.set_ylabel('Electric')
ax3.set_xlabel('Date')
ax1.set_title('Three Australian production time series')
plt.show()

Notice that for each of these time series the amplitude of the seasonal variation grows with time. This is a common situation with real-world data.
Looking at the electric production above, notice how the noise increases with time? This means that this a prime candidate for taking the log of the data. The following code performs the log transformation and plots the result for the electric consumption time series for our STL:
import numpy as np
CBE['elec_log'] = np.log(CBE.elec)
plot_ts(CBE.elec_log, 'Log of Australian electric production')
CBE.columns

Notice the following properties about this time series:
- There is a significant trend
- There is a noticeable seasonal component.
- The magnitude of the seasonal component increases with the trend in the un-transformed time series
- The seasonal component of the log-transformed series has a nearly constant magnitude but decreases a bit with time
These results indicate that an STL decomposition is required.
STL decomposition of electric time series
Let us analyze the electric time series. The following code computes the STL decomposition of the time series using the decomp_ts
function:
elect_decomp = decomp_ts(CBE.elec_log)
print(elect_decomp.head(12))
print(elect_decomp.tail(12))


Note the following about these results:
- The periodic component looks reasonable, but may not be stationary as evidenced by the remainder
- The removal of the trend component appears to be satisfactory
We can apply the Dicky Fuller test to determine if the residual is stationary. The null hypothesis is that the time series is not stationary, but with no trend.
Notice that the first and last 6 elements must be filtered since they have nan
values.
The following code executes the DF test and prints some summary statistics on our earlier synthetic data:
from statsmodels.tsa.stattools import adfuller
def DF_Test(ts):
stationary = adfuller(ts)
## print the results
print('D-F statistic = ' + str(stationary[0]))
print('p-value = ' + str(stationary[1]))
print('number of lags used = ' + str(stationary[2]))
print('Critical value at 5% confidence = ' + str(stationary[4]['5%']))
print('Critical value at 10% confidence = ' + str(stationary[4]['10%']))
DF_Test(decomp.resid[6:-6])

from statsmodels.tsa.stattools import adfuller
def DF_Test(ts):
stationary = adfuller(ts)
## Print the results
print('D-F statistic = ' + str(stationary[0]))
print('p-value = ' + str(stationary[1]))
print('number of lags used = ' + str(stationary[2]))
print('Critical value at 5% confidence = ' + str(stationary[4]['5%']))
print('Critical value at 10% confidence = ' + str(stationary[4]['10%']))
DF_Test(elect_decomp.resid[6:-6])

Given the DF statistic and p-value, we can reject the null hypothesis that the residual is not stationary for the electricity data.
For the next step, compute and plot the ACF and PACF for the remainder series:
splt.plot_acf(elect_decomp.resid[6:-6], lags = 40)
splt.plot_pacf(elect_decomp.resid[6:-6], lags = 40)
plt.show()

There is periodicity in the seemingly random residual data and use it to create an additional model that performs a regression on the residuals and add that model to the original one.
Apply ARIMA model
Now that we have an STL decomposition of the electric use time series, we can compute an ARIMA model for the residual. For a starting point I will use an ARIMA(2,1,2) model:
arima_electric = model_ARIMA(decomp.resid[6:-6], order = (2,1,2))

The standard error of the second AR coefficient is similar in magnitude or larger than the coefficient itself. Furthermore, the confidence interval overlaps zero – indicating that the model is overfit or over parameterized.
How do we find the ‘best’ ARIMA model for the residual? We need criteria to compare models with different values of p, d, and q. The Bayesian Information Criteria or BIC is closely related to the Akaike Information Criteria. The BIC weights the number of parameters in the model by the log of the number of observations.
The following code iterates over a grid of p, d, and q values. For each p, d, q combination the BIC is computed and compared to the best previous model. In a bit more detail the function in the cell below does the following:
- Initialize a large value of BIC
- Iterate over the grid of specified p, d, and q values
- Compute an ARIMA model of order (p, d, q). This process is wrapped in a
try
to prevent the function from crashing if the model is unstable and does not converge - The BIC of each model is compared to the best (lowest) BIC found so far. If a better model is found, that model, its parameters, and the BIC are saved as the best model
- Once the loop has completed the best BIC, the order of the best model and the best model are all returned
def model_ARIMA_2(ts, order):
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.arima_model import ARIMAResults
model = ARIMA(ts, order = order)
model_fit = model.fit(disp=0, method='mle', trend='nc')
BIC = ARIMAResults.bic(model_fit)
print('Testing model of order: ' + str(order) + ' with BIC = ' + str(BIC))
return(BIC, order, model_fit)
def step_ARIMA(resid, p_max, d_max, q_max):
from statsmodels.tsa.arima_model import ARIMAResults
from statsmodels.tsa.arima_model import ARIMA
best_BIC = 9999999999999999.0
for p in range(p_max + 1):
for d in range(d_max + 1):
for q in range(q_max + 1):
if(p > 0 or q > 0):
try:
order = (p, d, q)
BIC, order, model = model_ARIMA_2(resid, order)
if(BIC < best_BIC):
best_model = model
best_BIC = BIC
best_order = order
except:
pass
return(best_BIC, best_order, best_model)
BIC, order, model = step_ARIMA(decomp.resid[6:-6], 3, 3, 3)
print('***************************************')
print('Best model with BIC = ' + str(BIC) + ' and with order '+ str(order))

Forecasting Time Series
An important statement to ponder is this: you can decompose a calendar using one-hot encoding (OHE). This allows us to do a lot of things.
Now that we have explored these data, the next step is to compute and evaluate a forecast model. In this case, we will hold back the last 12 months of data before we train the model. The final 12 months of data can then be used to evaluate the model.
We can train the model in three steps:
- Compute new features, the count of months from the start of the time series and the square of count of months. These features are used to model trend
- Normalize the numeric features
- Create new dummy (binary) variables for the months. These features are used to model the seasonal variation
- Confidence for the trend and seasonal features are computed using a linear regression model
- An ARIMA model for the residuals is computed
- A prediction of the electric production is made for a 12 month period using the trend, seasonal and residual models
The code in the cell below performs the first three steps of the process:
# create new features, the count of months from the start of the
# series and the square of the count of months.
CBE.loc[:, 'Month_Cnt'] = [float(i + 1) for i in range(len(CBE.elec_log))]
CBE.loc[:, 'Month_Cnt2'] = [x**2 for x in CBE.Month_Cnt]
# normalize time features
from scipy.stats import zscore
CBE.loc[:, ['Month_Cnt', 'Month_Cnt2']] = CBE.loc[:, ['Month_Cnt', 'Month_Cnt2']].apply(zscore)
# create dummy variables for the months
years = int(len(CBE.elec_log)/12)
CBE.loc[:, 'Month'] = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] * years
dummies = pd.get_dummies(CBE.loc[:, 'Month'])
CBE[list(dummies.columns)] = dummies
# print the head of the data frame to look at the dummy variables.
CBE.head(12)

The code in the cell below computes a linear model for coefficients of the trend and seasonal features. The steps are:
- The features are extracted into a numpy array
- The label is extracted into a numpy array
- The model is defined. Note that we do not use an intercept since we have the seasonal component, which is a categorical feature
- The model fit is computed
- Predictions of the trend and seasonal values are computed
- The residuals with respect to these predictions are computed
import sklearn.linear_model as lm
X = CBE.loc[:'1989-12-31', ['Month_Cnt', 'Month_Cnt2', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']].values
Y = CBE.loc[:'1989-12-31', 'elec_log'].values
lm_mod = lm.LinearRegression(fit_intercept = False)
# where the following magic happens (the red line)
mod_fit = lm_mod.fit(X, Y)
# and the predictions and the residual
CBE.loc[:'1989-12-31', 'scores'] = mod_fit.predict(X)
CBE.loc[:'1989-12-31', 'resids'] = CBE.loc[:'1989-12-31', 'scores'] - CBE.loc[:'1989-12-31', 'elec_log']
Let us now have a look at how these predicted trends and seasonal components fit the actual electric production time series. The following code plots the actual time series in red and the values predicted by the trend and seasonal model in blue:
def plot_mod_fit(df, col):
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8, 5))
# set plot area
ax = fig.gca()
# define axis
df.loc[:, col].plot(color = 'r', ax = ax)
df.loc[:, 'scores'].plot(ax = ax)
ax.set_title('Actual ' + col + 'vs. the predicted values')
# give the plot a main title
ax.set_xlabel('Date') # set text for the x axis
ax.set_ylabel(col)# set text for y axis
plot_mod_fit(CBE, 'elec_log')

To have a point of comparison, execute the code in the cell below to compute the root mean square error (RMSE) of the fit of the model for the last 12 months of the electric production time series:
def RMSE(ts, score):
from math import sqrt
return sqrt(np.std(ts - score))
## Make the forecast for the next year
X = CBE.loc['1989-12-31':, ['Month_Cnt', 'Month_Cnt2', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']].values
RMSE(CBE.loc['1989-12-31':, 'elec_log'].values, mod_fit.predict(X))

Now let us have a look at a time series plot and distribution plots of the residuals:
plot_ts(CBE.loc[:'1989-12-31', 'resids'], title = 'Residual time series of log electric production')
dist_ts(CBE.loc[:'1989-12-31', 'resids'], 'n residual of trend and seasonal model')


Testing for stationarity:
DF_Test(CBE.loc[:'1989-12-31', 'resids'])

The non-stationarity means that the residuals are not white noise, and thus still have data we can extract:
BIC, order, model_fit = step_ARIMA(CBE.loc[:'1989-12-31', 'resids'], 4, 3, 4)
print('Best order = ' + str(order) + ' best BIC = ' + str(BIC))

However the first model does not compute with those parameters – it is unstable. Thus we use the second-best fit order:
arima_remainder = model_ARIMA(CBE.loc[:'1989-12-31', 'resids'], order = (2,1,3))
arima_remainder = model_ARIMA(CBE.loc[:'1989-12-31', 'resids'], order = (3,0,3))

N.B. Not all models with given (p, d, q) will compute. Python might yell at you about a hessian inversion warning – a hessian being a matrix with special features and it is an intermediate calculation that is important to determine the fidelity of the model; it is unstable.
start_index = len(CBE.loc[:'1989-12-31', 'resids'])
end_index = start_index + 12
model_prediction = model_fit.predict(start=start_index, end=end_index)
model_prediction
The following few numbers are the last months in that time series data. Those are the months that were withheld. That means not part of what we call the training data set; not part of the data set that was used for fitting the model.
The reason why we are doing this is that we can now compare that prediction to what really happened and see how good our model is. So we are now going to make a forecast using our linear regression thing:

Now we combine this with our first model to make our forecast. The prediction is the combination of the trend, season, and residual (ARIMA) models:
# make the forecast for the next year
X = CBE.loc['1989-12-31':, ['Month_Cnt', 'Month_Cnt2', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']].values
# predictions for the forecast
CBE.loc['1989-12-31':, 'scores'] = mod_fit.predict(X) - model_prediction
plot_mod_fit(CBE, 'elec_log')

The forecast looks reasonable. The red, actual time series, and the prediction, in blue, look very similar for the last 12 months. The real power of a model is: can it predict the future? Well as Yogi Berra said in the subheader, that is hard – but our model, trained on all but the last 12 months, then tested on the last 12 months and added to the original model
Finally, let us compute the RMSE and compare these results to the model with only trend and a seasonal component:
RMSE(CBE.loc['1989-12-31':, 'elec_log'].values, CBE.loc['1989-12-31':, 'scores'])

This value is worse than earlier – adding the ARIMA component to the original regression created a worse model, a higher RMSE value. This is odd behavior worthy of dissection because the model’s performance is expected to improve after accounting for the encoded info in the residuals. My hunch is that certain packages might be deprecated or have been modified under the hood in some manner. Adding the residuals of the first model as inputs into our second model, and so on and so forth, is the framework for creating ever more accurate models.
Using the auto_arima
function the RMSE forecast is much better:
forecast_12 = stepwise_fit.predict(n_periods=12)
print(forecast_12)

RMSE(CBE.loc['1990-1-31':, 'elec_log'].values, forecast_12)

CBE.loc['1990-1-31':, 'scores'] = forecast_12
plot_mod_fit(CBE, 'elec_log')

As expected the model did improve. Maybe earlier a problem in my script prevented it from calculating the RMSE correctly and thus worsening the performance of the model.
Models for non-stationary variance.
The Autoregressive Conditional Heteroskedasticity or ARCH and Generalized Autoregressive Conditional Heteroskedasticity or GARCH model, and their many relatives, are specifically intended to deal with variance which changes with time. Robert Engle published the ARCH model in 1982 for which he was awarded the Nobel Prize in Economics in 2003.
These models are beyond the scope of this article. Additional information can be found in the references given earlier.
Summary
I have demonstrated the following:
- Explored basic properties of time series: white noise, random walk, white noise with trend, seasonal components
- Decomposed time series data into its trend, seasonal and residual components (STL)
- Investigated models for dealing with the residuals: auto-regressive, moving average, integrated (ARIMA)
- Introduced autocorrelation (ACF and PACF)
- Applied the time series models to real data for chocolate, beer, and electricity production to compute and evaluate a forecast model
I hope you have enjoyed this article! Whether you are new to the concept or simply are in need of a refresher I want my readers to have a deep, and complete intuition and understanding of these complex yet highly important topics in Data Science and machine learning.
Find me on Linkedin
Physicist cum Data Scientist- Available for new opportunity | SaaS | Sports | Start-ups | Scale-ups