
When I started to learn time series Forecasting, I first read many blog posts and articles. Unfortunately, only a few of them were pretty comprehensive for beginners. The others were lacking completeness (i.e. do not cover the analysis of the residuals) or worse, applied the methods wrong. Fortunately, I found some very good books (see references) and had great sparring partners.
I came up with the idea of writing this article for two reasons. First, you do not have to go through the valley of tears and confusion as I did. Second, classical time series models like AR(I)MA come with many assumptions. Therefore it is essential to understand the idea behind time series modeling and its basics.
Depending on the demand and my time, I would like to write a series about time series forecasting. Starting from the very basics to advanced time series models that are also used in competitions like the famous Makridakis Competitions. But firstly start with this article to cover the basic theoretical concepts and tools in time series modeling.
Before we start I would like to thank my colleague Justina Ivanauskaite for the great conversations and contribution to this article.
This article helps you to understand the following topics:
- Understand the terminology
- Time Series components and how to decompose them
- White Noise and Random Walk models
- The concept of stationarity
- Autocorrelation and Partial Autocorrelation functions
Watch out for the 💡 key takeaways if you are in a hurry.
Terminology
We have to distinguish between a stochastic process (also called time series process or model) and a time series.
Stochastic process
Is described as a set of random variables {Y𝑡,𝑡∈𝑇 } that are ordered in time and defined at a set of time points 𝑇, which may be continuous or discrete. 𝑇 indicates at which the process was, will or can be observed. An example can be the number of new COVID-19 infections on successive days for a given region.
In this article we focus on discrete-time (DT) stochastic processes and time series. This means t can take integer values from 0, 1, 2,…, etc. . So when we talk about time series processes, models, or time series in the further course of the text, we always assume 𝑡 is discrete.
Time series
An (observed) time series is the realization of a time series process. It can be denoted with small letters y=(y₁, y₂,…,yₜ).
According to the COVID-19 example above, the time series can be the observed number of new COVID-19 infections per day for Texas from January 2021 till the end of March 2021.
Lags
When you work with time series, you will often hear the term lag. A lag can be seen as the time difference between two points. In the case of daily data, yₜ is the current or present value of a time series, while yₜ₋₁ (or lag 1) is the value from yesterday and yₜ₋₂ (or lag 2) the observation that was made the day before yesterday and so on.
💡 Key takeaways
- A time series process or time series model is the mathematical description of ordered, stochastic (also called random) processes.
- A time series is the realization of such a described process.
- Lags are the time difference between two observations or points.
Time Series Components
Time series are full of patterns. Therefore it is quite useful to split our time series into distinct components for a deeper analysis of its underlying structure:
- Trend-cycle Tₜ: Is a long-term increase or decrease in the data and does not always have to be linear. The worldwide increasing electricity consumption over the last 60 years can be an example of a trend.
- Seasonality Sₜ: Represents the changes with fixed and known periodicity (i.e. factors like time of the year). An example can be the increase in retail sales during the Christmas seasons and the decrease after the holidays.
- Remainder Rₜ: Is the irregular part of a time series that contains the noise (i.e. caused by measurement errors) or random movements. It is sometimes also called residuals. Although we cannot observe it directly, it is always present in a time series to some degree.
Depending on the literature, you will usually find three or four components. Some authors describe the cyclical as an own component others also introduce the component level, which describes the average value of the series. In this article I decided to stick with three distinct components also mentioned by Makridakis (1998).
Decomposition
Now that we are familiar with a time series components we can think about how to split or decompose a time series into its components. Basically, there are two ways how to compose a time series.
Additive decomposition
Yₜ = Sₜ + Tₜ + Rₜ
The additive decomposition is the most basic one. It is suitable if the magnitude (spike) of the seasonal fluctuations, or the variation around the trend cycle, does not vary with the level (mean) of the time series. So it is useful when the seasonal variation is relatively constant over time.
Multiplicative decomposition
Yₜ = Sₜ × Tₜ × Rₜ
The multiplicative decomposition is useful if you have an increasing trend and the amplitude of seasonal activity increases. In other words, if the seasonality’s magnitude is dependent on the level, use multiplicative decomposition.
Since definitions like this are always pretty dry, it is time for some examples. To get a clearer understanding of additive and multiplicative decomposition and when to use which, we will have a closer look at the following plots (fig. 1).

The plot on the left-hand side (a) shows an excerpt of the ausbeer data set. The seasonal variation of the data stays relatively constant over time. In other words: The height of the peaks is relatively constant. This means that an additive decomposition is the right one for this time series.
On the right-hand side (b) we see the plotted air passengers data set. Compared to the plot on the left we can see that the seasonal variation is not relatively constant over time. The peaks do not stay constant and change their heights (i.e. the peaks from 1959 compared to 1953).
If you want to test your knowledge about detecting additive and multiplicative seasonality, I can highly recommend the interactive tool "Additive and multiplicative seasonality – can you identify them correctly?" by Nikolaos Kourentzes.
Now that we have all the theoretical knowledge we need, we can run a time series decomposition in python. The statmodels package provides a function called: seasonal_decompose. For the following example, we will decompose the air passengers data set.
The two most important parameters for this function are:
- x: The time series to decompose
- model: The type of decomposition.
Additive is the default value for the model parameter. So if your time series requires an multiplicative decomposition, you have to set the parameter value to "multiplicative".

The plotted output (fig. 2) shows in chronological order the decomposed time series, the trend, seasonal, and remainder (Resid) components. A plot like this helps us to visually analyze the components of our time series in more detail.
💡 Key takeaways
- A time series consists of three components: The Trend-cycle, the Seasonality and the Remainder (also called residuals).
- These components can either stick in an additive or multiplicative order together.
- A visual analysis of the data can show us if the components are in an additive or multiplicative composition.
- If the variation around the trend cycle vary with the mean or we see an increasing amplitude of seasonal activity, we can consider multiplicative decomposition.
- Statsmodels _seasonaldecompose function helps us to decompose our time series into its components.
White Noise and Random Walk Models
Since we are familiar with the terminology, the components, and the decomposition of time series, it is time to talk about some basic but also important time series models.
White Noise
Let us first talk about the White Noise (fig. 3) model. A White Noise model can be defined as follows: Yₜ = εₜ
Where εₜ is the White Noise, the part that cannot be predicted based on the past history of the series.

White Noise comes with the following characteristics:
- It has a mean (μ) of 0 and a constant variance (also written as WN(0,σ²)).
- It does not follow any pattern and therefore it is completely random (thats why it is called "noise").
- It consists of independent identically distributed (also called i.i.d.) observations.
- Each observation has a zero correlation with the other observations in the series.
If the observations follow a normal distribution, we say it is Gaussian White Noise. Ideally, our forecast errors are (Gaussian) white noise. That would mean that we "caught" or modeled all the important effects with our time series model, and what is left is just unpredictable white noise (see the remainder Rₜ component).
Random Walk (with Drift)
Besides White Noise, there is also another very basic but important model called random walk (fig. 4). A random walk model can be defined as: Yₜ = Yₜ₋₁ + εₜ
Where Yₜ represents the current value, Yₜ₋₁ the value of Y one lag ago (e.g. yesterdays Y’s value) and εₜ random error (also called noise).

This means that the current observation is a random step away from the previous one, and all the steps are independently and identically distributed in size ("i.i.d."). Therefore, a random walk cannot be reasonably predicted. The best forecast for Yₜ is the previous value Yₜ₋₁ which is also called naive forecasting. A typical real-life example for a random walk series is the share price on successive days.
If a random walk series follows an up or down trend it contains a "drift". That is why we call it random walk with drift. It can be defined as: Yₜ= α + Yₜ₋₁ + εₜ where α stands for the constant or a drift.
As already mentioned, we ideally want our residuals to be white noise in the modeling step that comes later. That brings us to the question, how can we test if our time series is white noise? Guess what, there is a statistical test for it!
Test for white noise: Ljung-Box test
It is called the Ljung-Box (also called the portmanteau) test. The Ljung-Box test comes up with a null and an alternative hypothesis:
- H₀: The data are independently distributed
- Hₐ: The data are not independently distributed
In python, the package statsmodels provides a function called acorr_ljungbox. Besides the data, we have to provide a number of lags which brings us to the question "what is an appropriate number of lags?"
In his blog post Hyndman recommends:
- For non-seasonal time series, use ℎ=𝑚𝑖𝑛(10,𝑇/5)
-
For seasonal time series, use ℎ=𝑚𝑖𝑛(2𝑚,𝑇/5) where T is the number of lags and 𝑚 the period of seasonality.
For the following example, we will use our white noise data with 720 observations. Regarding Hyndman’s recommendations, this would mean that we set the length of the lags to 10.
The first array element ([0]) contains the p-values, while the second array element ([1]) contains the Ljung-Box test statistic. If the p-value(s) is below the 0.05 threshold, we can reject the null hypothesis, which means that the data is not white noise.
In our case all p-values are above the threshold of 0.05 means we can not reject the null hypothesis and our data is white noise.
💡 Key takeaways
- A white noise time series has a mean (μ) of 0, a constant variance, and can not be predicted since it is completely random.
- The residuals of our fitted model should be white noise.
- The best forecast for a random walk is the value from the previous step Yₜ₋₁ plus an error term εₜ.
- A random walk with a trend is called random walk with drift.
- To test a time series for white noise we can use the Ljung-Box test.
The concept of stationarity
Stationarity is an essential characteristic of time series since (classical) time series models like ARMA assume it and can lead to incorrect results if the underlying data is not stationary.
For the sake of comprehensiveness, we have to distinguish between strict stationarity and weak stationarity.
Strict stationarity
- We say a time series process is strictly stationary if its properties are unaffected by a change of time origin.
- The joint distribution of the observations yₜ, yₜ₊₁, …, yₜ₊ₙ is exactly the same as the joint probability distribution of the observations yₜ+h, yₜ₊ₕ₊₁, …, yₜ₊ₕ₊ₙ.
- Therefore it is unaffected by forward or backward shifts in time.
However, many real-life processes are not strictly stationary. But even if a process is strictly stationary, we usually have no explicit knowledge of the latent time series process. As strict stationarity is defined as a property of the process’s joint distributions, it is impossible to prove from an observed time series.
Weak stationarity
We define a time series process as weak stationary if
- The expectation E(yₜ) is constant over time
- The variance Var(yₜ) is constant over time
- The covariance Cov(Yₜ,Yₜ₊ₕ) depends only on the lag h, Cov(Yₜ₁,Yₜ₂) = Cov(Yₜ₁₊ₕ,Yₜ₂₊ₕ)
Broadly speaking, a time series said to be weakly stationary if there is no systematic change in mean μ (i.e. trends) if there is no systematic change in variance σ² and if strictly periodic variations (i.e. seasonality) have been removed.
How are these definitions or types of stationarity used in practice?
In practice, people usually mean by the term stationarity weak stationarity since strict stationarity is only a theoretical concept. **** To be here consistent, we will use from now on stationarity when we mean weak stationarity.
Alright, so now that we know what the concept of stationarity means, let’s test our knowledge by observing the following graphs in figure 5 and check if they are stationary or not.

Plot (a) shows a clear trend (no constant mean over time) in the data so the series can not be stationary.
Plot (b) on the opposite does not show a trend or any seasonality in the data. That indicates that plot (b) is stationary.
Plot (c ) has like plot (a) a clear trend in the data and therefore is non-stationary as well.
Last but not least a confusing case by Hyndman and Athanasopoulos (2018). Plot (d) appears non-stationary due to its strong cycles. However, these cycles are aperiodic. When the lynx population becomes too large for the available feed, they stop breeding, and the population falls too low numbers. Then the regeneration of their food sources allows the population to grow again and so on. That means in the long-term, the timing of these aperiodic cycles is not predictable. Because of this cyclic behavior with no trend nor seasonality the series is stationary.
I bet you are surprised that plot (d) turned out to be stationary. Sometimes it can be hard to tell if a time series is stationary visually. Thus we need a more robust way to check stationarity.
Luckily there are several statistical tests (also called unit-root tests) to check if a series is stationary or not.
Augmented Dickey Fuller (ADF) test
One most widely used is the Augmented Dickey Fuller (ADF) test.
There are many more unit root tests available like the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test or the Elliott–Rothenberg–Stock (ADF-GLS) test.
The ADF comes up with a null and an alternative hypothesis.
- H₀: The series has a unit root so it is non-stationary.
- Hₐ: The series has no unit root so it is stationary.
Broadly speaking, a unit root is a random walk with drift time series. It contains a stochastic trend that shows an unpredictable pattern. If you are interested in a more deeper explanation please see Makridakis (1998).
Statsmodels provides us for this purpose the adfuller function .
For this example, I used the airpassengers data set from above, which is not stationary. Since the data set shows clearly a trend, I set the function’s regression parameter to ct (constant and trend).
The function returns several values. The second one represents the calculated p-value. Since 0.55 is above the threshold of 0.05, we can not reject H₀, so the data is not stationary.
Now that we know our data is not stationary, how can we make them stationary?
How to make data stationary
As already mentioned, a lot of (classical) models assume stationary time series. So what if we know that our data are not stationary? Two very common approaches to make data stationary are:
- Transforming the data (e.g. log and/or using square roots transformations)
- Differencing
Transforming the data is a very basic approach that is also used in other statistical fields like regressions. One applies a log or square roots transformation to the data to make it stationary.
Another very common approach is differencing. It can be applied alternatively or in addition to the transformation approach. We usually take first differences of the series. That is also called first-order differencing. We are creating a new time series of successive differences Yₜ -Yₜ₋₁. For example, the original time series is Y₁, Y₂, Y₃, …, Yₙ, then we apply first-order differencing, which leads to a new time series Y₂-Y₁, Y₃-Y₂, Y₄-Y₃, …, Yₙ – Yₙ₋₁.
If this does not work and we still have non-stationary data, then we can also consider a second-order differencing by taking the differences of the created differences (not to be confused with taking the second difference Yₜ -Yₜ₋₂).
If your time series is in data frame format, you can make use of pandas function .diff(). This function takes the number of periods you want to difference over as a parameter.
If we want to apply first-order differencing we just append the function .diff() to our data frame. In our example we apply the .diff() method to the ausbeer data set (table 1).

Since we take first differences of the series (Y₂-Y₁) we will lose one (the first) data point. That means if we apply second-order differencing, we lose the first two data points.
💡 Key takeaways
- (Weak) stationary is a property many (classical) time series models assume.
- A time series is weak stationary if its properties (mean, variance) are constant over time and the covariance depends only on the lag h.
- To test our time series for stationarity we can use the Augmented Dickey–Fuller test.
- To make a time series stationary we can use differencing or transforming techniques like log transformation.
Autocorrelation and Partial Autocorrelation
With the understanding of what stationarity means and how to make a time series stationary, we will now focus on two crucial tools for deeper Time Series Analysis before forecasting. The Autocorrelation Function (ACF) and the Partial Autocorrelation Function (PACF). Both can be used to identify explanatory relationships between lagged values of a time series.
Autocorrelation Function (ACF)
- Calculates the autocorrelations, which measure the relationship between Yₜ (the current value of our variable) and Yₜ₋ₙ (multiple lagged versions of our variable). For example, with a lag-1 autocorrelation, the ACF calculates the correlation between Yₜ and Yₜ₋ₙ.
- Captures also indirect effects (e.g. yₜ could correlate with yₜ₋₂ only because both correlate with yₜ₋₁ rather than because of any new information contained in yₜ₋₂ that could contribute to forecasting yₜ).
The autocorrelation plot can be drawn with the statsmodels plot_acf function:
There is no fixed rule for the number of lags. However, Montgomery (2015) recommends, "a good general rule of thumb is that at least 50 observations are required to give a reliable estimate of the ACF, and the individual sample autocorrelations should be calculated up to lag K, where K is about T/4".
Let us examine the ausbeer data set for a better understanding of how to interpret autocorrelation values (fig 6).

The left-hand side shows the autocorrelation plot of the ausbeer data set (a). On the right-hand side, you find the plotted ausbeer time series (b), its trend (c ), and seasonal component (d).
The autocorrelation plot (a) shows us the value of the autocorrelation between observation and its lagged values. It also tells us if the value at lag n is statistically significant or not. If the autocorrelation value lies within the blue area, the value is not significant. If it lies outside, it is significant.
When a time series has a trend, the autocorrelations for small lags tend to be large and positive. You can find this in the plot (a) at the lags 1 and 2.
The value at lag 0 will always be 1 and significant since it’s the autocorrelation of a time series with itself corr(Yₜ,Yₜ). It can be seen as a reference point.
When our time series is seasonal, the autocorrelation values will be larger for the seasonal lags than for other lags. That can be seen at lags 6, 12, 18, etc..
Partial Autocorrelation Function (PACF)
As already mentioned, the ACF captures the indirect effects as well. A way to focus only on the direct relationship between Yₜ and its lagged version Yₜ₋ₙ is the Partial Autocorrelation Function (PACF). In general, the "partial" correlation between two variables is the amount of correlation between them which is not explained by their mutual correlations with a specified set of other variables.
Also here statsmodels provides a function to calculate and plot the partial autocorrelation values:

Figure 7 shows two plots created with the _plotpacf function. The first one (a) applied the pacf to white noise data. Since white noise has no correlations between its values, we also do not see here any (all values are in the blue area and therefore not significant).
The second plot (b) applied the pacf to random walk data. As already said, the best prediction for a random walk is the value from one lag ago.
Depending on your purpose, it might be required to make your time series stationary first before analyzing it with ACF and PCF plots (i.e. if you want to choose the p and q parameter values for your ARMA model).
ACF and PACF play later in Time Series Modeling not only an important role in choosing the right parameter values but also to analyze the residuals of our forecast model. The ACF and PCAF of white noise do not contain any significant autocorrelations or partial autocorrelations. Therefore we can later use ACF and PACF to check if our residuals hopefully show no significant autocorrelations and partial autocorrelations.
💡 Key takeaways
- With the Autocorrelation Function (ACF) plot we are able to visually analyze the autocorrelations and their significance.
- Unlike the ACF the Partial Autocorrelation Function (PACF) measures only the direct relationship between Yₜ and its lagged version Yₜ₋ₙ.
- Both are essential tools for choosing the right parameters for models like ARMA or ARIMA.
An article that explains the theoretical basics of a field always comes with the risk of getting too dry. I hope that was most of the time in this article, not the case, and you now have a better foundation for time series modeling. Check out the references below if you are interested in more information and great literature. As mentioned at the beginning, the initial idea was to make a series out of it. If there is a demand and I have the time, I promise the following articles will be more hands-on!
References
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., and Ljung, G. M. 2015. Time Series Analysis: Forecasting and Control.
Chatfield, C. and Xing, Haipeng. 2019. The Analysis of Time Series – An Introduction with R, CRC Press.
Dettling, Marcel. (n.d.). _Applied Time Series Analysis – Course 2020_.
Hyndman, R. J., and Athanasopoulos, G. 2018. Forecasting: Principles and Practice.
Makridakis, S. G. 1998. Forecasting: Methods and Applications, (Thrid Edition.), New York: Wiley.
Montgomery, D. C. 2015. Introduction to Time Series Analysis and Forecasting, (Second Edition.), Wiley Series in Probability and Statistics, Hoboken, NJ: Wiley.