Testing for Stationarity Before Running Forecast Models. With Python. And A Duckling Picture.
ADF, KPSS, OSCB, and CH tests for Stationarity and for a stable seasonal pattern; and how to deal with them if they provide contradictory results.
To avoid a trap that could lead to a deficient forecast model, we will apply the ADF and the KPSS tests in parallel to check if the time series not only quacks like a duck, but also waddles like waterfowl is supposed to. We will also run the OCSB and CH tests to check if seasonal differencing is required.

Our source consists of 1200 months of historical temperature records for the small (and entirely fictional) town of Lower Tidmarsh, East Dakotahoma. The Lower Tidmarsh town archive was destroyed by a kitchen fire in the 1980s before (or, as some residents told us, because) the volunteer fire brigade came to the rescue. The temperature records had to be reconstructed by interviewing the two centennial residents. The time series is synthetic, consisting of a sinusoidal seasonal component that mirrors the harsh winters and moderate summers in East Dakota; a global warming trend over the past century; and a white noise component representing the estimation uncertainty.
You can download the small Temp.csv source file (~33 kB) from Google Drive via the link shown above. The Jupyter notebook is available via the second link.
0. Dependencies
1. Data Processing
Download the source data file Temp.csv.

Pandas imports .csv date columns as objects/strings. Therefore, we convert the dates to datetime, set an index, and derive year and month from the index.

Let’s create a pandas pivot table to look at the source data in tabular form.

We use the pivot table to compute the 10-year rolling average temperature that will iron out the short-term fluctuations of seasonal peaks and dips, and then create a chart to study the long-term trend, if there is any.
The plot shows a rising trend – a first indication that our time series is not stationary.

Before we can feed the temperature into a forecast model such as SARIMA, we need to test it for stationarity.
We may be tempted to just kick off some kind of grid-search for suitable hyperparameters and then leave it to the auto-tuning process to identify the model with the lowest Akaike information criterion. But this can lead to the forecast quality trap mentioned earlier.
- The information criteria represent the objective we want to minimize with respect to the autoregressive AR and moving average MA terms;
- whereas the order of differencing must be determined in advance, by running tests for stationarity.
2. Testing for Stationarity
2.1 Stationarity and Differencing
Stationarity
"A stationary time series is one whose properties don’t depend on the time at which the series is observed." (Hyndman: 8.1 Stationarity and differencing | Forecasting: Principles and Practice (2nd ed) (otexts.com))
A time series is stationary if its mean, variance, and autocorrelation structure do not change over time. If they are not time-invariant, the properties we use today to prepare a forecast would be different from the properties we would observe tomorrow. A process that is not stationary would elude our methods for using past observations to predict the future development. The time series itself does not need to remain a flat, constant line in past and future periods to be deemed stationary – but the patterns that determine its changes over time need to be stationary to make its future behavior predictable.
The time series needs to exhibit:
- time-invariant mean
- time-invariant variance
- time-invariant autocorrelations
Time series with observations that are not stationary a priori can often be transformed to reach stationarity.
Inconstant Mean
A series that shows a robust upward or downward trend does not have a constant mean. But if its data points tend to revert to the trendline after disturbances, the time series is trend-stationary. A transformation such as de-trending may turn it into a stationary time series that can be used in the forecast model. If the trend follows a predictable pattern, we can fit a trendline to the observations and then subtract it before we feed the de-trended series into the forecast model. Alternatively, we may be able to insert a datetime index into the model as an additional independent variable.
If these de-trending measures do not suffice to realize a constant mean, we can investigate if the differences from one observation to the next have a constant mean.
By differencing the time series – taking the difference between an observation y(t) and an earlier observation y(t-n) – we could obtain a stationary (mean-reverting) series of the changes.
A time series in which any observation only depends on one or more of its predecessors (separated by a few lags), plus or minus some random error, is called a random walk. The differences between observations have a zero mean, apart from the error term, which by definition has a zero mean itself if it does not contain a signal with valuable information for the forecast. Random walks may exhibit long phases of apparent trends, up or down, followed by unpredictable changes of direction. A constant average trend requires one order of differencing.
A time series in which the differences between neighboring observations have a non-zero mean will tend to drift upwards (positive mean) or downwards (negative mean). We difference a time series with drift to get a series with constant mean.
Some time series require two rounds of differencing. The changes between observations are not constant (no constant "speed" between observations), but the change rate may be stable (constant "acceleration" or "deceleration"). If two rounds of differencing do not suffice to make a time series stationary, a third round is rarely justifiable. Rather, the properties of the time series should be investigated more closely.
A time series with seasonality will exhibit patterns that repeat after a constant number of periods: temperatures in January differ from those in July, but January temperatures will be at a similar level between years. Seasonal differencing takes the difference between an observation and its predecessor that is S lags removed, with S being the number of periods in a full season, like 12 months in a year or 7 days in a week.
If both the trend and the seasonal pattern are relatively time-invariant, the differenced time series (first-differenced with respect to the trend; and seasonally-differenced with respect to the seasonality) will have an approximately constant mean.
Inconstant variance
If the time series takes on the shape of an expanding or narrowing funnel, then its observations fluctuate around its trend with an increasing or decreasing variance over time. Its variance is not time-invariant.
By taking the logarithm of the observations, their square root, or by applying a Box-Cox-transformation, we may be able to stabilize the variance through transformations. After the forecast, we could reverse these transformations.
Inconstant autocorrelation structure
The correlation and covariance between two observations y(t) and y(t-1), for any given t, do not remain constant over time. For stationarity, the autocorrelations should be time-invariant.
PSA #1: Determine Stationarity Before Fitting A Model
The required order of differencing is a parameter that should be determined in advance, before fitting a forecast model to the data. A tuning algorithm can test any combinations of hyperparameters against a chosen benchmark such as the Akaike information criterion. But some of the hyperparameters may neutralize each other’s effects. A hyperparameter search in a SARIMA model will trade autoregressive AR and moving-average MA terms for changes in the order of differencing.
"It is important to note that these information criteria tend not to be good guides to selecting the appropriate order of differencing (d) of a model, but only for selecting the values of p and q. This is because the differencing changes the data on which the likelihood is computed, making the AIC values between models with different orders of differencing not comparable. So we need to use some other approach to choose d, and then we can use the AICc to select p and q." (Hyndman, 8.6 Estimation and order selection | Forecasting: Principles and Practice (2nd ed) (otexts.com)).
Thus, if a hyperparameter search attempts to determine the order of differencing in parallel with the other parameters, we may obtain an inferior forecast model. The search would find an order of differencing that apparently minimizes AIC or BIC. But it may have missed a model that could lead to more accurate predictions despite its higher AIC. The search algorithm is unaware that its objective, the information criterion, cannot compare models with different orders of differencing.
Either the tuning algorithm should apply hypothesis tests to determine the appropriate order of differencing before it starts a grid search for the other hyperparameters; or the data scientist pins down the order of differencing and then limits the grid search to the remaining parameters such as the AR and MA terms.
PSA #2: Conduct Parallel Tests for Stationarity
To find out if differencing is required, we can run four tests to obtain objective results, which a visual inspection of charts may miss:
- Augmented Dickey-Fuller ADF
- Kwiatkowski-Phillips-Schmidt-Shin KPSS
- Osborn-Chui-Smith-Birchenhall OCSB for seasonal differencing
- Canova-Hansen CH for seasonal differencing
I will skip some other unit root tests, such as Phillips-Peron.
These tests may return contradictory results in quite a few cases. The following example will demonstrate that ADF and KPSS should be evaluated in parallel, not in isolation. Many of us – I included, when I prepared my first forecasts – are used to rely on the ADF test as our default for stationarity tests; others prefer the KPSS test. Few among us, I suppose, routinely apply and then compare both tests to decide on differencing.
2.2 Augmented Dickey-Fuller Test (pmdarima) – Quacks Like A Duck?
- Null hypothesis: the series contains a unit root: it is not stationary.
- Alternative hypothesis: there is no unit root.
- Low p-values are preferable. If the test returns a p-value below the chosen significance level (e.g. 0.05), we reject the null and conclude that the series does not contain a unit root.
- If the ADF test does not find a unit root, but the KPSS test does, the series is difference-stationary: it still requires differencing.
-
The pmdarima tests, both ADF and KPSS, provide as outputs the p-value; and a Boolean value that is the answer to the the question: "Should we difference?"
2.3 Kwiatkowski-Phillips-Schmidt-Shin Test (KPSS) (pmdarima) – But It Does Not Walk Like A Duck?
- Null hypothesis: the series is stationary around a deterministic trend (trend-stationary).
- Note that the KPSS test swaps the null hypothesis and alternative hypothesis, compared to the ADF test.
- Alternative hypothesis: the series has a unit root. It is non-stationary.
- High p-values are preferable. If the test returns a p-value above the chosen significance level (e.g. 0.05), we conclude that it appears to be (at least trend-)stationary.
-
If the KPSS test does not find a unit root, but the ADF test does, the series is trend-stationary: it requires differencing (or other transformations such as de-trending) to remove the trend.
2.4 Compare the ADF and KPSS Test Results (pmdarima)
Thus, the pmdarima tests return conflicting results.

2.5 Order of Differencing Recommended by ADF and KPSS
pmdarima also offers a method that returns the recommended order of first-differencing.
The recommendations are contradictory as well, because the same ADF and KPSS tests are used to derive them.
But we will come back to these orders of differencing later, when we will wrap up our findings and decide how to proceed.

Let’s check with the statsmodels.stattools tests if this is just a quirk in the pmdarima algorithm (hint: it is not).
2.6 Augmented Dickey-Fuller Test (stattools) – Quacks Like A Duck?
- We use the adfuller test of statsmodels.stattools to obtain additional information compared to the pmdarima tests.
- Null hypothesis: the series contains a unit root, it is not stationary.
- Alternative hypothesis: there is no unit root.
- Low p-values are preferable. If the test returns a p-value below the chosen significance level (e.g. 0.05), we reject the null and conclude that the series does not contain a unit root. It appears to be stationary.
-
If the ADF test does not find a unit root, but the KPSS test does, the series is difference-stationary: it requires differencing.
2.7 Kwiatkowski-Phillips-Schmidt-Shin Test (KPSS) (stattools) – Does Not Walk Like A Duck?
- Null hypothesis: the series is stationary around a deterministic trend (trend-stationary).
- Alternative hypothesis: the series has a unit root. It is non-stationary.
- High p-values are preferable. If the test returns a p-value above the chosen significance level (e.g. 0.05), we conclude that it appears to be at least trend-stationary.
-
If the KPSS test does not find a unit root, but the ADF test does, the series is _trend-_stationary: it requires differencing or other transformations to remove the trend.
2.8 Compare the ADF and KPSS results – ADF quacks like a duck, but KPSS does not walk like waterfowl

2.9 Difference or Don’t Difference?
- So the ADF test does not find a unit root even though the chart above shows a clear upward trend.
- The KPSS test reports that the series is not stationary.
How do we deal with the conflict? Is the KPSS test always correct?
2.10 Visual Plausibility Check: Decomposition


- The trend chart does not show a constant mean, but rather an upward trend. The series cannot be stationary.
- The autocorrelation plot shows high and persistent autocorrelations in its ACF and PACF charts, with seasonal oscillations. The series cannot be stationary if it exhibits stable seasonality.
2.11 First-Difference: Reaching Stationarity
We apply the differencing method .diff() to the original time series; and then check for stationarity with both ADF and KPSS.


ADF and KPSS agree that the differenced series is stationary. The differenced series not only quacks like a duck, it also walks like one.



2.12 Stationary – But What About The Seasonality?
We have applied first-differences and received favorable test results from ADF and KPSS. Though the ACF plot still shows seasonal fluctuations.
Let’s run the OCSB and CH tests to decide if we need a helping of seasonal differencing as well.
The pmdarima implementations of both tests return the recommended orders of seasonal differencing.
Osborn-Chui-Smith-Birchenhall OCSB Test:
- Null hypothesis: the series contains a seasonal unit root
- It uses a Dickey-Fuller type regression. (ocsb: OCSB test in seastests: Seasonality Tests (rdrr.io) )
Canova-Hansen Test for Seasonal Stability:
- Null hypothesis: the seasonal pattern is stable over time
2.12a When we investigate the original data, we observe another conflict, this time about seasonal differencing:
- The OCSB does not identify a need for seasonal differencing, similar to the ACF for first differencing.
- The CH test does recommend 1 order of seasonal differencing, similar to KPSS for first differencing.
2.12b When we run OCSB and CH on the first-differenced data we have generated in chapter 2.11, then OCSB and CH agree that first-differencing has obviated the need for any seasonal differencing.

Conversely, if OCSB or CH had suggested to difference, we would have created a seasonally differenced series by appending the .diff(12) method to the original series.
Syntax for differencing in pandas: If y is the variable that represents the series of undifferenced data, then:
- y.diff(1) for first-differencing
- y.diff(12) for seasonal differencing if the seasonality has a periodicity of 12 months. The recommended order of seasonal differencing would be multiplied by the periodicity of 12 to inform the pandas function .diff() about the number of lags it should use to jump from end of the seasonal period to the preceding end.
- y.diff(1).diff(12) or y.diff(12).diff(1) – for combining both first- and seasonal differencing in a one-liner. The sequence of first- and seasonal differencing is not relevant – the results would be the same.
- Rules for identifying ARIMA models (duke.edu):
- "Rule 12: If the series has a strong and consistent seasonal pattern, then you must use an order of seasonal differencing (otherwise the model assumes that the seasonal pattern will fade away over time).
- However, never use more than one order of seasonal differencing or
- more than 2 orders of total differencing (seasonal+nonseasonal)."
2.13 ADF and KPSS Conflicts – How Do We Deal With Them?
If the ADF and KPSS tests return conflicting results, how do we proceed: difference or don’t difference?
As a general rule:
- Neither the ADF test nor the KPSS test will confirm or disconfirm stationarity in isolation. Run both tests to decide if you should difference.
- If a least one of the tests claims to have found non-stationarity, you should difference. An unambiguous confirmation of duckiness (stationarity) requires that both tests confirm the quacking and the waddling.
A more specific explanation:
There are 4 possible combinations of KPSS and ADF test results
- If KPSS and ADF agree that the series is stationary (KPSS with high p-value, ADF with low p-value): Consider it stationary. No need to difference it.
- ADF finds a unit root; but KPSS finds that the series is stationary around a deterministic trend (ADF and KPSS with high p-values). Then, the series is trend-stationary and it needs to be detrended. Difference it. Alternatively, a transformation may rid it of its trend.
- ADF does not find a unit root; but KPSS claims that it is non-stationary (ADF and KPSS with low p-values). Then, the series is difference-stationary. Difference it.
- If KPSS and ADF agree that the series is non-stationary (KPSS with low p-value; ADF with high p-value): Consider it non-stationary. Difference it.
Let’s translate this heuristic to Python:
For first-differencing, we take the higher of the orders which ADF and KPSS recommend.

For seasonal differencing, we take the higher of the orders which OCSB and CH recommend. To avoid over-differencing, we should check if first-order differencing already arrives at stationarity.


After one round of differencing, the code runs all four tests again – ADF, KPSS, OCSB, and CH – to confirm if additional differencing might be required. In our example, all four tests agree that the 1 order of first-differencing we have applied in chapter 2.11 was enough to arrive at a stationary time series – which we can now hand over to a forecast model.
duckling picture: by Kerin Gedge, unsplash.com; yellow and brown duckling photo – Free Animal Image on Unsplash