Time Series Decomposition and Statsmodels Parameters

Alex Mitrani
Towards Data Science
6 min readJan 17, 2020

--

Note: This article was updated on July 11, 2020 to reflect new changes to the statsmodels Python module and to include results from multiplicative models.

Time series decomposition is the process of separating time series data into its core components. These components include a potential trend (overall rise or fall in the mean), seasonality (a recurring cycle), and the remaining random residual. Nearly all time series that you will come across are not n aturally stationary, meaning that the mean, variance, or covariance will be time dependent. This is why data scientists must identify and separate trends and seasonality from time series data before applying a model.

You can manually remove trends by applying transformations, subtracting rolling means, and differencing to make your data stationary or you can use Python’s statsmodels library to identify trends and seasonality.

Over the past few weeks, I’ve covered a few time series topics that I will be building upon here including OHLC visualizations, time series data EDA, trend analysis, and stationarity.

Decomposition

All time series data can be broken down into four core components: the average value, a trend (i.e. an increasing mean), seasonality (i.e. a repeating cyclical pattern), and a residual (random noise). Trends and seasonality are not always present in time dependent data. The residual is what’s left over after trends and seasonality are removed. Time series models assume that the data is stationary and only the residual component satisfies the conditions for stationarity.

Python’s statsmodels library has a method for time series decomposition called seasonal_decompose(). I utilized historical daily average closing prices of the S&P 500 index over the last five years to illustrate time series decomposition. The data was obtained from the UniBit API (Note: In the later section I use only three years of prices due to limitations with the API).

S&P 500 Index Historical Prices Over The Previous Five Years — Courtesy of UniBit API

I stored my data in a pandas dataframe and set the index to the date column using the .set_index() method. I then ensured the data type of the date index column was a pandas datetime object. You need to ensure your data is in the proper format, the UniBit API provides dates in the format Year-Month-Day (i.e. 2015–01–20).

The seasonal_decompose() method can take up to six parameters. I focused on the data itself, the model type, and the frequency (period in the documentation). I used the adjusted closing prices column of the pandas dataframe where the index is a datetime object. The model type parameter can either be additive or multiplicative, this depends on if the amplitude of your data’s seasonality is level (mean) dependent. If the seasonality’s amplitude is independent of the level then you should use the additive model, and if the seasonality’s amplitude is dependent on the level then you should use the multiplicative model.

Additive versus multiplicative time series data. Note the expanding variance of the seasonality in the multiplicative series on the right. Source “Additive and multiplicative seasonality — can you identify them correctly?” by Nikolaos Kourentzes

You can perform an initial visual glance of a graphical visualization of your data to determine which model type matches your data, in this article I will test both models.

The Freq Parameter

The last parameter that I dealt with was the frequency parameter. (Note: The frequency parameter of statsmodels’ seasonal_decompose() method has been deprecated and replaced with the period parameter). I did not expect to have to select this parameter because it is optional, the seasonal decompose method worked perfectly fine with the data from tutorial that I initially used without this parameter specified. When I applied the method to the S&P 500 data frame I received the following error:

The error raised when I applied the seasonal decompose method to the S&P 500 data

I searched for a few reasons as to why this error occurred, one GitHub conversation discussed the inferred_freq attribute of a pandas series and another discussed a bug that recently appeared with the seasonal_decompose() method. The error also appeared in the comment section of Jason Brownlee’s tutorial that I was following. It is possible that the issue arises with time series data where observations are not consistent, i.e. weekends are missing in daily data. This is the only difference between the tutorial data that I initially used and the S&P 500 adjusted daily closing prices.

So, how do you select an appropriate value for the frequency parameter? In the comment section of these GitHub conversations several users specified a frequency that they could justify logically. Brownlee’s tutorial linked to the book “Forecasting: Principles and Practice” by Rob J. Hyndman and George Athanasopoulos. The authors also gave similar logical justifications for selecting this parameter in their article on classical decomposition. I tested three frequencies for my time series data: 5, 20, and 253. Five because that is how many trading days there are in a week, 20 trading days per month, and 253 per year (Note: You have to have at least twice as many observations in your data as the frequency that you want to test; i.e. if you want to set the frequency to 253 then you need at least 506 observations).

Additive Model

I compared the results of the Dickey-Fuller test for the additive models with the three frequency values, the model with a period of 5 days had the smallest p-value:

A seasonal decomposition of the S&P 500 data with an additive model and the period set to 5.

The seasonal decompose method broke down the data into three portions, trend, seasonality, and random residual components. The residual component of the series with a period set to 5 is stationary because the p-value of 4.576743e-21 is far less than 0.05.

Multiplicative Model

I then compared the results of the Dickey-Fuller test for the multiplicative models with the three same frequency values, the model with a period of 5 days once again had the smallest p-value:

A seasonal decomposition of the S&P 500 data with an multiplicative model and the period set to 5.

The residual component of the series with a period set to 5 is stationary because the p-value of 3.053408e-20 is far less than 0.05.

Comparison of Results

The p-value of the Dickey-Fuller test was smallest for the multiplicative model for two out of the three tested periods; 20 and 253 days. The best performing model was the additive model with a period set to 5 days. Only a period of 253 days had p-values above 0.05, the results are below:

A comparison of the Dickey-Fuller test results for S&P 500 index data with different periods

Summary

Statsmodels’ seasonal decompose method neatly breaks down time series data into components. Data scientists and statisticians must use logic to justify the period selected for additive and multiplicative seasonal decomposition models as modules with methods that automatically infer a series’ frequency are not always reliable. Nevertheless, statsmodels contains packages that greatly reduce the need for guesswork.

My repository for this project can be found here, the file used for this specific article is time_series_removing_trends_and_decomposition.ipynb.

--

--

Data scientist with a passion for using technology to make informed decisions. Experience in Real Estate and Finance.