
Making Sense of Big Data
Seasonality, as its name suggested, refers to the seasonal characteristics of the time series data. It is the predictable pattern that repeats at a certain frequency within one year, such as weekly, monthly, quarterly, etc. The most straightforward example to demonstrate seasonality is to look at the temperature data. We always expect the temperature to be higher in the summer while lower in the winter in most places on Earth.
Taking seasonality into consideration is very important in time series forecasting, such as demand forecasting. For example, we may expect ice cream sales to have seasonality since the sales will be higher in the summer every year. The model that considers the seasonal effects of the sales will be more accurate in time series forecasting. In general, the goal of time series analysis is to take advantage of the data’s temporal nature to make more sophisticated models. To properly forecast events, we need to implement techniques to find and model the long-term trends, seasonality, and residual noise in our data. This article will focus on discussing how to detect seasonality in the data and how to incorporate seasonality in forecasting.
Detect Seasonality
Before putting seasonality into the models, we need to know how the data is repeated and on what frequency. Detect seasonality can be straightforward if you understand the context of the data very well. For example, we know the temperature will be higher in the summer and lower in the winter in a year. To discover the seasonality of the data you are not familiar with the context, the simple way is to plot the data and observe the periodic signals along with the time series:

The graph above shows the repeated pattern every year. Even though the height at which each summer reaches differ from each other, we can still see the pattern that the temperature are at its highest during the summer, and are at its lowest during the winter.
There is also seasonality within a day for temperature data, and we may not know exactly how temperature fluctuate in a day. We could plot the data in a lower frequency, or use an algorithm called fast Fourier Transform (FFT) to **** detect the repeated frequency. Any periodic signal can be represented as the sum of several sine waves with varying amplitude, phase, and frequency. A time series can be converted into its frequency components with the mathematical tool known as the _Fourier transfor_m. The output of a FFT can be thought of as a representation of all the frequency components of your data. In some sense it is a histogram with each "frequency bin" corresponding to a particular frequency in your signal. Each frequency component has both an amplitude and phase and is represented as a complex number. Generally, we care only about the amplitude to determine what is the highest "frequency bin" in your data. I won’t go much detail in math here, but if you are interested, this article by Cory Maklin explains it very well. To perform FFT in a dataset, we can use the FFT module from Scipy. Take the temperature dataset (temps) that has 13 years of hourly temperature as an example. The dataset looks like this:

Perform the following code on the dataset:
from scipy.fftpack import fft
import numpy as np
import matplotlib.pyplot as plt
fft = fft((temps.Temperature - temps.Temperature.mean()).values)
plt.plot(np.abs(fft))
plt.title("FFT of temperature data")
plt.xlabel('# Cycles in full window of data (~13 years)');
Not that in the first step we subtract the mean of the temperature to avoid a large zero-frequency component. Plotting the whole dataset with 13 years of data may not show the pattern well as the graph shows:

A Fourier transform of a real signal, no imaginary part, is symmetric about the center of the frequency range. Plotting FFT in a long time range may not give the most relevant information we are looking for. Thus we can zoom in to check the "frequency bin" at different frequency level. We would expect the temperature to have daily Seasonality so that we zoom in to check the "histogram" from day 0 to day 5:
plt.plot(1./13 * np.arange(len(fft)), np.abs(fft))
plt.title("FFT of temperature data (Low Frequencies zoom)")
plt.xlim([0,5])
plt.xlabel('Frequency ($y^{-1}$)');
The graph below shows the daily seasonality because the "frequency bin" is very high at the "day 1" frequency:

Similarly, we can zoom in to the frequency around day 365 to check for yearly seasonality:
plt.plot(1./13 * np.arange(len(fft)), np.abs(fft))
plt.title("FFT of temperature data (High Frequencies zoom)")
plt.ylim([0,120000])
plt.xlim([365–30, 365 + 30])
plt.xlabel('Frequency ($y^{-1}$)');
The graph below shows the yearly seasonality as the day 365 peaks.

FFT is an excellent tool to transform the time series data for it to be plotted as "frequency histogram". The graph will show what is the frequency of the data’s repeated pattern thus detects the seasonality in data.
Add Seasonality
After detecting seasonality, there are several ways to incorporate seasonality in the model to better perform time series forecasting. This article will introduce using seasonal indicators, Fourier analysis and SARIMA model to add seasonalities in time series forecasting.
1, Add Seasonal Indicators
The most straightforward way of adding seasonalities into the model is to add seasonal indicators. Seasonal indicators are categorical variables describing the "season" of each observation. Taking temperature prediction as an example. To indicate daily seasonality, when training the model, you may use "hour" of the observation as a feature. This feature will take the fixed effect from the hour of the day in predicting the temperature just as any other features do. To include yearly seasonality, we should add "month" and "quarter" as features in the model with the same intuition. Using the dataset listed above, each observation should include the following features:

The "Hour", "Month", "Quarter" variables are the features we should include to capture seasonalities in the data. Even though they are in numerical form, they are actually categorical variables rather than numerical variables. Note that if we are use linear models for prediction, to include these categorical variables, we need to first transform these variables using transformers like one-hot encoding. After transformation, the "Hour" variable should look like this:

The five observations of the "Hour" variable are transformed into four columns, showing different values of hour. You can say that we do not include Hour_4 as a column here to avoid multicollinearity. If Hour only has five values, not being in the first four values automatically indicating that this observation takes the fifth value. A simple way to do one-hot encoding using Pandas is to use the function get_dummies():
pandas.get_dummies(data, drop_first=True)
The function will transform categorical variables into dummy variables, thus ready to put in linear models. These dummy variables are call seasonal indicators.
If your prediction model doesn’t assume linear relations, you do not need to do one-hot encoding. For example, if you are using tree models, simply put the categorical variables into the model as features. If the categorical variable is in textual form, assign numerical numbers to each category before putting into the model. (change "January" into "1")
2, Fourier Analysis
Any signal can be represented as a linear combination of sines and cosines of varying frequencies 𝑓𝑛 and amplitudes 𝐴𝑛 and 𝐵n:__

The Fourier transform decomposes a signal into a set of frequencies, allowing for us to determine the dominant frequencies that make up a time series. Take the temperature data as an example. From the plots above, we know that temperature is roughly sinusoidal. Thus we know that a reasonable model might be:

Where y0 and t0 are parameters to be learned. T is usually one year for seasonal variation. While this function is linear in y0, it is not linear in t0. However, using Fourier analysis, that the above is equivalent to:

This function is linear in A and B. Thus, we can create a model containing sinusoidal terms on one or more time scales, and fit it to the data using a linear regression. The following code shows the process of constructing yearly, half year and daily seasonalities as features, and using them in a linear regression model to predict temperature:
df['sin(year)'] = np.sin(df['julian'] / 365.25 * 2 * np.pi)
df['cos(year)'] = np.cos(df['julian'] / 365.25 * 2 * np.pi)
df['sin(6mo)'] = np.sin(df['julian'] / (365.25 / 2) * 2 * np.pi)
df['cos(6mo)'] = np.cos(df['julian'] / (365.25 / 2) * 2 * np.pi)
df['sin(day)'] = np.sin(df.index.hour / 24.0 * 2* np.pi)
df['cos(day)'] = np.cos(df.index.hour / 24.0 * 2* np.pi)
regress = LinearRegression().fit(X=train[['sin(year)', 'cos(year)', 'sin(6mo)','cos(6mo)','sin(day)','cos(day)']], y=train['temp'])
The FFT figure above is very useful in determining what kind of seasonalities should be included in the regression. Note that rather than using the regular date here, we are using Julian date as the t variable here. Julian date is the number of days since the beginning of the Julian Period (January 1, 4713 BC), thus it is a continuous number.
Besides adding seasonal indicators or using Fourier Analysis, we can also use SARIMA model, which is adding seasonal components to ARIMA (Autoregressive Integrated Moving Average) model. This tutorial describes SARIMA model in detail if you are interested.
Thank you for reading this article. Lastly, don’t forget to:
- Check these other articles of mine if interested;
- Subscribe to my email list;
- Sign up for medium membership;
- Or follow me on YouTube and watch my most recent YouTube video about How Am I Benefiting from Writing at Medium