Modeling Variable Seasonal Features with the Fourier Transform

Improve time series forecast performance with a technique from signal processing

Florin Andrei
Towards Data Science

--

Modeling time series data and forecasting it are complex topics. There are many techniques that could be used to improve model performance for a forecasting job. We will discuss here a technique that may improve the way ML forecasting models learn from time features, and generalize from them. The main focus will be the creation of the seasonal features that feed the time series forecasting model in training — there are easy gains to be made here if you include the Fourier transform in the feature creation process. I took inspiration from work I’ve done in the past with digital electronics and signal processing, and applied some of those concepts to feature engineering for time series forecasting.

This article assumes you are familiar with the basic aspects of time series forecasting — we will not discuss this topic in general, but only a refinement of one aspect of it. This is not about time series analysis or modeling. This is not about financial time series (stock prices). This is about machine learning forecasting of general time series. For an introduction to ML time series forecasting, see the Kaggle course on this topic — the technique discussed here builds on top of their lesson on seasonality.

We will also compare some prominent time series models like Facebook Prophet and ARIMA, and learn from the techniques they use, which can definitely be used in a custom ML forecasting model.

ETS methods and autoregressive models

ETS methods (error, trend, seasonality) decompose the time series signal in several components:

ETS model
  • y(t) is the signal (time series) you’re trying to predict
  • g(t) is the trend, a function that captures the non-periodic changes
  • s(t) is the seasonality, or the changes with strict periodicity
  • epsilon is the noise that cannot be predicted by the model

By correctly learning y(t) and g(t), the model should be able to make predictions, except for the random variations represented by epsilon (which are not actually part of the model). Facebook Prophet takes a similar approach, while adding other features as well (holidays, exogenous features).

Autoregressive models essentially do linear regression, where the features are past data points in the series (the lags of the time series). One example is the AR(p) model, or the autoregressive part of ARIMA (which is essentially just a GLM — generalized linear model with only linear terms):

AR(p) model
  • y(t) is the future value of the signal you’re trying to predict
  • b is the bias term
  • y(t-1) is the last known value (lag 1)
  • y(t-2) is the value before that (lag 2), etc.
  • the phi coefficients are the model weights
  • the noise term (epsilon) is not actually part of the AR(p) model, since it cannot be predicted

So the AR(p) model tries to predict future values from multiple lags of the past values.

Various blends of ETS and autoregressive models are possible in machine learning forecasting. Exogenous variables can also be included — variables that are not part of the time series, but have an influence over it (like promotions may influence sales). SARIMAX is an example of a model that uses seasonal components, auto-regressive components, moving average components (of the error term), and exogenous features.

If you’re modeling a general time series (e.g. store sales), the random walk hypothesis is not dominant, you do not care much about building a rigorous mathematical model, and the main focus is strictly the RMSE performance of the time series value forecast, you could build a model using a general-purpose machine learning model (it could be some variation on the random forest concept, it could be as simple as linear regression, or as complex as an ensemble of various ML models, or even neural networks), and then you could engineer all the features you need, and provide them to the model in training.

Think of a data frame containing the Xtrain features that will be used to fit the model. The target Ytrain is the time series itself, perhaps transformed in some ways (scaled, log, etc). The Xtrain data frame may have columns including:

  • trend features, representing the broad variations of the time series; these could be: a constant term (the value 1 repeated for each row), a function linear in time (literally the enumeration: 0, 1, 2, 3, …), a quadratic function (0, 1, 4, 9, …), a cubic, or may have other shapes. Prophet uses linear trend with inflection points computed from data.
  • seasonality, representing up-and-down variations that maintain the same shape and happen on a strict time schedule
  • autoregressive features, or the lags of the time series itself (the time series column from the target Ytrain, but shifted 1 or more rows down)
  • exogenous features, such as promotions, holidays, geolocation grouping, etc.

When you run model.fit(Xtrain, Ytrain) the model will then learn from these features, with the target Ytrain being the time series itself. This is a fairly general form of time series forecasting with machine learning models.

Seasonal features in detail

Facebook Prophet represents seasonal features as a Fourier series. In general, a periodic function can be represented as a series of sine/cosine pairs, where the period of each sine/cosine is a submultiple of some base period:

Fourier series
  • t is time
  • P is the base period of a seasonal feature — the period of the sine/cosine pair with the largest period
  • n, the index in the series, is a period demultiplier (frequency multiplier)
  • sine and cosine terms are weighted by the a(n) and b(n) parameters

In Pandas form, here are two sine/cosine pairs, the second pair has twice the frequency of the first pair:

                sin(1,freq=A-DEC)  cos(1,freq=A-DEC)  sin(2,freq=A-DEC)  cos(2,freq=A-DEC)
time
1956Q1 0.000000 1.000000 0.000000 1.000000
1956Q2 0.999963 0.008583 0.017166 -0.999853
1956Q3 0.017166 -0.999853 -0.034328 0.999411
1956Q4 -0.999963 -0.008583 0.017166 -0.999853
1957Q1 0.000000 1.000000 0.000000 1.000000
... ... ... ... ...
2013Q1 0.000000 1.000000 0.000000 1.000000
2013Q2 0.999769 0.021516 0.043022 -0.999074
2013Q3 0.025818 -0.999667 -0.051620 0.998667
2013Q4 -0.999917 -0.012910 0.025818 -0.999667
2014Q1 0.000000 1.000000 0.000000 1.000000

[233 rows x 4 columns]

The Fourier theorem states that (loosely speaking) when representing a well-behaved periodic function this way, by adjusting the weights a(n) and b(n), the series can be made to converge to the function. So a finite Fourier sum might be a good way to approximate periodic signals (seasonal components of time series) — we’re only looking for an approximation, not full convergence, so we use a finite sum.

Another way to model seasonality is via one-hot-encoded variables. If P is the base period of a seasonal feature, then P columns in a data frame, each containing either 0 or 1, with 1 occurring only once in each column for the entire period P, and the values 1 never overlapping in any row between the P columns, could be learned by a linear model as:

one-hot encoding
  • Cp is each column that encodes time
  • the beta parameters are the weights applied to each column — their relative values literally model the shape of the wave

In Pandas form, here is a seasonal feature one-hot encoded with a period of 4 observations:

           s(1,4)  s(2,4)  s(3,4)  s(4,4)
time
1956Q1 1.0 0.0 0.0 0.0
1956Q2 0.0 1.0 0.0 0.0
1956Q3 0.0 0.0 1.0 0.0
1956Q4 0.0 0.0 0.0 1.0
1957Q1 1.0 0.0 0.0 0.0
... ... ... ... ...
2013Q1 1.0 0.0 0.0 0.0
2013Q2 0.0 1.0 0.0 0.0
2013Q3 0.0 0.0 1.0 0.0
2013Q4 0.0 0.0 0.0 1.0
2014Q1 1.0 0.0 0.0 0.0

[233 rows x 4 columns]

If the model you use is pure linear regression (no penalty), and there is a bias term, one-hot encoding may run into the so-called dummy variable trap. You may also hear the term collinearity problem. In a nutshell, the Cp columns can be linearly combined to generate a constant “trend” for the bias term.

If that’s the case, dummy encoding (K-1 encoding) is the solution — it is literally the same thing as one-hot encoding, but with P-1 columns instead (drop one column):

dummy encoding

You could also try to remove the bias term, or use a regularized model, or use gradient descent. In practice, test the model’s performance and make the right decision.

Fourier series vs one-hot/dummy encoding

One-hot/dummy encoding

Let’s say you create a group of one-hot encoded columns, with P columns in the group. This group will be able to model any seasonality with a period of P. It will also be able to model seasonalities with submultiple periods of P, such as P/2, etc, so you do not need to explicitly model them.

This succession (P, P/2, P/4, etc.) is simply another way of describing Fourier series, which model any complex periodic signal as a sum of a base component and all its frequency-multiples / period-submultiples. This is key to understanding why the one-hot/dummy encoded time features perform the same task as Fourier series time features. They are different implementations of the same basic idea.

The low limit for the periods modeled by one-hot/dummy time features is twice the sampling period of your time series: if the time series has daily observations, the shortest period modeled by your time dummies will be 2 days. This is literally the Nyquist-Shannon theorem stated in time series terms.

What one-hot/dummy encoding really does is — it models the waveform of the P-days component, like a scatterplot with P points. In doing so, because of how the Fourier transform works, it also captures all the P-submultiple components (P/2, etc).

One-hot encoded time features can learn arbitrarily complex waveforms very well. On the other hand, if P is large then the number of one-hot encoded columns is also large, and then you run into the curse of dimensionality problem. It’s best to use one-hot/dummy encoding for short periods of time, but this is not a strict rule — if you don’t have many features to begin with, then a large group of time dummies will be fine.

Fourier series encoding

Fourier series are concise and can express arbitrarily large periods P — they are well-suited for large-period seasonality. On the other hand, if the waveform is very complex, it may not be learned well without creating many sine/cosine pairs.

The clue that you should probably use Fourier series is in the periodogram (described below). If you notice your series has a strong and sharp P days seasonal component, along with a strong and sharp P/2 component, and not much else in terms of a clear pattern, that indicates that two sine/cosine pairs, one with a period of P, the other with a period of P/2, may work well.

Keep in mind that the P/2 component and the P component model the same phenomenon: it’s not two different phenomena, but the same thing, with a base period of P. The P/2, P/4, etc components are just its Fourier harmonics scattered across the spectrum. This is simply because the base P seasonality does not have a perfect sine shape — if it did, then only the P component would show in the periodogram.

Never create Fourier components with a period shorter than twice the sampling period of your time series. If your time series has a daily sampling period, then the shortest seasonality you will ever be able to model is 2 days. The Nyquist-Shannon theorem places a hard limit there, like a brick wall. That’s all the data you have.

Perhaps even leave some safety margin to the Nyquist limit. The music you listen to in audio CD format has signal frequencies up to 20 kHz. The sampling frequency of audio CD is 44.1 kHz, so the Nyquist limit is 22.05 kHz, a little higher than the maximum recorded frequency.

Additive vs. multiplicative seasonality

Consider the Quarterly Australian Portland Cement production dataset, showing the total quarterly production of Portland cement in Australia (in millions of tonnes) from 1956:Q1 to 2014:Q1.

df = pd.read_csv('Quarterly_Australian_Portland_Cement_production_776_10.csv', usecols=['time', 'value'])
# convert time from year float to a proper datetime format
df['time'] = df['time'].apply(lambda x: str(int(x)) + '-' + str(int(1 + 12 * (x % 1))).rjust(2, '0'))
df['time'] = pd.to_datetime(df['time'])
df = df.set_index('time').to_period()
df.rename(columns={'value': 'production'}, inplace=True)
        production
time
1956Q1 0.465
1956Q2 0.532
1956Q3 0.561
1956Q4 0.570
1957Q1 0.529
... ...
2013Q1 2.049
2013Q2 2.528
2013Q3 2.637
2013Q4 2.565
2014Q1 2.229

[233 rows x 1 columns]
cement production

This is time series data with a trend, seasonal components, and other attributes. The observations are made quarterly, spanning an interval of several decades.

The trend g(t) is nearly linear, with a few inflection points. The seasonal component s(t) has a simple waveform, repeated over and over to the end.

A linear model trying to represent this dataset only from trend and seasonality may combine these two groups of features additively (we ignore all other components in the model). This is called additive seasonality:

additive seasonality

I will anticipate a result obtained later, and I will show you here the output of combining a quadratic trend (which happens to look nearly linear here) with an additive seasonality model, applied to the dataset:

trend with additive seasonality

The problem with additive seasonality here is obvious: the model has learned a fixed-amplitude seasonality, and it’s simply adding it to the trend. The model generates a pattern of waves with fixed amplitude, dictated by the weights of the time features.

One improvement is multiplicative seasonality. This is an option in Facebook Prophet. The assumption is that the amplitude of the seasonality is proportional to the trend. No doubt this is close to true for some datasets. The multiplicative trend formula is:

multiplicative seasonality

That would model seasonality well, as long as the seasonal amplitude is indeed proportional to the overall trend. In that case, the trend itself models the envelope of the seasonal component. We will not demo that case here, but we will show later that this is not a great fit either.

But what if all that is not the case? Can we do better than simple addition or multiplication, when neither is a great fit? Can we have a model that learns a more general variation of the seasonal features, one that’s not constant, and is not bound by strict proportionality to the trend?

Fourier analysis

Let’s take a look at the seasonal components in the Australian Portland cement dataset, using the periodogram plot. This involves using the periodogram()function from scipy.signal (all code is included in the companion notebook, linked at the end).

What scipy.signal.periodogram() does is — it looks at a periodic signal, and it tries to estimate the coefficients in a Fourier series (one somewhat more complex than the series formula shown above) that would approximate the signal well. It then returns the weights in the series, which can be visualized in a plot.

This is the result of plotting the output of scipy.signal.periodogram() applied to the dataset:

periodogram
periodogram

The periodogram shows the power density of spectral components (seasonal components). The strongest seasonal component in the dataset is the one with a period equal to 4 quarters, or 1 year. This confirms the visual impression that the strongest up-and-down variations in the graph happen about 10 times per decade. There is also a component with a period of 2 quarters — that’s the same seasonal phenomenon, and it simply means the 4-quarter periodicity is not a simple sine wave, but has a more complex shape. There are other components, too, with periods of 10 or more quarters, but we will ignore them.

The Fourier spectrogram

The periodogram will highlight all spectral components in the signal (all seasonal components in the data), and will provide an overview of their overall “strength”, but it’s an aggregate of the “strength” of any component over the whole time interval. It says nothing about how the “strength” of each seasonal component may vary in time across the dataset.

To capture that variation, you have to use the Fourier spectrogram instead. It’s like the periodogram, but performed repeatedly over many time windows across the entire data set. The spectrogram is also available as a method in the scipy library.

Let’s plot the spectrogram for the seasonal components with periods of 2 and 4 quarters, mentioned above. As always, the full code is in the companion notebook linked at the end.

spectrum = compute_spectrum(df['production'], 4, 0.1)
plot_spectrogram(spectrum, figsize_x=10)
spectrogram
spectrogram

What this diagram shows is the “strength” of the 2-quarter and 4-quarter components over time, the maximum amplitude of their variation at different moments in time (in sound synthesis for digital music instruments this is called the envelope). They are pretty weak initially, but become very strong around 2010, which matches the variations you can see in the data set plot at the top of this article.

Beyond multiplicative seasonality

The envelope, as we will see below, will model the amplitude of seasonal components in a way that’s definitely better than simply adding a fixed-amplitude seasonality to the trend. It can also be more general than simply tying the amplitude of seasonality to the overall trend (like with multiplicative seasonality) — it fully decouples seasonality from trend.

In other words, with the envelope you can keep using the basic additive seasonality model, but the seasonal component itself has changed. It is no longer a linear combination of fixed-amplitude components. Instead, it follows the trends (the envelopes) of the various seasonal components detected in the signal. The overall model is additive w.r.t. seasonality, but seasonality is multiplied by the various envelopes.

Let’s say you take the envelopes F of the seasonal components from the Fourier spectrogram, and you smooth them — the smoothed versions are denoted by F-tilde. Then one-hot encoding, dummy encoding, and Fourier encoding for the seasonal component become:

one-hot encoding with component envelopes
dummy encoding with component envelopes
Fourier series with component envelopes

The beta(p), a(n) and b(n) will be the weights that your model will learn while fitting the time features. The F-tilde coefficients (which themselves are time series) are the component envelopes, and they will be multiplied into the time features before model training, as shown below.

For one-hot and dummy encoding, the formulae shown above use a single envelope — that of the base component with period P. For the Fourier sum, the formula suggests extracting a separate envelope for each component. There may be deviations from this pattern in practice, but that’s a separate discussion.

If the overall model that learns from all the features is linear, then you can still say that you’re using a form of “additive seasonality” (adjusted with the envelope):

additive seasonality

If the overall model is a random forest, or some other non-linear model, then the simple formula shown above does not apply.

Seasonal features in code

Let’s create a few sample data frames containing seasonal features that could be used to model seasonality in a dataset, using methods from statsmodels.tsa.deterministic.

One-hot encoded features, one for each quarter, with a period of 1 year:

seasonal_year = DeterministicProcess(index=df.index, constant=False, seasonal=True).in_sample()
print(seasonal_year)
        s(1,4)  s(2,4)  s(3,4)  s(4,4)
time
1956Q1 1.0 0.0 0.0 0.0
1956Q2 0.0 1.0 0.0 0.0
1956Q3 0.0 0.0 1.0 0.0
1956Q4 0.0 0.0 0.0 1.0
1957Q1 1.0 0.0 0.0 0.0
... ... ... ... ...
2013Q1 1.0 0.0 0.0 0.0
2013Q2 0.0 1.0 0.0 0.0
2013Q3 0.0 0.0 1.0 0.0
2013Q4 0.0 0.0 0.0 1.0
2014Q1 1.0 0.0 0.0 0.0

[233 rows x 4 columns]

If you want to use dummy encoding instead, simply drop one column. Also explore the drop=True option in DeterministicProcess(), which will check for perfect collinearity and will try to make the right decision (drop / no drop).

Yearly sine-cosine pairs of features:

cfr = CalendarFourier(freq='Y', order=2)
seasonal_year_trig = DeterministicProcess(index=df.index, seasonal=False, additional_terms=[cfr]).in_sample()
with pd.option_context('display.max_columns', None, 'display.expand_frame_repr', False):
print(seasonal_year_trig)
        sin(1,freq=A-DEC)  cos(1,freq=A-DEC)  sin(2,freq=A-DEC)  cos(2,freq=A-DEC)
time
1956Q1 0.000000 1.000000 0.000000 1.000000
1956Q2 0.999963 0.008583 0.017166 -0.999853
1956Q3 0.017166 -0.999853 -0.034328 0.999411
1956Q4 -0.999963 -0.008583 0.017166 -0.999853
1957Q1 0.000000 1.000000 0.000000 1.000000
... ... ... ... ...
2013Q1 0.000000 1.000000 0.000000 1.000000
2013Q2 0.999769 0.021516 0.043022 -0.999074
2013Q3 0.025818 -0.999667 -0.051620 0.998667
2013Q4 -0.999917 -0.012910 0.025818 -0.999667
2014Q1 0.000000 1.000000 0.000000 1.000000

[233 rows x 4 columns]

For the rest of this article, we will use one-hot encoding, despite the fact that the model we fit is LinearRegression(). For this data set, statsmodels checks the collinearity of the time series target with DeterministicProcess(drop=True), and does not deem it necessary to drop one column (does not switch to dummy encoding). Of course, feel free and compare one-hot encoding with dummy encoding for your data set and model.

Let’s see what happens when we fit a linear model on these seasonal features.

Fit a linear model

Let’s create trend features (the columns called const, trend, and trend_squared) and then join them with the seasonal_year features generated above:

trend_order = 2
trend_year = DeterministicProcess(index=df.index, constant=True, order=trend_order).in_sample()
X = trend_year.copy()
X = X.join(seasonal_year)
        const  trend  trend_squared  s(1,4)  s(2,4)  s(3,4)  s(4,4)
time
1956Q1 1.0 1.0 1.0 1.0 0.0 0.0 0.0
1956Q2 1.0 2.0 4.0 0.0 1.0 0.0 0.0
1956Q3 1.0 3.0 9.0 0.0 0.0 1.0 0.0
1956Q4 1.0 4.0 16.0 0.0 0.0 0.0 1.0
1957Q1 1.0 5.0 25.0 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ...
2013Q1 1.0 229.0 52441.0 1.0 0.0 0.0 0.0
2013Q2 1.0 230.0 52900.0 0.0 1.0 0.0 0.0
2013Q3 1.0 231.0 53361.0 0.0 0.0 1.0 0.0
2013Q4 1.0 232.0 53824.0 0.0 0.0 0.0 1.0
2014Q1 1.0 233.0 54289.0 1.0 0.0 0.0 0.0

[233 rows x 7 columns]

This is the X data frame (features) that will be used to train/validate the model. We’re modeling the data with quadratic trend features, plus the 4 time features needed for the yearly seasonality. The y data frame (target) will be just the cement production numbers.

Let’s carve a validation set out of the data, containing the year 2010 observations. We will fit a linear model on the X features shown above and the y target represented by cement production (the train portion of the dataset, what remains after carving out the validation data), and then we will evaluate model performance in validation.

We will also do all of the above with another, trend-only model, that will only fit the trend features but will ignore seasonality — just to show what the overall trend is.

def do_forecast(X, index_train, index_test, trend_order):
X_train = X.loc[index_train]
X_test = X.loc[index_test]

y_train = df['production'].loc[index_train]
y_test = df['production'].loc[index_test]

model = LinearRegression(fit_intercept=False)
_ = model.fit(X_train, y_train)
y_fore = pd.Series(model.predict(X_test), index=index_test)
y_past = pd.Series(model.predict(X_train), index=index_train)

trend_columns = X_train.columns.to_list()[0 : trend_order + 1]
model_trend = LinearRegression(fit_intercept=False)
_ = model_trend.fit(X_train[trend_columns], y_train)
y_trend_fore = pd.Series(model_trend.predict(X_test[trend_columns]), index=index_test)
y_trend_past = pd.Series(model_trend.predict(X_train[trend_columns]), index=index_train)

RMSLE = mean_squared_log_error(y_test, y_fore, squared=False)
print(f'RMSLE: {RMSLE}')

ax = df.plot(**plot_params, title='AUS Cement Production - Forecast')
ax = y_past.plot(color='C0', label='Backcast')
ax = y_fore.plot(color='C3', label='Forecast')
ax = y_trend_past.plot(ax=ax, color='C0', linewidth=3, alpha=0.333, label='Trend Past')
ax = y_trend_fore.plot(ax=ax, color='C3', linewidth=3, alpha=0.333, label='Trend Future')
_ = ax.legend()


do_forecast(X, index_train, index_test, trend_order)
RMSLE: 0.03846449744356434
model validation

We’ve seen this before. Blue is train, red is validation. The full model is the sharp, thin line. The trend-only model is the wide, fuzzy line.

This is not bad for a trivial model, but there is one glaring issue: the model has learned a constant-amplitude yearly seasonality. Even though the yearly variation actually increases in time, the model can only stick to fixed-amplitude variations. Obviously, this is because we gave the model only fixed-amplitude seasonal features, and there are no other features (target lags, etc) to help it overcome this issue (this is by design, to highlight the importance of the technique described here). Additive seasonality with a fixed amplitude does not work well here.

Perhaps information about the amplitude of seasonal components obtained via Fourier analysis (envelopes) would improve performance?

Adjust the seasonal components

Let’s pick the 4-quarter seasonal component. As seen in the spectrogram above, its envelope is very noisy, so we could fit a linear model (let’s call it the envelope model) on the order=2 trend of the envelope, on the train data with model.fit(), to smooth the envelope, and then we will extend that trend into validation / testing with model.predict():

envelope_features = DeterministicProcess(index=X.index, constant=True, order=2).in_sample()

spec4_train = compute_spectrum(df['production'].loc[index_train], max_period=4)
spec4_train

spec4_model = LinearRegression()
spec4_model.fit(envelope_features.loc[spec4_train.index], spec4_train['4.0'])
spec4_regress = pd.Series(spec4_model.predict(envelope_features), index=X.index)

ax = spec4_train['4.0'].plot(label='component envelope', color='gray')
spec4_regress.loc[spec4_train.index].plot(ax=ax, color='C0', label='envelope regression: past')
spec4_regress.loc[index_test].plot(ax=ax, color='C3', label='envelope regression: future')
_ = ax.legend()
envelope fit
envelope fit

The gray line is the envelope itself — it’s very noisy. The blue line is the envelope, the strength of the variation of the 4-quarter seasonal component in the train data, fitted (smoothed) as a quadratic trend (order=2). The red line is the same thing, extended (predicted) over the validation data. This is smoothing via regression.

You may notice that, while the general trend was modeled as being nearly linear by a quadratic model, this envelope is clearly non-linear. Multiplicative seasonality will likely not be the best fit for this data.

We have modeled the variation in time of the 4-quarter seasonal component. Let’s take the output from the envelope model (F-tilde in the envelope-adjusted Fourier series formula) and multiply it by the time features corresponding to the 4-quarter seasonal component:

spec4_regress = spec4_regress / spec4_regress.mean()

season_columns = ['s(1,4)', 's(2,4)', 's(3,4)', 's(4,4)']
for c in season_columns:
X[c] = X[c] * spec4_regress
print(X)
        const  trend  trend_squared    s(1,4)    s(2,4)    s(3,4)    s(4,4)
time
1956Q1 1.0 1.0 1.0 0.179989 0.000000 0.000000 0.000000
1956Q2 1.0 2.0 4.0 0.000000 0.181109 0.000000 0.000000
1956Q3 1.0 3.0 9.0 0.000000 0.000000 0.182306 0.000000
1956Q4 1.0 4.0 16.0 0.000000 0.000000 0.000000 0.183581
1957Q1 1.0 5.0 25.0 0.184932 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ...
2013Q1 1.0 229.0 52441.0 2.434701 0.000000 0.000000 0.000000
2013Q2 1.0 230.0 52900.0 0.000000 2.453436 0.000000 0.000000
2013Q3 1.0 231.0 53361.0 0.000000 0.000000 2.472249 0.000000
2013Q4 1.0 232.0 53824.0 0.000000 0.000000 0.000000 2.491139
2014Q1 1.0 233.0 54289.0 2.510106 0.000000 0.000000 0.000000

[233 rows x 7 columns]

The four time features are not either 0 or 1 anymore. They’ve been multiplied by the component envelope (F-tilde in the formula), and now their amplitude varies in time, just like the envelope.

Retrain the model

Let’s train the main model again, now using the modified time features.

do_forecast(X, index_train, index_test, trend_order)
RMSLE: 0.02546321729737165
model validation, adjusted time dummies
model validation, adjusted time dummies

We’re not aiming for a perfect fit here, since this is just a toy model built to demo a technique, but it’s obvious the model performs better, it is more closely following the 4-quarter variations in the target. The performance metric has improved also, by 51%, which is not bad at all.

Clearly the overall trend is not modeled well. Higher order trends may work better. Or adding inflection points, a la Facebook Prophet, may work better. Feel free to try it.

Even further improvements are definitely possible (e.g. an autoregressive model with plenty of time lags would fit much better) but that’s not the topic of this article — we’re only focusing here on one aspect of seasonality.

Final comments

Improving forecasting performance is a vast topic. The behavior of any model does not depend on a single feature, or on a single technique. However, if you’re looking to extract all you can out of a given model, you should probably feed it meaningful features. One-hot encoding or dummy encoding, or sine/cosine Fourier pairs, are more meaningful when they reflect the variation in time of the seasonality they are modeling.

Adjusting the seasonal feature’s envelope via the Fourier transform is particularly effective for linear models. Boosted trees do not benefit as much, but you can still see improvements almost no matter what model you use. If you’re using ensemble models, you may have a simpler model such as linear regression at the bottom of the stack, and you probably want to improve its performance before other models learn from its residuals (assuming the ensemble is configured that way).

It is also important that the envelope models you use to adjust seasonal features are only trained on the train data, and they do not see any testing data while in training, just like any other model. Once you adjust the envelope, the time features contain information from the target — they are not purely deterministic features anymore, that can be computed ahead of time over any arbitrary forecast horizon. So at this point information could leak from validation/testing back into training data via time features if you’re not careful (purely deterministic time features do not suffer from this issue).

Finally, the envelopes themselves are time series which might be subject to forecasting, since they model a real-world process or phenomenon. That’s a separate discussion.

Links

The data set used in this article is available here under the Public Domain (CC0) license:

The code used in this article can be found here:

A notebook submitted to the Store Sales — Time Series Forecasting competition on Kaggle, using ideas described in this article:

A GitHub repo containing the development version of the notebook submitted to Kaggle is here:

The Facebook Prophet paper — Forecasting at Scale:

https://peerj.com/preprints/3190.pdf

All images and code used in this article are created by the author.

--

--

BS in Physics. MS in Data Science. Over a decade experience with cloud computing.