In predictive modeling, results are only as good as the inputs added to the model. In time series, the principal issue with selecting model inputs is how to best represent a series’ history to predict its future. We can break down a series into its fundamental components: trend, seasonality, and residual, but with complex datasets, this can be difficult to get exactly right. Take the example of the sunspots dataset, available on Kaggle with a Public Domain license. It measures monthly observed sunspots since the 1700s and when decomposed linearly into the aforementioned components, looks like this:

The top graph shows the original series, which is fine. The second graph down, however, shows its "trend," but it does not look like a trend I have ever seen or one that can easily be modeled. The third graph is even worse, displaying "seasonality" in a way that is almost impossible to interpret. Granted, this is a big series and a more complex decomposition may have been more appropriate, but the point remains: it is not easy to decide which factors to incorporate when attempting to predict this series.
One option is to add anything we think might be relevant, then, using L1 regularization or a technique similar to recursive feature elimination, we narrow down the variables only to the most important subset. That is what is done in this example, which you can find on Github or Read the Docs. We will be using the scalecast package to do the time series analysis. Please give it a star on GitHub if you find interesting. The ELI5 package is also leveraged for feature importance scoring and scikit-learn for modeling.
Preparation and Add Variables
First, install the scalecast package:
pip install scalecast
Then, we create the Forecaster object, which is a model wrapper and results container:
df = pd.read_csv(
'Sunspots.csv',
index_col=0,
names=['Date','Target'],
header=0,
)
f = Forecaster(y=df['Target'],current_dates=df['Date'])
Autoregressive Terms
In time series, the most effective variables are usually the series’ own past values (lags). To evaluate how far in the series’ history we can likely find statistically significant lags, Autocorrelation Function and Partial Autocorrelation Function plots are often employed.

These are interesting plots that also reveal the cyclical nature of the dataset. They reveal that statistically significant terms might exist up to 15 years in the past or more. Let’s add the first 120 (10 years) worth of terms, as well as terms spaced one year apart up to 20 years in the past.
f.add_ar_terms(120) # lags 1-120
f.add_AR_terms((20,12)) # yearly lags up to 20 years
Seasonal Cycles
Typically, seasonalities for monthly spaced data would include monthly and quarterly fluctuations. However, in this example, we are trying to predict a natural phenomenon that does not conform to the Gregorian calendar overlaid on our daily lives. The cycles that this series exhibits are highly irregular, as well as inconsistent. According to Wikipedia, "Solar cycles last typically about eleven years, varying from just under 10 to just over 12 years."
All that granted, we can still add monthly and quarterly seasonality, as well as every 1-year cycle up to 20 years. This is certainly overkill, but that’s why we will use variable reduction techniques to filter which of these added features are probably noise and which are signal. All these seasonal terms use Fourier transformations.
f.add_seasonal_regressors('month','quarter',raw=False,sincos=True)
# 12-month cycles from 12 to 288 months
for i in np.arange(12,289,12):
f.add_cycle(i)
Trends
For trends, we can keep it simple. We add the year variable, as well as a variable called "t" that will count from 1 through the length of the series for each observation, otherwise known as a time trend.
f.add_time_trend()
f.add_seasonal_regressors('year')
After adding all these variables, we are left with 184 predictors. The size of the dataset is 3,265. Sometimes, a good number of variables to aim for is the square root of the dataset’s size (around 57 in this case), but there is no hard and fast rule on this. Either way, it’s not likely that all the features we added will assist prediction-making (and it’s hard to know exactly which ones will and will not). All else equal, more parsimonious models, models with fewer inputs, are preferable.
Eliminate Variables
Eliminate using L1 Regularization
The first technique we explore is a simple and common one: L1 regularization with a lasso regression model. L1 regularization adds a penalty (alpha) to regressors in a linear model, such that its cost function can be written as:

Where x is a variable, p the total number of variables, w a given variable’s coefficient and M the total number of observations. In layman’s terms, each coefficient’s weight is changed by a factor of alpha, which is set according to what best optimizes the model to make predictions. Under this framework, the least important variables in a linear model become regularized out and their coefficients receive a value of 0. We run this model in our analysis and examine which variables receive a coefficient weight of 0; the variables that do not become our reduced subset. The alpha parameter is set with a grid search on a validation slice of data and all inputs are scaled by default, which is necessary when employing this technique.
f.reduce_Xvars(
method = 'l1',
overwrite = False,
dynamic_testing = False,
)
lasso_reduced_vars = f.reduced_Xvars[:]
In our dataset, this reduces the variables from 184 to 34. That is quite a dramatic drop!
Eliminate using Permutation Feature Importance
The next elimination technique is more complex and involves permutation feature importance (PFI) from ELI5. The idea behind PFI is to train and score any scikit-learn estimator using out-of-sample data, but with each scoring iteration, scramble a single feature’s values so that they are random. This is repeated several times for each input in the model. Then, the algorithm’s average change-in-accuracy as a single input is randomized is measured and the features that most reduced the model’s accuracy are scored higher in a feature-importance ranking. It is a technique agnostic to the particular algorithm fed to it, which makes it generalizable to many model classes. Its drawbacks include that it tests each variable one-by-one, so it misses interactions between variables when evaluating features. For this same reason, two variables highly correlated with one another can each receive low importance scores, even though they are both important to the model. This happens a lot with time series especially, which is why scalecast makes adjustments to the raw scores before deciding which feature will next be dropped. See the documentation.
Incorporating PFI for feature elimination, we use the following process:
- Train a given model class with the full set of variables and tune its hyperparameters using a grid search on an out-of-sample validation slice of data
- Retrain the model with optimal hyperparameters from step 1 and monitor the error on the validation slice of data
- Score the features using PFI
- Drop the lowest scoring feature
- Repeat from step 2 until all but a certain amount of variables have been dropped
- Choose the reduced variable set that returned the lowest error on the validation data

To speed up evaluation (since this is an iterative and computationally expensive technique), we can use a non-dynamic testing process, which means error metrics are equivalent to averages of one-step forecasts, and stop once all but the square-root worth of values (57) have been dropped. In the code, it looks like:
f.reduce_Xvars(
method = "pfi",
estimator = "mlr",
keep_at_least = "sqrt",
grid_search = True,
dynamic_testing = False,
overwrite=False,
)
Again, due to the flexibility of the ELI5 integration, any scikit-learn estimator can be used in this process. The above code uses multiple linear regression, but we also explore k-nearest neighbors, gradient boosted trees, and a support vector machine. Many other models are available if we choose to use them. We can plot the results by examining the out-of-sample error obtained after each variable is dropped:

For all of these estimators, the error reduced initially, then increased once too many variables had been dropped, except SVR which kept decreasing until the end. Overall, the GBT and SVR estimators performed the best, with the GBT slightly better than the SVR. The optimal amount of variables each one found is listed below:
the mlr model chose 146 variables
the knn model chose 98 variables
the gbt model chose 136 variables
the svr model chose 61 variables
Validate Selection
How do these variable subsets generalize?
Since the GBT model class performed the best generally on the validation set of data, we can retrain this model class using each of the variable subsets we found from the various techniques explored. From those results, we will evaluate which we believe to be the best reduced variable set based on performance on a separate out-of-sample dataset, the test set. This time, we dynamically test the models to get a better idea of how well they perform on a full out-of-sample period (throughout this example, we have used 120 observations in both the validation and test sets):


The main takeaway from examining these results is that although the reduced dataset of 136 features with GBT scored the best on the validation set, the reduced variable set from the k-nearest neighbor model with 98 features performed best on the test set, showing that a reduced variable subset can be obtained from one model class and integrated into another successfully. We can save that reduced dataset for future Forecasting tasks if we so please:
Xvars_df = f.export_Xvars_df()
Xvars_df.set_index("DATE",inplace=True)
reduced_dataset = Xvars_df[knn_reduced_vars]
Which variables were most frequently dropped?
Finally, an interesting exercise is to see which of the variables were most frequently dropped in each one of the techniques explored. We can write some code to do that, and this is what we get:
Most frequently dropped variables:
- quartercos (all 5 techniques)
- AR86 (all 5 techniques)
- AR46 (all 5 techniques)
One, but not both, of our quarter-seasonal variables were dropped every time, as well the 86th and 46th series lag. There are many other interesting insights from this part of the analysis. Check out the notebook linked at the beginning of the article for the full results.
Conclusion
Variable selection is an important part of time-series model optimization, but it is not always easy to know which variables are most important to include in any given model. In this example, we added more variables to a model than we thought necessary and explored variable reduction with the lasso estimator and L1 regularization, as well as iterative feature elimination using four scikit-learn model classes and leveraging permutation feature importance. Check out scalecast on Github and give it a star if you find it interesting!