Example of optimisation in forecasting model
In Machine Learning (ML) projects, when we work with time series (TS)in the project we usually look for the ideal time series transformations in order to improve the implemented model performance.
This can be a tedious task if we work with a large dataset, and we want to try lots of different time series transformations. I have implemented an approach that can efficiently manage this task.
Estimation of Distribution Algorithms (EDAs) are a type of evolutionary algorithms that reproduce the next generation using a probabilistic model based on the best individuals selected in the previous generation. Some different EDAs implementations are implemented in the Python package EDAspy (https://github.com/VicentePerezSoloviev/EDAspy ; https://pypi.org/project/EDAspy/). To install the package just do:
pip install EDAspy
A very easy example is shown below. Note that for such an easy example the improvement obtained is very small compared with the improvement we could obtain if a larger dataset is used. Some easy time series transformations are implemented, but feel free to try more TS transformations.
First we load the needed libraries:
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
# EDAspy libraries
from EDAspy.timeseries import EDA_ts_fts as EDA
from EDAspy.timeseries import TS_transformations
Then we use a little public dataset to use as an example (available in Pandas library). Visualize the data:
mdata = sm.datasets.macrodata.load_pandas().data
df = mdata.iloc[:, 2:12]
df.head()

We list the variable without the variable we want to forecast (‘pop’), and build the dataset with the time series transformations of the rest of variables. More transformations can be added following the steps:
- Add the transformation postfix
- Add to the dataset the respective variable with name (name + postfix) But some available time series transformations are available in TSTransformations.
variables = list(df.columns)
variable_y = 'pop' # pop is the variable we want to forecast
variables = list(set(variables) - {variable_y})
TSTransf = TSTransformations(df)
transformations = ['detrend', 'smooth', 'log'] # postfix to variables, to denote the transformation
# build the transformations
for var in variables:
transformation = TSTransf.de_trending(var)
df[var + 'detrend'] = transformation
for var in variables:
transformation = TSTransf.smoothing(var, window=10)
df[var + 'smooth'] = transformation
for var in variables:
transformation = TSTransf.log(var)
df[var + 'log'] = transformation
We must define a cost function. In this case is the following with hyper-parameters. The cost function returns a MAE and inputs a list of variables from the built dataset (with time series transformations):
def cost_function(variables_list, nobs=20, maxlags=15, forecastings=10):
"""
variables_list: list of variables without the variable_y
nobs: how many observations for validation
maxlags: previous lags used to predict
forecasting: number of observations to predict
return: MAE of the prediction with the real validation data
"""
data = df[variables_list + [variable_y]]
df_train, df_test = data[0:-nobs], data[-nobs:]
model = VAR(df_train)
results = model.fit(maxlags=maxlags, ic='aic')
lag_order = results.k_ar
array = results.forecast(df_train.values[-lag_order:], forecastings)
variables_ = list(data.columns)
position = variables_.index(variable_y)
validation = [array[i][position] for i in range(len(array))]
mae = mean_absolute_error(validation, df_test['pop'][-forecastings:])
return mae
We take the normal variables without any time series transformation and try to forecast the y variable using the same cost function defined. This value is stored to be compared with the optimum solution found.
mae_pre_eda = cost_function(variables)
print('MAE without using EDA:', mae_pre_eda)
# MAE without using EDA: 5.091478009948458
Initialization of the initial vector of statistics. Each variable has a 50% probability to be or not chosen
vector = pd.DataFrame(columns=list(variables))
vector.loc[0] = 0.5
Run the algorithm. The code will print some further information during execution
eda = EDA(max_it=50, dead_it=5, size_gen=15, alpha=0.7, vector=vector,
array_transformations=transformations, cost_function=cost_function)
best_ind, best_MAE = eda.run(output=True)

The results are the following. In the left side the best local costs, and in the right side the best global cost find until the respective iteration. To plot the results:
hist = eda.historic_best
relative_plot = []
mx = 999999999
for i in range(len(hist)):
if hist[i] < mx:
mx = hist[i]
relative_plot.append(mx)
else:
relative_plot.append(mx)
print('Solution:', best_ind, 'nMAE post EDA: %.2f' % best_MAE, 'nMAE pre EDA: %.2f' % mae_pre_eda)
plt.figure(figsize = (14,6))
ax = plt.subplot(121)
ax.plot(list(range(len(hist))), hist)
ax.title.set_text('Local cost found')
ax.set_xlabel('iteration')
ax.set_ylabel('MAE')
ax = plt.subplot(122)
ax.plot(list(range(len(relative_plot))), relative_plot)
ax.title.set_text('Best global cost found')
ax.set_xlabel('iteration')
ax.set_ylabel('MAE')
plt.show()

Hope that this is useful for your future projects. Future medium posts will share more real examples showing how to use EDAspy functionalities. Feel free to look for more examples in the notebooks section of the package (https://github.com/VicentePerezSoloviev/EDAspy/tree/master/notebooks)
