Data science in demand forecasting

In this story, I would like to make an overview of common data science techniques and frameworks to create a demand forecast model.
First of all, let’s define what is demand forecasting and what impact it has got on business. Wiki said — "Demand forecasting is a field of predictive analytics which tries to understand and predict customer demand to optimize supply decisions by corporate supply chain and business management."
There are several types of demand forecast:
- Short-term (It is is carried out for a shorter-term period of 3 months to 12 months. In the short term, the seasonal pattern of demand and the effect of tactical decisions on customer demand are taken into consideration.)
- Medium to long-term (It is typically carried out for more than 12 months to 24 months in advance (36–48 months in certain businesses). Long-term Forecasting drives business strategy planning, sales and marketing planning, financial planning, capacity planning, capital expenditure, etc.)
- External macro-level (This type of Forecasting deals with the broader market movements which depend on the macroeconomic environment. External Forecasting is carried out for evaluating the strategic objectives of a business like product portfolio expansion, entering new customer segments, technological disruptions, a paradigm shift in consumer behavior, and risk mitigation strategies.)
- Internal business level (This type of Forecasting deals with internal operations of the business such as product category, sales division, financial division, and manufacturing group. This includes annual sales forecast, estimation of COGS, net profit margin, cash flow, etc.)
- Passive (It is carried out for stable businesses with very conservative growth plans. Simple extrapolations of historical data are carried out with minimal assumptions.)
- Active (It is carried out for scaling and diversifying businesses with aggressive growth plans in terms of marketing activities, product portfolio expansion, and consideration of competitor activities and external economic environment.)
Why demand forecast is so important and what process it improve? Demand Forecasting is the pivotal business process around which strategic and operational plans of a company are devised. Based on the Demand Forecast, strategic and long-range plans of a business like budgeting, financial planning, sales and marketing plans, capacity planning, risk assessment, and mitigation plans are formulated. It also affects the following processes:
- Supplier relationship management.
- Customer relationship management.
- Order fulfillment and logistics.
- Marketing campaigns.
- Manufacturing flow management.
Ok, let’s solve this problem and try to use different data science techniques and frameworks to make an accurate demand forecast.
For my experiment, I would like to use a dataset from Kaggle’s competition. In every data science task, I use CRISP-DM to follow all the necessary processes during the work on the project.
The first phase of CRISP-DM is – Business Understanding. The task is to forecast the total amount of products sold in every shop for the test set. What about the metrics or success criteria? The forecast is evaluated by the root mean squared error (RMSE).
The second and the third phases, as for me, are one of the most important – Data understanding and preparation. The first task when initiating the demand forecasting project is to provide the client with meaningful insights. The process includes the following steps:
- Gather available data
In our case we have got the next datasets:
- sales_train.csv – the training set. Daily historical data from January 2013 to October 2015.
- test.csv – the test set. You need to forecast the sales for these shops and products for November 2015.
- sample_submission.csv – a sample submission file in the correct format.
- items.csv – supplemental information about the items/products.
- item_categories.csv – supplemental information about the items categories.
- shops.csv- supplemental information about the shops.
Briefly review the data structure, accuracy, and consistency. Let’s make several tables and plots to analyze our dataset.

This base analytics could help us to understand, the main input parameters of the dataset. For example, we could see, that in our dataset we have got a negative value for Price, which could be a mistake, and a negative value for Sales, which could be зurchase returns.
The next table explains for us the distribution for numeric columns:

Here we can see, that half of our sales have a value equal to 1.
Let’s make a scatter plot to analyze the relationship between sales volume and price.

This type of plot usually can show us, that we have got units with little price and big volume of sales and inits with an abnormally high price and very low volume of sales.
Let’s analyze our data in dynamic –
- The first plot show for us decrease trend of unique items in the assortment:

- The second plot show for us decreases trend of quantity sales and two picks, which could be a seasonal or some promo. This is what we need to learn from business and through quantitative analysis.

- The next step is analytics of our category which we will use to aggregate our dataset, the first dimension is items/category names and the second is shops id. This analysis shows for us the sales volume distribution in item’s category and shops and would help for us to understand which categories and shops is the most important for us and which ones have got a short history for analytics and forecasting.


EDA is an ongoing process that you can continue to do throughout the entire time of working with different experiments. Let’s stop at this point and start creat models.
Before we start to create models, we need to split our dataset for training, validation, and testing. Keep in mind, that we need to use the date column to filter our dataset, don’t use random split for time-series.

Basic approach
In this part, I would like to explain and create basic approach models.
The first one is the Smoothed Moving Average. The smoothed moving average (SMMA) is a demand forecasting model that can be used to gauge trends based on a series of averages from consecutive periods.
For example, the smoothed moving average from six months of sales could be calculated by taking the average of sales from January to June, then the average of sales between February to July, then March to August, and so on.
This model is called ‘moving’ because averages are continually recalculated as more data becomes available.
A moving average of order mm can be written as:

where m=2k+1m=2k+1. That is, the estimate of the trend-cycle at time tt is obtained by averaging values of the time series within kk periods of tt. Observations that are nearby in time are also likely to be close in value.
Smoothed Moving Average is useful for looking at overall sales trends over time and aiding long-term demand planning. Rapid changes as a result of seasonality or other fluctuations are smoothed out so you can analyze the bigger picture more accurately. The smoothed moving average model typically works well when you have a product that’s growing consistently or declining over time. Also, the important disadvantage of this approach, that we could make Items without history.
To make it in Python we can use pandas.DataFrame.shift to create Lag value,
full_df['sales_lag_n'] = full_df['sales'].shift(periods=n)
then we can use pandas.DataFrame.rolling to create a rolling mean base on created Lag values.
full_df['sma'] = full_df['sales_lag_n].rolling(n).mean()
The next model is Holt Winter’s Exponential Smoothing. Holt (1957) and Winters (1960) extended Holt’s method to capture seasonality. The Holt-Winters seasonal method comprises the forecast equation and three smoothing equations – one for the level ℓtℓt, one for the trend bt, and one for the seasonal component st, with corresponding smoothing parameters αα, β∗β∗ and γγ. We use mm to denote the frequency of the seasonality, i.e., the number of seasons in a year. For example, for quarterly data m=4m=4, and monthly data m=12m=12.
There are two variations to this method that differ like the seasonal component. The additive method is preferred when the seasonal variations are roughly constant through the series, while the multiplicative method is preferred when the seasonal variations are changing proportionally to the level of the series. With the additive method, the seasonal component is expressed in absolute terms in the scale of the observed series, and in the level equation, the series is seasonally adjusted by subtracting the seasonal component. Within each year, the seasonal component will add up to approximately zero. With the multiplicative method, the seasonal component is expressed in relative terms (percentages), and the series is seasonally adjusted by dividing through by the seasonal component. Within each year, the seasonal component will sum up to approximately mm.
This method is more efficient than the previous one because it handles the season components, but it has got the same disadvantage it doesn’t handle new items in the assortment.
This method has an implementation in Python
from statsmodels.tsa.holtwinters import ExponentialSmoothing as HWES
This model applies to one pair shop-item, which means that we need to create a new model for every pair.
for index, row in tqdm(df_test.iterrows()):
tmp = df_train_aggr[(df_train_aggr['shop_id'] == row['shop_id']) & (df_train_aggr['item_id'] == row['item_id'])]
model = ExponentialSmoothing(tmp.item_cnt_day)
model_fit = model.fit()
forecast = model_fit.forecast(steps=n)
The last model in my basic approach is ARIMA. ARIMA, short for ‘Auto-Regressive Integrated Moving Average’ is actually a class of models that ‘explains’ a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so that equation can be used to forecast future values. Any ‘non-seasonal time series that exhibits patterns and is not a random white noise can be modeled with ARIMA models.
An ARIMA model is characterized by 3 terms: p, d, q where,
- p is the order of the AR term
- q is the order of the MA term
- d is the number of differences required to make the time series stationary
If a time series, has seasonal patterns, then you need to add seasonal terms and it becomes SARIMA, short for ‘Seasonal ARIMA’. More on that once we finish ARIMA. As a previous model, I will build a separate model for each shop-item pairs. So the main idea is to find the right parameters for our models. I would not write a long description of how to calculate each one, but You can find it here. In my case, I would like to use something like auto.arima. I find an interesting implementation – pmdarima.
"pmdarima" brings R’s beloved auto.arima to Python, making an even stronger case for why you don’t need R for data science. pmdarima is 100% Python + Cython and does not leverage any R code, but is implemented in a powerful, yet easy-to-use set of functions & classes that will be familiar to scikit-learn users.
The code will be very similar to the previous one:
import pmdarima as pm
for index, row in tqdm(df_test.iterrows()):
model = pm.auto_arima(tmp.item_cnt_day, start_p=1, start_q=1, max_p=3, max_q=3, m=12,
start_P=0, seasonal=False,
d=1, D=1, trace=False,
error_action='ignore', # don't want to know if an order does not work
suppress_warnings=True, # don't want convergence warnings
stepwise=True)
forecast = model_fit.predict(n_periods = n, return_conf_int=False)
ARIMA is a quite strong model, which could give a good forecast. ARIMA can be limited in forecasting extreme values. While the model is adept at modeling seasonality and trends, outliers are difficult to forecast for ARIMA for the very reason that they lie outside of the general trend as captured by the model.
So, classical time series forecasting methods may be focused on linear relationships, nevertheless, they are sophisticated and perform well on a wide range of problems, assuming that your data is suitably prepared and the method is well configured.
Machine learning approach
A most common enterprise application of machine learning teamed with statistical methods is predictive analytics. It allows for not only estimating demand but also for understanding what drives sales and how customers are likely to behave under certain conditions. The main idea in using machine learning models for demand forecast is to generate a lot of useful features.
Feature engineering is the use of domain knowledge data and the creation of features that make machine learning models predict more accurately. It enables a deeper understanding of data and more valuable insights.
This feature could be:
- Product/Shop characteristics (information from items dictionaries)
- Internal information about promo activities and any price changes
- Different level target encoding of categorical variables
- Date features
In my experiments, I will use the following Python libraries CatBoost, XGBoost, and H2O AML.
Let’s start with XGBoost.
from xgboost import XGBRegressor
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost cannot handle categorical features by itself, it only accepts numerical values similar to Random Forest. Therefore one has to perform various encodings like label encoding, mean encoding, or one-hot encoding before supplying categorical data to XGBoost. XGboost splits up to the specified _max_depth_ hyperparameter and then starts pruning the tree backward and removes splits beyond which there is no positive gain. It uses this approach since sometimes a split of no loss reduction may be followed by a split with loss reduction. XGBoost can also perform leaf-wise tree growth. XGBoost missing values will be allocated to the side that reduces the loss in each split.
# Train
model = XGBRegressor(
max_depth=8,
n_estimators=1000,
min_child_weight=300,
colsample_bytree=0.8,
subsample=0.8,
eta=0.3,
seed=42)
model.fit(
X_train,
Y_train,
eval_metric="rmse",
eval_set=[(X_train, Y_train), (X_valid, Y_valid)],
verbose=True,
early_stopping_rounds = 10)
XGBoost is a good and fast implementation of gradient boosting algorithm for machine learning, but the main disadvantage, as for me, is that couldn’t use categorical factors and in the results, a model may lose some information.
The next library is CatBoost.
from catboost import CatBoostRegressor
Catboost grows a balanced tree. In each level of such a tree, the feature-split pair that brings to the lowest loss (according to a penalty function) is selected and is used for all the level’s nodes. It is possible to change its policy using the grow-policy parameter.
Catboost has two modes for processing missing values, "Min" and "Max". In "Min", missing values are processed as the minimum value for a feature (they are given a value that is less than all existing values). This way, it is guaranteed that a split that separates missing values from all other values is considered when selecting splits. "Max" works the same as "Min", only with maximum values.
Catboost uses a combination of one-hot encoding and an advanced mean encoding. For features with a low number of categories, it uses one-hot encoding. The maximum number of categories for one-hot encoding can be controlled by the _one_hot_max_size_ parameter. For the remaining categorical columns, CatBoost uses an efficient method of encoding, which is similar to mean encoding but with an additional mechanism aimed at reducing overfitting. Using CatBoost’s categorical encoding comes with a downside of a slower model.
# Train
model=CatBoostRegressor(iterations=100, depth=10, learning_rate=0.03, loss_function='RMSE')
model.fit(X_train, Y_train, cat_features = categorical, eval_set=(X_valid, Y_valid))
CatBoost provides useful tools for easy work with highly categorized data. It shows solid results training on unprocessed categorical features.
The last one on H2O AML.
import h2o
from h2o.automl import H2OAutoML
h2o.init(nthreads = 7, max_mem_size = '45g')
H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top-performing models in the AutoML Leaderboard.
AutoML will iterate through different models and parameters trying to find the best. There are several parameters to specify, but in most cases, all you need to do is to set only the maximum runtime in seconds or a maximum number of models. You can think about AutoML as something similar to GridSearch but on the level of models rather than on the level of parameters.
# Run AutoML
aml = H2OAutoML(max_models = 5, seed=1)
aml.train(y=y, training_frame=x_train_hf,
validation_frame = x_valid_hf)
AutoML is to automate repetitive tasks like pipeline creation and hyperparameter tuning so that data scientists can spend more of their time on the business problem at hand. AutoML also aims to make the technology available to everybody rather than a select few. AutoML and data scientists can work in conjunction to accelerate the ML process so that the real effectiveness of machine learning can be utilized. In my case, I have the best result among the previous model.
Deep learning approach
Powerful class of machine learning algorithms that use artificial neural networks to understand and leverage patterns in data. Deep Learning algorithms use multiple layers to extract higher-level features from raw data progressively: this reduces the amount of feature extraction needed in other machine learning methods. The deep learning algorithm learns on its own by recognizing patterns using many layers of processing. That is why the "deep" in "deep learning" refers to the number of layers through which the data is transformed. Multiple transformations automatically extract important features from raw data.
The main challenge in this task is to handle categorical variables. In deep learning, we can use Entity Embeddings. Embeddings are a solution to dealing with categorical variables while avoiding a lot of the pitfalls of one-hot encoding. An embedding is a mapping of a categorical variable into an n-dimensional vector. So, how our neural network’s architecture looks like?

The first, Python framework for this task – Keras. The main challenge here is to write a code for embedding every categorical feature.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Input, Dense, Activation, Reshape, BatchNormalization, Dropout, concatenate, Embedding
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
Keras’s implementation of this approach is rather bulky.
model_inputs = []
model_embeddings = []
for input_dim, output_dim in emb_space:
i = Input(shape=(1,))
emb = Embedding(input_dim=input_dim, output_dim=output_dim)(i)
model_inputs.append(i)
model_embeddings.append(emb)
con_outputs = []
for con in con_feature:
elaps_input = Input(shape=(1,))
elaps_output = Dense(10)(elaps_input)
elaps_output = Activation("relu")(elaps_output)
elaps_output = Reshape(target_shape=(1,10))(elaps_output)
model_inputs.append(elaps_input)
con_outputs.append(elaps_output)
merge_embeddings = concatenate(model_embeddings, axis=-1)
if len(con_outputs) > 1:
merge_con_output = concatenate(con_outputs)
else:
merge_con_output = con_outputs[0]
merge_embedding_cont = concatenate([merge_embeddings, merge_con_output])
merge_embedding_cont
output_tensor = Dense(1000, name="dense1024")(merge_embedding_cont)
output_tensor = BatchNormalization()(output_tensor)
output_tensor = Activation('relu')(output_tensor)
output_tensor = Dropout(0.3)(output_tensor)
output_tensor = Dense(500, name="dense512")(output_tensor)
output_tensor = BatchNormalization()(output_tensor)
output_tensor = Activation("relu")(output_tensor)
output_tensor = Dropout(0.3)(output_tensor)
output_tensor = Dense(1, activation='linear', name="output", kernel_constraint = NonNeg())(output_tensor)
optimizer = Adam(lr=10e-3)
nn_model = Model(inputs=model_inputs, outputs=output_tensor)
nn_model.compile(loss='mse', optimizer=optimizer, metrics=['mean_squared_error'])
reduceLr=ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=1, verbose=1)
checkpoint = ModelCheckpoint("nn_model.hdf5", monitor='val_loss', verbose=1, save_best_only=True, mode='min')#val_mean_absolute_percentage_error
callbacks_list = [checkpoint, reduceLr]
history = nn_model.fit(x=x_fit_train, y=y_train.reshape(-1,1,1),
validation_data=(x_fit_val, y_val.reshape(-1,1,1)),
batch_size=1024, epochs=10, callbacks=callbacks_list)
All machine learning features and entity embeddings approach showed slightly better results than previous models, but more training time was spent. The advantage of using embeddings is that they can be learned, representing each category better than what other models can approximate.
So, we see that this approach is good, but the main disadvantage is a lot of code.
Fastai is our solution. Fast.ai is popular deep learning that provides high-level components to obtain state-of-the-art results in standard deep learning domains. Fast.ai allows practitioners to experiment, mix and match to discover new approaches. In short, to facilitate hassle-free deep learning solutions. The libraries leverage the dynamism of the underlying Python language and the flexibility of the PyTorch library.
from fastai.tabular import *
Training a Deep Neural Network (DNN) is a difficult global optimization problem. Learning Rate (LR) is a crucial hyper-parameter to tune when training DNNs. A very small learning rate can lead to very slow training, while a very large learning rate can hinder convergence as the loss function fluctuates around the minimum, or even diverges.
Fastai implemented in this framework one cycle policy. Super-convergence uses the CLR method, but with just one cycle – which contains two learning rate steps, one increasing and one decreasing – and a large maximum learning rate bound. The cycle’s size must be smaller than the total number of iterations/epochs. After the cycle is complete, the learning rate should decrease even further for the remaining iterations/epochs, several orders of magnitude less than its initial value.
#TabularList for Validation
val = (TabularList.from_df(X_train.iloc[start_indx:end_indx].copy(), path=path, cat_names=cat_feature, cont_names=con_feature))
test = (TabularList.from_df(X_test, path=path, cat_names=cat_feature, cont_names=con_feature, procs=procs))
#TabularList for training
data = (TabularList.from_df(X_train, path=path, cat_names=cat_feature, cont_names=con_feature, procs=procs)
.split_by_idx(list(range(start_indx,end_indx)))
.label_from_df(cols=dep_var)
.add_test(test)
.databunch())
#Initializing the network
learn = tabular_learner(data, layers=[1024,512], metrics= [rmse,r2_score])
#Exploring the learning rates
learn.lr_find()
learn.recorder.plot()
# Learn
learn.fit_one_cycle(10, 1e-02)
As a result, we have got less code and a faster way to find optimal learning rate. The result is very similar to the previous neural network architecture.
This architecture works well, but what if we would like to get some information from previous periods of sales without adding Lag features. So we need to add LSTM or RNN layer to our architecture. In Keras, it will make the code even more cumbersome and there is no implementation for Fastai.
I found the solution to this problem – PyTorch Forecasting.
Pytorch Forecasting aims to ease state-of-the-art time series forecasting with neural networks for both real-world cases and research alike. The goal is to provide a high-level API with maximum flexibility for professionals and reasonable defaults for beginners. Specifically, the package provides
- A time-series dataset class that abstracts handling variable transformations, missing values, randomized subsampling, multiple history lengths, etc.
- A base model class which provides basic training of time series models along with logging in tensorboard and generic visualizations such as actual vs predictions and dependency plots
- Multiple neural network architectures for time series forecasting that have been enhanced for real-world deployment and come with in-built interpretation capabilities
- Multi-horizon time series metrics
- Ranger optimizer for faster model training
- Hyperparameter tuning with optuna
The package is built on PyTorch Lightning to allow training on CPUs, single and multiple GPUs out-of-the-box.
import torch
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_forecasting import Baseline, TemporalFusionTransformer, TimeSeriesDataSet
from pytorch_forecasting.data import GroupNormalizer
from pytorch_forecasting.metrics import SMAPE, PoissonLoss, QuantileLoss, RMSE
from pytorch_forecasting.models.temporal_fusion_transformer.tuning import optimize_hyperparameters
from pytorch_forecasting.data.encoders import NaNLabelEncoder
In my example, I used Temporal Fusion Transformer [2]. This is an architecture developed by Oxford University and Google that has beaten Amazon’s DeepAR by 36–69% in benchmarks.
The first step – we need to create a data loader and create a special data object for our model.
max_prediction_length = 1
max_encoder_length = 6
training_cutoff = X_train["time_idx"].max() - max_prediction_length
training = TimeSeriesDataSet(
X_train[lambda x: x.time_idx <= training_cutoff],
time_idx="time_idx",
target="log_sales",
group_ids=["shop_id", "item_id"],
min_encoder_length=max_encoder_length // 2, # keep encoder length long (as it is in the validation set)
max_encoder_length=max_encoder_length,
min_prediction_length=1,
max_prediction_length=max_prediction_length,
static_categoricals=["shop_id", "item_id"],
static_reals=['city_coord_1', 'city_coord_2'],
time_varying_known_categoricals=["month"],
time_varying_known_reals=["time_idx", "delta_price_lag"],
time_varying_unknown_categoricals=["shop_category", "city_code", "item_category_id","type_code", "subtype_code", "country_part"],
categorical_encoders = {"shop_id": NaNLabelEncoder(add_nan=True),"item_id": NaNLabelEncoder(add_nan=True),"shop_category": NaNLabelEncoder(add_nan=True),"city_code": NaNLabelEncoder(add_nan=True),"item_category_id": NaNLabelEncoder(add_nan=True),"type_code": NaNLabelEncoder(add_nan=True),"subtype_code": NaNLabelEncoder(add_nan=True),"country_part": NaNLabelEncoder(add_nan=True),},
time_varying_unknown_reals=['date_cat_avg_item_cnt_lag_1','date_shop_cat_avg_item_cnt_lag_1', 'date_shop_type_avg_item_cnt_lag_1','date_shop_subtype_avg_item_cnt_lag_1','date_city_avg_item_cnt_lag_1','date_item_city_avg_item_cnt_lag_1','date_type_avg_item_cnt_lag_1','date_subtype_avg_item_cnt_lag_1', 'item_shop_last_sale', 'item_last_sale','item_shop_first_sale', 'item_first_sale'],
add_relative_time_idx=True,
add_encoder_length=True,
allow_missings=True
)
validation = TimeSeriesDataSet.from_dataset(training,X_train, min_prediction_idx=training.index.time.max() + 1, stop_randomization=True)
batch_size = 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=2)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=2)
The next step is to find an optimal learning rate.
pl.seed_everything(42)
trainer = pl.Trainer(
gpus=1,
# clipping gradients is a hyperparameter and important to prevent divergance
# of the gradient for recurrent neural networks
gradient_clip_val=0.1,
)
tft = TemporalFusionTransformer.from_dataset(
training,
# not meaningful for finding the learning rate but otherwise very important
learning_rate=0.03,
hidden_size=16, # most important hyperparameter apart from learning rate
# number of attention heads. Set to up to 4 for large datasets
attention_head_size=1,
dropout=0.1, # between 0.1 and 0.3 are good values
hidden_continuous_size=8, # set to <= hidden_size
output_size=1, # 7 quantiles by default
loss=RMSE(),
# reduce learning rate if no improvement in validation loss after x epochs
reduce_on_plateau_patience=4,
)
print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")
# find optimal learning rate
torch.set_grad_enabled(False)
res = trainer.tuner.lr_find(
tft,
train_dataloader=train_dataloader,
val_dataloaders=val_dataloader,
max_lr=10.0,
min_lr=1e-6
)
print(f"suggested learning rate: {res.suggestion()}")
fig = res.plot(show=True, suggest=True)
fig.show()
Now we can configure our neural network and train it.
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min")
lr_logger = LearningRateMonitor() # log the learning rate
logger = TensorBoardLogger("lightning_logs") # logging results to a tensorboard
trainer = pl.Trainer(
max_epochs=30,
gpus=1,
weights_summary="top",
gradient_clip_val=0.1,
limit_train_batches=30, # coment in for training, running valiation every 30 batches
# fast_dev_run=True, # comment in to check that networkor dataset has no serious bugs
callbacks=[lr_logger, early_stop_callback],
logger=logger,
)
tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate=0.03,
hidden_size=16,
attention_head_size=4,
dropout=0.1,
hidden_continuous_size=8,
output_size=1, # 7 quantiles by default
loss=RMSE(),
log_interval=10, # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batches
reduce_on_plateau_patience=4
)
print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")
# fit network
trainer.fit(
tft,
train_dataloader=train_dataloader,
val_dataloaders=val_dataloader
)
Also, we could tune our model and find optimal hyperparameters.
from pytorch_forecasting.models.temporal_fusion_transformer.tuning import optimize_hyperparameters
# create study
study = optimize_hyperparameters(
train_dataloader,
val_dataloader,
model_path="optuna_test",
n_trials=200,
max_epochs=50,
gradient_clip_val_range=(0.01, 1.0),
hidden_size_range=(8, 128),
hidden_continuous_size_range=(8, 128),
attention_head_size_range=(1, 4),
learning_rate_range=(0.001, 0.1),
dropout_range=(0.1, 0.3),
trainer_kwargs=dict(limit_train_batches=30),
reduce_on_plateau_patience=4,
use_learning_rate_finder=False, # use Optuna to find ideal learning rate or use in-built learning rate finder
)
# save study results - also we can resume tuning at a later point in time
with open("test_study.pkl", "wb") as fout:
pickle.dump(study, fout)
# show best hyperparameters
print(study.best_trial.params)
As a result, I have got a good performance model with all features and approaches, that could be used for time series forecasting.
Ensemble
Stacking or Stacked Generalization is an ensemble machine learning algorithm. It uses a meta-learning algorithm to learn how to best combine the predictions from two or more base machine learning algorithms.
We could use Stacking to combine the severals model and make new predictions.
The architecture of a stacking model involves two or more base models, often referred to as level-0 models and a meta-model that combines the predictions of the base models referred to as a level-1 model.
- Level-0 Models (Base-Models): Models fit on the training data and whose predictions are compiled.
- Level-1 Model (Meta-Model): Model that learns how to best combine the predictions of the base models.
The meta-model is trained on the predictions made by base models on out-of-sample data. That is, data not used to train the base models is fed to the base models, predictions are made, and these predictions, along with the expected outputs, provide the input and output pairs of the training dataset used to fit the meta-model. The outputs from the base models used as input to the meta-model may be real value in the case of regression, and probability values, probability like values, or class labels in the case of classification.
For Stacking, we can use sklearn.ensemble.StackingRegressor.
from mlxtend.regressor import StackingCVRegressor
from sklearn.datasets import load_boston
from sklearn.svm import SVR
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
stack = StackingCVRegressor(regressors=(svr, lasso, rf), meta_regressor=lasso,random_state=RANDOM_SEED)
Stacking regression is an ensemble learning technique to combine multiple regression models via a meta-regressor. The individual regression models are trained based on the complete training set; then, the meta-regressor is fitted based on the outputs – meta-features – of the individual regression models in the ensemble.
Conclusion
As a result of this analysis, we can see that time series forecasting doesn’t stay. Every day we could find new approaches and new frameworks. In this article, I tried to collect some of them and show how to implement them in the real case. From my experience combination strategies have potential application in Demand Forecasting problems, outperform other state-of-the-art models in trend and stationary series, and have comparable accuracy to other models. All depend on the input data and business goal, but I hope that those models help You to create your own state-of-the-art approach for your business.
Thanks for reading.
All code you can find in the Git repository – link.