Multivariable time series forecasting using Stateless Neural Networks

Arjeus A. Guevarra
Towards Data Science
12 min readAug 27, 2019

--

Forecasting with multiple variables that aids a target variable with stateless deep learning.

Photo by Jon Tyson on Unsplash

Time series forecasting with assisting variables

Time series. Datasets that have a time element with them. Such data allow us to think about the combination of 2 properties of time series:

  1. Seasonality — Patterns in data that tends to repeat over and over at a specific length of time.
  2. Trend — This is similar to regression, where we are capturing the global pattern of the series.
  3. The relevance of data tends to center at present time, meaning past and data close to present time is of greater influence and accuracy of future predictions is better when closer to present data (principle of entropy).

Such a combination is what makes time series special. The data at a point in time depends on the previous data. Intuitively, this is the cause and effect nature of events, such as the words that you’re reading makes sense based on the previous words.

Normally, machine learning only captures the trend, which makes training a typical machine learning model over time series expensive.

Forecasting a single variable with supporting variables

Normally forecasting a single variable involves creating a model that is a function of previous time data. This is known as autoregression of single variable.

[1]Autoregression formula. (Wikipedia)

The image above is a sample of an autoregression formula. [1]

We will not dive into the details of the formula, but notice the ff variables:

dependent variable. The value of data at present time.

independent variable. Note that previous data are given values to the variables.

The rest are additional parameters and variables that refine the accuracy of the model.

But what if we have additional hypothetical variables that are correlated with the target variable? These could increase the accuracy of the target variable that we need through proper modelling.

Algorithms and Models

When you think of time series, you immediately think of LSTMs, GRUs and other recurrent neural networks. But in this case we refuse to do this and instead rely on feedforward neural networks. Why is it? Let’s illustrate the two and why I chose to follow the latter:

  1. Recurrent neural networks — These are graphs whose output is fed back to the input. As inference is being done, the output of the network are stored as states, ready to be inferred on the next time step.
RNN and its states through time. Source: https://upload.wikimedia.org/wikipedia/commons/e/ee/Unfold_through_time.png

The recurrent network attempts to predict the next timestep through the following:

  1. At → Actual input value at time t.
  2. Xt → The previous state of the RNN or the modelled prediction

With the variables above, the network achieves its predictive power by minimizing the error between the modelled data and actual data. However, notice that as the RNN passes through different temporal data, the error of predictive model increases.

Continuous state transformation worsens the transformation through time.

LSTMs and GRUs appear to solve this problem, as it contains gates that accepts or forgets previous data that are considered irrelevant. However, these gates and data transformations cannot specifically be controlled into a critical aspect of time series analysis: how various timestates relate to one another.

In the image below, the timestep becomes a vector to a cell with an activation function. Thus, the relationship of the timesteps are captured vs the target future step.

Feedforward network that relates various timesteps vs the next timestep

We can then move to the implementation proper for the specifics of the pipeline and the model designed.

IMPLEMENTATION

The complete code implementation is located on Github at https://github.com/arjay55/time_series_stateless.git

I. Libraries

The libraries mainly consist of the ff:

  1. pandas → for data preparation
  2. Scikit-learn (sklearn) and scipy→ for external loss function and for scaling
  3. seaborn → for heatmap visualization
  4. skopt → an optimization module which includes Bayesian optimization. We will use this for hyperparameter tuning during training.
  5. TensorFlow → an open-source library for deep learning. We used the latest version, tensorflow-beta1.

II. Data Preparation

We will only load the normalized dataset for data privacy.

First, we load the normalized dataset.

Snip of the Dataframe. Note the missing data on inflation column

Upon examining the DataFrame, we have 3 columns
1. mutual_unit → this is the trust fund units, convertible to the currency of investment. The higher its value, the more returns you have on your investment.
2. stock_exchange → this is the index for an aggregate of stock prices from various companies in a company.
3. inflation → inflation rate of a given country.

These variables are chosen as they are representative characteristics of financial status of a given country.

Note that the inflation column has a lot empty or null values. One technique is to fill the data with assumed values:
1. Linear interpolation in the upper and lowermost indices of the dataset.
2. ARIMA based extrapolation beyond the furthest indices. Why? It is easier to model the outer portions using 2 forecasting steps only.

We saved the file to a csv for consumption of another programming language, R. I chose to use this as its “auto-arima” function is popular and mature compared to Python equivalent.

We will have 2 outputs:
1. backasted[2] values to x steps → “backasted” means forecasting backwards, or simply reversing the time series and performing forecasting.
2. forecasted values to y steps → the usual forecasting.

The x and y values are determined by function get_outernan and fill_data fills the missing data with assumed values.

In Rstudio, we load the interpolated inflation dataset and perform simple ARIMA:

  • In forecasting, regular ARIMA is done
  • In “backcasting”, notice the use rev() function prior to timeseries ts() function to transform data to timeseries.
  • The get get_outernan function returned values (6,15), where it is the number of steps to backcast and forecast respectively.

III. Data Exploration

Now that we have prepared the data for analysis, we can start exploring them to examine its data quality and simplify the data using domain expertise.

First, let’s refine our objective. Remember that we are intending to forecast a single variable. That variable is the mutual_unit. We have 3 variables

Our intent is to forecast the mutual_unit variable with the aid of helper variables, which are stock_exchange and inflation.

Let’s plot the three variables:

Line plots of the three variables

Upon examining the graphs, there may be some correlation between the mutual_unit and stock_exchange. For inflation, the values have a pivot from going up to going down.

As an additional analysis let’s perform a timestep vs timestep correlation analysis. We’ll use a heatmap to review the correlations.

On the first snip (correl.py), it produces a heatmap.

We can see that mutual_unit has high levels of correlation:

Correlation of mutual_unit to other variables

Note: As of the time of this writing, it appears that inflation has negative correlation with the mutual_unit. This could be that the unit goes higher to recover from the increase in inflation. You could try reducing the time window of interest and it may increase the model accuracy.

In our model, we need to specify the weights of the variables. This is the 2nd part of the code (correl2). Since we are using stock_exchange and inflation as helper variables, their weight will have these effects:

  1. Higher weights mean higher correlation with the target variable.
  2. Lower weights is the opposite of the above. If higher weights are specified, then the noise of the variable will overfit or make the target variable less accurate.

To be set as weights for our model. The correlations are scaled as follows:

ABS(VARIABLEx)/SUM(ABS(VARIABLES))

where VARIABLEx is the correlation coeff. divided by the sum of the variable’s correlation coeffs. Note that absolute values are taken prior to other computations.

computed initial weights of the variables

Note that the three have about the same proportion as they have correlations versus mutual_unit.

IV. Prediction Pipeline

This is the part where we apply machine learning to make predictions! This is not magical of course, the process is still stochastic and mathematical. It is just its information processing techniques that make its prediction impressive.

The type of prediction we make is regression, where we are predicting continuous values. In regression, we model continuous values. In classification, we model discrete values. In the image below, a certain metric is represented in two types. In discrete or classification, the values are mild, moderate or severe. In the right axis, the metric is represented as continuous values.

Comparison between regression and classification.

Temporal processing of datasets

Our data is a time series, therefore, apart from its feature values, there is a forward moving time. While our dataset contains trend and seasonality, it is important to truncate the dataset into a rolling window. Why? Because using the entire history may overfit the model.

In our case, 1 month of window may suffice, or 4 business weeks. This assumption means that most of the properties can be modeled using this window.

Note: From this point I will mention the function that are used in the implementations below.

We also need to define the scope of rolling window. In our case, we will assume 3 months of training data to characterize the model.

Illustration period of training scope and model scope. The right side of the illustration is the recommended dataset.
Dataset creation using rolling window. Photo credit and source from Huaiqing Wang, Diagram of building up the dataset

We’ll now split the train and test datasets:

The code above shows the creation of train and test dataset. The result is a TimeseriesGenerator that already creates the rolling window datasets.

We are splitting the train/test dataset by 80%/20%. Note that the test data has batch size of only 1, where as we will encounter later, the test set will be compared 1 by 1 to form the validation error.

The training window size is 20.

Model Architecture

The model is of Sequential in nature, meaning the layers are stacked into a single input/single output without any branching.

Model architecture from TensorBoard

The model starts from “dense” then ending at the “dense_2” layer.

  1. The 1st layer (“dense”) is simply a dot multiplication, which is equivalent to linear vector autoregression (VAR), with adjustable number of parameters for added complexity. It does not contain stochastic parameters.
  2. The 2nd layer (“dense_1”) is an additional layer to learn further nonlinearities.
  3. The Next layers are namely,
    a. Dropout layer —layer that randomly disconnects certain weights during training. This helps the network learn nonlinearities in logical forms while reducing overfitting.
    b. Flatten
    layer — layer that permutes the previous hidden layer into the output layer. The output dimension is equal to the target variable’s dimension.
Function: train_evaluate()
Dropout illustration. Source: Srivastava Nitish, et. al, Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014), Journal of Machine Learning Research 15

Adding temporal training techniques on the input layer

Recall that recurrent networks only model the current and the next timestep. What if we only want an autoregressive model that characterizes the next timestep with the previous timesteps? (See Figure above, “Feedforward network that relates various timesteps vs the next timestep”). We’ll then a stateless network as it simplifies the model.

Another important characteristic is input masking. Recall that our input shape has 20 timesteps. We will not use these shape as input layer shape:

  1. Using all timesteps may introduce overfitting.
  2. As we are training as rolling window to capture the seasonality, we will use input skips to capture the seasonality at different scales.
  3. Further timesteps have lesser relevance to present value.
Masking the inputs. The timesteps highlighted as gray are masked.

With these, we only have 11 inputs at input layer. You can try different combinations to mask.

Code snip for code on masking. For full code, you can refer to train_evaluate() function.

Hyperparameter tuning

Hyperparameters are mainly the adjustable arguments when instantiating or training a model.

As an example, the gist above instantiates a “Dense” layer. Those that are parts of the kwargs dictionary are hyperparameters:

  1. The ‘input_layer’ is the number of neurons created. Adjusting these will directly affect the model’s capacity to adjust to nonlinearities (underfit or overfit).
  2. The ‘uni_reg’ applies L1 regularization to that layer. Higher parameter values will adjust more slowly, reducing the risk of overfitting.

Here are the model’s hyperparameters:

The gist above is created using the library called “skopt.space”, where you specify the ranges and categories to optimize.

Afterwards, we wrap the entire pipeline into a function and run the optimization function command. This is the reason we created a large function called

train_evaluate(normalized_data, est_period,loss_func,param_record, **kwargs)

It contains model creation, validation process, etc. Feel free to study it.

Now, to the hyperparameter optimization command.

We are optimizing the parameters using Bayesian optimization of Gaussian Probabilities. It essentially maps a probabilistic model of the optimization space. An assumption on navigating space between exploitation (suspected value) and exploration(creating real data points) [4]

Continuous improvement using Bayesian optimization. Source: Gilles Louppe, Manoj Kumar, Bayesian optimization with skopt (2016)

Test/Validation

Testing this autoregressive problem has a lot of components:

  1. Test sets during training
  • these are the test values where features at X(t+1) are predicted

2. Iterative prediction for multi-step prediction

First, let’s remind ourselves of our target variable, mutual_unit .

In order to perform the validation of this variable, the 3 variables need to be predicted at X(t+1). The input data are pushed to the left, then the latest timestep data are replaced with the predicted data. The process are iterated to achieve the iteration at multi-steps.

Process on how the model forecasts in multi-steps. Xn is real data point where Hn is a predicted value

The function that does this is

validation()

The validation function does the iterative forecast and is compared with the corresponding test data. Note that only mutual_unit is only held out and compared.

A code snip on comparing the predicted(y_pred) and actual values(y_true).

Now, the vectors y_pred and y_true are future values. Entropy or unpredictability increases as time steps are forecasted further. Therefore we will introduce a discount factor to these values. In our application, the discount factor is 0.9.

Code to create the coefficients for multiplying into future data.

After this, the mean absolute error is computed. This will be the main validation metric as measure of the model’s performance.

This MAE will serve as metric for Early stopping.

Early stopping is applied to our validation metric (MAE) to mitigate the risk overfitting. It stops once the validation error begins to increase.However, it has the risk of not full reaching full optimality as the validation value is assumed to be convex, as once the inflection point of the error is founded, training is halted.

Early stopping illustration. Source: Shubham Jain, An Overview of Regularization Techniques in Deep Learning (with Python code) (2018),

I used a hand-coded Early stopper and not an API from TensorFlow due to a customized validation computation plus it is easier to code it rather to find an appropriate API. The object name is

class CustomEarlyStopper:
def __init__(self, patience=3, delta=1e-4):
...

It has similar parameters like patience and delta, which acts as a tolerance to increases in validation error.

Let’s take a look between predicted and actual values. Notice the predicted values have a positive offset above the predict value.

Comparison between the predicted(‘mutual_unit_forecast’) and actual(‘mutual_unit’)

Finalization

We are now preparing the model for inferential use.

  1. Using the final parameters, we reoptimize the model to reach the same validation error during hyperparameter tuning. These can be obtained through loading the optimization object.

In the notebook, you can notice the functions jump() and load() to make persistent(save to disk). The functions are placed to facilitate usage of objects upon system interruption.

Note: You can refactor the code so that the model optimized with the best hyperparameters are made persistent(user) so as to avoid re-training as above.

2. After this, capture the last training loss of the last optimal training. Note that this is the training loss to apply and NOT validation loss. This technique can also be refactored.

3. Finally, we have to train the model using the entire dataset. To reduce the risk of overfitting, we’ll get the previous training loss from the last trained model. We’ll stop the model training once the target training loss is achieved.

Final Remarks

Thank you for reading and hope you learned about the basics of time series forecasting, from data preparation to modelling. I had fun sharing this experience in as much as having fun learning, discovering and applying deep learning on simple time series forecasting.

References

[1] Autoregressive model, Wikipedia

[2] “Rob J Hyndman”, Backcasting in R (2014)

[3] Vector autoregression, Wikipedia

[4] Gilles Louppe, Manoj Kumar, Bayesian optimization with skopt (2016)

--

--

I am an Electronics Engineer turned Data Scientist and AI enthusiast and is passionate in data driven solutions and advancing artificial intelligence.