Global deep learning for joint time series forecasting

A few words on the hottest models in the field

Gabriele Orlandi
Towards Data Science

--

Photo by Wim van 't Einde on Unsplash

Machine Learning is a notoriously intricate field practised by academics and industries alike, constantly improving on its benchmarks and spawning interesting ideas and problem-solving approaches.

It has been deployed successfully in countless practical applications in many different fields before even a proper theory has been developed explaining why it works.

For this reason, it can sometimes be a bit hard to keep up with the latest architectures; in this article, we are exploring the most recent successes in the field of time series forecasting, a class of prediction problems with its own particular status due to the time dimension.

More precisely, we’ll take into consideration the so-called global models: architectures that are built to detect patterns across many related Time Series at once, learning a single representation which is capable of explaining and forecasting each series individually.

Everything, everywhere, all at once

A predictive model is called global when it is trained on many different datasets, each being the random outcome of its own stochastic process.

Models of this kind are becoming more common every day, leveraging the bounty of data that we nowadays have in many fields and tackling problems where we need particularised predictions for individual small-scale datasets; whereas, previously we could only hope to predict at a higher, aggregate level.

An obvious advantage of this approach is that the model will benefit from having much more training data, which in this era of big data can amount to an increase of many orders of magnitude.

Moreover, if a generic pattern we wish to learn is present only in a subset of the datasets, while in some others it just hasn’t happened yet, a global model can learn to attribute that to all datasets.

On the other hand, if datasets come from different generating processes, by definition there must be at least some differences in their structures and their patterns, and it is therefore not always desirable for a model to just ignore these and predict everything homogeneously.

For example, sales data from different stores of the same company can be regarded as similar, because many of the variables that contribute to the data-generating processes are the same throughout (same products, prices, marketing…).

Nonetheless, many other variables (location, local customer habits…) differentiate each series in a way that can be hard for a model to detect, and that’s why we need to encode these variables as its input.

For this reason, global models are predisposed to process information about how datasets are similar and how they differ from one another. We usually convey this information by appending labels to datasets, one for each characteristic we want to trace; some other times, we just instruct the model to automatically cluster datasets accordingly.

The catch here is that the generating processes, and hence the resulting outcomes, must share enough information that the model is able to take advantage of cross-learning them all together.

The choice of which datasets to regard as related can be somewhat arbitrary since we often don’t know in advance if we can achieve this cross-learning advantage and lumping together data that is too diverse can be damaging by introducing too much variance and noise.

To sum up, the most important prerequisites when trying to train a global model are two:

  • to have many related datasets to predict, all coming from similar processes and showing similar patterns;
  • to know a great deal about them, specifically in how they are similar and how they differ.

A few interesting models

Time series forecasting is a particularly well-suited problem for global models [1][2] since it’s not uncommon to have many related time series, often in a fixed relational structure: data from customers of a company, sensors in a system network, traffic at different locations…

At the same time, neural networks have proven to be ideal global models for their ability to handle any type of covariate feature and their overall penchant for huge complex datasets.

It’s therefore not surprising that the combination of deep learning, time series forecasting and big data has found such a fertile ground in recent years in academic and industry research alike.

Without pretence of completeness, here are some of the more interesting results of this astral conjunction.

Please note that results and benchmarking for each of these models can be found in the referenced papers; we are not so much interested in this aspect, but rather in the novelty and cleverness of any one of these approaches.

DeepAR

DeepAR [3] is a Deep Learning model based on recurrent networks, devoted to learning an autoregressive representation of the target time series.

More specifically, it is a multi-layer network of LSTM cells, with an encoder-decoder structure that serves the purpose of summarising at each training step the informational output of the cells in its past (called conditioning range) and making it available for predicting its future (prediction range).

It is a probabilistic model, which means that it outputs parameters of a likelihood function (whose shape can be specified by the user) representing the distribution probability of the forecast values, given those same parameters.

Predicting a likelihood is a clever trick that allows for great flexibility: from such a function we can draw samples, generate quantile predictions, confidence intervals and optionally even bootstrap Monte Carlo samples that can reach any number of steps into the future.

Schema of training (left) and inference (right) in DeepAR [3].

DeepAR is of course equipped for dynamic covariates, such as calendar features, and static ones: these are the very labels that we have to attach to each series if we want to train a single global model.

Target series and covariates [3].

For more details on the inner working of this model, such as the clever way it handles series having different scales, we refer to the original paper [3].

StemGNN

Spectral Temporal Graph Neural Network [4] is an innovative and clever design for a global model that jointly learns inter-series and intra-series correlations.

In short, the idea is to use Fourier transforms on both dimensions, across time and across the series, in order to get representations that can then be learned by other neural network blocks such as convolutional and sequential modules.

While performing Fourier transforms on a series in the time domain is a classic and straightforward technique, we need to understand what it even means to do so across all series, at a fixed time.

Indeed the first layer of StemGNN, called latent correlation layer, takes the whole dataset and returns a graph, where nodes are time series and weighted edges are correlation relationships between them that were inferred by the layer itself.

This layer is made of a GRU [9] and a self-attention mechanism, but it can also be removed if domain knowledge suggests a particular graph representation to be used instead; for example, one could link the series using the same labels that would be passed to other models as static covariates.

After this, the model applies several identical blocks in sequence performing Fourier transforms to the graph representation of the dataset in both the graph domain (GFT) and the time domain (DFT) of each series (which is a node in the graph); other than this, convolutional blocks learn the information that these transformations exposed, before inverting them to return to the original domains.

The rather complex structure of StemGNN [4].

These blocks are the main novel part of this architecture, and they are aptly called StemGNN blocks. Their inner working is actually very complex, as such we refer to the paper [4] for details.

For our purposes, it suffices to say that applying spectral graph convolution after DFT and GFT allows us to jointly learn patterns that arise in the spectral representation of each series as well as in the spectral matrix representation of the correlation graph.

Finally, StemGNN blocks are applied in a residual fashion so that each block can go deeper into the patterns that the previous ones have left behind. This is a powerful technique, also prominently used in [4].

DSSM

Deep State Space Models [5] are a novel idea that combines Deep Learning and state space models, to leverage the advantages of both approaches.

State space models (SSMs)[7][8] are a family of classical statistical models for forecasting, including the notorious ARIMA and exponential smoothing methods.

They are particularly suitable when a time series is well-characterized, due to the possibility of making structural assumptions in the model and crafting it by selecting its components and features.

This tailored approach has its problems, though, in that it requires well-known series with enough history, and it scales poorly to the amount of series that we nowadays often have. Moreover, as a per series approach, it is simply unable to transfer learning across series.

The idea of DSSM’s authors is to then jointly learn SSM parameters through a recurrent neural network ingesting all series and their covariates: the term “jointly” here refers to the fact that, although each series will have a different SSM with its own parameters, the meta-parameters of the neural network are shared throughout the training process.

In other words, a single RNN trained across all series is learning to assign different SSM parameters to each series, as a meta-learning task.

Models like this, combining machine learning and classical statistical techniques, are called hybrid models in this context.

Operationally, at training, an LSTM-like network outputs SSM parameters which are fed to a likelihood function that depends on the particular SSM model of that series, together with known observations; the likelihood is then maximised to update the LSTM parameters.

Training in DSSM [5].

Analogously, at inference, DSSM computes the probability distribution of the SSM’s state at the last training step, by using known values of the series and the learned parameters in the training set; this probability represents both the knowledge and the best estimate produced by the model about the state we are in. By using this probability, the model then unrolls any number of predictions in a Monte Carlo approach, by using the SSM recursively with the parameters given by the RNN at inference.

Inference in DSSM [5].

DeepGLO

Finally, another interesting hybrid model: DeepGLO [6], as the authors put it, is “a deep forecasting model which thinks globally and acts locally”.

It is a combination of a classical matrix factorization model, a convolutional network (TCN) for regularizing it and another, independent local TCN acting on each series and on the output of the first model.

The feature that sets DeepGLO apart from many competitors is using all time series together not only at training but also at inference.

To quote again the authors, “For instance, in stock market predictions, it might be beneficial to look at the past values of Alphabet, Amazon’s stock prices as well, while predicting the stock price of Apple. Similarly, in retail demand forecasting, past values of similar items can be leveraged while predicting the future for a certain item”.

The classical part of the algorithm, the matrix factorization model, consists of taking all time series as a single matrix and decomposing it as the product of two matrices, called factors.

The way matrix product works makes it so that each factor will share one of the dimensions of the original matrix: if we denote this as Y being of shape (n, t + tau) (we are already splitting the time dimension for training and test purposes), the factors F and X will be respectively (n, k) and (k, t + tau), with k some number which is usually much smaller than n.

See the following picture for a clearer depiction of decomposition dimensions.

Matrix factorization: in this schematic example we have n time series of length t + tau, where t is the training set period. As a whole, the resulting decomposition is Y = FX [6].

Effectively, X can be seen as encoding global information in k basis series which are as long as the original ones, and F contains coefficients that give the original series as linear combinations of the basis ones.

The hybridization of this model with a convolutional network is rather complex and we defer to [6] for those interested in the details; in short, the network is used as a regularizer, meaning that it encourages the factorization process to give out basis series that are close to what the network would predict.

How basis series get combined to form predictions for the actual target in a factorization model (a), and how global predictions from this model are combined with local series in the final TCN layer of the model [6].

The factorization and the convolutional network are trained jointly in alternate steps; at inference, the network predicts future values of the basis series by starting from those given by the factorization, and multiplying them with the coefficients matrix gives the final global predictions.

Finally, another local per-series neural network acts by taking in as input past values of the series, global predictions from the previous models, and local covariate series.

Conclusion

There is little to no doubt that global models will take a bigger and bigger share of the forecasting scene, as we are witnessing an increase in data, knowledge of neural networks and business applications that doesn’t seem to be ending soon.

As you can see, there’s incredible diversity when it comes to them, especially in forecasting: although all of them have neural networks as their core, each one brings its own bag of tricks that are mainly focused on capturing interesting information and exposing it to the networks themselves.

Special mention for the hybrid models, which not only manage the coexistence of two learning engines but also their advantageous codependency, in order to get a result that is greater than the sum of its parts.

There is debate around which is fundamentally better: global versus local [1][2], and classical versus hybrid versus pure deep learning [10][11][12].

It could be argued that it all depends on the information we have about both the data-generating process and its actual realization (the time series); the first should guide us in picking the best class of functions for the problem, i.e. the algorithm, and the second should help it to find the best function within the class (the training outcome).

As usual, the quantity and quality of prior information should correlate with the complexity and specificity of the model.

--

--