This year at the Neural Information Processing Conference, authors published a number of new papers focusing on time series forecasting and classification. Here I will briefly review their major contributions as well as discuss their implementation and our timeline for porting them to our deep learning for time series forecasting framework flow-forecast.
Background
As the creator/maintainer of an open-source framework, both myself and our core contributors have to constantly weigh the time necessary to add new models and methods versus the benefits for our end users. At flow forecast, as a framework that serves both businesses and researchers, we have a seemingly contradictory mission: we want to rapidly add the latest state of the art deep learning for time series forecasting/classification research papers while, simultaneously, providing stability, ease of use, interpretability, robustness, and reliability to our overall end users (who are often not familiar with the latest research or how to effectively leverage them with respect to their business problems). In other words, we want to constantly incorporate the most complex techniques while still keeping our framework easy to use.
Therefore, deciding which papers to port to flow forecast while balancing other priorities is difficult. Here I breakdown what we are working on in terms of possible papers to integrate from Neurips. Of course, if any of you readers have time and want to contribute porting one of these papers it would be greatly appreciated. Additionally, please note that these readings are not an overall indication of how good a paper is. Rather they are an evaluation of how well they would fit into our framework based on their performance (including which datasets they were tested on), complexity to port, relevance to our users’ use cases, and speed.
Benchmarking Deep Learning Interpretability in Time Series Predictions
Video Link (by the way I believe you can only see video links if you registered for the conference unfortunately)
Summary: This is an interesting paper that discusses common flaws with deep learning for time series interpretability methods. The authors describe how most saliency methods suffer from two major problems: saliency methods often break down with respect to multiple time steps; and model architecture plays a big role in the quality of the methods. To address these problems the authors purpose a framework called Temporal Saliency Rescaling (TSR). TSR operates as follows:
(a) we first calculate the time-relevance score for each time by computing the total change in saliency values if that time step is masked; then (b) in each time-step whose time-relevance score is above a certain threshold, we calculate the feature-relevance score for each feature by computing the total change in saliency values if that feature is masked. The final (time, feature) importance score is the product of associated time and feature relevance scores.
Code/Quality: The authors do provide an implementation of their code with the paper. The code for the most part seems to be built around Captum, a PyTorch based framework for interpretation of DL models, which is good. But, it will definitely require a fair amount of refactoring/stylistic changes. I think it could be done in around two weeks of focused work and testing. Our framework already includes Shap but incorporating Captum along with their methods could also provide a great extension.
Relevance for our users: Interpretability or lack thereof remains probably one of the most common criticisms I hear of DL models for time series over more classical models. I cannot tell how many times I’ve heard the (IMO ignorant line) "we would use deep learning but we need to be able to explain our decisions to stakeholders. We can’t have a black box…" Therefore any model that increases interpretability is great. Similarly, on the research side of things, I think finding better visualizations of models is a budding area.
Performance on Datasets: The authors use synthetic datasets where they already know the important features to evaluate the quality of the predictions. In addition, they try their TSR method on several real world datasets such as FMRI classification data (a sequence of FMRI images). They find TSR performs better than the vanilla methods at providing interpretable/accurate saliency maps.
Final Verdict: Including better interpretability methods is a major focus area for our project. Making it easier to explain model decisions to third parties and even allow ML engineers themselves to debug models better is a pressing problem. I think this paper is a step in the right direction. I do wish it wasn’t limited in scope to just classification as we are primarily a forecasting repository. So I’m not sure how well their techniques will generalize to forecasting problems. That said, I think porting this over will help our framework. So at the end of the day I would say this is a high priority.
Adversarial Sparse Transformer for Time Series Forecasting
Video Link (none)
Summary:The paper addresses the accumulation error problem when conducting multi-step forecasting (i.e., this is essentially when we append the model’s own output to the real values and use it to forecast subsequent time steps). The paper also addresses creating more diverse forecasts with multiple ranges of values. To address these problems the author purposes using GANs. It is one of the first articles that I have seen that describes using GANs for forecasting. The GANs are used as a method of regularizing multi-step time series predictions. They work in conjunction with a sparse attention mechanism that utilizes an ent-max activation function instead of Softmax. This allows the network to better learn long range dependencies in time steps and, in particular, which steps aren’t important at all.
Code/Style: There is an implementation of the paper located here. I’m not sure if it is the official implementation of the paper or not. The code quality appears decent at first glance. That said, I still think there would be a lot of difficulties porting the model fully to our framework at the moment. For one thing we would have to modify the main training module in flow forecast to be able to work with GAN like architectures. Currently, our training module assumes we only have a single model and loss function. This would require either adding if/else blocks or creating a function that maps model type to a training loop (if we envisioned many models consisting of multiple losses).
Relevance for our users: Most of our users are interested in increasing the performance of their models; however, at the same time, even within the deep learning space, simpler models with fewer hyper-parameters are preferred. Given the complexity associated with converging GANs and the fact that many in corporate data scientist positions likely have very limited knowledge of GANs, I don’t see many using this model. Even within the ML time series research community I don’t see that much work directly building off the GAN architecture, except maybe for using the sparse attention mechanism (though I have been wrong before).
Performance on Datasets: The model performs better than the Convolutional Transformer (which BTW we have implemented in flow forecast) as well as several other models on the traffic and electricity datasets. My major grip here is that the authors don’t directly compare performance with Temporal Fusion Transformers. When you compare their model with TFT their performance improvements are in many cases negligible. For instance when you compare TFT with AST on traffic we see TFT achieved 0.095 compared to their 0.093+/- .01. Additionally, I overall don’t like the trend of time series forecasting papers that only use these simple univariate time series datasets. At minimum, call you paper AST for Univariate Time Series Forecasting because there are no results that demonstrate a broader applicability.
Final Verdict: Adding code for training I think would be very difficult as I would have to modify several modules in Flow-Forecast. Additionally, GANs are notoriously difficult to train and performance gains seem marginal. That said, I really do like the idea of utilizing sparse attention and finding ways to address the compound error of multi-step predictions. I might also implement sparse attention and allow people to utilize it as a swappable parameter. Altogether though I’d say adding this full model is relatively low priority for our team. However, that doesn’t mean I wouldn’t like it added to the repository. If anyone is interested in porting it let me know!
Probabilistic Time Series Forecasting with Structured Shape and Temporal Diversity
Summary: This paper addresses a dilemma where common time series loss functions (MSE, RMSE, etc) result in potentially accurate predictions but fail to gauge uncertainty and cause the model to fall apart when the distribution shifts. On the other hand, loss functions such as Gaussian or Quantile loss often fail to provide narrow enough predictions. To rectify this problem the author purposes STRIPE. STRIPE basically is able to produce a diverse set of potential forecasts each of which have the added benefit of being sharper (i.e. so you just don’t have a huge confidence interval). This paper is by the same author as the DilateLoss function, which is already in our flow forecast library.
Performance on datasets: The model compares itself against a Variational Autoencoder and DeepAR in terms of performance. With respect to these, the model wrapper does seem to give more diverse forecasts. The model wrapper additionally performs better in terms of DilateLoss and MSE than other model wrapping techniques on synthetic and the traffic/electricity datasets. However, it is not clear if these techniques would increase performance when paired with the state of the art models on these datasets.
Relevance for our users: We already have several methods that are probabilistic. The main benefit of this would seem to be a better range of forecast candidates. Depending on how the model performs on real world multivariate datasets, I could see it either being very useful or not useful at all.
Current Implementation/Code Quality: There is code provided on the website and it appears to be fairly similar to the style of the DialateLoss code. If I remember correctly, the DilateLoss function took me around 3–4 days to port. This is a bit trickier however, as it isn’t a loss function but essentially a model wrapper. Therefore realistically I don’t think I could port it in under one to two weeks.
Final Verdict: I think this is an interesting approach and like the fact that it is in theory model agnostic (however, I need to explore if it is easy to wrap our existing models with it). At first glance it seems to be based on an encoder decoder based approach. So I’m not sure how that will interact with our existing models. The implementation looks fairly clean to work with though. I’d rate this overall as Medium Priority. It is nice to have, however, I don’t know how much it would truly help our users.
Summary: This paper describes learning causal relations in multivariate time series in the presence of unobserved variables (latent co-founders). Previous methods find the wrong link.
Deep reconstruction of strange attractors from time series
Summary: This is an area I’m admittedly not that familiar with, however I’ll do my best here to summarize what I think the author is describing. Basically the author describes how some complex systems operate in many dimensions, yet we have only a 1-D representation of them. For instance, an ECG is a low dimensional measurement of the heart beating. To solve this problem the author proposes using auto-encoders to create an effective multi-dimensional representation of the system. This high dimensional representation should, hopefully, roughly mimic the actual system.
Performance on Datasets: The author examines his model by visualizing the embeddings of well known attractors as well as comparing the embeddings. They also find that their regularization method performs better at forecasting noisy time series data than non-regularized variants.
Relevance for our users: There isn’t that much of immediate impact that I can see for our business users. At least out of the box I don’t think it would improve model performance without significant tweaks. However, that said, it could be good for our researcher users who might use it to eventually create more powerful models.
Current Implementation/Code Quality:
Final Verdict: This is an interesting work in the broader category of Time Series Analysis. That said, I don’t really know how much it could directly enhance our forecasting methods. It could be useful potentially for generating embeddings of time series which you could then use for finding similar time series for transfer learning. It could also potentially serve as another useful tool in our visualization tool box as well. The type of regularization he purposes could be useful to pair with our models as well.
Spectral Temporal Graph Neural Network for Multivariate Time-series Forecasting
Summary: This paper follows a rather new trend of using Graph Neural Networks to model the interactions between multivariate time series. Spectral Temporal Graph Neural Network (STGNN) claims to be the first model to accurately learn intra-series and inter-series correlations in the spectral domain. A major component of the model is its use of a Graph Fourier transform to capture inter-series correlations. The model also seems good at learning spatial dependencies and correlations without a pre-defined topology. For instance with respect to COVID the model does well at learning inter-country correlations. This also seems to be true on the traffic and electricity forecasting datasets.
Current Implementation/Code Quality: There is not currently an implementation of this paper as far as I can tell. Neither a Google search nor looking at the authors GitHub turned up anything.
Relevance to our users: Models that perform well on complex real world datasets are always valuable to our end users. The fact that here it performs good at learning complex multivariate spatial temporal dependencies makes me think it would be a great addition to our library. One question I do have though would be how it handles multivariate data from many locations. For instance, I believe the electricity, COVID, and other datasets they use are basically univariate but they use multiple places. Whereas a lot of the data we forecast on and companies forecast would be more like say (target, temoporal_feat1, temporal_feat2, location).
Performance on Datasets: The model performs well on a wide range of multivariate time series datasets. Oddly they don’t compare results with either the Temporal Fusion Transformer paper or the Convolution Transformer paper. Instead they mainly compare results to other GNNs. Additionally, for reasons I don’t understand, the transformer crowd of papers seem to report results in terms of Quantile Loss whereas this paper reports them in terms of MAPE and MSE. This makes it a pain to compare their results even on the same datasets. However, based on the results on ablation studies I do think the model performed well and I’d guestimate it would be better than some of the other papers if we did have metrics that compared them.
Final Verdict: This is definitely a paper I would like to add at some point. However, the lack of code and our team’s own inexperience with graph convolution models makes it difficult. That said we may need to benchmark against it for some of our own upcoming research so we may try to implement it anyways soon. For now though I would consider it a medium priority item. I do hope to have at least a few GNN based models in flow forecast by February or March (hopefully this will be one).
STLnet: Signal Temporal Logic Enforced Multivariate Recurrent Neural Networks
Summary: ML based approaches to Time Series Forecasting occasionally will break down and give ridiculous values. For instance, when I’ve done river flow forecasting and I don’t have a RELU activation at the end I frequently see the model predicting negative values (even though it is impossible to have negative river flow). Similarly a model for power plant power forecasting might predict the plant producing more power than is possible or a traffic forecasting model predicting more traffic than can fit on the roads. To rectify this problem the authors purpose a logic enforced neural network.
Code Quality: I have not been able to find code for this model yet. I think implementing it would be challenging from scratch. The overall algorithm has several steps and I don’t think it would work nicely with training loop. So it would likely require it own separate training loop and class that we would implement.
Performance on datasets: The authors find that their model performs better on the air quality dataset both in terms of RMSE and the number of time logic constraints are violated.
Relevance to our users: This definitely relevant to our users as a lot of times real world forecasting problems will be constrained to certain values. Therefore training a model to follow these simple logical constraints would be very helpful. This would likely also increase model robustness as then models would spit out less problematic values. I especially like how the algorithm seem to operate in a model agnostic fashion.
Final verdict: I like the constraints this paper imposes and it demonstrates good real world results. I definitely think this could be useful in our repository. However, unless I find an implementation I don’t think we will be adding it soon simply due to the complexity of the algorithm.
Neural Controlled Differential Equations for Irregular Time Series
Summary: The authors introduce a neural controlled differential equation model. This helps resolve an issue with differential equations where the equation is determined by the initial condition, however afterwards it does not learn from new data. The authors use this new controlled diffeq model to effectively forecast time series with many missing values. This structure allows the model to learn much better than RNNs when many values are missing.
Code Quality: The authors provide code for Neural ODE’s in the NeuralODE library which is available here. As this is a library itself the code quality is pretty high. At flow forecast a goal of ours for a while has been to make a bridge that allows users to use models from the NeuralODE library in our framework. However, this has proven complicated both in terms of code structure and theoretical understanding. I think probably the easiest way to do this would be to have general wrapper classes that could be used for either ODEs or STRIPE. However, my guess is this would take a long time to implement and test, probably at least 3–4 weeks.
Performance on Datasets: The model achieves superior performance to RNN based methods on the CharacterTrajectories, PhysioNet sepsis prediction, and Speech Commands datasets. What I found particularly impressive was that on the PhysioNet sepsis prediction challenge only around 10.3% of the values are observed yet it achieved an AUC of .88. Similarly the model outperforms GRU-ODE-Bayes and several other models on these varied datasets.
Relevance to our users: A lot of real world data has missing values. For instance our stream flow forecasts can at time have weeks of data missing during a station malfunction. EHR data is another major area with constant missing data. Therefore finding models that perform well on this data is crucial. Additionally, I think that Neural ODE based models might work better in the event of domain shift since there is already a general equation.
Final Verdict: This is a solid paper all around and relevant for our repo. I would rate this a Medium priority item and hope to have it ported to flow forecast by the summer months.
Other research possibly relevant research
Of course at a conference as large as Neurips, I can’t give super detailed descriptions for every paper related to time series. Below are some other related papers that I haven’t had time to take more than a brief glance at.
Normalizing Kalman Filters for Multivariate Time Series Analysis
This paper describes using Kalman Filter together with deep learning. I haven’t had time to read through the paper yet but it looks like another good example of a synthesis model.
Deep Rao-Blackwellised Particle Filters for Time Series Forecasting
I’d categorize this paper as a hybrid classical time series/deep learning model. Unfortunately I don’t really know anything about the classic model or how it operates. So I’m not going to comment further about it right now. Feel free to take a look if you are interested though.
Conclusion
In conclusion, there are a lot of interesting papers that focus on time series analysis, forecasting, and classification out of Neurips 2020. We hope to incorporate quite a few of these papers over the next several months and, hopefully, most of them over the next year. Additionally, we will closely be monitoring ICLR 2021 for other potential candidates to add. Make sure to watch our repository on GitHub, follow our Twitter for updates, or, better yet, help contribute yourself.