Key takeaways from Kaggle’s most recent time series competition - Ventilator Pressure Prediction

What it takes to be competitive in time series prediction

Ignacio Oguiza
Towards Data Science

--

Photo by Brett Jordan on Unsplash

The importance of deep learning for time series prediction keeps growing.

The first time a neural network finished within the top 3 solutions in a Kaggle time series competition was in 2015 (Rossmann store sales). Since then, it has become increasingly common to see neural networks at the top of the leaderboard.

And the trend continues. In fact, in the last Kaggle time series competition, the top 15 teams used neural networks.

I created the tsai [1] deep learning library 2 years ago to make it easy to use state-of-the-art deep learning models and methods with time-series data. When the last Kaggle time series competition ended, I was eager to know how the top teams had achieved their excellent results. So I reviewed all the solutions posted by the 15 gold medal-winning teams. And here are some key findings.

Kaggle time series competition

During the last two months, Kaggle’s hosted the Google Brain — Ventilator Pressure Prediction competition. The goal was to simulate a ventilator connected to a sedated patient’s lung. More concretely, participants had to predict the pressure in the lungs during the inspiratory phase of each breath.

The data set consisted of about 125k simulated breaths, of which 60% were labeled (training data). There were 80 irregularly-sampled time steps in each breath and five features per time step. Each breath in the training set had an 80-step sequence target (pressure). The goal was to predict such a sequence for each breath in the test data. And the key metric was mean absolute error (MAE).

Created by Ignacio Oguiza

Key findings

Task definition

The problem was a sequence-to-sequence task, where both sequences occurred in parallel. Interestingly, targets were not fully continuous. Each target step was of type float with only 950 possible values.

As expected, most top teams approached the problem as a regression task.

However, some of the gold medal winners successfully handled the competition as a classification task. They predicted the probabilities for each of the 950 classes.

I was surprised to see the classification approach worked so well. You may want to try it with your own datasets! (if your data is continuous you can always discretize the target creating bins).

Features

Top teams used three different approaches:

  • Original features only. Only the winning team used this approach successfully in one of their models. To compensate for the small number of features, they used a large number of epochs (2.5k).
  • Original plus handcrafted features. Most teams followed this route, usually adding tens of them. Well-designed, handcrafted features can help a model converge faster, so fewer epochs are required. Handcrafted features offer the opportunity to boost performance using expert domain knowledge.
  • Original plus handcrafted plus learned features. A few teams passed the input through a feature extractor with multiple convolution layers in parallel to learn new features. Each convolution layer used a different kernel size.
Created by Ignacio Oguiza

The combination of original features, handcrafted features designed by domain experts, and learned features is truly powerful.

Models

LSTMs and transformers dominated this time series competition. CNNs and boosted trees were not competitive.

All top teams used neural networks (deep learning). Unlike other domains, boosted trees were not competitive. None of the best solutions included boosted trees.

Top models included:

  • Stacked bidirectional LSTMs (a type of Recurrent Neural Network) dominated this competition. Almost all gold medal winners used predictions from an LSTM model as part of their final ensemble. As mentioned before, some teams built hybrid models by adding a feature extractor before the LSTM layers.
Upstage team (3rd position) model architecture [2]
  • There were few Transformer models. Some teams mentioned that it was difficult to make them work as well as LSTMs. However, a couple of top teams managed to get them to work with excellent results. Transformer models only used the encoder part and no positional encoding since time was already one of the features. Two additional customizations were the addition of convolutional layers before the transformer blocks or skip connections.
Created by Ignacio Oguiza. Based on a Transformer architecture used by the UnderPressure team [3].

It’s also interesting to note that none of the top solutions used convolutional neural networks (CNNs).

Custom loss functions: multi-task training

Custom loss functions made a big difference for top teams.

Submissions were valued using mae (mean absolute error). So most participants used L1Loss or a related loss (HuberLoss, SmoothL1Loss, etc.). However, almost all top teams used multi-task learning with auxiliary losses. They added additional targets to the original one to reduce overfitting and improve generalization.

As explained before, the target here was a sequence of pressures with 80 time steps. However, top teams modified it to predict secondary targets like:

  • the pressure difference between current and previous time steps (between one and four steps apart), or
  • the cumulative pressure at each time step, or
  • the pressure variance for each time step

Adding the pressure difference and the accumulated pressure forces the model to learn the target, its derivative and its integral, improving its performance.

“The huge gap in public leaderboard may be due to this loss.” (team in 3rd position)

“[custom loss] help the model predict pressure and its derivative and its integral correctly. This boosts CV LB” (team in 13th position)

Upstage team (3rd position) model architecture [2]

Frameworks & hardware

In this competition, teams used either TensorFlow and/ or Pytorch. Scikit-learn was also used by most teams, mainly for preprocessing and cross-validation.

Access to TPUs or multiple fast GPUs was very important as it took several hours to train a single model on a single fold.

Data augmentation

Data augmentation is one of the best strategies to reduce overfitting.

Few teams could find a good way to augment data, though. But those who did significantly improved their performance. Some useful data augmentation strategies used in this competition were:

  • Random shuffling of nearby time steps (based on a rolling window).
Upstage team (3rd position) model architecture [2]
  • Random masking. During training, one of the categorical variables was set to zero.
Upstage team (3rd position) model architecture [2]
  • Mixup
Upstage team (3rd position) model architecture [2]
  • CutMix

Training

Top teams trained their models using a large number of epochs (usually somewhere between 150 and 300, although some of them used up to 2.5k!). All of them used some learning rate scheduler. Cosine Annealing and ReduceLROnPlateau were the most popular ones. At least one team claimed a significant performance boost using Cosine Annealing With Warm Restarts.

Ensembles

Model ensembles are frequent in Kaggle competitions. And it was super important in this one. The reason is that when you use mean absolute error (MAE) to evaluate a prediction, it is usually better to use the median than the mean. And the median is only accurate when you have many values.

All top teams built one or multiple strong models and run:

  • multiple folds (10–15+), or
  • used all train data with different seeds, or
  • a combination of both

Post-processing

Gold medal winners used three main techniques:

  • rounding to the nearest of the 950 pressure values
  • always using the median, or
  • using the prediction mean or median based on an algorithm

In addition to these, there was a very ingenious technique that at least three teams discovered. It gave them a huge advantage. They found a (legal) leak in the data and finished first, second, and fourth.

Pseudo-labels

Some of the best solutions also leveraged the unlabeled dataset to generate additional labels. Improvements with this technique were modest compared to the others described above.

Conclusion

The field of time series follows the path of computer vision and NLP, where neural networks dominate the landscape.

Neural networks plus domain expert knowledge can significantly improve the performance of your time series tasks.
The use of deep learning applied to time series has evolved rapidly in recent years. It has matured so now is a good time to start using it to solve time series problems.

Lastly, I would like to thank all the participants in the VPP competition, especially those who shared their posts, code, etc. with the rest. All of them contribute to improving the prediction of time series.

References:

[1] tsai (created by timeseriesAI) is an open-source, state-of-the-art deep learning library for time series and sequential data built on top of Pytorch/ fastai.

[2] Taken from Upstage team model architecture (https://www.kaggle.com/c/ventilator-pressure-prediction/discussion/285330). I highly recommend you to read their detailed solution. It’s very well documented.

[3] Notebook created by Chris Deotte from the UnderPressure team (finished in 13th position) https://www.kaggle.com/cdeotte/tensorflow-transformer-0-112?scriptVersionId=79039122

--

--

Data Scientist | Creator/ maintainer of tsai | Specialized in time series, temporal and sequential data | Consultant | https://github.com/timeseriesAI/tsai