The world’s leading publication for data science, AI, and ML professionals.

Deep Learning : No, LSTMs Are Not Dead!

If they are dead, why do they still win Kaggle Competitions?

Time-Series Forecasting: No, LSTMs Are Not Dead!

Photo by Ricardo L on Unsplash
Photo by Ricardo L on Unsplash

Who has been closely following the machine learning field during the past decade?

The people to do so have witnessed the revolutionary progress of science like no other. It is like the beginning of the 20th century, where Einstein’s Annus mirabilis papers became the foundation of quantum mechanics. Only this time, it was the AlexNet paper[1], an architecture that challenged computer vision and renewed people’s interest in machine learning (which was later transformed to Deep Learning).

The caveat of this relentless growth is that it’s difficult to correctly assess every breakthrough: Before a new feature is introduced and starts gaining ground, another one emerges – more powerful, faster, or cheaper. The tremendous growth creates so much hype that attracts many newcomers, often with much enthusiasm but little experience.

One such misunderstood breakthrough in the field of Deep Learning is the family of recurrent neural networks. If you google phrases such as "LSTMs are dead" and "RNNs have died" you will find a ton of results, most of which are incorrect or do not give the full picture. This article will show you that recurrent networks are still relevant and find use in many practical scenarios.

But first, I will give you the historical context to understand why the majority of people believe the opposite. Also, this article does not only talk about LSTMs and Transformers. You will also learn how to unbiasedly evaluate a concept in Data Science.

I’ve launched AI Horizon Forecast, a newsletter focusing on time-series and innovative AI research. Subscribe here to broaden your horizons!

Enter LSTMs

Every big tech company embraced LSTMs; There was no NLP research without LSTMs.

Long Short-Term Memory Networks – LSTMs[2] started taking off in 2014, even though they were introduced back in 1997. They belong to the family of the Recurrent Neural networks – RNNs[3], along with the Gated Recurrent Units-GRU[4].

With the accessibility of the GPUs and the advent of the first Deep Learning frameworks, LSTMs became the state-of-the-art model that dominated the NLP domain. The discovery of word embeddings in 2013 was also instrumental in establishing the mechanism of transfer learning. In fact, the standard components of almost any NLP task back then were: a) pretrained word-embeddings, b) LSTMs and c) the sequence-to-sequence architecture [5].

Everyone who was a data scientist during that period can agree that LSTMs dominated the NLP landscape: They were used for speech recognition, text-to-speech synthesis, language modeling, and machine translation. Every big tech company embraced them; There was no NLP without LSTMs.

One of the best models created by Google for Machine Translation is shown in Figure 1:

Figure 1: The Google Neural Machine Translation - GNMT architecture (Source)
Figure 1: The Google Neural Machine Translation – GNMT architecture (Source)

This complex model, introduced in [6] was behind the Google Translate service. It reduced translation error by 60% compared to its predecessor. As you can see, it makes heavy use of LSTMs, forming the famous encoder-decoder topology (including a Bidirectional Lstm).

This implementation also makes use of Attention, a mechanism ** that allows the model to focus on the relevant parts of the input sequence as needed. This is shown in Figure 1, where the top vector of the encoder is weighted using attention score**s. To put it differently, each word at each time step is weighted with a learnable score that minimizes errors. For more information, check this article, or even better read the original paper [5].

However, LSTMs has 2 main drawbacks:

  1. They are not easy to parallelize during training.
  2. Because of their recurrent nature, there is a limit on the length of the sequences they can model.

But more on to that later.

Enter Transformers

RNNs are sequential models, meaning words are processed in order. But the Transformer processes all the words in parallel.

In 2017, Google introduced the Transformer[7] ** architecture, a milestone for the NLP ecosystem. This new model delves deeper into _Attentio_n by proposing the Multi-Head Attentio**n mechanism which:

  • Takes full advantage of self-attention, thus achieving superior performance.
  • Adopts a modular structure, making heavy matrix operations more parallelizable. In other words, it runs faster and has better scalability.

However, no LSTMs were used within the Transformed model. Even at the first layer where contextual information matters (and LSTMs could be useful), the Transformer paper proposes a different mechanism, called positional encoding. This also reveals the main difference between the two types of models: RNNs are sequential models, meaning words are processed in order. But the Transformer processes all the words in parallel. This cuts down training time significantly.

Since then, the Transformer’s core philosophy has been the basis for further research in language processing, giving birth to new variations. These are shown in Figure 2.

Figure 2: The open-source Transformer Family (Source)
Figure 2: The open-source Transformer Family (Source)

Don’t Forget the Time Series!

Both LSTMs and Transformers are great at modeling sequential information. Hence, they can also be applied to Time Series Forecasting cases.

If you are interested in Time-Series Forecasting, check my list of the Best Deep Learning Forecasting Models.

Traditional statistics win the first round

However, experimental results showed that they couldn’t decisively outperform the traditional statistical methods (e.g ARIMA) in terms of accuracy. On the other hand, the combination of both statistical and rnn-based methods was more efficient. One such example was the ES-RNN model built by Uber that eventually won the M4 competition: It was a hybrid model using Exponential Smoothing on top of a dilated LSTM.

Naturally, the Transformer was put to the test. For time series forecasting, the most common approach was the following: Use the original Transformer, and replace the position encoding layer with the Time2vec layer. But neither the Transformer model was able to surpass the statistical methods.

Moreover, I want to clarify a few things:

  • This doesn’t mean that statistical methods were always better. For example, if there were a lot of data, LSTMs could perform better than ARIMA.
  • Statistical methods require more data pre-processing: This may include making the time-series stationary (if it isn’t), removing seasonality, volatility, and so on. LSTMs can capture the natural characteristics of sequences more easily, at least by using simpler techniques.
  • Also, statistical methods are less versatile: For instance, autoregressive methods can’t handle extra features that are unknown in the future.

The bottom line is that ML methods were not consistently better than statistical methods, in terms of forecasting ability.

Deep Learning wins the second round

It was not until 2018–2019 that research paid off and deep learning models started becoming more competitive in time series forecasting tasks. For a more comprehensive analysis regarding Time Series Forecasting and Deep Learning, check this article:

Two state-of-the-art models are shown in Figure 3 and Figure 4. They depict the architectures of Google’s Temporal Fusion Transformer and Amazon’s DeepAR respectively. Notice anything interesting?

Figure 3: The Temporal Fusion Transformer (Source)
Figure 3: The Temporal Fusion Transformer (Source)
Figure 4: DeepAR model architecture (Source)
Figure 4: DeepAR model architecture (Source)

Well, many things are interesting about these models, but the most important one which resonates with the topic of the article is:

Both models utilize LSTMs! But how?

DeepAR is a complex time series model which combines autoregressive and deep learning characteristics. The h_i,t vectors that Figure 4 shows are in fact the hidden states of LSTM cells. These hidden states are then used to calculate the μ andσ parameters of a Gaussian distribution. From this distribution, n samples are selected, whose median represents the predicted value.

Temporal Fusion Transformer – TFT is a multi-layered, pure deep learning model for time series. This model features both an LSTM encoder-decoder as well a novel Attention mechanism that provides interpretable forecasting. We won’t delve into the specifics of this model here – check this amazing article which provides a thorough explanation.

The bottom line is that both deep learning models outperform the traditional statistical methods. Also, both models are more versatile, because they work with multiple time series and accept a richer set of features (with TFT being slightly superior).


How Recurrence and Attention are related

To show this, we are going to emphasize an excerpt from the TFT paper:

To learn temporal relationships at different scales, TFT uses recurrent layers for local processing and interpretable self-attention layers for long-term dependencies.

Considering what we know so far, and given the above excerpt, we can now connect the dots:

The recurrent networks are excellent at capturing the local temporal characteristics of a sequence, while the Attention is more adept at learning the long-term dynamics.

This is not an arbitrary conclusion. The authors of the TFT paper proved it by performing something which is called Ablation Analysis: In this type of analysis, we remove or replace certain components of a complex machine learning system to understand the contribution of each component.

The authors of TFT tested, among other components, the LSTM encoder-decoder layer: They ablated by replacing it with the standard positional encoding layer of the original Transformer. They found 2 things:

  1. The utilization of a sequence-to-sequence layer was instrumental in the model’s performance.
  2. In 4 out of 5 datasets where the benchmark was performed, the LSTM layer achieved higher performance.

Therefore, we can safely conclude that LSTM layers are still an invaluable component in a time series deep learning model. Moreover, they don’t antagonize the Attention mechanism. Instead, they can still be combined with an Attention-based component to further improve the efficiency of a model.


The hidden gem of LSTMs: Conditional Output

This is one of the most overlooked advantages of LSTMs, which many data science practitioners are still unaware of.

If you have been using vanilla recurrent networks, you know that this type of network can only handle temporal data – data that is represented as sequences with various dependencies among them. However, they cannot model directly static metadata or time-invariant data.

In NLP, static metadata is not relevant. Instead, NLP models focus upon a vocabulary of words, where each word is represented by an embedding, a unified concept across the whole model. The type of document where each word comes from is not really important, as long as the NLP model can learn the correct context-aware representation of each word. Remember, a particular word can have different embeddings, depending on its meaning and its position in a sentence.

However, in a time series model, the time invariant data have a much larger impact. For example, let’s say we have a sales forecasting scenario which involves a store’s products. The volume of a product’s sales can be modeled as a time sequence, but it is also influenced by external factors such as holidays. So, a good forecasting model should also consider those variables. That’s what TFT does (see Figure 5). But how TFT is able to achieve it?

Figure 5: Effect of external static variables on forecasting (Source)
Figure 5: Effect of external static variables on forecasting (Source)

TFT is expertly designed to integrate static metadata. It uses various techniques, which are described in the original paper. The most important one however has to do with LSTMs.

LSTMs can perform this task seamlessly, using a trick first introduced in [11]: Instead of setting the initial h_0 hidden state and the cell state c_0of the LSTM to 0 (or random), we initialize them with a specified vector/embedding of our choice. Or we can make those vectors trainable during fitting (Actually, that’s what TFT does). In this way, the output of an LSTM cell is properly conditioned on external variables, without affecting its temporal dependencies.


LSTMs vs TCNS

Before the advent of Attention and Transformers, there was another novel model which promised to change the landscape. These are the so-called Temporal Convolutional Networks – TCN.

TCNs use dilated convolutions which apply padding on the input sequences at various lengths – making them able to detect dependencies between items that are not only close to each other but also at entirely different positions.

First introduced in 2016[12] and formalized in 2018[13], the TCNs leveraged convolutional networks to model sequence-based data. Naturally, they were also ideal candidates for time series forecasting tasks.

Figure 6: A dilated convolution with filter size k = 3 and dilation factors d = 1, 2, 4 . The receptive field can cover all datapoints x_0...x_T from the input sequence. (Source)
Figure 6: A dilated convolution with filter size k = 3 and dilation factors d = 1, 2, 4 . The receptive field can cover all datapoints x_0...x_T from the input sequence. (Source)

The "secret weapon" of TCNs is the dilated convolution, which is displayed in Figure 6. The standard CNNs use kernels/filters with fixed size and thus they are only able to cover data elements that are in immediate proximity. In contrast, TCNs use dilated convolutions that apply padding on the input sequences at various lengths – making them able to detect dependencies between items that are not only close to each other but also at entirely different positions.

Apart from that, they use other techniques such as residual connections, which are now a standard for deep networks. Again, we will not delve into specifics (more of TCNs in a new article). Instead, we will focus on the differences with regards to LSTMs:

  • Speed: In general, TCNs are faster than LSTMs, because they use convolutions, which can be done in parallel. In practice, by using a lot of dilations, and considering the residual connections, TCNs may end up being slower.
  • Input Length: Both TCNs and LSTMs are able to accept variable-length inputs.
  • Memory: On average, TCNs require more memory than LSTMs because each sequence is processed by multiple dilation layers. Again, this depends on the hyperparameters which define how complex each model becomes.
  • Performance: The initial paper showed that TCNs outperform LSTMs. In practice, however, that is not always the case. A more exhaustive study in [13] showed that in some tasks TCNs were better, while in others LSTMs were more efficient.

In other words, there is no clear winner here. Both models have their advantages and disadvantages. The best course is to evaluate them both and see what is best suitable for your case.

Note however that this approach is now obsolete. You will not achieve state-of-the-art performance by applying a single TCN or LSTM model for that matter unless your case is very trivial. Modern cases factor in more external parameters which require a more challenging approach. This in turn means that more than one component/model would have to be used. This is better explained in the following section.


Deep Learning and Time Series in Kaggle

Up until now, we have been evaluating individual models from an academic perspective. However, the practical aspect cannot be overlooked if we are to formulate a more detailed view.

A good basis for evaluation is Kaggle, which indirectly provides empirical evidence on the state of the field of data science. We will be focusing on a recent Kaggle competition, Ventilator Pressure Prediction. The task was to predict the sequence of pressure within a mechanical lung, given the sequence of control inputs. Each training instance could be considered as a time series of its own, hence making the task a multiple time-series problem.

This competition was challenging for 3 reasons:

  1. The problem of the competition could be formulated both as a regression and classification task.
  2. The dataset leaves the door open for creative feature engineering.
  3. Given that each subject/datapoint was represented by a different sequence, the utilization of a statistical model was not viable.

Now, there are 2 interesting things about this completion, with regards to the topic of this article as well:

  • The top 3 teams as well as many others utilized at least an LSTM-based component in their final solution (e.g. stacked LSTMS, Bi-directional LSTMS).
  • The winning team submitted a multi-level deep architecture, which included, among others, an LSTM network and a Transformer block. This architecture is shown in Figure 7:
Figure 7: Architecture of 1st place solution (Source)
Figure 7: Architecture of 1st place solution (Source)

Of course, that team implemented many other techniques which contributed to their win. The important thing here is that non-trivial datasets can be analyzed from many different aspects, and thus require more complex solutions. And since each model has its own unique strengths and weaknesses, you can’t limit yourself to a single model or a single approach.


The Fate of Convolutional Neural Networks

I hope this article has made a pretty good argument about the value of LSTMs. However, there is no doubt that Transformers is an amazing breakthrough in the field of machine learning. This level of success will deterministically lead to a higher level of adoption in the future.

In 2020, Transformers were adapted for computer vision, giving birth to the Vision Transformer (ViT). That paper triggered further research and eventually this new model after extra modifications was able to outperform CNNs in many image classification tasks. Even better, researchers found that combining both components has even better results. Besides, we will see more of ViTs in the future.

Therefore, I hope that this time we will avoid the pleasure of bold statements such as "The death of CNNs" or "The fall of CNNs" and so on.

Closing Remarks

In a nutshell, this case study discussed the following points:

  • It is nearly impossible to properly evaluate the impact of a breakthrough in the field of machine learning.
  • The advent of Transformers reshaped the landscape: LSTMs, especially in NLP stopped being the center of attention.
  • Regarding Time Series, LSTMs are more useful. Their benefits are considerable.
  • Modern (and interesting) challenges in data science factor in more than one domain, e.g. audio, text, graphs, and so on. This in turn requires a combination of various approaches/models to tackle those challenges.

Thank you for reading!

AutoGluon-TimeSeries : Creating Powerful Ensemble Forecasts – Complete Tutorial


References

  1. Alex et al. "ImageNet Classification with Deep Convolutional Neural Networks" (NIPS 2012)
  2. Hochreiter et al., "Long Short-term Memory" (Neural Computation, 1997)
  3. Rumelhart et al. Learning internal representations by error propagation (Sept. 1985)
  4. Kyunghyun et al. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
  5. Sutskever et al 2014, Sequence to Sequence Learning with Neural Networks (2014)
  6. Yonghui Wu et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (2016)
  7. A. Vaswani et al. Attention Is All You Need, Jun 2017
  8. Seyed Mehran Kazemi et al. Time2Vec: Learning a Vector Representation of Time, July 2019
  9. Bryan Lim et al., Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting (International Journal of Forecasting December 2021)
  10. D. Salinas et al., DeepAR: Probabilistic forecasting with autoregressive recurrent networks, International Journal of Forecasting (2019).
  11. Andrej Karpathy, Li Fei-Fei, Deep Visual-Semantic Alignments for Generating Image Descriptions
  12. Lea et al. Temporal convolutional networks for action segmentation and detection (CVPR 2017)
  13. Bai et al., An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (2018)
  14. Dosovitskiy et al., An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale (2020)

Related Articles