High-Level History of NLP Models

How we arrived at our current state of attention based transformer architecture for NLP tasks

Mallory Hightower
Towards Data Science

--

Natural Language Processing (NLP), enabling computers to make sense of the human language, is not a novel concept. However, the last decade has witnessed an unprecedented leap in NLP technology advancement, with much of the progress enabled by deep learning. NLP technology has progressed so rapidly that data scientists must continually learn new machine learning techniques and model architectures. Thankfully, since the development of the current state of the art NLP architecture, attention based models, progress in the NLP field seems to have slowed momentarily. Data scientists finally have a moment to catch up!

But how did we arrive at our current state in NLP? The first big advancement came in 2013 with the breakthrough research of Word2Vec (detailed in a paper by Mikolov). Mikolov et al realized that as they trained neural networks on an NLP task, the network was forced to learn the similarities between words. These vector representations of words were stored in the embedding layer of the neural network, and their discovery has added an entirely new dimension to NLP tasks. Thanks to Word2Vec, we now have a more efficient way to create word vectors. We no longer have to rely on the traditional sparse representation of words with one hot encoding. Additionally, utilizing word embeddings requires less memory, decreases compute time, and has been shown to drastically improve downstream model performance. Other word representation models, such as GloVe, have followed. No more one hot encoding!

Thanks to advances in deep learning and increasing computational capabilities, Recurrent Neural Networks (RNN) and Long Short-Term Memory networks (LSTM), a version of the RNN, rose in popularity in 2014 and 2015. Andrej Karpathy’s blog post entitled “The Unreasonable Effectiveness of Recurrent Neural Networks” is a famous and well-referenced love letter to RNN’s. RNN’s and LSTM’s enabled the processing of textual sequence data. The order of the data matters with sequence data and there was no good way to handle sequence data before RNN’s. LSTM’s improved on RNN’s in that for long sequences, the network remembers the earlier sequence inputs. This was a significant problem for RNN’s, also known as the vanishing gradient problem. LSTM’s remember what information is important in the sequence and prevent the weights of the early inputs from decreasing to zero. There is an additional version of the RNN, called a Gated Recurrent Unit (GRU). It is very similar to the LSTM but differs in its special gates for retaining long sequence information. RNN’s and LSTM’s were the bread and butter of NLP tasks for a few years — everyone used them. But it wasn’t long before they were replaced by an even better architecture: attention networks!

source

Attention based networks became popular around the years 2015–2016. Attention networks are a type of neural network that allow for the focus on a specific subset of the data input: you can specify what you want the network to pay attention to. These models have been breaking performance records on many NLP tasks, like Neural Machine Translation, Language Modeling, and Question Answering problems. Attention networks are also more efficient and require less computational resources. This is an important improvement, as it often requires significant computing power in the form of a GPU (which is not always accessible) to train RNN’s.

A specific type of attention based network introduced in 2017, the Transformer model, has been especially dominant in modern NLP architecture. The Transformer is similar to RNN’s in that it handles sequence data, but the data doesn’t need to be input into the model in any particular order. Because of this, the Transformer model can train faster and with much more data using parallelization. The Transformer model led to our current state in NLP: the era of BERT, ERNIE 2.0 and XLNet.

source

Bidirectional Encoder Representations from Transformers (BERT) models were introduced in 2018 by researchers at Google. Versions of BERT are one of the most advanced NLP models available. BERT, a deeply bidirectional unsupervised model, is used for pre-training word representations for later use in an NLP task. Bi-directionality is crucial in neural networks as it allows the information to flow forwards and backwards as the model trains, which results in better model performance.

While the concept of BERT is similar to Word2Vec and GloVe, the BERT word vectors are context sensitive! With Word2Vec and GloVe, words that have high contextual variety (I am feeling blue, blue is my favorite color) are represented by a single vector. You can guess that this type of representation could result in poor model performance downstream, as word meaning relies heavily on context. With BERT, the two contexts of the word blue would be represented with different vectors.

BERT was just the tip of the iceberg for attention based architectures. In 2019, researchers at Carnegie Mellon and Google created XLNet. The paper claims that XLNet “outperforms BERT on 20 tasks, often by a large margin.” Unlike other recent advancements in NLP, the architecture is not drastically different. Like BERT, XLNet utilizes an attention based network. In the summer of 2019, a Chinese technology company published a paper on another attention based network, ERNIE 2.0. The paper claims that ERNIE 2.0 outperforms BERT and XLNet on 16 tasks, including Chinese language tasks. Like BERT, ERNIE 2.0 and XLNet are both pre-training models that make use of transformer architectures and attention mechanisms. While the original BERT model is no longer king, versions of BERT such as RoBERTa, remain competitive in the NLP leading technology space.

In conclusion, there is no single, overall best NLP model at the moment. However, the attention based, transformer networks are the reigning architecture. The top models perform well on different tasks and each have their own unique advantages and drawbacks. With all of these competing models, it can be difficult to figure out which model is the best for your task. One of my new, favorite resources is paperswithcode.com. This website conveniently organizes research papers according to specific machine learning tasks, enabling you to stay up to date on the newest models and architectures.

So there you have it — a brief history of the rapid advances in NLP within the last decade. NLP is an ever changing and developing field, certainly not for the data scientist that prefers model stability. But that is part of the fun! We will see how long the era of the attention based networks last.

--

--