
This is the second part of my previous blog Journey to BERT which I wrote a while ago. In this blog, I would further the narration and explain the further conceptual milestones in the evolution towards Bert.
So Far
- Pre-Neural word embeddings such as Glove, TF-IDF
- Common NLP Tasks such as classification, Question Answering and text summarization.
- Neural embeddings with Word2Vec.
- Using Deep Sequential models (RNN) for NLP
- Discovering Attention and Bi-directionality.
With the following models and approaches, we continue to dwell on further advancements in NLP. From hereon, the most sought after attributes are transfer learning, contextual learning and also scale.
TagLM
TagLM(Language Model augmented sequence taggers) perhaps was the first serious venturing into producing contextual word embeddings. Matthew Peters et al. came up with this paper in 2017 which demonstrated how the un-contextual word embeddings can be augmented by Language model embeddings in a sequence tagging task. The Language Model embeddings are based on pre-trained bi-directional RNNs on unlabelled text data and are concatenated with the hidden layer outputs of the multi-stack bi-directional RNNs meant for sequence tagging.

TagLm achieved impressive results on the NER recognition on CoNNL dataset beating all previous sota approaches.
CoVe
Context Vector(CoVe)’s approach to contextual word vectors is comparable to TagML. The paper was written by McCann et al. in 2018 and its main motivation was the successful application of transfer learning from ImageNet to downstream tasks in Computer Vision. The authors suggest that the equivalent of an ImageNet in the realms of NLP could be an LSTM based encoder employed in neural machine translation task using attention sequence to sequence model based on LSTM. The pre trained encoder could be then used as a source of contextual word embedding that can then augment a conventional word vector system (such as Glove). The NMT is trained on multiple machine translation datasets which are basically pairs of sentences in two languages.


The paper suggested combing the CoVe vectors (last layer of the bi-LSTM encoder) with Glove vectors and illustrated performance gains in some common nlp tasks. Unlike Tag-LM however, CoVe needed labelled data(pair of texts in 2 languages) for training the encoder on Machine translation task. And that’s an obvious limitation of this approach. Plus, the actual performance gains in down-stream task is more dependent on the architecture of the downstream task.
ELMO
Embeddings from language models (ELMO) in some sense is a refinement over Tag-LM from the same group(Peters et al.). The authors suggest that a (bi-directional)language model learnt in an unsupervised way over a large corpus carries both, semantic as well as syntactic connotation of words. The initial layers of the model capture the syntactic meaning(NER, POS Tagging) while the end layers of the model capture the semantic meaning( sentiment analysis, question answering, semantic similarity etc ). Hence instead of using only the last layer (as done in Tag-LM or in Clove wherein the network is pre-trained and frozen), taking a linear combination of all the layers would be a better and rich estimation of the contextual meaning of a word. ELMO representations are hence considered ‘deep’.
In ELMO, the language model is learnt in the usual way of predicting the next word from the previous sequence of words in a sentence in both directions. The loss function being the negative log likelihood.

ELMO is an improvement over Tag-LM and Clove and this can be attributed to the fact that the representations are ‘Deep’. The paper illustrated that ELMO achieved incremental performance gains on a variety of common NLP tasks.

ELMO however continued to carry the shortcoming of depending more on the architecture of the downstream task for performance gains.
Compeer with ELMO was another proposal from Jeremy Howard et al. (Fast.ai), Universal Language Model Fine-tuning for Text Classification(ULMFiT). Apart from the pre-training step of Language modelling using RNN, ULMFiT proposed using an LM Fine-tuning on the target dataset. The rationale was to learn the ‘distribution’ or ‘task-specific features’ of the target dataset. The final step is task-specific fine-tuning. For example a classifier(using few linear blocks).
Towards Transformers
So the transformers architecture becomes ‘the’ most underlying composition for more modern approaches such as the BERT family and the GPT series. First proposed in this paper ‘attention is all you need’ by Vashwani et al. 2018, it presents an alternative to RNNs (and their flavors) for processing sequential data.

It’s better to refer to this excellent article by Jay Almar for a comprehensive understanding. In a concise form though, the architecture has the following important elements.
- Multi-Head -Self-attention: At a very high level, self-attention allows referencing to other words and sub-space within the sequence to associate the meaning of a word. Basically, another way of capturing (long term)dependencies. Being ‘multi head’ means employing multiple heads to focus on multiple sub-spaces within the sequence with multiple representational spaces. Something like using multiple brains. Mathematically, the self-attention carrying embeddings are calculated using a softmax over keys.queries multiplied with the value matrix.

- (Sinusoidal) Positional encodings: Interestingly, the transformers are not sequential in nature. In-fact, it looks and processes the entire sequence. In such a case, the positional encoding encapsulates the order of tokens. It’s actually a vector value for an embedding created using sin and cos functions. An excellent reference here.
So what benefits does the transformers architecture bring over the RNNs (Bi-LSTMs)?
- Vanishing gradient: No concept of memory gates in transformers as this information loss prone method is circumvented by having direct access to all parts of the sequence.
- Long term dependencies: Transformers are better at capturing long term dependencies because of multi-head self attention layers.
- Bi-directional by design: So, the transformer encoder reads the entire sequence at once and uses all the surroundings of the words, both before and after. Therefore, it’s inherently bi-directional. In fact, many argue that it’s non-directional.
Generative Pre-Trained Transformers (GPT)
First introduced in 2018 by Radford et al. (just before BERT) GPT was one of the first to use the transformers architecture. The authors from OpenAI presented this architecture as an effective combination of existing ideas a. unsupervised pre-training (as seen in ELMO) and b. Transformers.
Further, the framework had two major components
1. Un-supervised pre-training (using Transformers) which is basically maximizing the likelihood of a token given a context of tokens on parameters of the network.


The paper proposes using a multi-layer(12-layer) transformer decoder for this which basically constitutes of multi-headed self-attention layer + positional feed-forward layers that produces a distribution over target tokens using a softmax. This variation of the original transformers architecture was uni-directional (left to right) as the self-attention was attributed only from the left context.
2. Supervised Fine-Tuning: For a downstream task such as classification, the labelled data is fed into the previous model for representations and fine tuning of the transformer decoder. An additional linear layer+ softmax layer facilitates the final classification task. The authors also propose adding an additional learning objective of learning a language model which demonstrates better generalization.
In addition to the aforesaid niche features, ‘scale‘ was another attribute of GPT-1. It was trained on a massive BooksCorpus corpus with 240 GPU days. All subsequent models after GPT-1 were trained on large volumes of data with powerful GPUs/TPUs and with more and more parameters. GPT-1 successfully demonstrated that a transformer based on massive pre-training + little supervised Fine-Tuning with additional objective learning can cater to various NLP tasks(NLI, Question Answering and Classification). In fact, it did out-performed various sota models back then.
GPT-1 model however is uni-directional(left to right) in nature as self-attention is based only on previous tokens. Something which is addressed by BERT.
BERT
A long journey indeed! BERT(Bidirectional Encoder Representations from Transformers) was published shortly after GPT-1 from Google by authors Devlin et al. Overall, the approach looks very similar to what was presented in the GPT-1 architecture with an unsupervised language model learning and then a supervised fine-tuning step. However, BERT’s architecture is more like the original transformer’s architecture by Vaswani et al and is based on a multi-layer bidirectional Transformer en-coder. Whereas GPT-1 architecture is only a left context only(unidirectional) version of the original architecture, commonly referred as ‘transformer decoder’.

So, the main argument of the authors was that the unidirectional pre-training limits the representation for downstream tasks and hence is sub-optimal. For example, a unidirectional pre-trained model used for fine-tuning a Question Answering task is sub-optimal because context information from both directions are not exploited. Since BERT is bi-directional, a standard Language Model task is unfit as an objective learning task. This is because in transformers architecture, all words are fed at once (and hence accessible) to the model. For a standard LM task, each word can see itself from the future and hence the learning becomes trivial.

BERT addressed this by using ‘Masked Language Modelling ‘ which is essentially masking random tokens in the text and predicting it.

In addition to MLM, BERT also employs another learning objective called ‘Next Sentence Prediction’. In NSP, the objective is to classify whether a sentence is a following sentence of another given sentence. Intuitively, this helps learning relations between sentences.

Fine-tuning is the second phase in BERT just like GPT-1. The modifications (input representation and output layers)are essentially task specific. For example for a classification task, the CLS (first special token in a setence) is fed to a classifier network. The learning is end to end which means that all layers and their weights continue to learn.

BERT was proposed in two flavors, BERT base and BERT large. They primarily differ in number of layers(transformer blocks). Base=12 layers, 110M params and Large = 24 layers, 340 params. BERT is indeed a milestone in Natural language processing which successfully demonstrated a sota approach enabling Transfer Learning based on Transformers (self attentions), Bi-directionality and clever objective learning task. And off-course trained on a large scale corpus (BooksCorpus + English WikiPedia with 256 TPU days).
Beyond BERT There has already been many advances after the original BERT paper in various tangents. There are more sophisticated variations such as RoBERTa which is trained for longer, on a larger corpus and employing clever learning objectives (such as dynamic masking and dumping NSP). Another variant called ALBERT for example which aims to produce a smaller model by using parameter reduction techniques. ELCTRA, XLNet are few other interesting variations.
Also, there has been some active research being pursued to make the BERT model light weight. (BERT large ~ 340 M parameters). There has been several approaches proposed for the same such as weights pruning, quantization and distillation(DistillBERT). Here is an excellent blog on the same: https://blog.inten.to/speeding-up-bert-5528e18bb4ea
SummaryI guess there has been rapid and tremendous growth within the NLP world. From using statistical representations of text towards context-aware neural representations. From statistics and classical ML based approaches to Deep Learning based Sequence models. Discovering attention and bi-directionality on the way and realizing the power of Transfer Learning. And finally towards the sophisticated transformers architecture. Modern NLP frameworks have come a long way in leveraging these important milestones and scale.
References
- http://jalammar.github.io/illustrated-bert/
- http://web.stanford.edu/class/cs224n/
- https://devopedia.org/bert-language-model
- https://www.cs.princeton.edu/courses/archive/fall19/cos484/lectures/lec16.pdf
- https://openai.com/blog/better-language-models/
- https://neptune.ai/blog/ai-limits-can-deep-learning-models-like-bert-ever-understand-language