Understanding BERT: Is it a Game Changer in NLP?

Published in

Towards Data Science

7 min readOct 11, 2019

One of the most path-breaking developments in the field of NLP was marked by the release (considered to be the ImageNet moment for NLP) of BERT — a revolutionary NLP model that is superlative when compared with traditional NLP models. It has also inspired many recent NLP architectures, training approaches and language models, such as Google’s TransformerXL, OpenAI’s GPT-2, ERNIE2.0, XLNet, RoBERTa, etc.

Let’s deep dive into understanding BERT and it’s potential to transform NLP.

What is BERT?

BERT (Bidirectional Encoder Representations from Transformers) is an open-sourced NLP pre-training model developed by researchers at Google in 2018. A direct descendant to GPT (Generalized Language Models), BERT has outperformed several models in NLP and provided top results in Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and other frameworks.

It’s built on pre-training contextual representations — including Semi-supervised Sequence Learning (by Andrew Dai and Quoc Le), ELMo (by Matthew Peters and researchers from AI2 and UW CSE), ULMFiT (by fast.ai founder Jeremy Howard and Sebastian Ruder), the OpenAI transformer (by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever), and the Transformer (Vaswani et al).

What makes it’s unique from the rest of the model is that it’s is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. Since it’s open-sourced, anyone with machine learning knowledge can easily build an NLP model without the need for sourcing massive datasets for training the model thus saving time, energy, knowledge and resources.

Finally, BERT is pre-trained on a large corpus of unlabelled text which includes the entire Wikipedia (that’s about 2,500 million words) and a book corpus (800 million words).

How does it work?

Traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary which means the word “right” would have the same context-free representation in “I’m sure I’m right” and “Take a right turn.” However, BERT would represent based on both previous and next context making it bidirectional. While the concept of bidirectional was around for a long time, BERT was first on its kind to successfully pre-train bidirectional in a deep neural network.

How did they achieve this?

They use two strategies — Mask Language Model (MLM) — by Masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.

The second technique is the Next Sentence Prediction (NSP), where BERT learns to model relationships between sentences. In the training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. Let’s consider two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:

When training the BERT model, both the techniques are trained together, thus minimizing the combined loss function of the two strategies.

Architecture

BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional. Image Source: Google AI Blog

The BERT architecture builds on top of Transformer. There are two variants available:

· BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters

· BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters

Results

On SQuAD v1.1, BERT achieves 93.2% F1 score (a measure of accuracy), surpassing the previous state-of-the-art score of 91.6% and human-level score of 91.2%: BERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks.

BERT is here — But is it ready for the real world?

BERT is undoubtedly a milestone in the use of Machine Learning for Natural Language Processing. But we need to introspect on how BERT can be used in various NLP scenarios.

Text Classification and Categorization has been one of the prime applications of NLP. E.g. the concept has been used in ticket tools to classify tickets based on the short description/email and categorize/route the ticket to the right team for resolution. Similarly, it can also be used to classify if an email is a spam or not.

You can find some applications of it already used in your daily life.

Gmail’s Suggested Replies, Smart Compose & Google Search Autocomplete

Chatbots are disrupting the messaging industry with its ability to answer user queries and handle a variety of tasks. However, one of the biggest limitations has been in intent recognition and capturing entities from sentences.

Question Answering (QnA) model is one of the very basic systems of Natural Language Processing. In QnA, the Machine Learning based system generates answers from the knowledge base or text paragraphs for the questions posed as input. Can BERT be used in a chatbot? Certainly yes. BERT is now being utilized in many conversational AI applications. So, your chatbots should be getting smarter.

However, BERT can be used only for answering questions from very short paragraphs and a lot of key issues need to be addressed. NLP as a general task is way too complex and has many more meanings and subtleties. BERT solves only a part of it but is certainly going to change entity Recognition models soon.

BERT today can address only a limited class of problems. However, there are many other tasks such as sentiment detection, classification, machine translation, named entity recognition, summarization and question answering that need to build upon. A common criticism now is that such tasks are based on the manipulation of representations without any kind of understanding and adding simple adversarial content that modifies original content confuses it.

The true benefits of BERT in NLP will only be realized when there is broader adoption in operations and improvement in live scenarios thus supporting a wide range of applications across organizations and users.

However, things are changing quickly with a wave of transformer-based methods (GPT-2, RoBERTa, XLNet) that keeps raising the bar by demonstrating better performance or easier training or some other specific benefit.

Let’s look at some of the other developments which came after BERT’s introduction

RoBERTa

Developed by Facebook, RoBERTa is built on BERT’s language masking strategy and modifies some of the key hyperparameters in BERT. To improve the training procedure, RoBERTa removes the Next Sentence Prediction (NSP) task from BERT’s pre-training and introduces dynamic masking so that the masked token changes during the training epochs. It was also trained on an order of magnitude more data than BERT, for a longer amount of time.

DistilBERT

Developed by HuggingFace, DistilBERT learns a distilled (approximate) version of BERT, retaining 95% performance on GLUE but using only half the number of parameters (only 66 million parameters, instead of 110 million). The concept is that once a large neural network has been trained, its full output distributions can be approximated using a smaller network (like posterior approximation).

XLM/mBERT

Developed by Facebook, XLM uses a known pre-processing technique (BPE) and a dual-language training mechanism with BERT in order to learn relations between words in different languages. The model outperforms other models in a multi-lingual classification task and significantly improves machine translation when a pre-trained model is used for the initialization of the translation model.

ALBERT

Jointly developed by Google Research and Toyota Technological Institute, ALBERT (A Lite BERT for Self-Supervised Learning of Language Representations) is primed to be the successor to BERT which is much smaller and lighter and smarter to BERT. Two key architecture changes allow ALBERT to both outperform and dramatically reduce the model size. The first one is the number of parameters. It improves parameter efficiency by sharing all parameters, across all layers. That means Feed Forward Network parameters and Attention parameters are all shared.

Researchers also isolated the size of the hidden layers from the size of vocabulary embeddings. This was done by projecting one-hot vectors into a lower-dimensional embedding space and then to the hidden space, which made it easier to increase the hidden layer size without significantly increasing the parameter size of the vocabulary embeddings.

When it comes to pre-train, ALBERT has it’s own training method called Sentence-Order Prediction (SOP) as opposed to NSP. The problem with NSP as theorized by the authors was that it conflates topic prediction with coherence prediction.

ALBERT represents a new state of the art for NLP on several benchmarks and a new state of the art for parameter efficiency. It’s an amazing breakthrough that builds on the great work done by BERT one year ago and advances NLP in multiple aspects.

BERT and models like it are certainly game-changers in NLP. Machines can now better understand speech and respond intelligently in real-time. Many BERT based models are being developed including VideoBERT, ViLBERT (Vision-and-Language BERT), PatentBERT, DocBERT, etc.

What are your thoughts on the state of NLP and BERT?