
Waves of hypes on Data Science technologies have been surging day after day. Many of you might have been tired of surfing through large and small hypes to bring a solution for a new proposal or to pretend you are familiar with new AI technologies in a meeting. The targets of this article are managers without a strong data science background, junior-level data scientists, and developers who are interested in NLP, and the goal is to help them to feel comfortable with BERT with minimal knowledge and to say "Can we use BERT for this project?" intelligently in brainstorming sessions.
Table of Contents
- Why should we learn about BERT now and use it?
- How does BERT work in layman’s terms?
- What value does BERT bring to your projects?
- Where can we learn more about BERT?
1. Why Should We Learn about BERT Now and Use it?
It is sometimes a good strategy to postpone learning about new technologies in the data science domain due to hypes and biased success stories. Let’s start with my short answer about "now." BERT was developed and published by Google in 2018, and after two years the open-source community around this approach has become mature and we can use an amazing toolbox developed by them. We should learn about BERT because it has alleviated the problems in the efficiency of model training unlike conventional recurrent neural networks as Culurciello discussed in The fall of RNN / LSTM. We should use BERT because it is easy to finetune and use models thanks to Hugging Face’s framework, and we can switch from BERT to other state-of-art NLP models by making small modifications to your code – sometimes just a few lines of modifications.
2. How Does BERT Work in Layman’s Terms?
The official name of BERT is sort of long, and it stands for Bidirectional Encoder Representations from Transformers. Let’s start with the last word, which is most important. Transformer is a type of neural network architecture, and BERT and its derivatives inherit a similar architecture. Unlike Recurrent Neural Networks, Transformers do not require the sequence of inputs to be processed in the order. It means that Transformer does not need to process the beginning of a sentence before it processes the middle or the end of the sentence. On the other hand, RNN needs to process the input in the order; thus, it creates bottlenecks. This feature gives Transformer much more freedom to run parallelization during model training.
Then, let’s go to the first word Bidirectional. If the architecture is directional it processes the input either from left-to-right or right-to-left. BERT is free from the curse of processing inputs in the order, and BERT can learn about the context of a word based on all of its surroundings. This approach is called Masked Language Model unlike a traditional approach creating a language model by predicting the next word given the words beforehand as input.
The remaining part from BERT is Encoder Representations. Before diving into the explanation, let me ask a question. When do you think someone understands a sentence well? I think it would not be enough to understand the meanings of vocabulary in a sentence. A reasonable answer would be when someone understands the context of a sentence. Then, how can we tell whether someone understands the context or not? To me, one of the measurable, introductory approaches is whether someone understands the source of demonstrative pronouns such as "it," "this" or "that" in a sentence. BERT includes Encoders to understand the context of words. The architecture of Encoders is identical to one another, and each one consists of two components: Self-Attention and Feed-Forward Neural Network. Out of these components, I will only focus on Self-Attention.
Let me use a sentence "The animal didn’t cross the street because it was too tired." to demonstrate how Self-Attention mechanism interprets "it." We know "it" refers to "the animal" in this sentence, and it was interpreted correctly by BERT pre-trained model as below.

Now, we learn Self-Attention has the capability to understand the relationship between words, but a fair next question is how well it interprets. To answer this question, I have modified the original sentence to "The animal didn’t cross the street because it was too wide." instead of "too tired." As you see the image below, the strongest connection exists between "the animal" and "it" unlike my expectation.

Each BERT’s Encoder has multiple Self-Attention heads, and my expectation was one of the Self-Attention heads would capture the relationship between "it" and "the street" since "the street" was too wide. Even though BERT has the capability to understand the context, it has not been perfect against a relatively easy sentence yet. As a disclaimer, I validated my hypothesis using a pre-trained BERT, but other BERT-derivatives would perform better because of the increase in the size of the training dataset and the improvement in its architecture. If you’re interested in understanding how Self-Attention and Multi-Headed Attention mechanisms work step by step, please read The Illustrated Transformer by Jay Alammer.
3. What Value Does BERT Bring to Your Projects?
Successful applications of BERT are Text Classification, Question Answering, and Sentiment Analysis. The outcomes have not been mature enough yet, but we can also apply BERT to Named Entity Recognition and Text Summarization. Regarding Natural Language Generation, the most successful model is Turing-NLG which is a Transformer-based generative language model developed by Microsoft. If you need to tease out patterns out of texts, like emails, documents, and customer reviews, I suggest that BERT and its derivatives are worth applying at the experiment stage.
I have listed the common applications of BERT, and BERT is definitely a great method but please try not to search for projects that one method can solve because the true value of BERT or other NLP models is to create new metadata for business units and their customers. A project like an Email classification is intuitive as an application of BERT, but the categories of emails could be the intermediate product instead of the final product for many business problems. In practice, this intermediate product would bring more value to your projects by combining it with traditional approaches.
As an example, if you are interested in customer churn analysis, you can extract the metadata from text communication between customers and agents using BERT, such as customer sentiments, the quality of responses to customers by agents, and categories of problems. And, you can combine all or some of these metadata with other traditional metadata, such as the recency and frequency of interactions, response times by agents, and the dollar values that customers had spent. As well as the churn analysis, we can run a marketing analysis to suggest cross-sell or up-sell based on the combination of classified problems or challenges customers individually faced to enhance customer satisfaction using the metadata supplied by BERT. Of course, we can use these combinations of problems reported by customers to improve the quality/features of own products by creating hypotheses against the failures. Then, we can track the transition of the distribution of categorized problems over time to test whether the proposed hypotheses were correct. I believe we can provide higher value to business units and customers by connecting the metadata as dots in a creative manner. I would not touch in this article, but the concept of MLOps or ML Pipelines becomes more important as solutions require complicated processes in development and operation.
4. Where Can We Learn More about BERT?
Let me share the links to deepen your understanding of BERT since my post only covers the minimal concepts.
For managers, developers and junior-level data scientists
- The Illustrated Transformer by Jay Alammer
This is my favorite blog post, and it would be the best next step. If you don’t feel comfortable with linear algebra, you can focus on the big picture by skipping the calculation of matrices in the middle of this post.
- Deep Learning 2019 – Image Classification by Jeremy Howard
The series does not cover BERT, but just watching the first video would help you experience the modern framework for Deep Learning and how easy to finetune a model unlike using TensorFlow four years ago. Also, you do not need to watch all videos if you feel overwhelmed by the number of lessons since Lessons 1, 2, and 4 go through both Image Classification and NLP as minimal domains to be covered.
For developers and junior-level data scientists
- Illustrated Word2Vec by Jay Alammer
This is another great post by Jay Alammer. I had a hard time understanding the word embeddings with high dimensions, such as 256 or 512. This post has taught me about the concept of word embeddings with just two dimensions by showing many visual examples.
- BERT Research Series by Chris McCormick
I prefer Jay Alammer’s blogs to understand the concepts because of the creative visual representations. But, if you like to learn from videos and codes, this series would work better for you since Chris McCormick uses Jupyter Notebooks to show examples of how the concepts can be actually implemented.