This is Part 3 in the 3 Part Series on Transformers for Product People. Click here for Part 1. This article relies on concepts and information covered in the previous articles. If you lack familiarity with transformers and GPT, beginning from Part 1 is recommended.
If you are reading this, you design, manage, or invest in technical products. You may also just be a smart engineer or data scientist who figured out that these articles are very useful for understanding new research. So, do you actually need to understand transformer models? Are our transformer models even any good?
No.

To build cutting edge products, understand transformer models. Product people at Google understood transformers very well, leading them to revamp a line of products in their core business: Search. Hum to find songs – powered by BERT. Featured snippets – powered by BERT. Key moments in videos – powered by BERT. These features may already look familiar to you:

To understand transformer models, understand BERT and GPT. Each has a unique architecture optimized for specific training tasks. While I could tell you now which applications each model is best suited for, explaining the design choices and differences between the models will enable you to identify new applications that haven’t yet been tried.
The following section will walk through BERT, which requires an understanding of Generative Pre-Training (GPT). In part 2, I explained the GPT model architecture in depth, and I recommend reading that article first before continuing onward from here. Warning: if you haven’t read part 2, the following section will confuse you and it will be your fault. After explaining the BERT architecture, I will contextualize this model within my larger series on transformer models to enable identifying and evaluating potential applications in full.
BERT Explained
Four months to the day after OpenAI introduced GPT, Google published BERT: Bidirectional Encoder Representations from Transformers. BERT leveraged the power of pre-trained transformers while addressing some of the limitations presented by the GPT architecture. In doing so, BERT vastly expanded the set of tasks that transformers could effectively tackle.
We’ll begin by summarizing only those components of GPT which BERT innovates upon:
- GPT was designed by dropping the encoder component from the transformer (T-ED) and keeping only the decoder (T-D).
- The decoder, sometimes referred to as a generator, functions as a language model, which means that it is optimized to predict the next word in a sentence.
- Attention in this model functions unidirectionally, meaning that the model can only look at previous words when predicting the next word, and not the words that will follow the mystery token.
GPT was the first fine-tuning based language model built on the transformer architecture, meaning it created a pre-trainable transformer trained using next-word prediction. Language modeling base on next-word prediction created a major limitation: only previous context can be leveraged in understanding meaning. As the authors of BERT note²:
[the limitations of uni-directional attention] are sub-optimal for sentence-level tasks, and could be very harmful when applying fine-tuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions.
Essentially, if comprehension within a sentence is required, then only relying on the first part of the sentence will be too limiting. Here is an example where this can be problematic: "Today, I went to the strome and bought some bread and peanut butter." Autocorrect has two potential candidates: store, and stream. Both are equally close in spelling to "strome", and both are plausible when only considering the sentence up until that point. An autocorrect tool built on GPT will rely on the document context up until "strome" for its decision. If the critical context is directly following, can’t we do better?
Transformer models require bi-directional attention in order to incorporate context from both directions into model decisions. Transformer-decoders (T-D), which suit language modeling and therefore pre-training, only have uni-directional attention. Pre-training with bi-directional attention requires leveraging the transformer-encoder (T-E), which is the component where bi-directional attention happens in T-ED.
So, the BERT architecture is based on preserving only the encoder component of the T-ED. Compared to GPT, which is a T-D, BERT is a T-E. Training a T-E is less straightforward than training a T-D, as we’ll see. Nonetheless, now is a good time to recap on the different transformer architectures that we’ve seen over the last three articles. Simply put, the three reigning NLP models are:
- Transformers: T-ED
- GPT: T-D
- BERT: T-E
While there are other models with names like RoBERTa, DistilBert, T5, DialoGPT, etc., all those models have one of the above three architectures. This is all you need to know about transformer architectures at the highest level.
Training BERT
Pay particular attention to the language modeling tasks being described here, because they directly determine potential applications of these models.
T-E took the longest to be published (nearly a year after T-ED), because it’s the least obvious to train. The problem stems from feeding T-E the full text for a monolingual task. What training task can you challenge T-E with if it’s fed the full input sentence? How can you ask it to decide which of "stream" or "store" forms the correct sentence if it’s fed the correct sentence to begin with? BERT was published with two distinct training tasks in order to circumvent this problem.
BERT is first trained as a masked language model (MLM). MLM entails passing BERT a sentence like "I sat [MASK] my chair" and requiring BERT to predict the masked word. Next-word prediction language modeling can be considered a special case of MLM, where the last word in the sentence is always the masked word. Hence, MLM can be thought of as a more generalized form of language modeling than the task employed to train GPT.
After MLM, BERT is trained on a task called "next-sentence prediction". In this task, BERT is passed sentence pairs separated by an indicator token. BERT is trained to predict whether the second sentence should follow the first or is actually unrelated. An example looks like this:

BERT, like you, should predict "Low probability".
These two tasks constitute the pre-training that enable BERT to be easily adapted to new tasks.
Applications of BERT
Understanding the BERT training tasks is essential for determining its applications. It works like this: If you can show that the task you want your product to perform for customers can be framed as one of these training tasks, then your task is a feasible application. Let’s look at Google’s Featured Snippets and the Google product manager (PM) Reuben as an example.
Reuben is a PM working in Google Search, and he’s familiar with BERT. Reuben noticed that many users were typing full questions into the Google search bar and then navigating through the suggested results to find their answer. He has identified a customer need: users have questions they need answered, but they want to maintain their existing behavior of querying the Google search bar.
Reuben follows one user’s session closely and observes the following behavior: His user types the following question into Google Search: "How many data scientists in the US." The user then clicks through the first site, and comes across the following sentence: "Growth in data scientist job postings were flat from 2019 to 2020 at around 6,500 in the US, according to Glassdoor." The user copies this sentence, and pastes it into his Google Doc.
Thinking about his current product, Reuben notices that Google’s solution retrieves a corpus of documents that are likely to contain the answer to the user’s question. The user is then required to sift through those documents to find the answer. The user starts with the sentence: "How many data scientists in the US." and then selects the sentence: "Growth in data scientist job postings were flat from 2019 to 2020 at around 6,500 in the US, according to Glassdoor." Reuben realizes that this task can be translated into the BERT training task of next-sentence prediction. He can use BERT to rank the sentences in the corpus retrieved based on the probability that that sentence will "follow" the query sentence.
In effect, he wants his BERT output to look like this:

Translating a domain specific task into a transformer training task is essentially the process of fine-tuning your model. To develop Featured Snippets, Reuben will want to fine-tune BERT to shift its behavior from finding the sentence that most likely follows the query to finding the sentence that most likely answers the user’s question. Reuben can do this by training BERT further on a set of sentence pairs that more directly fit this pattern.
All Together Now
This same process can be used to understand applications of transformer models generally. Let’s quickly review the training tasks for the three model architectures we covered and their attention mechanisms:
- T-ED. Task: Translation.
- T-D. Task: Next-word prediction.
- T-E. Task: 1. MLM. 2. Next-sentence prediction.
Spell-check in the middle of a sentence: Replace the misspelled word with a mask: MLM. Transform casual English into formal English: Translation. Suggest what to respond to a text: Next-word prediction. Etc. Etc.
Determining applications is related to, but not the same as model selection. As Google researchers noted in their release of the T5 model (T-ED):
We propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings…We can even apply T5 to regression tasks by training it to predict the string representation of a number instead of the number itself.
As the authors note, on some level all tasks can be framed in a text-to-text format. Maybe then a T-ED can be used for all tasks and furthermore, any task should be suitable for a T-ED model. While this makes mathematical sense, it often times doesn’t make business sense. One could frame writing their PhD dissertation as a next-word prediction task by passing into GPT: "Here is my PhD dissertation on artificial intelligence" and wait for GPT to do the rest. Unfortunately, they’d likely end up handing in garbage and fail their program and then question their future.
So, how can you know if an application is really feasible for a transformer model? As a rule of thumb, the more directly an application task is translated into a training task, the better the model will perform. As we said above, every next-sentence prediction task can be formulated as a translation task, but T-E will yield better results on that task for cheaper.
All in all, remember that these models operate by leveraging the patterns detected within the massive corpus of text from training to determine the likeliest output in a new text-based task. Model performance is limited by what is contained in the training corpus. Transformer models don’t make discoveries or figure things out. They automate text-based tasks made predictable by the patterns contained in millions and millions of existing documents. Nonetheless, they promise to absolutely revolutionize artificial intelligence as we know it.
References
[1] Nayak, Pandu. "Understanding Searches Better than Ever Before." Google, Google, 25 Oct. 2019, blog.google/products/search/search-language-understanding-bert/.
[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[3] "Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer." Google AI Blog, 24 Feb. 2020, ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html.
All images created by author