Deconstructing BERT, Part 2: Visualizing the Inner Workings of Attention

A new visualization tool shows how BERT forms its distinctive attention patterns.

Published in

Towards Data Science

15 min readJan 7, 2019

In Part 1 (not a prerequisite) we explored how the BERT language model learns a variety of intuitive structures. In Part 2, we will drill deeper into BERT’s attention mechanism and reveal the secrets to its shape-shifting superpowers.

🕹 Try out an interactive demo with BertViz.

Giving machines the ability to understand natural language has been an aspiration of Artificial Intelligence since the field’s inception, but this goal has proved elusive. In some sense, understanding language requires solving the larger problem of artificial general intelligence (AGI). For example, the Turing Test — one of the earliest conceived measurements of machine intelligence — is based on a machine’s ability to converse in natural language with a human.

In just the last few years, however, a quiet revolution has been taking place in the field of Natural Language Processing (NLP). New deep learning models have emerged that have dramatically improved the ability of machines to process language, resulting in large performance improvements across NLP tasks ranging from sentiment analysis to question answering.

Perhaps the most well-known of these models is BERT (Bidirectional Encoder Representations from Transformers). BERT builds on two recent trends in the field of NLP: (1) transfer learning and (2) the Transformer model.

The idea of transfer learning is to train a model on one task, and then leverage the acquired knowledge to improve the model’s performance on a related task. BERT is first trained on two unsupervised tasks: masked language modeling (predicting a missing word in a sentence) and next sentence prediction (predicting if one sentence naturally follows another). By pre-training over a large corpus (all of English Wikipedia and 11,000 books), BERT comes to any new task with a solid foundation in the workings of the English language.

Underlying BERT is the Transformer model, a type of neural network that accepts a sequence as input (e.g. a sequence of words) and produces some output (e.g. sentiment prediction). Unlike traditional recurrent networks such as LSTMs, which process each sequence element in turn, the Transformer processes all elements simultaneously by forming direct connections between individual elements through an attention mechanism. Not only does this enable greater parallelization, but it also results in higher accuracy across a range of tasks.

The Architecture of BERT

BERT is a bit like a Rube Goldberg machine: though the end-to-end system may seem convoluted, the individual components are quite simple. In this article I will focus on the core component of BERT: attention.

Roughly speaking, attention is a way for a model to assign weight to input features based on their importance to some task. When deciding whether an image contains a dog or cat, for example, a model might pay more attention to — i.e. place more weight on — the furry parts of the image as opposed to the lamp or window in the background.

Similarly, a language model that is trying to complete the sentence “The dog from down the street ran up to me and ____” may want to pay more attention to the word dog than street, because knowing that the subject is dog is more important for predicting the next word than knowing where the dog came from. As we’ll see later, attention can also be used to form connections between words, enabling BERT to learn a variety of rich lexical relationships.

Are you paying attention?

Fortunately, the mechanics of attention in BERT are quite simple. Suppose you have some sequence X, where each element is a vector (referred to as a value). In the following example, X consists of 3 vectors, each of length 4:

Attention is simply a function that takes X as input and returns another sequence Y of the same length, composed of vectors of the same length of those in X:

where each vector in Y is simply a weighted average of the vectors in X:

That’s it — attention is just a fancy name for weighted average! The weights show how much the model attends to each input in X when computing the weighted average, and are thus referred to as attention weights. Later on, we’ll discuss how these attention weights are calculated.

(Note that attention is not the only component of BERT; there are also feed-forward layers, residual connections, and layer normalization modules that all work together with the attention component to produce the model output. But attention is the real workhorse, and so we’ll focus on that. For more details on the other components, check out the tutorials in the references section.)

Attending to language

So how does attention apply to language? Well, suppose that X represents a sequence of words like “the dog ran”. We can associate each word with a continuous vector — called a word embedding — that captures various attributes of the word:

One can imagine that these attributes represent things like the sentiment of the word, whether it’s singular or plural, part-of-speech indicators, etc. In practice, word embeddings are not so interpretable, but still have nice properties. For example, words with similar meanings are generally close to one another in the embedding space. Also, one can perform arithmetic on word embeddings and produce meaningful results, e.g. embedding(king) − embedding(man) + embedding(woman) ≈ embedding(queen).

Since attention is also a form of simple arithmetic, it seems reasonable to apply attention to these word embeddings:

By applying attention to the word embeddings in X, we have produced composite embeddings (weighted averages) in Y. For example, the embedding for dog in Y is a composite of the embeddings for the, dog, and ran in X, with weights of 0.2, 0.7, and 0.1, respectively.

How does composing word embeddings help the model in its ultimate goal of understanding language? To fully comprehend language, it is not sufficient to understand the individual words that make up a sentence; the model must understand how the words relate to each other in the context of the sentence. The attention mechanism enables the model to do this, by forming composite representations that the model can reason about. For example, when a language model tries to predict the next word in the sentence “the running dog was ___”, the model should understand the composite notion of running dog in addition to the concepts running or dog individually; e.g., running dogs often pant, so panting is a reasonable next word in the sentence.

Visualizing Attention

Attention provides us with a lens (albeit a blurry one) through which we can see how BERT forms composite representations to understand language. We can access this lens using BertViz, an interactive tool we developed that visualizes attention in BERT from multiple perspectives.

The visualization below (available in interactive form here) shows the attention induced by a sample input text. This view visualizes attention as lines connecting the word being updated (left) with the word being attended to (right), following the design of the figures above. Color intensity reflects the attention weight; weights close to one show as very dark lines, while weights close to zero appear as faint lines or are not visible at all. The user may highlight a particular word to see the attention from that word only. This visualization is called the attention-head view for reasons discussed later. It is based on the excellent Tensor2Tensor visualization tool from Llion Jones.

**Left:** visualization of attention between all words in the input. **Right:** visualization of attention from selected word only.

In this example, the input consists of two sentences: “the rabbit quickly hopped” and “the turtle slowly crawled”. The [SEP] symbols are special separator tokens that indicate a sentence boundary, and [CLS] is a symbol appended to the front of the input that is used for classification tasks (see references for more details).

The visualization shows that attention is highest between words that don’t cross a sentence boundary; the model seems to understand that it should relate words to other words in the same sentence in order to best understand their context.

However, some specific word pairs have higher attention weights than others, e.g. rabbit and hopped. In this example, understanding the relationship between these words might help the model determine that this is a description of a nature scene as opposed to a carnivorous foodie’s review of a hopping restaurant that served rabbit.

Multi-head attention

The above visualization shows one attention mechanism within the model. BERT actually learns multiple attention mechanisms, called heads, which operate in parallel to one another. As we’ll see shortly, multi-head attention enables the model to capture a broader range of relationships between words than would be possible with a single attention mechanism.

BERT also stacks multiple layers of attention, each of which operates on the output of the layer that came before. Through this repeated composition of word embeddings, BERT is able to form very rich representations as it gets to the deepest layers of the model.

Because the attention heads do not share parameters, each head learns a unique attention pattern. The version of BERT that we consider here — BERT Base — has 12 layers and 12 heads, resulting in a total of 12 x 12 = 144 distinct attention mechanisms. We can visualize attention in all of the heads at once, using the model view (available in interactive form here):

Model view (first 6 layers) for input sentences “the rabbit quickly hopped” and “the turtle slowly crawled”.

Each cell in the model view shows the attention pattern for a particular head (indexed by column) in a particular layer (indicated by row), using a thumbnail form of the attention-head view from earlier. The attention patterns are specific to the input text (which in this case is the same as the input to the the attention-head view above). From the model view, we can see that BERT produces a rich array of attention patterns. In the second half of this article we’ll explore how BERT is able to generate such diverse patterns.

Deconstructing Attention

Earlier we saw how attention is used by the model to compute a weighted average over word embeddings, but how are the attention weights themselves computed?

The answer is that BERT uses a compatibility function, which assigns a score to each pair of words indicating how strongly they should attend to one another. To measure compatibility, the model first assigns to each word a query vector and a key vector:

These vectors can be thought of as a type of word embedding like the value vectors we saw earlier, but constructed specifically for determining the compatibility of words. In this case, the compatibility score is just the dot product of the query vector of one word and the key vector of the other, e.g.:

To turn these compatibility scores into valid attention weights, we must normalize them to be positive and sum to one (since attention weights are used to compute a weighted average). This is accomplished by applying the softmax function over the scores for a given word. For example, when computing the attention from dog to The, dog, and ran, we have:

The softmax values on the right represent the final attention weights. Note that, in practice, the dot-products are first scaled by dividing by the square root of the vector length. This adjusts for the fact that long vectors may produce very high dot products.

So we now understand that attention weights are computed from query and key vectors. But where do the query and key vectors come from? Like the value vectors mentioned earlier, they are computed dynamically based on the output from the previous layer. The details of this process are beyond the scope of this article, but you can read more about it in the references at the end.

We can visualize how attention weights are computed from query and key vectors using the neuron view, below (available in interactive form here). This view traces the computation of attention from the selected word on the left to the complete sequence of words on the right. Positive values are colored blue and negative values orange, with color intensity representing magnitude. Like the attention-head view presented earlier, the connecting lines indicate the strength of attention between the connected words.

Let’s go through the columns in the neuron view one at a time, and revisit some of the concepts discussed earlier:

Query q: the query vector q encodes the word on the left that is paying attention, i.e. the one that is “querying” the other words. In the example above, the query vector for “on” (the selected word) is highlighted.

Key k: the key vector k encodes the word on the right to which attention is being paid. The key vector and the query vector together determine a compatibility score between the two words.

q×k (elementwise): the elementwise product between the query vector of the selected word and each of the key vectors. This is a precursor to the dot product (the sum of the elementwise product) and is included for visualization purposes because it shows how individual elements in the query and key vectors contribute to the dot product.

q·k: the scaled dot product (see above) of the selected query vector and each of the key vectors. This is the unnormalized attention score.

Softmax: the softmax of the scaled dot product. This normalizes the attention scores to be positive and sum to one.

The neuron view is best understood through interaction. You can view a brief video demo below (or access the tool directly):

Explaining BERT’s attention patterns

As we saw from the model view earlier, BERT’s attention patterns can assume many different forms. In Part 1 of this series, I describe how many of these can be described by a small number of interpretable structures. In this section, we revisit those core structures and use the neuron view to reveal the secrets to BERT’s powers of plasticity.

Delimiter-focused attention patterns

Let’s start with the simple case where most attention is focused on the sentence separator [SEP] token (Pattern 6 from Part 1). As discussed in this paper, this pattern serves as a kind of “no-op”; an attention head focuses on the [SEP] tokens when it can’t find anything else in the input sentence to focus on:

Delimiter-focused attention pattern for Layer 7, Head 3 of the BERT-base pretrained model.

So, how exactly is BERT able to fixate on the [SEP] tokens? Let’s see if the visualization can provide some clues. Here we see the neuron view of the example above:

In the Key column, the key vectors for the two occurrences of [SEP] carry a distinctive signature: they both have a small number of active neurons with strongly positive (blue) or negative (orange) values, and a larger number of neurons with values close to zero (light blue/orange or white):

The query vectors tend to match the [SEP] key vectors along those active neurons, resulting in high values for the elementwise product q×k, as in this example:

Query vector for first occurrence of “the”, key vector for first occurrence of [SEP], and elementwise product of the two.

The query vectors for the other words follow a similar pattern: they match the [SEP] key vector along the same set of neurons. Thus it seems that BERT has designated a small set of neurons as “[SEP]-matching neurons,” and query vectors are assigned values that match the [SEP] key vectors at these positions. The result is the [SEP]-focused attention pattern.

Bag of Words attention pattern

This is a less common pattern, which was not discussed in Part 1. In this pattern, attention is divided fairly evenly across all words in the same sentence:

Sentence-focused attention pattern for Layer 0, Head 0 of the BERT-base pretrained model.

BERT is essentially computing a bag-of-words embedding by taking an (almost) unweighted average of the word embeddings in the same sentence.

So how does BERT finesse the queries and keys to form this attention pattern? Let’s again turn to the neuron view:

Neuron view of sentence-focused attention pattern for Layer 0, Head 0 of the BERT-base pretrained model.

In the q×k column, we see a clear pattern: a small number of neurons (2–4) dominate the calculation of the attention scores. When query and key vector are in the same sentence (the first sentence, in this case), the product shows high values (blue) at these neurons. When query and key vector are in different sentences, the product is strongly negative (orange) at these same positions, as in this example:

The query-key product tends to be positive when query and key are in the same sentence (left), and negative when query and key are in different sentences (right).

When query and key are both from sentence 1, they tend to have values with the same sign along the active neurons, resulting in a positive product. When the query is from sentence 1, and the key is from sentence 2, the same neurons tend to have values with opposite signs, resulting in a negative product.

But how does BERT know the concept of “sentence”, especially in the first layer of the network before higher-level abstractions are formed? As mentioned earlier, BERT accepts special [SEP] tokens that mark sentence boundaries. Additionally, BERT incorporates sentence-level embeddings that are added to the input layer (see Figure 1, below). The information encoded in these sentence embeddings flows to downstream variables, i.e. queries and keys, and enables them to acquire sentence-specific values.

***Figure 1****: Segment embeddings* for Sentences A and B are added to the input embeddings, along with position embeddings. (From BERT paper.)

Next-word attention patterns

In the next-word attention pattern, virtually all the attention is focused on the next word in the input sequence, except at the [SEP] and [CLS] tokens:

Next-word attention pattern at Layer 2, Head 0 of the BERT-base pretrained model.

It makes sense that the model would focus on the next word, because adjacent words are often the most relevant for understanding a word’s meaning in context. Traditional n-gram language models are based on this same intuition. Let’s check out the neuron view for the above example:

We see that the product of the query vector for “the” and the key vector for “store” (the next word) is strongly positive across most neurons. For tokens other than the next token, the key-query product contains some combination of positive and negative values. The result is a high attention score between “the” and “store”.

For this attention pattern, a large number of neurons figure into the attention score, and these neurons differ depending on the token position, as illustrated here:

Elementwise product of query and key vectors, for query at position i and key at position i+1, for i = 2, 8, 14. Note that the active neurons differ in each case.

This behavior differs from the delimiter-focused and the sentence-focused attention patterns, in which a small, fixed set of neurons determine the attention scores. For those two patterns, only a few neurons are required because the patterns are so simple, and there is little variation in the words that receive attention. In contrast, the next-word attention pattern needs to track which of the 512 words receives attention from a given position, i.e., which is the next word. To do so it needs to generate queries and keys such that each query vector matches with a unique key vector from the 512 possibilities. This would be difficult to accomplish using a small subset of neurons.

So how is BERT able to generate these position-aware queries and keys? In this case, the answer lies in BERT’s position embeddings, which are added to the word embeddings at the input layer (see Figure 1). BERT learns a unique position embedding for each of the 512 positions in the input sequence, and this position-specific information can flow through the model to the key and query vectors.

For updates on my visualization work and other AI projects, feel free to follow me on Twitter.

Notes

We have only covered some of the coarse-level attention patterns discussed in Part 1 and have not touched on lower-level dynamics around linguistic phenomena such as coreference, synonymy, etc. I hope that this tool can help provide intuition for many of these cases.

Try it out!

You can check out the visualization tool on Github. Please play with it and share what you find!

References

You can find more details on the architecture of BERT and other transformer models in these excellent tutorials:

The Illustrated BERT: A beautifully illustrated tutorial on the architecture of BERT.

Transformers from Scratch: A detailed but intuitive tutorial on building a Transformer from scratch, complete with PyTorch code.

Acknowledgements

I would like to thank Richelle Dumond, John Maxwell, Lottie Price, Kalai Ramea, and Samuel Rönnqvist for their feedback and suggestions for this article.

For further reading, check out my most recent article, in which I explore OpenAI’s text generator, GPT-2.