What Does Transformer Self-Attention Actually Look At?

A deep dive into BERT’s attention heads

Published in

Towards Data Science

8 min readDec 12, 2021

Background: Transformers, Self-Attention, and BERT

In the past few years the progress on natural language understanding has exploded, thanks largely to the new neural network architecture known as the Transformer. The main architectural innovation of the Transformer model has been the extensive use of so-called “self-attention”, so much so that the paper introducing the model was titled “Attention is All You Need”. The self-attention mechanism encodes each input as a function of all of the other inputs, helping algorithmetize the intuitive notion of contextuality in language.

Example usage of GitHub Copilot to instantiate a complex object by looking only at its class signature and preceding boilerplate. Copilot was created by training a Transformer model on open source code from Github. (What can GitHub copilot do for Data scientists?)

Since its introduction in 2017, the Transformer architecture has branched off into multiple subfamilies, most notably those of the causal decoder variety (GPT-2, GPT-3) which are trained to predict the next word in a sequence, and those of the encoder variety (BERT, T5, MUM) which are trained to fill-in-the-blank at arbitrary positions in a sequence. Each has its own strengths and weaknesses, but for the purposes of this article we will be focusing on the encoder variety, particularly BERT.

BERT Architecture Refresher

BERT is designed to be a very general language encoder model which can be repurposed for many different types of tasks without the need for changes to its architecture. It achieves this by receiving inputs in the general form of word tokens* with a special [CLS] token at the start, and a special [SEP] token after each piece of text. (*technically WordPiece tokens)

This sequence of inputs is then converted to vector embeddings, which get repeatedly re-encoded with respect to one another via the self-attention mechanism (more on that later), with each subsequent re-encoding maintaining the same sequence length.

In this way any task that receives multiple inputs, say a paragraph of text + a question about the text, can be naturally passed in as [SEP]-demarcated token sequences. And similarly any task that expects a classification to be made, such as “does the input have negative sentiment or positive sentiment?” can be naturally outputted via the final layer’s [CLS] token embedding.

Illustration of how BERT digests the input “The boy rode a horse” + “It was fun”. The input tokens are converted into embeddings which are then repeatedly reencoded by each layer. The final embedding of the special [CLS] token can be used to perform a classification. (Image by author)

For the base BERT model there are 12 layers, and each layer contains 12 attention heads, making for 144 attention heads in total. The attention operation is somewhat involved (for a detailed walkthrough see Illustrated: Self-Attention), but the important thing to know is, for each attention head:

Each input is assigned three vectors: a Key a Query, and a Value
To determine how much input i should “pay attention” to input j, we take the dot product of input i’s Query vector with input j’s Key vector, rescale it, and pass it through a sigmoid.
We then use this resulting attention score to weight input j’s Value vector.

Ok, with all of that introduction out of the way we can finally get to the interesting part.

What BERT’s Attention Heads (Don’t) Look At

Often when you hear a high level description of BERT’s attention, it’s followed by “and this allows the attention heads to learn classic NLP relationships all on their own!” followed by a provocative graphic like this:

One of BERT’s Layer 5 attention heads performing coreference resolution. “it” is attended to most strongly by its associated noun phrase “the dog”. (Image by author, generated using BertViz)

And while it’s true that some attention heads learn to represent nice interpretable relationships like this, most of them don’t. Kovaleva et al. (2019) group BERT’s attentional focus into 5 types:

Vertical: All tokens attend strongly to the same single other token, usually the [SEP] token.
Diagonal: All tokens attend strongly either to themselves, or to a token with a constant offset, e.g. the token immediately after themselves.
Vertical + Diagonal: A combination of the first two patterns.
Block: Tokens attend strongly to other tokens within their [SEP]-demarcated block, and not to any of the tokens outside of that.
Heterogenous: More complex, non-obvious patterns.

Each grid represents the behavior of a particular attention head for a particular input. Position (i,j) in the grid denotes the strength of the attention of token i for token j. Shown here are inputs demonstrating the 5 characteristic attention patterns. (Kovaleva et al., 2019)

The interesting thing is that when you look at how common each of these attentional patterns are, you find that the Heterogenous + Block patterns, aka the only ones that do anything interesting, only account for the typical behavior of half of the attention heads. And even more bizarrely the Vertical pattern, which just stares at the same single token in all cases, characterizes a full third of the attention heads.

Fraction of inputs producing each attention pattern. Each bar represents a different set of inputs, all of which are tasks from the GLUE dataset. (Kovaleva et al., 2019)

When you dig into the Vertical pattern, you find that most of the single-tokens being looked at are the [CLS], [SEP], and punctuation tokens. So, why on earth would a model as supposedly smart as BERT be spending so much of its valuable attentional resources looking at these uninformative tokens?

The theory about what’s going on here is that when an attention head stares at one of these stop-tokens, it acts like a no-op. So if whatever linguistic structures a particular attention head is tuned for aren’t present within the input, this allows it to “turn off”.

Kobayashi et al. (2020) dug into this strange finding further, and found that although the attention score on these tokens is high, the norm of the Value vector that is being multiplied with the attention score is very low. So low in fact that the final product winds up being near-zero.

The amount of attentional weight given to different token types in each layer. Left: Naive attention score, Right: Attention score weighted by the norm of the Value vector. (Kobayashi et al., 2020)

The really funny thing is that because there are so many of these attention heads that are basically doing nothing, you can actually improve the model’s performance by removing certain attention heads! In fact for some tasks like MRPC (determine if two sentences are equivalent) and RTE (determine if one sentence implies another) removing a head at random is more likely to help performance than hurt it.

Task performance when disabling different attention heads. Orange line indicates the accuracy of the unaltered BERT model. Layer # shown on y-axis, head # shown on x-axis. Left: Performance on the MRPC task, Right: Performance on the RTE task. (Kovaleva et al., 2019)

What about the Useful Attention Patterns?

Going back to the figure in the first section: It’s true that there are some attention heads in BERT which seem to be tuned to perform recognizable NLP subtasks. So, what are these heads, and what can they do?

Some attention heads are found to encode the same relationships as those of particular edges within Dependency Parse trees. In particular, there are heads for which the strongest attention score for a given input (ignoring [CLS] and [SEP]) is consistently given to pairs of words having a specific dependency relation.

BERT heads encoding specific dependency relations. Left: Direct Object relation, Center: Determiner relation, Right: Possessive Modifier relation. (Clark et al., 2020)

Another complex and important task for many NLP systems is coreference resolution. This is the problem of determining when two words in a sentence refer to the same entity. To see why this is such a hard problem, consider the sentence “Sally gave Sarah a Mentos because her breath was bad.” In this case “her” refers to “Sarah”, but making this determination requires knowing that Mentos are used to alleviate bad breath, and are not some kind of apology gift.

This reliance on world knowledge renders this a very nontrivial task, yet Clark et al. find that Head #4 in Layer #5 correctly identifies (i.e. most strongly attends to) the coreferent with 65% accuracy, compared to only 27% for a system which selects the nearest mention as the coreferent.

Another route we can go is to look at what all of the attention heads learn in aggregate. Remember that dependency parse tree from earlier? One question we can ask is: For a given word pair, if we consider the attention scores produced by all of the attention heads in the network, can we figure out whether they should be connected by an edge in the dependency parse?

Creating a network-wide attention vector for a given word pair by aggregating the attention scores across all of the 144 attention heads (Coenen et al., 2019)

It turns out that the answer is yes, a classifier constructed in this way can predict edges in the Dependency Parse with 85.8% accuracy, far higher than chance (Coenen et al., 2019). Although as pointed out by Rogers et al. (2020), when constructing a classifier on top of a high-dimensional representation like this, it’s not always clear how much of the knowledge is contained in the underlying representation, vs how much is injected by the classifier.

Final Thoughts

There is a lot of research on this topic that I haven’t had time to touch on, such as what role different architectural decisions have in determining BERT’s capabilities (Ontañón et al., 2021), what sorts of reasoning is being performed within different layers of BERT (Tenney et al., 2019), and what the word-embeddings downstream of the attention heads encode (Hewitt & Manning, 2019).

So many of the headline papers are about the breakthrough capabilities that these sorts of enormous neural network models possess, but what’s always interested me most about the field was how such models work under the hood. And I think especially in cases like this, where we find that the much-touted attention mechanism spends a lot of time looking at useless tokens, that we realize just how poor our understanding of these models is. And I think this air of mystery makes them even cooler.

Works Cited

[1] J. Alammar, The Illustrated Transformer (2018), GitHub Blog

[2] R. Karim, Illustrated: Self-Attention (2019), Towards Data Science

[3] J. Vig, Deconstructing BERT, Part 2: Visualizing the Inner Workings of Attention (2019), Towards Data Science

[4] A. Keerthi, What can GitHub copilot do for Data scientists? (2021), Towards Data Science

[5] J. Devlin, M. Chang, K. Lee, and K. Toutanova, BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding (2019), ArXiv.org

[6] A. Rogers, O. Kovaleva, and A. Rumshisky, A Primer in BERTology: What We Know about How BERT Works (2020), ArXiv.org

[7] K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, What Does BERT Look At? An Analysis of BERT’s Attention (2019), ArXiv.org

[8] O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky, Revealing the Dark Secrets of BERT (2019), ArXiv.org

[9] G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui, Attention Is Not Only a Weight: Analyzing Transformers with Vector Norms (2020), ArXiv.org

[10] A. Coenen, E. Reif, A. Yuan, B. Kim, A. Pearce, F. Viégas, and M. Wattenberg, Visualizing and Measuring the Geometry of BERT (2019), ArXiv.org

[11] S. Ontañón, J. Ainslie, V. Cvicek, and Z. Fisher, Making Transformers Solve Compositional Tasks (2021), ArXiv.org

[12] I. Tenney, D. Das, E. Pavlick, Bert Rediscovers the Classical NLP Pipeline (2019), ArXiv.org

[13] J. Vig, A Multiscale Visualization of Attention in the Transformer Model (2019), ACL Anthology

[14] C. D. Manning, K. Clark, J. Hewitt, U. Khandelwal, and O. Levy, Emergent Linguistic Structure in Artificial Neural Networks Trained by Self-Supervision (2020), Proceedings of the National Academy of Sciences

[15] J. Hewitt and C. D Manning, A structural probe for finding syntax in word representations (2019), Association for Computational Linguistics

[16] M. Marneffe, B. MacCartney, and C. D. Manning, Generating typed dependency parses from phrase structure parses (2006), Lrec

[17] D. Jurafsky and J. H. Martin, Speech and Language Processing, Second Edition (2014), Pearson Education

[18] Universal Dependency Relations, UniversalDependencies.org

What Does Transformer Self-Attention Actually Look At?

A deep dive into BERT’s attention heads

Written by Jon Simon