Source: Generated by Midjourney.

How GPT works: A Metaphoric Explanation of Key, Value, Query in Attention, using a Tale of Potion

Lili Jiang
Towards Data Science
10 min readJun 17, 2023

--

!Update: a 10-min video version of this post is now available!

The backbone of ChatGPT is the GPT model, which is built using the Transformer architecture. The backbone of Transformer is the Attention mechanism. The hardest concept to grok in Attention for many is Key, Value, and Query. In this post, I will use an analogy of potion to internalize these concepts. Even if you already understand the maths of transformer mechanically, I hope by the end of this post, you can develop a more intuitive understanding of the inner workings of GPT from end to end.

This explanation requires no maths background. For the technically inclined, I add more technical explanations in […]. You can also safely skip notes in [brackets] and side notes in quote blocks like this one. Throughout my writing, I make up some human-readable interpretation of the intermediary states of the transformer model to aid the explanation, but GPT doesn’t think exactly like that.

[When I talk about “attention”, I exclusively mean “self-attention”, as that is what’s behind GPT. But the same analogy explains the general concept of “attention” just as well.]

The Set Up

GPT can spew out paragraphs of coherent content, because it does one task superbly well: “Given a text, what word comes next?” Let’s role-play GPT: “Sarah lies still on the bed, feeling ____”. Can you fill in the blank?

One reasonable answer, among many, is “tired”. In the rest of the post, I will unpack how GPT arrives at this answer. (For fun, I put this prompt in ChatGPT and it wrote a short story out of it.)

The Analogy: (Key, Value, Query), or (Tag, Potion, Recipe)

You feed the above prompt to GPT. In GPT, each word is equipped with three things: Key, Value, Query, whose values are learned from devouring the entire internet of texts during the training of the GPT model. It’s the interaction among these three ingredients that allows GPT to make sense of a word in the context of a text. So what do they do, really?

Source: created by the author.

Let’s set up our analogy of alchemy. For each word, we have:

  • A potion (aka “value”): The potion contains rich information about the word. For illustrative purpose, imagine the potion of the word “lies” contains information like “tired; dishonesty; can have a positive connotation if it’s a white lie; …”. The word “lies” can take on multiple meanings, e.g. “tell lies” (associated with dishonesty) or, “lies down” (associated with tired). You can only tell the real meaning in the context of a text. Right now, the potion contains information for both meanings, because it doesn’t have the context of a text.
  • An alchemist’s recipe (aka “query”): The alchemist of a given word, e.g. “lies”, goes over all the nearby words. He finds a few of those words relevant to his own word “lies”, and he is tasked with filling an empty flask with potions of those words. The alchemist has a recipe, listing specific criteria that identifies what potions he should pay attention to.
  • A tag (aka “key”): each potion (value) comes with a tag (key). If the tag (key) matches well with the alchemist’s recipe (query), the alchemist will pay attention to this potion.

Attention: the Alchemist’s Potion Mixology

The potions with their tags. Source: created by the author.

In the first step (attention), the alchemists of all words each go out on their own quests to fill their flasks with potions from relevant words.

Let’s take the alchemist of the word “lies” for example. He knows from previous experience — after being pre-trained on the entire internet of texts — that words that help interpret “lies” in a sentence are usually of the form: “some flat surfaces, words related to dishonesty, words related to resting”. He writes down these criteria in his recipe (query) and looks for tags (key) on the potions of other words. If the tag is very similar to the criteria, he will pour a lot of that potion into his flask; if the tag is not similar, he will pour little or none of that potion.

So he finds the tag for “bed” says “a flat piece of furniture”. That’s similar to “some flat surfaces” in his recipe! He pours the potion for “bed” in his flask. The potion (value) for “bed” contains information like “tired, restful, sleepy, sick”.

The alchemist for the word “lies” continues the search. He finds the tag for the word “still” says “related to resting” (among other connotations of the word “still”). That’s related to his criteria “restful”, so he pours in part of the potion from “still”, which contains information like “restful, silent, stationary”.

He looks at the tag of “on”, “Sarah”, “the”, “feeling” and doesn’t find them relevant. So he doesn’t pour any of their potions into his flask.

Remember, he needs to check his own potion too. The tag of his own potion “lies” says “a verb related to resting”, which matches his recipe. So he pours some of his own potion into the flask as well, which contains information like “tired; dishonest; can have a positive connotation if it’s a white lie; …”.

By the end of his quest to check words in the text, his flask is full.

Source: created by the author.

Unlike the original potion for “lies”, this mixed potion now takes into account the context of this very specific sentence. Namely, it has a lot of elements of “tired, exhausted” and only a pinch of “dishonest”.

In this quest, the alchemist knows to pay attention to the right words, and combines the value of those relevant words. This is a metaphoric step for “attention”. We’ve just explained the most important equation for Transformer, the underlying architecture of GPT:

Q is Query; K is Key; V is Value. Source: Attention is All You Need

Advanced notes:

1. Each alchemist looks at every bottle, including their own [Q·K.transpose()].

2. The alchemist can match his recipe (query) with the tag (key) quickly and make a fast decision. [The similarity between query and key is determined by a dot product, which is a fast operation.] Additionally, all alchemists do their quests in parallel, which also helps speed things up. [Q·K.transpose() is a matrix multiplication, which is parallelizable. Speed is a winning feature of Transformer, compared to its predecessor Recurrent Neural Network that computes sequentially.]

3. The alchemist is picky. He only selects the top few potions, instead of mixing in a bit of everything. [We use softmax to collapse Q·K.transpose(). Softmax will pull the inputs into more extreme values, and collapse many inputs to near-zero.]

4. At this stage, the alchemist does not take into account the ordering of words. Whether it’s “Sarah lies still on the bed, feeling” or “still bed the Sarah feeling on lies”, the filled flask (output of attention) will be the same. [In the absence of “positional encoding”, Attention(Q, K, V) is independent of word positions.]

5. The flask always returns 100% filled, no more, no less. [The softmax is normalized to 1.]

6. The alchemist’s recipe and the potions’ tags must speak the same language. [The Query and Key must be of the same dimension to be able to dot product together to communicate. The Value can take on a different dimension if you wish.]

7. The technically astute readers may point out we didn’t do masking. I don’t want to clutter the analogy with too many details but I will explain it here. In self-attention, each word can only see the previous words. So in the sentence “Sarah lies still on the bed, feeling”, “lies” only sees “Sarah”; “still” only sees “Sarah”, “lies”. The alchemist of “still” can’t reach into the potions of “on”, “the”, “bed” and “feeling”.

Feed Forward: Chemistry on the Mixed Potions

Up till this point, the alchemist simply pours the potion from other bottles. In other words, he pours the potion of “lies” — “tired; dishonest;…” — as a uniform mixture into the flask; he can’t distill out the “tired” part and discard the “dishonest” part just yet. [Attention is simply summing the different V’s together, weighted by the softmax.]

Source: created by the author.

Now comes the real chemistry (feed forward). The alchemist mixes everything together and does some synthesis. He notices interactions between words like “sleepy” and“restful”, etc. He also notices that “dishonesty” is only mentioned in one potion. He knows from past experiences how to make some ingredients interact with each other and how discard the one-off ones. [The feed forward layer is a linear (and then non-linear) transformation of the Value. Feed forward layer is the building block of neural networks. You can think of it as the “thinking” step in Transformer, while the earlier mixology step is simply “collecting”.]

The resulting potion after his processing becomes much more useful for the task of predicting the next word. Intuitively, it represents some richer properties about this word in the context of its sentence, in contrast with the starting potion (value) that is out of context.

The Final Linear and Softmax Layer: the Assembly of Alchemists

How do we get from here to the final output, which is to predict that the next word after “Sarah lies still on the bed, feeling ___” is “tired”?

So far, each alchemist has been working independently, only tending to his own word. Now all the alchemists of different words assemble and stack their flasks in the original word order and present them to the final linear and softmax layer of the Transformer. What do I mean by this? Here, we must depart from the metaphor.

This final linear layer synthesizes information across different words. Based on pre-trained data, one plausible learning is that the immediate previous word is important to predict the next word. As an example, the linear layer might heavily focus on the last flask (“feeling”’s flask).

Then combined with the softmax layer, this step assigns every single word in our vocabulary a probability for how likely this is the next word after “Sarah lies on the bed, feeling…”. For example, non-English words will receive probabilities close to 0. Words like “tired”, “sleepy”, “exhausted” will receive high probabilities. We then pick the top winner as the final answer.

Source: created by the author.

Recap

Now you’ve built a minimalist GPT!

To recap, for each word in the attention step, you determine which words (including self) each word should pay attention to, based on how well that word’s query (recipe) matches the other word’s key (tag). You mix together those words’ values (potions) proportional to the attention that word pays to them. You process this mixture to do some “thinking” (feed forward). Once each word is processed, you then combine the mixtures from all the other words to do more “thinking” (linear layer) and make the final prediction of what the next word should be.

Source: created by the author.

Side note: the language “decoder” is a vestige from the original paper, as Transformer was first used for machine translation tasks. You “encode” the source language into embeddings, and “decode” from the embeddings to the target language.

This is a good stopping point.

If you are eager to learn more, we will go over two more variants on top of the minimalist architecture above: Multi-Head Attention and repeated blocks.

Multi-Head Attention: Many Sets of Alchemists

So far, every word only has one alchemist’s recipe, one tag, and one potion. [For each word, each query, value, key is a single vector, not a matrix.] But we can get better results if we equip each word with a few sets of recipe, tag, potion. For reference, GPT uses 12 sets (aka 12 “attention heads”) per word. Maybe for each word, the alchemist from the first set specializes in analyzing sentiments, the alchemist from the second set specializes in resolving references (what does “it” refer to), etc.

Advanced note: The group of sentiment alchemists have only studied the sentiment potions; they wouldn’t know how to handle potions from the other sets, nor will they ever touch those. [V, K, Q from the same attention head are jointly trained. V, K, Q from different attention heads don’t interact in the Multi-Head Attention step.]

Source: created by the author.

The 12 alchemists of each word present their specialized, filled flasks together [concatenate output of different attention heads]. As a group, they do a giant chemical reaction using all these flasks and present one resulting potion [feed forward, aka linear layer].

Advanced note: Just like before, within the decoder block, flasks of different words don’t get mixed together. [The feed forward step is position-wise, meaning it applies to each word independently.]

This metaphor explains the following equation and diagram from the original paper.

Source: Attention is All You Need.

Stacking the Blocks: And … Repeat!

Source: created by the author.

For better results, we repeat the decoder block N times. For reference, GPT repeats 12 times. Intuitively, you want to intersperse the act of collecting potions from other relevant words (attention) and synthesizing those potions on your own to get meaning out of it (feed forward): collect, synthesize; collect, synthesize…

Now you can know alchemy… I mean…GPT!

--

--