The world’s leading publication for data science, AI, and ML professionals.

All You Need to Know about In-Context Learning

What is and how does it work what makes Large Language Models so powerful

| IN CONTEXT LEARNING | LARGE LANGUAGE MODELS| LLMs

Photo by 🇸🇮  Janko Ferlič on Unsplash
Photo by 🇸🇮 Janko Ferlič on Unsplash

"For me context is the key – from that comes the understanding of everything." – Kenneth Noland

In-context learning (ICL) is one of the most surprising model skills. Observed with GPT-3 it caught the authors’ attention. Exactly what is ICL? More importantly, what gives rise to it?

This article is divided into different sections, for each section we will answer these questions:

  • What is In-Context Learning (ICL)? Why this is interesting? Why it is useful?
  • The mystery of ICL: how does it work? Is the training data? is the prompt? it is the architecture?
  • What is the future of ICL? What are the remaining challenges?

Check the list of references at the end of the article, I provide also some suggestions to deepen the topics.

What is In-Context Learning (ICL)?

Photo by Dmitry Ratushny on Unsplash
Photo by Dmitry Ratushny on Unsplash

"The limits of my language mean the limits of my world." – Ludwig Wittgenstein

Before Large Language Models (LLMs) were published, an Artificial Intelligence model was limited to the data it was trained on. In other words, LLMs could only solve tasks for which their training was designed.

GPT-3 and today’s LLMs, on the other hand, show a new capability: the ability to learn new skills and solve new tasks simply by providing new examples in the input (prompt). Also, in this case, we are not training the model; there is no gradient update or change in model parameters. This skill is called In-Context Learning (ICL).

image source: here
image source: here

To be more specific, the way to interact with a model is to provide natural language instructions in a prompt. Although this may seem limited, different examples up to a certain number of tokens (context windows) can be entered in a prompt. In addition, despite being placed in this textual template, it can also allow the model to solve mathematical exercises. In fact, in the prompt, we can insert examples of word corrections, arithmetic exercises, translations, programming, and whatnot.

image source: here
image source: here

Now, we can give a formal definition of what is ICL:

In-context learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration. (source)

Simply put, by giving a model a list of input-output pairs that demonstrate a task, the model reads the training examples to figure out the input and output distribution, manages to map the inputs and outputs, and generates an appropriate response.

As shown in this study that this simple idea helps the model to perform certain tasks more easily. Explaining the model with unambiguous instructions on how to perform the tasks allows the model to better understand and perform the tasks more easily. Using these few examples is then competitive (ICL) is competitive against training models with many more labeled data.

This has led to the emergence of various strategies to exploit ICL (prompt engineering) since changing the prompt allows for better performance than having to do fine-tuning for a specific task.

image source: here
image source: here

Speak to me: How many words a model is reading

This behavior also seems to emerge only at scale, i.e., it appears that ICL emerges only with a certain number of parameters. In fact, for some capabilities, it appears that the model has random performance up to a certain number of parameters, and then abruptly its performance improves.

image source: here
image source: here

Emergent Abilities in AI: Are We Chasing a Myth?

In brief, behavior is both researched and studied because it has definite advantages:

  • The examples are written in natural language, so communication with the template is interpretable and understandable to humans. it is much easier to integrate human knowledge because you only need to change the prompt.
  • Also in context learning it remembers how humans learn because it recalls the process of learning by analogy.
  • it is training-free, we do not have to train the model (unlike supervised learning). This means it is much cheaper as a computational cost. This is really efficient since the skill can be acquired instantly.
  • It also means that the model can be used as-a-service, and can thus be deployed for many tasks. In fact, the tasks can be taught by everyday users.
  • ICL provides the model to generalize, allowing the model to learn underlying patterns and rules that are present in the examples and then apply them to new situations. Moreover, it provides the model with versatility, since it can be applied to many different types of skills.

It looks amazing, but how it works?

The mystery of ICL: how does it work?

Photo by 𝓴𝓘𝓡𝓚 𝕝𝔸𝕀 on Unsplash
Photo by 𝓴𝓘𝓡𝓚 𝕝𝔸𝕀 on Unsplash

"Usually, if you want to fine-tune these models, you need to collect domain-specific data and do some complex engineering. But now we can just feed it an input, five examples, and it accomplishes what we want. So, in-context learning is an unreasonably efficient learning phenomenon that needs to be understood," Akyürek says. (source)

As much as [ICL](https://en.wikipedia.org/wiki/Prompt_engineering) seems almost magical, it also has its limitations. GPT-3 for example had shown what seemed incredible reasoning capabilities. Yet some datasets that required reasoning, such as the Winograd dataset did not show ICL:

A Winograd schema is a pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution. (source)

In fact, there was no improvement with the use of a few examples in the prompt.

image source: here
image source: here

These facts and other seemingly contradictory behaviors have led researchers to ask: where does ICL originate? Why works better than fine-tuning? Can ICL be improved by changing the prompt?

Meanwhile, one must remember that most skills are learned during pre-training. The first step of training an LLM that requires huge amounts of text and is typically conducted by simply asking the model to predict a word in a sequence given the previous part of the sequence. This step is the most expensive, time-consuming, and resource-intensive one. During alignment (the transition from GPT 3.5 to ChatGPT) the model only improves its ability to exploit this knowledge and how to interact with humans.

The Infinite Babel Library of LLMs

Use what you see: the pre-training impact on ICL

During pre-training, LLMs are thus exposed to an enormous amount of text: from Wikipedia, books (fiction and nonfiction), scientific articles, tweets, Reddit posts, blog posts, internet dumps, and so on.

In a 2022 article, it was proposed that [[ICL](https://en.wikipedia.org/wiki/Prompt_engineering)](https://en.wikipedia.org/wiki/Prompt_engineering) can be considered a kind of implicit fine-tuning. The main difference is that while ICL is produced only by forward computation while fine-tuning also has a backpropagation step (in which parameters are updated). This confirms that ICL must originate during pretraining, but how does it impact pretraining?

As one article showed, the pretraining dataset is critical for a model to develop [[[[ICL](https://en.wikipedia.org/wiki/Prompt_engineering)](https://en.wikipedia.org/wiki/Prompt_engineering)](https://en.wikipedia.org/wiki/Prompt_engineering)](https://en.wikipedia.org/wiki/Prompt_engineering). According to the authors, the source domain is more important than the size of the [[corpus](https://en.wikipedia.org/wiki/Text_corpus)](https://en.wikipedia.org/wiki/Text_corpus). Also, putting several corpora together can lead to the emergence of ICL (if two corpora alone do not give ICL, joining them together can give ICL). Another important factor is the domain relevance of the corpus: training only on a news corpus allows relative in-context learning ability on a news-related downstream task. Finally, the authors note, that although perplexity (one of the most commonly used metrics for tracking [LLMs](https://en.wikipedia.org/wiki/Large_language_model)) and ICL generally correlate, perplexity alone does not reflect a model’s ability for ICL (comparing two LLMs, the model with the lowest perplexity is not necessarily the one with the highest ICL).

This was further confirmed by the fact that the dataset must be several in rare classes to allow ICL. According to the authors, the training data examples should appear in clusters (i.e. there should be several for each class) and be a certain variety of classes.

image source: here
image source: here

Another study states instead that in-context learning appears when the pretraining distribution (the training data) is an implicit mixture. The examples for pretraining are extracted from a mixture of tasks, and the association between examples and tasks is latent. So then once the model is trained, it manages on its own to uncover the latent task in the demonstration. For example, a series of tweets with positive and negative content represent a latent task of sentiment analysis.

In short, these articles claim that ICL appears if the dataset is diverse, they present a diverse number of class numbers (but simultaneously several examples per class), they cover multiple domains, and best if these examples represent a latent task of NLP. Since LLMs are generally trained with a huge amount of text, these premises are met.

Say Once! Repeating Words Is Not Helping AI

How you use what you learn: can you recall what you learned?

Some researchers have attempted to develop a framework for understanding how [[ICL](https://en.wikipedia.org/wiki/Prompt_engineering)](https://en.wikipedia.org/wiki/Prompt_engineering) emerges during pretraining. According to the authors, an [LLM](https://en.wikipedia.org/wiki/Large_language_model) uses ICL to "locate" concepts that are needed to perform the task. The idea is that during training the model acquires latent concepts and then finds it again during ICL. To find them again, the LLM can use all or some components of a prompt: format, inputs, outputs, and input-output mapping.

As explained, in a blog post by the authors, the model learns several concepts during training, after which the model uses the training examples to understand that the task in the prompt required either sentiment analysis or topic classification and at this point applies the mapping to the test input:

In this paper, we study how in-context learning can emerge when pretraining documents have long-range coherence. Here, the LM must infer a latent document-level concept to generate coherent next tokens during pretraining. At test time, in-context learning occurs when the LM also infers a shared latent concept between examples in a prompt. (source)

image source: here
image source: here

Yes but what is a concept? for the authors is "a latent variable that contains various document-level statistics." So a concept for a topic (for example, news) is the distribution of words (what words are used), format (how they are written), relationships between articles and topics, and so on.

The body of texts that are provided to the model are not random words, but the texts have their own internal coherence. In other words, similar texts have similar semantic information (the same topic) and formatting (alternate programming documentation explanations and code snippets). By learning to predict a word given those precedences, the LLM also models internal consistency and allows it to infer latent concepts that are in the prompt

In the authors’ words:

1. Pretrain: To predict the next token during pretraining, the LM must infer ("locate") the latent concept for the document using evidence from the previous sentences.

2. In-context learning: If the LM also infers the prompt concept (the latent concept shared by examples in the prompt) using in-context examples in the prompt, then in-context learning occurs! (source)

So for the authors, this process of "locating" can be seen as Bayesian inference, in which the LLMs infer concepts in the prompt (a concept that is shared by all the examples presented to it in the input prompt). Once he has inferred the concept he can then produce the correct answer

In formula:

image source: here
image source: here

Ask nicely: effect of the prompt

In recent work, Min et al. defined the characteristics of a prompt for ICL and how the various components of the prompt affect the performance of the model in doing ICL

image source: here
image source: here

Considering a demonstration as input-output pairs ( (x1, y1)…(xk, yk)) there are four formal aspects:

  • The input-label mapping. an input x is correctly paired with its label y.
  • The distribution of the input text, the distribution from where input x is extracted.
  • The label space is the space of the y outputs.
  • The format, specifically the pairing of the input-output pairs

For the authors, the format, the distribution of the input, and label spaces are important. In contrast, input-label mapping matters little to ICL. According to Stanford AI Lab, this would stem from the fact, that the model is already exposed to input-output matching during pretraining so it would not need the input-label mapping in the demonstration. Instead, the other elements are needed to be able to locate the concepts it has learned (in short perform Bayesian inference).

image source: here
image source: here

Another paper states that actually input-label mapping, while according to another it is true that it is important but if the model is large enough it can learn the mapping on its own.

For other authors, it is important that the demonstrations are different, simple, and similar anyway (at least in terms of structure). For another paper, the order of the demonstrations is important. Whereas, Liu et al, show that the choice of examples strongly impacts ICL. So one should choose examples that are close to an [embedding](https://en.wikipedia.org/wiki/Embedding) space. In fact, one technique that shows results is when one provides a question to embed it and looks for examples that are close in distance in the embedding.

image source: here
image source: here

The AI college student goes back to the bench

A closed look to attention

We have seen so far the role of the training dataset and the prompt, now it is time to closer look at the effect of architecture. An LLM is a transformer, and the transformer is mainly based on multi-head self-attention. Because [ICL](https://en.wikipedia.org/wiki/Prompt_engineering) is one of the most interesting behaviors of LLMs, many authors have focused on trying to find a mechanistic answer to how ICL occurs:

If we can understand the internal structures that cause Transformer models to produce the outputs they do, then we may be able to address current safety problems more systematically, as well as anticipating safety problems in future more powerful models. (source)

image source: here
image source: here

Researchers at Anthropic identified circuits they called induction heads. An induction head is a circuit consisting of two attention heads in different layers cooperating with each other to copy or complete patterns. Basically, the first attention head copies information from the previous token to the next one. The second attention head then has information about what happened previous to the present token. Then this mechanism can search the sequence where the present token A and sees the next token B, so the pattern once it sees A is more likely to produce output B.

image source: here
image source: here

For the authors, however, it is not a simple copying mechanism. In fact, in inductive reasoning, we can infer that A is followed by B, if previously in context we saw that A most likely followed by B. For the authors then these induction heads crystallize this inference mechanism, which is not based on the training data but on the context: [A][B]…[A]→[B]

For Anthropic these induction heads play an important role in ICL. In fact, the fact that they can learn and repeat arbitrary sequences can be thought of as a simplified form of few-shot learning. In a large model, this effect is amplified, since they can work on abstract representations. Thus: ", the very same heads that do this sequence copying also take on a more expanded role of analogical sequence copying or in-context nearest neighbors".

Now, this mechanism is also interesting because it also promotes another kind of sequence completion: [A][B] … [A] → [B]. In this case A and B are not the same tokens A and B but tokens that are similar in embedding space (for example, the same word in different languages).

These induction heads seem to appear as the LLM improves its skill in [[ICL](https://en.wikipedia.org/wiki/Prompt_engineering)](https://en.wikipedia.org/wiki/Prompt_engineering). Also, for Anthrophic in small LMs, one can observe this relationship with ICL (for them the induction heads are the driver of ICL). In addition, reverse engineering of these induction heads can be done for them, and this seems like a promising line of research to understand how they are formed and how they impact ICL.

image source: here
image source: here

[[ICL](https://en.wikipedia.org/wiki/Prompt_engineering)](https://en.wikipedia.org/wiki/Prompt_engineering) in each case is linked and emerges through attention. This has a quadratic (in computational terms) cost, though. Several models with simplified forms of attention (linear or logarithmic) were tested; however, this led to a decrease in expressiveness and impacted the ICL ability of the model. Therefore, although an alternative to multi-head self-attention is sought, the authors take care that their proposed model is capable of ICL.

Welcome Back 80s: Transformers Could Be Blown Away by Convolution

Learning to learn the context

Clearly, since the transformer is trained through gradient descent there is a relationship between the latter and [ICL](https://en.wikipedia.org/wiki/Prompt_engineering). Using linear regression as a starting point, Akyürek suggests that transformers implicitly treat ICL as an optimization problem.

[Oswald](https://arxiv.org/abs/2212.07677) showed that transformer layers can theoretically implement [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) on the in-context data. According to Oswald, in-context learning mimics gradient descent in certain cases. This paper showed that there is an ICL and gradient descent relationship.

Image source: here
Image source: here

The results above, and [Akyürek](https://arxiv.org/abs/2211.15661)’s results, mean that models doing in-context learning are not just matching previous patterns, but instead are also learning how to perform other tasks (an extension of what was said with induction heads). In fact, Akyürek provided prompts containing synthetic data to prevent the model from having already seen the data.

Akyürek’s hypothesis is then the models internally perform sort of Machine Learning algorithms (which in part echoes but extends the idea that the model does Bayesian inference). In the article, they state that the model implements in its hidden states a linear model, and this is learned during training.

"In this case, we tried to recover the actual solution to the linear model, and we could show that the parameter is written in the hidden states. This means the linear model is in there somewhere," Akyürek says. (source)

In an intriguing experiment, Google tested whether models via ICL can override previous prior knowledge, for the authors this is also an example of the emergent property of broad LLMs.

In one of the experiments, the authors performed regular ICL, flipped ICL (where labels are flipped), and semantically-unrelated label ICL (SUL-ICL) where labels are words that are not semantically related.

image source: here
image source: here

This article shows some interesting things. When the labels are flipped (but the ground-truth evaluation is kept the same) if the model is able to override its prior knowledge it should have a decrease. The result is that the performance of small models stays flat, while there is a drop for large models.

These results indicate that large models can override prior knowledge from pre-training when contradicting input-label mappings are presented in-context. Small models can’t do this, making this ability an emergent phenomena of model scale. (source)

image source: here
image source: here

Second, the model can also learn from input-label mappings when provided in the demonstration of semantically-irrelevant labels ("foo/bar" instead of "negative/positive" for sentiment analysis). A model that relies only on [prior knowledge](https://en.wikipedia.org/wiki/Prior_knowledge_for_pattern_recognition) should have a performance drop because it cannot exploit the semantic meaning of labels for predictions. In fact, small models have a drop in prediction, while LLMs do not. For the authors, this means that while small models rely on prior knowledge, "large models, on the other hand, have the ability to learn input-label mappings in context when the semantic nature of labels is removed."

image source: here
image source: here

The authors also took a look at what is the effect of instruction tuning on ICL. During instruction tuning, instructions are given to the model that often contains questions and answers. So this process involves natural language labels, and the authors wondered whether it impacts an LLM’s ability to learn input-label mappings or exploit semantic prior knowledge. The experiments show that: "instruction tuning improves the ability to learn input-label mappings, it strengthens the usage of semantic prior knowledge more."

So these results show that it is not only the architecture, the amount of data, and the prompt that influence ICL, but for Google also the number of parameters themselves:

These results underscore how the in-context learning behavior of language models can change depending on the scale of the language model, and that larger language models have an emergent ability to map inputs to many types of labels, a form of true symbolic reasoning in which input–label mappings can be learned for arbitrary symbols. (source)

Scaling Isn’t Everything: How Bigger Models Fail Harder

Conclusions, challenges, and perspective

Photo by Nadine Shaabana on Unsplash
Photo by Nadine Shaabana on Unsplash

"Separate text from context and all that remains is a con." ― Stewart Stafford

"Words are never good or bad on their own, context makes them so." ― Abhijit Naskar

In-context learning is one of the most interesting and elusive behaviors of LLMs. First admired with the publication of GPT-3, it has excited the community about its possible applications.

ICL in simple terms is the ability to learn from analogy. It only takes a few examples in a demonstration for the model to make a prediction. Which allows the model unprecedented versatility and the possibility of developing endless applications.

Despite this, we still do not understand precisely how it originates during training. We have seen the importance of training data, the prompt, or attention itself. Today, with the idea of wanting to replace attention-based models with new architectures we need to understand how to preserve ICL.

Research in ICL is very active, some of the lines of research are:

  • New Pretraining Strategies, as mentioned if a training strategy increases performance classically (decrease in perplexity) it does not mean that it increases the ICL skills of the model. Therefore, focused strategies are sought to increase a model’s ICL skills.
  • ICL Ability Distillation, ICL seems to emerge with the scale of the model, but if one were able to distill these skills into smaller models we would have savings in computational cost, memory, and infrastructure. Therefore, distillation seems promising for smaller models with ICL. Preliminary studies look promising.
  • ICL Robustness. As we have seen ICL skills are not stable, permutations and changes in the format of the demonstration impact ICL. In one study it is shown that increasing robustness comes at the cost of decreasing accuracy, so we need studies that delve into how ICL works. A better theoretical understanding can help develop a more robust ICL.
  • ICL Efficiency and Scalability. ICL requires different examples in the demonstration. In theory, more examples improve ICL. Increasing the number of examples has a computational cost, which comes from calculating attention (efficiency). The other challenge is that you cannot add more examples than the context window allows (scalability). As we saw in a previous article, research has been very active in how to extend the context window (and what strategies have been used), although it is unclear whether the model can then exploit it. Also, in some cases, inverse scaling was seen, where the model instead of following in-context instruction regurgitated memorized data.

Another line of research is the development of techniques that can improve ICL by acting on the format of the prompt. Several interesting approaches have been proposed over time (including Chain of thought (COT), Self-consistency COT, Tree of Thoughts, and so on). These approaches have shown success in being able to improve model performance for mathematical exercises and other reasoning problems. All this is done simply through modifications of the prompt. In this article, I have focused on more mechanistic aspects of the model, training data, the prompt, and how ICL emerges. In the next article, I will discuss these approaches in detail.

What do guys think? Let me know in the comments


If you have found this interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, you can become a Medium member to access all its stories (affiliate links of the platform for which I get small revenues without cost to you) and you can also connect or reach me on LinkedIn.

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…


Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited. I suggest also them if you want to deepen on the topic.

  1. Brown, 2020, Language Models are Few-Shot Learners, link
  2. Dong, 2022, A Survey on In-context Learning, link
  3. Zhao, A Survey of Large Language Models, link
  4. Xie, 2022, How does in-context learning work? A framework for understanding the differences from traditional supervised learning. link
  5. Wei, 2022, Emergent Abilities of Large Language Models, link
  6. Zhou, 2022, Teaching Algorithmic Reasoning via In-context Learning, link
  7. Vinita Silaparasetty, What is Prompt Engineering?, link
  8. Fareed Khan, Prompt Engineering Complete Guide, link
  9. Paul DelSignore, The Dark Side Of Prompt Engineering, link
  10. Babar M Bhatti, The Art and Science of Crafting Effective Prompts for LLMs, link
  11. Dai, 2022, Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers, link
  12. Shin, 2022, On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model, link
  13. Cameron R. Wolfe, Ph.D., Language Model Scaling Laws and GPT-3, link
  14. Xie, 2021, An Explanation of In-context Learning as Implicit Bayesian Inference , link
  15. Huszár, 2022, Implicit Bayesian Inference in Large Language Models, link
  16. Min, 2022, Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?, link
  17. Chan, 2022, Data Distributional Properties Drive Emergent In-Context Learning in Transformers, link
  18. Liu, 2023, What Makes Good In-Context Examples for GPT-3?, link
  19. Priyanka, Perplexity of Language Models, link
  20. Olsson, 2022, In-context Learning and Induction Heads, link
  21. Wies, 2023, The Learnability of In-Context Learning, link
  22. Akyürek, 2022, What learning algorithm is in-context learning? Investigations with linear models, link
  23. Oswald, 2022, Transformers learn in-context by gradient descent, link
  24. Wei, 2023, Larger language models do in-context learning differently, link
  25. Google blog, Larger language models do in-context learning differently, link
  26. Zewe, Solving a machine-learning mystery, link
  27. Kaddour, 2023, Challenges and Applications of Large Language Models, link
  28. Magister, 2022, Teaching Small Language Models to Reason, link
  29. Chen, 2022, On the Relation between Sensitivity and Accuracy in In-context Learning, link

Related Articles