| ARTIFICIAL INTELLIGENCE| LLM| NLP |
LLMs have shown their skills in recent months, demonstrating that they are proficient in a wide variety of tasks. All this through one mode of interaction: prompting.
In recent months there has been a rush to broaden the context of language models. But how does this affect a language model?
This article is divided into different sections, for each section we will answer these questions:
- What is a prompt and how to build a good prompt?
- What is the context window? How long it can be? What is limiting the length of the input sequence of a model? Why this is important?
- How we can overcome these limitations?
- Do the models use the long context window?
How to talk to a model?
What is a prompt and what is a good prompt?
Simply put, a prompt is how one interacts with a large language model (LLM). Given an LLM, we can interact by providing instructions in text form. This textual prompt contains the information the model needs to process a response. The prompt can contain a question, task description, content, and lots of other information. Essentially, through the prompt we provide the model with what our intent is and what we expect it to respond to.
The prompt can drastically change the behavior of the model. For example, asking the model "describe the history of France" is different from asking it "describe the history of France in three sentences" or "describe the history of France in rap form."
In order to get adequate information from the model, it is advisable to write a good prompt. In general, a good prompt should contain either a question or a set of instructions. In addition, there could be a context (question + context). For example, we could ask the model to tell us in an article (context) what the main characters are (question).
As a rule of thumb, there are some elements that need to be considered when writing a prompt:
- simplicity, since it is often an iterative process, it is best to start with simple questions and gradually ask for more information. Also, templates work best if they are reduced in the form of subtasks: simple tasks instead of complex ones.
- Instruction about the task. Using verbs that specify the instruction helps the model better understand the task at hand. Also, some verbs work better for some tasks and it is recommended to provide the instruction at the beginning of the prompt.
- Specificity. Being specific and detail-oriented in the prompt is beneficial to the execution of the task. Also, examples can be provided to better explain what is desired (few shot prompting). Since the length of the prompt is not infinite, however, you should avoid providing too many examples or too much detail (in many LLMs the prompt will be truncated beyond a certain length).
In addition, there are also other techniques that can be used to improve the prompt, such as chain-of-thought (forcing the model to regress on intermediate steps), self-reflection (allowing the model to evaluate its response), tree of thoughts, and so on.
In general, although some of these techniques are trivial, they do not always yield good results. There are even more sophisticated techniques, and prompt engineering is still an open field. The principle of these techniques is to allow the model to reason about a problem or to make the best use of what it has learned during training.
In any case, all of these techniques must be exploited with one problem: the maximum number of tokens (subwords) that can be inserted inside a prompt.
How long can a prompt be and why?
How long can be a prompt: The length of the context
The prompt can grow quickly, especially if the context contains a lot of information (use articles in the context, past conversations, add external information, and so on). This means that the model must operate on long sequences as input.
Basically, an LLM is a transformer, and transformers do not scale well with sequence length. This comes from the fact that the transformer is built on repeated blocks of self-attention that have a quadratic cost with respect to length.
Of course, there has been much previous work that has tried to reduce this cost with various strategies. Linear alternatives to self-attention have been found to be unexpressive, though.
Welcome Back 80s: Transformers Could Be Blown Away by Convolution
Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. (source)
Usually, the context windows are relatively small (512–1024 tokens). Yet, in recent months we have seen that there are now models even with thousands and thousands of tokens for the context window.
For example, GPT-4 has a context length of 32k. Aside from, the fact that the number is impressive, it is not just a marketing question. In fact, the longer the context length the more information a model can relate. Also, greater context length allows for greater accuracy, and fluency, and is thought to stimulate the creativity of the model.
At the theoretical level it is true a transformer trained with context length of 1k tokens could also generate in inference a sequence of 100 k. But since it is trained with a different training data distribution (much less than 100 k sequences) the result generated will be nonsensical.
In fact, it has been shown that asking for a spell check at ChatGPT for texts longer than 1000 words leads the model to hallucinate.
The quadratic cost of self-attention means that increasing the context length equals a corresponding increase in the cost of training. The cost of LLaMA has been estimated at $3 million (just as GPU training) increasing the context length by 50 times (from 2K to 100K) also means increasing the cost by 50 times.
Is self-attention the only limit to enlarging context length?
No. After tokenization, the model takes the sequence of tokens and the first step is embedding. For a sequence of tokens n we have an embedding size of d. Obviously, if n >> d there is a risk of information loss, and increasing n much more than d poses significant challenges.
Also, the sinusoidal positional encoder is not compatible with some solutions to be able to enlarge the context length and needs to be rethought.
In addition, the training is parallelized, but during inference instead the computation is sequential. In fact, a token depends on the tokens generated by the sequence. So inference must also be optimized to enlarge the context length.
How is it possible to have a context window of tens of thousands or even hundreds of thousands of tokens?
Extending the context window of LLM
Make your context XL
Although the results of the past few months are impressive, there had already been attempts to increase the length of the context window in 2019. In fact, the Transformer-XL could generate coherent text up to thousands of tokens.
The authors exploited the idea of recurrent neural networks, in which the hidden state is reused to allow more information to be given to the transformer. In other words, after passing a sequence segment, while processing the next segment the previously obtained hidden state was reused. Looking at, the description of the model, the similarity to RNNs is evident.
A short training, a long inference
Although TransformerXL was an interesting solution, other strategies are being tested in recent months. These strategies aimed to solve the limitations inherent in the original transformer and also took advantage of today’s hardware advancements.
One idea to reduce the cost of training is to train the model with a context length of 2K but then conduct fine-tuning on longer sequences (e.g., 65K). Theoretically, this could also work (the model learns a general representation of the language in the first training and then specializes in the task of longer sequences later).
In reality, this strategy with the original transformer is doomed to fail as explained in a 2021 paper. As explained by the authors the ability of a larger context length in inference is called "extrapolation."
We define extrapolation as a model’s ability to continue performing well as the number of input tokens during validation increases beyond the number of tokens on which the the model was trained. We find that transformer language models (LMs) that use sinusoidal position embeddings have very weak extrapolation abilities. (source)
So for the authors, positional encoding is the culprit for the lack of the original transformer’s ability to extrapolate. Positional encoding, which is a step at the beginning of the model, was included as a clever trick to allow the model to account for the position of each token in the sequence.
The authors, suggest replacing it with attention with linear bias (ALiBI). In simple words, a penalty is added to the product of queries and keys in the attention and this penalty is proportional to their distance:
The trick is ingenious because it does not add parameters to learn and does not increase the computational cost by much.
Do you need all these tokens?
Extending the context window to more than 100k tokens certainly has great interest. On the other hand, not all tokens in the sequence are actually interesting. So, is it necessary for us to calculate the relationships (attention score) among all these tokens?
So the idea is to leverage the sparsity when calculating the attention score so that we do not calculate the relationships between the tokens in which we are not interested. As Google explained, though, this is not exactly straightforward:
Two natural questions arise: 1) Can we achieve the empirical benefits of quadratic full Transformers using sparse models with computational and memory requirements that scale linearly with the input sequence length? 2) Is it possible to show theoretically that these linear Transformers preserve the expressivity and flexibility of the quadratic full Transformers? (source)
Google tried to answer these questions by starting an intriguing observation, the attention layer can be understood as a graph. In fact, when computing the attention for all positions (nodes) in a sequence we compute pairwise similarities (edges). So visually we can think of attention as a directed graph.
Building on this concept, it can be seen that several alternatives to classical attention can be considered with graphs that are not fully connected.
Google, took advantage of this concept by first exploiting the combination of global and local (or window) attention in this paper, and then improving on the idea with BigBird.
BigBird basically combines three concepts: global tokens to cover the whole sequence, local attention to cover the surrounding of each token, and for each token, there are tokens sampled randomly.
BigBird succeeds in sufficiently approximating the attention. At the same time, it is sparse (so less computation) pero it does not interrupt the flow of information (the ability of one token to influence other tokens).
The authors prove that this sparse attention is not only as expressive as the original attention but can be used to extract contextual information from sequences that are long by nature such as genomic sequences.
In any case, the sparsity concept is so powerful that many researchers are of trying to implement it in other models such as the Vision Transformers.
Conditional computation for fast transformer
Based on the idea that not all tokens are important, there is also another way to not apply all model weights to all tokens during training.
CoLT5 exploits this concept to increase the length of the input. Conditional computation, simply put, makes sure to allocate more resources to those tokens that are considered important.
The authors construct a system, in which attention and feedforward network computation are separated into two branches (heavy and light). The light layer is applied for all tokens, while the heavy MLP is applied only for the important ones. These tokens are selected by a routing module that decides which tokens are important.
Multi-Query Attention to save computation
The key and values for each token are cached for each token during inference (so as to avoid redoing the same operations while generating text). On the one hand, this saves computations, but it increases memory usage in the GPU.
In order to avoid this, Multi-Query Attention (MQA) suggests sharing weights for all attention heads during the linear projection step of keys and values.
This has an advantage particularly when dealing with long sequences, decreasing not only memory usage but generation time. Google has demonstrated the advantage of MQA while using PaLM.
Flash attention, the light in the new LLMs
The models and ideas seen earlier seek to modify attention in ways that reduce its cost. Flash attention uses a different approach and is used by virtually all models today.
At the base, there is a better use of the GPU. In fact, the GPU has its own memory hierarchy. When the GPU performs an operation, data must be present on the fast memory (SRAM memory). The data is copied to this memory from the HBM memory, and once the calculations are finished, the output is copied to the HBM.
As you can see, SRAM memory is not only much faster but also much smaller. Over time, computation has become faster and faster and HBM has become the bottleneck.
This is because, during attention, several operations are conducted (multiplication of queries and keys, softmax, multiplication of this result with values). These operations generate intermediates that are copied to the HBM and SRAM (back and forth). This data copying operation is what slows down the operation.
The SRAM has a memory limit, so the flash attention solution is to divide the various arrays (queries, keys, values) into blocks. So all operations are accomplished in a single GPU kernel and only then write the result to HBM. In addition, the softmax is also reduced as time, since it is calculated only on the block and not on the whole NxN matrix.
Not only META’s LLaMA uses FlashAttention, but today virtually all models do.
Recent development
Recent developments in GPUs have also enabled an increase in increasing tokens and their context. There are now 80 GB GPUs for example.
Also, in addition to the technical refinements we have seen above, there is also a better understanding of why the transformer does not have the ability to extrapolate. For example, in this paper, they show that the classical attention drifts away to a later position in the sequence (behavior that the authors input to sinusoidal position encoding). For this reason, we have seen how positional encoding has been replaced in ALiBI (others have proposed that it can be replaced with a randomized version, random positional encoding).
Other authors point out how techniques such as chain-of-thought help the model extrapolate since it must focus on the intermediates of reasoning. Also, a few shots can improve the model’s ability to extrapolate better than fine-tuning (in any case, it’s not a silver bullet). Actually, fine-tuning with some tricks can bring very good results as in the case of LLaMA 7B where in this study they introduced window attention they managed to increase the context window from 2K to 32K.
However if previously Claude‘s 100K context length seemed unbelievable. META’s Megabyte claims 1M tokens (the trick would be, "Megabyte segments sequences into patches and uses a local submodel within patches and a global model between patches").
A paper published a short while ago even claims a 1G token. All of this shows how there is still a lot of active research, and how many groups are working on finding ways to extend the context length.
Noting all this research and alternative solutions, a question arises: how does the model use this long context? Can it make the best use of it?
Lost in the Middle: How Language Models Use Long Contexts
The latest advances in LLM have allowed the context window to be extended, this opens up the question of whether indeed the models have benefited. An article published this month investigated this issue.
The authors of this study were able to take advantage of the fact that not only proprietary models such as Claude or GPT-4 have a long context window. In fact, MPT-30B and LongChat-7B have a context window of 8K and 16K tokens, respectively. The authors, therefore, decided to use both these two models as well as models that are closed (GPT-3.5 and Claude).
Having chosen the models, one must also choose tasks where it is necessary for the model to have a long context window. For example, in Multi-Document Question Answering requires the model to reason about a set of different documents to find the information that is needed for the answers. This is an important task since it mimics the fact of a search in a document corpus such as the Internet might for example (we can imagine an AI search engine that has to search multiple websites for the answer).
For a question x, there is a set of documents and only one of these documents contains the information to answer the question. As in the example:
The authors exploited datasets of annotated questions (google searches). They then added Wikipedia chunks that were related to the topic but did not contain the answer (being careful that the right document was not always in the same location because the LLM could learn to do the heuristic trick).
The authors note three particularly interesting results:
- they note a U-shape response, dependent on the location of the relevant document. In other words, model performance degrades when the model must access information that is present in the center of the context window. Thus, models are much better at identifying relevant information if they are at the beginning or at the end of the input sequence
- The performance is inferior to the closed-book setting. When the relevant document is in the center of the input context, the model performs worse than when no document is provided. In a closed-book setting, the model has to rely only on the memory of its parameters.
- In general, performance degrades if more documents are provided to the model.
In addition, the authors note that per se models with a longer context window are not superior to their counterparts with a shorter context sequence.
Since the model struggles in using information that lies in the center of the context window, the authors wondered whether the model was at least capable of finding information again. In other words, using a simple file consisting of key-value pairs (a JSON) can the model find back information?
The authors decided to use the simplest possible task to study the model’s behavior in depth. This is a basic skill, in which the model has to find a piece of information without the need for complex skills. Using simulated data, the authors created JSON containing key-value pairs, in which only one is the one of interest.
Our synthetic key-value retrieval task is designed to provide a minimal testbed for the basic ability to retrieve matching tokens from an input context. […] we explicitly seek to distill and simplify the task by removing as much natural language semantics as possible (using random UUIDs instead), since language features may present potential confounders (source)
The result shows that not all models are capable, Claude on the one hand succeeds in the task, but other models have performance degradation if there are 140 or more key-value pairs.
In addition, the authors observe a curious fact:
LongChat-13B (16K) in the 140 key-value setting is a notable outlier; when the relevant information is at the start of the input context, it tends to generate code to retrieve the key, rather than outputting the value itself. (source)
Why can’t LLMs make the best use of a long context window?
The authors of this study wondered if this was related to architecture. Today, two main types of architecture are used: decoder-only and encoder-decoder language models. Although they have been used in a great many articles, there are still obscure points about the differences in their behaviors.
Therefore, the authors decided to additionally use two other models: Flan-T5-XXL and Flan-UL2. These two models show to be more robust when the relevant information is found in the middle of the context window.
This is interesting because being bidirectional, an encoder-decoder could be more robust in processing information in a longer context window and therefore more efficient when it has to process multiple documents.
Is a long context window useful?
If the model is not capable of exploiting it anyway, is there really a benefit in having such a long context window? After all, having a longer context window comes at a cost anyway: the model has to process all the input information. In other words, if the model can take 100K tokens, it makes sense to feed it 100K pieces of information.
The authors decided to test this using a retrieval system that takes an input query (a question from a dataset of questions) and finds k documents from Wikipedia. They then added them to the prompt and tested how the models behaved with these added documents.
Using more than 20 retrieved documents only marginally improves reader performance (∼1.5% for GPT-3.5-Turbo and ∼1% for Claude), while significantly increasing the input context length (and thus latency and cost). (source)
In other words, the model saturates. As mentioned before this confirms that the model is more efficient in using the information at the beginning of the context window.
Closing thoughts
The prompt is how we interact with a model. The more precise and detailed it is, the better the model’s response.
There is a limit to the amount of information we can put into a prompt, though. This limit is the context window and it comes from numerous factors as we saw earlier. Ever since the first transformer was published, attempts have been made to enlarge this context window by exploiting numerous solutions.
Despite this, we do not know how well the model can use this context window. Studies today show that models do not make the most of them.
Since the scaling law was published there has been a race to the parameter, ever larger models looking for evanescent emergent properties. Today we know that all those parameters are not necessary, and that GPT-4 itself is not as large as thought, but actually an ensemble of eight models. The context window seems to be another frontier, where people try to reach a larger number not for real utility but to show the superiority of their model.
Despite the results and the myriad of published models there are still points that need to be studied. How the model uses this longer context window is one such point that needs more analysis. For despite the elegance of the technical solutions sometimes the cost of a longer context window is not justified.
What do you think? Let me know in the comments.
If you have found this interesting:
You can look for my other articles, you can also subscribe to get notified when I publish articles, you can become a Medium member to access all its stories (affiliate links of the platform for which I get small revenues without cost to you) and you can also connect or reach me on LinkedIn.
Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, Artificial Intelligence, and more.
GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…
or you may be interested in one of my recent articles:
CodeGen2: a new open-source model for coding
ClinicalGPT: the LLM clinician
Reference
Here is the list of the principal references I consulted to write this article, only the first name for an article is cited. I suggest also them if you want to deepen on the topic.
- Prompt Engineering Guide, link
- Wang, 2023, Prompt Engineering for Healthcare: Methodologies and Applications, link
- White, 2023, A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT, link
- Liu, 2021, Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing, link
- Liu, 2022, Generated Knowledge Prompting for Commonsense Reasoning, link
- Sascha Heyer, Generative AI – Best Practices for LLM Prompt Engineering, link
- Jacob Ferus, GPT-4 Has Arrived – Here’s What You Need To Know, link
- Simon Attard, Giving Large Language Models Context, blog post, link
- Rickard, Commoditization of Large Language Models: Part 3
- Vaswani, 2017, Attention Is All You Need, link
- Press, 2021, Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, link
- Zaheer, 2021, Big Bird: Transformers for Longer Sequences, link
- Ainslie, 2020, ETC: Encoding Long and Structured Inputs in Transformers, link
- Google blog, Constructing Transformers For Longer Sequences with Sparse Attention Methods, 2021, link
- Shazeer, 2019, Fast Transformer Decoding: One Write-Head is All You Need, link
- Pope, 2022, Efficiently Scaling Transformer Inference, link
- Ahmed Taha, FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, link, medium post
- Angelina Yang, What is Global Attention in Deep Learning? blog post, link
- Dao, 2022, FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, link
- A deep dive on FlashAttention here
- Medium post on LLaMA, META’s LLaMA: A small language model beating giants
- Ainslie, 2023, CoLT5: Faster Long-Range Transformers with Conditional Computation, link
- Yu, 2023, MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers, link
- A blog post about recent development in extending the context length, link
- Andrew Lukyanenko, Paper Review: Scaling Transformer to 1M tokens and beyond with RMT, blog post, link
- Liu, 2023, Lost in the Middle: How Language Models Use Long Contexts, link