The world’s leading publication for data science, AI, and ML professionals.

Towards infinite LLM context windows

From 512 to 1M+ tokens in 5 years – LLMs have rapidly expanded their context windows. Where's the limit?

It all started with GPT having an input context window of 512 tokens. After only 5 years the newest LLMs are capable of handling 1M+ context inputs. Where’s the limit?

I like to think of the LLMs (specifically, of the models’ parameters, i.e., their weights of the neural network layers and weights of the attention mechanisms) as a vast memory bank filled with a wide range of information about the world mixed with linguistic knowledge on how to process text. It’s like static knowledge it has once learned during the model pre-training and (optionally) fine-tuning; a common core reusable by all conversations with the model, no matter who asks the questions. The input context, on the other hand, is where the magic happens. It’s like passing a note to the model with special instructions or information. This can be anything from a question we want answered to unique information we have or even a history of past chats we’ve had with it. It’s a way of giving the model a personal memory.

Evolution of Context Size

One of the key characteristics of an LLM is its input context size. It can be defined as the number of words (or tokens, to be precise) that a language model takes into consideration when generating a response. The longer the context, the more "personal" (or "contextual", thus the name) information can be passed to the model. This is why researchers work hard on providing new techniques for enabling models with longer and longer context windows.

The first GPT, announced by OpenAI in June 2018, had a context size of merely 512 tokens. Its successor, GPT-2, announced nearly 1,5 years later, doubled the context size. Then came the first pop-star LLMs – GPT-3 (announced in May 2021, but made publicly available in November 2021) – doubled it again, to 2048 tokens. It was further followed by GPT-3.5 with 16K tokens context. Successive models have pushed these boundaries even further.

The chart below shows the evolution of consecutive generations of OpenAI GPT models over time. Note the logarithmic vertical scale for context size. Till 2022 OpenAI was the only game in town. Yet, they challenged themselves by training new models with more and more longer context windows. Recently, they were caught up or even outraced by competition. Stop here for a moment, before you will go further. Can you name the recent models that achieved that? They are marked as gray points in the chart below (hover over them, to find out what they are).

Since the beginning of 2023, a boom of diverse models proposed by various vendors has started. The models were offered with different context sizes, yet the largest vendors competed offering LLMs with still-growing context lengths. Mixtral 8x7B by Mistral AI came with a 32K context size. MPT-7B-StoryWriter-65k+ by MosaicML came with a context window of over 65K tokens. The newest models, like Google’s Gemini 1.5 Pro and Anthropic’s Claude 3 leap even further, offer staggering, respectively, 128K and 200K tokens (both with the possibility to extend it to even 1M tokens). Even Microsoft has published recently the Phi-3 model with a 128K context window, while all its predecessors (Phi-1, Phi-1.5, Phi-2) have only a 2K window.

This progression illustrates the technical advancements, but also a fundamental shift in how we envision interactions with AI – aiming to provide more and more personalized information without losing coherence or context.

Let’s take a closer look at the explosion of new LLMs that started in 2023.

What catches an eye is the huge versatility of models from various vendors with varying context sizes. The models also significantly differ in other critical characteristics, like their sizes (i.e., the number of their parameters – the size of the "static" part of the model). So, one may wonder what drives this diversity. It’s important to note here that the context size usually goes in-par with the size of the model itself (expressed in a number of its parameters).

This variability in the sizing of LLMs underscores an important shift towards specialized applications rather than a one-size-fits-all approach. Smaller LLMs (called Small Language Models – SLMs) are significantly less costly to be trained and hosted. It’s also more cost-effective to fine-tune them and turn them into specialized SLMs, which, for specific domain-related tasks, might perform as good as LLMs, at a small percentage of their cost.

On the other hand, there are flagship LLMs of very large sizes, which are extremely expensive to be trained, hosted, and fine-tuned. Yet, they are capable of handling various tasks with decent quality (with a need for fine-tuning) and propose much higher reasoning capabilities.

Feeling the Scale

So, context window sizes of 100K+ tokens are considered large, but what does it exactly mean? To get a glimpse of how much data can be passed within a prompt of such size note that 100K tokens is an equivalent of:

  • ~165 A4 pages of a text in English (assuming 12pt font size, 1.5 line spacing, 400–500 words per page, ~3/4 word per token)
  • 10K lines of code (loc) in Python (assuming ~10 tokens per line of code; based on own analysis)

If you want to test how many tokens are required to represent a given specific text you can use this free tool from OpenAI: OpenAI Tokenizer. Keep in mind, that different LLMs use different tokenizers, so the results may vary.

Cost of the Scale

Another aspect of the scale is the cost related to using such large contexts. As of April 2024 passing 100K input tokens (via an API, which is priced on a pay-per-token basis) is a cost of:

  • 0.025 USD for Claude 3 Haiku (0.25 USD/1M tokens)
  • 0.05 USD for Command-R (0.50 USD/1M tokens)
  • 0.30 USD for Claude 3 Sonnet (3 USD/1M tokens)
  • 0.70 USD for Gemini 1.5 Pro (7 USD/1M tokens)
  • 1.00 USD for GPT 4 Turbo (10 USD/1M tokens)
  • 1.50 USD for Claude 3 Opus (15 USD/1M tokens)

Note: The above prices are for input tokens. Output tokens are 3x to 5x more expensive (depending on the model).

It’s important to be aware that the LLMs are stateless, which means they don’t remember what was passed to them (within the input context) in previous conversations. So all the important information has to be included in the input context every single time.

Imagine a chatbot. It is expected to remember what has been told to it in previous messages within a conversation. So with each new message, a history of the conversation so far has to be passed to it. As we pile up all past messages (and model responses) in the input context, we increase the cost of processing each consecutive question. So having a long conversation with a bot (based on models like GPT 4 Turbo or Claude 3 Opus) having unconstrained chat history piled up in the input context may eventually lead to a cost as high as 1 USD per message. It’s way too much.

So, while having these very long context windows sounds plausible and might be extremely useful in some specific scenarios, one has to carefully consider the cost attached to it.

Community-driven Context Extensions

While authors of foundational models compete to provide new LLMs with increasingly longer context windows, there is an interesting phenomenon within the researchers and open-source community. Having access to open foundational LLM models (like models Llama family) they apply various techniques to create variations of the base models by extending their input context window.

A nice example of it is a recent extension of the Llama 3 model by Matt Shumer from the native 8K context window to 16K tokens: mattshumer/Llama-3–8B-16K. Or further extension to 24K context window (and fine-tuned for role playing): openlynn/Llama-3-Soliloquy-8B. Or even more – to 64K: winglian/Llama-3–8b-64k-PoSE.

One mainstream technique enabling context extension is RoPE (Rotary Position Embeddings; see [2104.09864] RoFormer: Enhanced Transformer with Rotary Position Embedding (arxiv.org) for details). RoPE embeds information about token positions into the model, allowing it to understand the sequence order of tokens even over longer stretches of text. It also rotates the position embeddings for each token according to its position in the sequence. This rotation gives the model a consistent way to identify token positions even as the context window expands. There are also novel techniques based on (or compatible with) RoPE, like:

Towards Infinite Context Window

There are different approaches to extending the context window. Some authors claim the capabilities of unlimited extensions. E.g., as stated by authors of PoSE:

[…] our method can potentially support infinite length, limited only by memory usage in inference.

Others, like authors of [2401.01325] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning (arxiv.org), claim that:

[…] LLMs should have inherent capabilities to handle long contexts.

I’m not going to dive into the details of these methods; I just want to highlight that new records of very long context windows are about to be achieved. So where’s the limit? And does it make sense to create LLMs with such enormous context windows?

For sure, there are use cases that benefit from such large context windows. Just imagine how handy it may be to paste an entire code base of some troublesome and complex library (with its all dependencies) and let the LLM analyze it in one shot. Or let the LLM answer questions related to some extremely lengthy contract with all its appendices and related documents while having it all in the context. For such ad-hoc operations, LLMs with vast context windows are perfect.

Now imagine an LLM that can take an unlimited prompt at its input and, in addition, that remembers all previous prompts and generated answers. Well… this is already feasible. You can create a wrapper for any LLM with a finite context window, which will store internally all processed prompts and their answers, and for each new input prompt will extend it with relevant information retrieved from its internal storage.

Rings a bell? It should! This is exactly what RAGs are. If you want to learn more about this technique I do recommend reading [2312.10997] Retrieval-Augmented Generation for Large Language Models: A Survey (arxiv.org), which explains what RAG is, reviews in detail its building blocks, details the latest technologies and methodologies in this area, and introduces evaluation frameworks and benchmarking processes.

Naïve RAG explainer from "Retrieval-Augmented Generation for Large Language Models: A Survey" | Source: RAG-Survey/images/Survey/RAG_case.png at main
Naïve RAG explainer from "Retrieval-Augmented Generation for Large Language Models: A Survey" | Source: RAG-Survey/images/Survey/RAG_case.png at main

Final Thoughts

When you try to fit larger amounts of information into an input context of an LLM, reflect on the rapid journey we had from GPT-3’s 2048 tokens context window to the expansive horizons of 1M+ contexts of models like Claude 3 and Gemini 1.5 Pro. What’s the next boundary to be crossed?

And before you get too excited about the novel possibilities opened by the extremely long context windows, keep in mind the cost related to using them. Every token counts!

So in the end, maybe a solid RAG is a better option, especially in heavy-duty enterprise/commercial use cases, where cost is a significant factor.


🖐 🤓👉 Fun Fact

Ever thought about how our memory works like an LLM’s context window? To me, it resembles humans’ limited short-term memory capacity. Just like LLMs can manage a specific number of tokens within their context, human short-term memory can typically hold 7 pieces of information at once. This magic number 7 isn’t an exact count of separate items; it’s more about "chunks" of information.

Take phone numbers, for example. You don’t remember each digit individually; you chunk them into groups, making it easier to recall. Another analogy is with a grocery list: you remember the items in the list (or maybe even groups of items, if they are strictly related), not specific words in their names or descriptions. This is why you might see people breaking down complex information into manageable bits to remember it better (which, on the other hand, is a very good analogy for the tokenization process). It’s a trick we use without even thinking about it.

So, the next time you’re trying to remember a grocery list or someone’s phone number, think of it as your brain’s context window at work, just like an LLM. It’s all about chunking and maximizing what can be held in a compact space.

Oh… and if you wonder why this magic number is 7, refer to The Magical Number Seven, Plus or Minus Two – Wikipedia.


Related Articles