
The inference process is one of the things that greatly increases the money and time costs of using large language models. This problem augments considerably for longer inputs. Below, you can see the relationship between model performance and inference time.
![Performance score vs inference throughput [1]](https://towardsdatascience.com/wp-content/uploads/2024/01/1ry6MM0kUsofErl_iD16oXw.png)
Fast models, which generate more tokens per second, tend to score lower in the Open LLM Leaderboard. Scaling up the model size enables better performance but comes at the cost of lower inference throughput. This makes it difficult to deploy them in real-life applications [1].
Enhancing LLMs’ speed and reducing resource requirements would allow them to be more widely used by individuals or small organizations.
Different solutions are proposed for increasing LLM efficiency; some focus on the model architecture or system. However, proprietary models like ChatGPT or Claude can be accessed only via APIs, so we cannot change their inner algorithm.
We will discuss a simple and inexpensive method that relies only on changing the input given to the model – prompt compression.
First, let’s clarify how LLMs understand language. The first step in making sense of natural language text is to split it into pieces. This process is called tokenization. A token can be an entire word, a syllable, or a sequence of characters frequently used in current speech.

As a rule of thumb, the number of tokens is 33% higher than the number of words. So, 1000 words correspond to approximately 1333 tokens.
Let’s look specifically at the OpenAI pricing for the gpt-3.5-turbo model, as it’s the model we will use down the line.
![OpenAI pricing [3]](https://towardsdatascience.com/wp-content/uploads/2024/01/1V90w4PQY6MB6h4dGOE472w.png)
We can see that the inference process has a cost for both the input tokens, corresponding to the prompt sent to the model, and output tokens, the text generated by the model.
One of the applications in which the input tokens consume the most resources is Retrieval-Augmented Generation. The input can even reach thousands of tokens. In RAG, the user query is sent to a vectorial database, where the most similar piece of information is retrieved and sent to the LLM along with the query. In the vectorial database, we can add personal documents that the model didn’t see during its initial training.

The number of tokens sent to the LLM can be significant depending on how many chunks of text are retrieved from the database.
Prompt Compression
![Illustration of prompt compression [1]](https://towardsdatascience.com/wp-content/uploads/2024/01/1iPsv67TX93GTVSIrVkXphw.png)
Prompt compression shortens the original prompt while keeping the most important information. It also speeds up how quickly the language model can process the inputs to help it generate fast and accurate answers.
This technique uses the fact that language often includes unnecessary repetition. Research shows English has a lot of redundancy – around 75% – in texts that are paragraph or chapter length [2]. This means most of the words can be predicted from the words before them in the context.
Auto compressors
The first compression method we’ll talk about is AutoCompressors. It works by summarizing long text into short vector representations called summary vectors. These compressed summary vectors then act as soft prompts for the model [4].
![AutoCompressors flow [4]](https://towardsdatascience.com/wp-content/uploads/2024/01/1MQp_M4DqZauPWgswFI4h1Q.png)
During soft prompting, the pre-trained model is kept frozen, and a small number of trainable tokens are added at the beginning of the input text for each specific task. These tokens are not fixed but are instead learned through training. They are optimized end-to-end in the context of the entire model to best suit the specific task.
![RAG with AutoCompressor [4]](https://towardsdatascience.com/wp-content/uploads/2024/01/1EPrR2rN9bL9XO5x29iaLUw.png)
For RAG, the indexed documents can be pre-processed in order to be transformed into summary vectors. During the retrieval phase, the retrieved chunks are fused and sent to the LLM. The fusion process means that their vectorial representations are joined end-to-end to form a single, longer vector. The vectors are basically stacked on top of each other.
In order to create these summary vectors, you can choose to train a compressor yourself, or you can use a pre-trained one. Below is an example of how to use the API with a pre-trained compressor taken from the GitHub page of the paper [5].
![Example use of the API with a pre-trained AutoCompressor model [5]](https://towardsdatascience.com/wp-content/uploads/2024/01/1AlVS9hw_ecpYEB2tWs0vVQ.png)
AutoCompressor-Llama-2–7b-6k is a fine-tuned version of the LLama-2–7B model. It was fine-tuned on a single NVIDIA A100 80GB GPU. The training data consisted of 15 billion tokens from RedPajama, split into sequences of 6,144 tokens each. The LLama-2 model itself stayed frozen during training. Only the summary token embeddings and attention weights were optimized using LoRA.
Selective Context
![Example of context filtering based on the self-information score [6]](https://towardsdatascience.com/wp-content/uploads/2024/01/1N1H4E40ktukLFAg-W2BX8A.png)
In information theory, entropy measures the unpredictability or uncertainty of a piece of information. In the context of language models, it represents the level of uncertainty in predicting the next token in a sequence. A higher entropy indicates greater unpredictability.
When an LLM predicts tokens with high certainty, those tokens contribute less to the model’s overall entropy. This motivated the introduction of a method for prompt compression based on removing predictable tokens from the data.
The idea is that if tokens with low perplexity are removed, it has a minor impact on the LLM’s understanding of the context because these tokens don’t add much new information to begin with. Tokens with high perplexity are said to have high self-information values.
In order to compress the prompt, a base language model like Llama or GPT-2 assigns a self-information value to each lexical unit (saying basically how surprised it is to see it). A lexical unit can be either a phrase, sentence, or token, depending on our choice. Then, the base model ranks the units in descending order and retains only those from the first p-th percentile, where p is a variable we can set. The authors chose a percentile-based approach instead of an absolute-value one because is more flexible.
Let’s see an example of a text compressed at different lexical units.

Between these three lexical units, sentence-level compression keeps the original sentences intact. Also, a lower reduce ratio compresses the text more.
LongLLMLingua
![Framework of LongLLMLingua [7]](https://towardsdatascience.com/wp-content/uploads/2024/01/14p6DDg1NtFBA5qgQAZkLcA.png)
The last compression method we’ll talk about is LongLLMLingua. LongLLMLingua builds on LLMLingua, which uses a base LLM like Llama to assess the perplexity of each token in a prompt, discarding those with lower perplexities. This approach is based on information entropy, similar to Selective Context.
However, instead of just removing the tokens directly, LLMLingua uses a budget controller, a token-level prompt compression algorithm, and a distribution alignment mechanism. We won’t go into too much detail, but you can read more about it in the original paper [8].
The problem with LLMLingua is that it doesn’t take into consideration the user question during the compression process, which could result in retaining information that is not relevant. LongLLMLingua claims to improve this shortcoming by incorporating the user’s question into the compression process.
The four new components they bring to the table are a question-aware coarse-to-fine compression method, a document reordering mechanism, compression ratios, and a post-compression subsequence recovery strategy to improve LLMs’ perception of the key information.
Question-Aware Coarse-Grained Compression means that instead of looking at each document alone, the new method checks how each document relates to the question. If a document makes the question seem more expected or "less surprising" to the model, it’s seen as more important.
Question-Aware Fine-Grained Compression
![Contrastive perplexity equation [7]](https://towardsdatascience.com/wp-content/uploads/2024/01/1EqFL1PqRLNt0-pQB9fSLgw.png)
First, we measure how surprising a word is normally (without the question being considered). This is the _perplexity(xi | x<i), which means the perplexity or surprise of seeing the word x_i given all the words before it. Then, we measure the perplexity again, but this time including the question in the context. This is the _perplexity(xi | x^que, x<i), which means the surprise of seeing the word x_i given the question and all the words before it.
The idea is to find out how much the question changes the surprise level for each word. If a word becomes a lot less surprising when you include the question, it’s probably very relevant to the question.
Then, we reorder the documents in descending order based on the importance scores obtained from the first step. This way, the most important documents come first.
Subsequence recovery
![Example of Subsequence Recovery, the red text represents the original text, and the blue text is the result after using the LLaMA 2–7B tokenizer [7]](https://towardsdatascience.com/wp-content/uploads/2024/01/1dNB0ty6oxSDC2ELglgGlsA.png)
After compression, it can happen that key entities like dates or names become altered. For instance, "2009" might become "209," or "Wilhelm Conrad Rontgen" might turn into "Wilhelmgen." In order to avoid this problem, we first identify the longest substring in the LLM’s response that matches a part of the compressed prompt. This substring is considered a key entity. Next, we find the original, uncompressed subsequence corresponding to the compressed entity. Then, we replace the compressed entity with the original one.
RAG with LlamaIndex and prompt compression
We will use Nicolas Cage’s Wikipedia page for a practical RAG application. Probably, the model already saw information about the actor in its training data, so we are specifying that we expect an answer only based on the retrieved context. We are loading the Wikipedia page using the WikipediaReader() loader.
from llama_index import (
VectorStoreIndex,
download_loader,
load_index_from_storage,
StorageContext,
)
WikipediaReader = download_loader("WikipediaReader")
loader = WikipediaReader()
documents = loader.load_data(pages=['Nicolas Cage'])
Now, we are building a simple vector store index. It takes only one line to do the chunking, embedding, and indexing of the documents.
The retriever will be used to return the most relevant documents given the user query. It does so by computing the similarity between the query and various document chunks within the embedding space. We want to retrieve the top 2 most similar chunks.
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k=2)
Now that our data is stored in the index, we launch a user query. The retriever.retrieve(question) function searches the index to find the 2 chunks of data that are most similar to the query.
question = "Where did Nicolas Cage go to school?"
contexts = retriever.retrieve(question)
# Expected answer: Beverly Hills High School
The contexts list carries NodeWithScore data entities with metadata and relationship information with other nodes. For now, we are only interested in the content.
context_list = [n.get_content() for n in contexts]
context_list
This is the retrieved context. Even if we choose to get only the top two documents, we still have to deal with a lot of text.

We combine these relevant chunks with the original query to create a prompt. We will use a prompt template instead of just a f-string because we want to reuse it down the line.
We then feed this prompt into gpt-3.5-turbo-16k to generate a response.
# The response from original prompt
from llama_index.llms import OpenAI
from llama_index.prompts import PromptTemplate
llm = OpenAI(model="gpt-3.5-turbo-16k")
template = (
"We have provided context information below. n"
"---------------------n"
"{context_str}"
"n---------------------n"
"Given this information, please answer the question: {query_str}n"
)
qa_template = PromptTemplate(template)
# you can create text prompt (for completion API)
prompt = qa_template.format(context_str="nn".join(context_list), query_str=question)
response = llm.complete(prompt)
print(str(response))
Output:
Nicolas Cage attended Beverly Hills High School and later attended UCLA School of Theater, Film and Television.
Now, let’s measure the RAG performance after using different prompt compression techniques.
Selective Context
We will use a _reduceratio of 0.5 and see how the model does. If the compression keeps the information we are interested in, we will lower the value in order to compress more text.
from selective_context import SelectiveContext
sc = SelectiveContext(model_type='gpt2', lang='en')
context_string = "nn".join(context_list)
context, reduced_content = sc(context_string, reduce_ratio = 0.5,reduce_level="sent")
prompt = qa_template.format(context_str="nn".join(reduced_content), query_str=question)
response = llm.complete(prompt)
print(str(response))
This is the reduced content.

It was compressed at the sentence level, but unfortunately, the information about where Nicolas Cage went to school got lost. We also tried token and phrase-level compression, but the information was still absent.
Output:
The provided information does not mention where Nicolas Cage went to school.
LongLLMLingua
# Setup LLMLingua
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import CompactAndRefine
from llama_index.indices.postprocessor import LongLLMLinguaPostprocessor
node_postprocessor = LongLLMLinguaPostprocessor(
instruction_str="Given the context, please answer the final question",
target_token=300,
rank_method="longllmlingua",
additional_compress_kwargs={
"condition_compare": True,
"condition_in_question": "after",
"context_budget": "+100",
"reorder_context": "sort", # enable document reorder,
"dynamic_context_compression_ratio": 0.3,
},
)
retrieved_nodes = retriever.retrieve(question)
synthesizer = CompactAndRefine()
The _postprocessnodes function is the one we care about the most because it shortens the node text given the query.
from llama_index.indices.query.schema import QueryBundle
new_retrieved_nodes = node_postprocessor.postprocess_nodes(
retrieved_nodes, query_bundle=QueryBundle(query_str=question)
)
Now let’s see the results.
original_contexts = "nn".join([n.get_content() for n in retrieved_nodes])
compressed_contexts = "nn".join([n.get_content() for n in new_retrieved_nodes])
original_tokens = node_postprocessor._llm_lingua.get_token_length(original_contexts)
compressed_tokens = node_postprocessor._llm_lingua.get_token_length(compressed_contexts)
print(compressed_contexts)
print()
print("Original Tokens:", original_tokens)
print("Compressed Tokens:", compressed_tokens)
print("Compressed Ratio:", f"{original_tokens/(compressed_tokens + 1e-5):.2f}x")
Original Tokens: 2362 Compressed Tokens: 344 Compressed Ratio: 6.87x
Compressed context:

Let’s see if the model understands the compressed context.
response = synthesizer.synthesize(question, new_retrieved_nodes)
print(str(response))
Output:
Nicolas Cage attended Beverly Hills High School.
From the context compressed using longllmlingua, it is clear where the actor went to school. We also got almost a 7x reduction of input tokens! This translates to saving $0.00202. Imagine the cost reduction for 1B tokens. Normally, they would cost $1000, but with prompt compression, we’re just paying around $150.

Conclusion
Of the methods discussed, LongLLMLingua seems to be the most promising for prompt compression in RAG applications. It compresses prompts by 6–7x while still retaining key information needed for the LLM to generate accurate responses.
. . .
If you enjoyed this article, join Text Generation – our newsletter has two weekly posts with the latest insights on Generative AI and Large Language Models.
You can also find me on LinkedIn.
. . .
References
2. Prediction and entropy of printed English
- https://openai.com/pricing
- Adapting Language Models to Compress Contexts
- https://github.com/princeton-nlp/AutoCompressors
- Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering
- LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
- LongLLMLingua: ACCELERATING AND ENHANCING LLMS IN LONG CONTEXT SCENARIOS VIA PROMPT COMPRESSION