Member-only story
How to Cut RAG Costs by 80% Using Prompt Compression
Accelerating Inference With Prompt Compression
The inference process is one of the things that greatly increases the money and time costs of using large language models. This problem augments considerably for longer inputs. Below, you can see the relationship between model performance and inference time.
Fast models, which generate more tokens per second, tend to score lower in the Open LLM Leaderboard. Scaling up the model size enables better performance but comes at the cost of lower inference throughput. This makes it difficult to deploy them in real-life applications [1].
Enhancing LLMs’ speed and reducing resource requirements would allow them to be more widely used by individuals or small organizations.
Different solutions are proposed for increasing LLM efficiency; some focus on the model architecture or system. However, proprietary models like ChatGPT or Claude can be accessed only via APIs, so we cannot change their inner algorithm.
We will discuss a simple and inexpensive method that relies only on changing the input given to the model — prompt…