Member-only story

How to Cut RAG Costs by 80% Using Prompt Compression

Accelerating Inference With Prompt Compression

Iulia Brezeanu
Towards Data Science
11 min readJan 4, 2024
Image by the author. AI Generated.

The inference process is one of the things that greatly increases the money and time costs of using large language models. This problem augments considerably for longer inputs. Below, you can see the relationship between model performance and inference time.

Performance score vs inference throughput [1]

Fast models, which generate more tokens per second, tend to score lower in the Open LLM Leaderboard. Scaling up the model size enables better performance but comes at the cost of lower inference throughput. This makes it difficult to deploy them in real-life applications [1].

Enhancing LLMs’ speed and reducing resource requirements would allow them to be more widely used by individuals or small organizations.

Different solutions are proposed for increasing LLM efficiency; some focus on the model architecture or system. However, proprietary models like ChatGPT or Claude can be accessed only via APIs, so we cannot change their inner algorithm.

We will discuss a simple and inexpensive method that relies only on changing the input given to the model — prompt…

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Written by Iulia Brezeanu

AI Engineer @ The Motley Fool 🦙 specialized in generative AI and LLMs

Responses (11)

What are your thoughts?