Member-only story

How to Cut RAG Costs by 80% Using Prompt Compression

Accelerating Inference With Prompt Compression

Iulia Brezeanu
Towards Data Science
11 min readJan 4, 2024

Image by the author. AI Generated.

The inference process is one of the things that greatly increases the money and time costs of using large language models. This problem augments considerably for longer inputs. Below, you can see the relationship between model performance and inference time.

Performance score vs inference throughput [1]

Fast models, which generate more tokens per second, tend to score lower in the Open LLM Leaderboard. Scaling up the model size enables better performance but comes at the cost of lower inference throughput. This makes it difficult to deploy them in real-life applications [1].

Enhancing LLMs’ speed and reducing resource requirements would allow them to be more widely used by individuals or small organizations.

Different solutions are proposed for increasing LLM efficiency; some focus on the model architecture or system. However, proprietary models like ChatGPT or Claude can be accessed only via APIs, so we cannot change their inner algorithm.

We will discuss a simple and inexpensive method that relies only on changing the input given to the model — prompt…

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Iulia Brezeanu
Iulia Brezeanu

Written by Iulia Brezeanu

AI Engineer @ The Motley Fool 🦙 specialized in generative AI and LLMs

Responses (11)

What are your thoughts?

Great article, thank you! Please write more!

Hi, This is Regan. I am reaching out to express my admiration for the insightful content on your blog. Your post caught my attention, and I believe it holds great value for a wider audience, particularly those who speak Chinese.
I am the operator of…

Thanks. Very well explained. The code snippets help a lot.