The speed at which people are evolving into GenAI experts these days is remarkable. And each of them proclaims that GenAI will bring the next industrial evolution.
It’s a big promise, but I agree, I think it’s true this time. AI will finally revolutionize the way we work and live. I can’t imagine sliding into the next AI winter.
LLMs and multimodal models are simply too useful and "relatively" easy to implement into existing processes. A vector database, a few lines of code to process the raw data, and a few API calls.
That’s it. – At least in theory.
Although it sounds quite straightforward, the real progress in the industry is probably better described by one of Matt Turck’s recent LinkedIn posts:
2023: "I hope Generative AI doesn’t kill us all!"
2024: "I hope Generative AI goes from proof of concept experiment at my company to small production deployment to read PDFs in the next 12–18 months!"
- Matt Turck
Building a Prototype is easy. Turning it into something production-ready is hard.
The following post is for everybody who spent the last month building their first LLM apps and started productizing them. We will look at 17 techniques you should try to mitigate some of the pitfalls along the RAG process and gradually develop your application into a powerful and robust solution that will last.
To understand where we have opportunities to improve the standard RAG process, I will briefly recap what components we need to create a simple RAG app. Feel free to skip this part, to get directly to the "Advanced" techniques.
Table of Contents
1.RAW data preparation
(1) Preparing (or processing) the data
2. Indexing / Chunking
(3) Enhancing data quality – Abbreviations / technical terms / links
(5) Optimization indexing structure – Full Search vs. Approximate Nearest Neighbor, HNSW vs. IVFPQ
(6) Choose the right embedding model
3. Retrieval Optimization – Query Translation / Query Rewriting / Query Extension
(7) Query Expansion with generated answers – HyDE and co.
4. Post-Retrieval
(11) Sentence Window Retrieval
(12) Auto-merging Retriever (aka Parent Document Retriever)
5. Generation / Agents
6. Evaluation – Evaluate our RAG system
(17) Continued data collection from the app and users
What do we need for a production-ready solution?
That LLMs are getting more and more powerful is one thing, but if we are honest with each other, the bigger lever to improve the RAG performance is again the less fancy stuff:
Data quality – data preparation – data processing.
During the runtime of the app or when preparing the raw data, we process data, classify data and gain insights from the data to steer the outcome in the right direction.
It is illusory to simply wait for bigger and bigger models in the hope that they will solve all our problems without us working on our data and processes.
Maybe one day we’ll get to the point where we feed all the crappy raw data into a single model and it somehow makes something useful out of it. But even if we reach that point, I doubt it will make sense from a cost and performance perspective.
The RAG concept has come to stay.
In his talk at Sequoia Capital’s AI Ascent, Andrew Ng showed how GPT-3.5 can beat far more powerful models by using an agent-based approach. The image below compares the accuracy of GPT-4 (zero-shot prompting) to GPT-3.5 + agentic workflow.
I explain the idea behind Agents in one of my recent articles. You can find it here.
![Coding Benchmark (HumanEval) - Image by the author (data by [Andre Ng, 2024])](https://towardsdatascience.com/wp-content/uploads/2024/06/1cdQNL9JpKRYv1hZt0ihhyQ-1.png)
With all these capabilities that the big Transformer models have brought us, AGI (Artificial General Intelligence) is no longer just something for science fiction films.
At the same time, I doubt that we will reach AGI level with one single omniscient AI model. Andrew Ng describes the "AGI-like" program as a whole system, combining LLMs and Multimodal Models together with suitable components and tools around them. [Andre Ng, 2024]
If that’s the case, the journey towards AGI is for all of us. We can fill the space between the large models and real-world applications, and step by step make such a system more intelligent and capable over time.
In this article, I will show you some tools for that journey. The described techniques will help us improve the robustness and performance of our system.
They will help us answer questions like:
- How does my system find the "really" relevant content?
- How do I prepare and process the data so that the LLM can do something with it?
- Can I connect the LLMs in series? Route requests through different components?
Before we dive into the advanced techniques, let’s talk about Naive RAG – the most simplistic RAG system – first and build on top of that.
I promise to keep it short, but feel free to skip that part, if you more than familiar with the standard RAG system.
Feel free to jump to the advanced techniques here.
Naive RAG – A Brief Recap
With RAG we are building here on top of giants, utilizing existing concepts and technologies and combining them in a suitable way.
A lot of the technologies have their origin in the search engine domain. The aim is to build a process around the LLMs, feeding the model with the right data, to make decisions or summarise information.
The image below just shows a bunch of technologies we are using while building such a system.

Besides the Transformer Models, we are using a bunch of technologies, like:
- Semantic search techniques
- Techniques to prepare and process textual data
- Knowledge graphs, smaller classification/regression/NLP models, etc.
All of those technologies existed already for years. The vector search library FAISS was published in 2019. Also, text vectorization is nothing new. [Ilin, 2024]
RAG just connects those components, to solve a particular problem.
Bing Search for example is combining their traditional "BING" web search with the capabilities of the LLMs. That allows their chatbot answer question to "real" life data, like:
- "What is Google’s share price today?"
The image below shows the standard RAG process. When the user asks a question, Naive RAG compares the user’s question directly with any content in our vector database.

We are interested in finding similar content. Similar content is content that is close to each other in our vector space. The distance is measured e.g. by calculating the cosine similarity.

Sample:
Question: "What position has Tom Brady played?"
Let’s assume that we have two main data sources in our vector database:
- Tom Brady’s Wikipedia article.
- Articles from a cookbook.
In the example below the content from Wikipedia should be more relevant and thus closer to the user’s question.

But how "close" is "close" enough?
We can’t set a threshold for the similarity score that would allow us to differentiate between relevant and irrelevant content. You can try it yourself, but you will probably realise that it is not practical.
Found content? Let’s build the prompt
Now we found some content similar to the user’s question, we need to pack all of them in a meaningful prompt. We get a prompt with at least 3 building blocks.
- The system prompt, to tell the LLM how to behave
- The users question
- The Context, the relevant documents found after performing the similarity search
A suitable prompt template could look like this:

The part in the system prompt "… use only the provided information" is turning the LLM into a system that processes and interprets information. In this scenario, we do not directly utilize the knowledge of the model to answer the question.
That’s it. – It’s that simple. A vector store, an embedding model, a LLM, a few lines of Python code – and of course some documents.
When we scale these systems and take them from a prototype to a productive solution, we are taken back to reality.
We will most likely encounter various pitfalls along the process, e.g.:
- (Valuable Content vs. Distractors) Our nearest neighbor search will always come up with relevant content, but how relevant is that content? We call data points that are not relevant for answering user questions distractors.

- (Chunk Optimization) What size should the individual content pieces have, to be just as specific enough but still include enough context around it?
- (Monitoring/ Evaluation) How do we monitor our solution in operations? (LLMOps)
- (Agents / Multiple Prompting) How do we handle more complex queries, which can’t be solved with a single prompt?
How to deal with RAG’s potential pitfalls?
As mentioned above, we have different components that interact with each other. This gives us a whole range of possible ways to improve the performance of the overall system.
Basically we have the following 5 process steps we can try to improve:
- Pre-Retrieval: ingest embeddings into the vector store
- Retrieval: find relevant content
- Post-Retrieval: pre-processing the result before we send it to the LLM?
- Generation: use the context provided to solve the user’s problem
- Routing: overall routing of the request, e.g. agentic approach is breaking the question down and sending it back and forth with the model

If we look at them more closely, we get the following picture.

Let’s take a look at each of them.
We start with the most obvious and simplest method – Data quality. For most RAG use cases it is textual data, e.g. some wiki articles.
1. Raw Data Creation / Preparation
It is not always the case that we need to work with what we have. Often we can influence the creation process of the documents.
Through LLMs and RAG applications we are suddenly forced to structure our knowledge bases. In Naive RAG, we search for information pieces, which are somehow similar to the user’s question.
This way, the model never sees the entire context of the wiki, but only individual text snippets. That is a problem when the documents include e.g.:
- Domain-specific or even document-specific abbreviations
- Text passages that are linked to other parts of the wiki
If it is difficult for a human without background knowledge to grasp the full meaning behind text snippets, the LLM will also stumble.
In later parts of the article, you will find some techniques which try to tackle those problems after or during the retrieval step.
In an ideal world, we don’t need them.
Every section in our wiki should be as easy to understand as possible, which also makes it easier for us humans. This way, we improve the readability of our wiki and the performance of our RAG app at the same time. Win-win.
The following example shows how we can make life easier for our RAG app by setting up the content in the right way.
(1) Preparing the data in a way that the text chunks are self-explanatory
In the figure below you can see an example similar to what you often see in tutorials and technical documentation. If we do not have a pure LLM and not a multimodal model, the LLM will find it difficult to fully understand the content in version 1 on the left. Version 2 at least gives it a better chance to do so.

The next step along the RAG process is to chunk the data in a meaningful way, translate it into embeddings, and index it.
2. Indexing / Chunking – Chunk Optimization
Transformer models have fixed input sequence lengths. So we are limited in the token size of our prompt we are sending to the LLM and Embedding Model.
But in my opinion, this is not a limitation.
It makes sense to think about the optimal length of a text snippet and prompt, as this has a significant impact on performance, e.g.:
- response time (+costs)
- similarity search accuracy
- etc.
There are various text splitter available, to chunk the text.
(2) Chunk Optimization – Sliding Windows, Recursive Structure Aware Splitting, Structure Aware Splitting, Content-Aware Splitting
The size of the chunk is a parameter to think of – it depends on the embedding model you use and its capacity in tokens, standard transformer Encoder models like BERT-based Sentence Transformers take 512 tokens at most, some embedding models are capable of handling longer sequences, 8191 tokens and more.
But bigger is not always better. I would rather find the two sentences in a book that contain the two most important pieces of information than 5 pages that somehow include the answer as well.

The compromise here is:
enough context for the LLM to reason upon vs. specific enough text embedding in order to efficiently execute search upon
There are different ways to address those chunk size selection concerns. In LlamaIndex this is covered by the NodeParser class with some advanced options such as defining your own text splitter, metadata, nodes/chunks relations, etc.
The simplest way to ensure that all information is captured correctly and no parts are overlooked by dividing the entire text strictly into sections is to use sliding windows.
It’s simple, the text sections overlap – that’s it.

Beyond that there are a bunch of other chunking techniques you can try, to improve the chunking process, like:
- Recursive Structure Aware Splitting
- Structure Aware Splitting (by Sentence, Paragraph)
- Content-Aware Splitting (Markdown, LaTeX, HTML)
- Chunking using NLP: Tracking Topic Changes
(3) Enhancing data quality – Abbreviations, technical terms, links
Data cleansing techniques remove irrelevant information or put sections of text into context to make them easier to understand.
Sometimes the meaning of a particular paragraph in a longer text seems perfectly clear if you know the context of the book. If the context is missing, it becomes difficult.
Abbreviations, specific technical terms, or company internal terms make it hard for the model to understand the full meaning.
The image below shows some examples.
- If you have no idea about basketball, you will never associate MJ with THE "Michael Jordan".
- If you don’t know that the sentence describes something in the supply chain context, you would probably assume PO means "post office" or any other word combination starting with a P and O.

To mitigate that issue, we can try to ingest that necessary additional context while processing the data, e.g. replace abbreviations with the full text by using an abbreviation translation table.
You will definitely need that, when you are working on use cases around Text to SQL. The field names in a lot of databases are often weird. Often only the developer and god knows, what the meaning behind a particular field name is.
SAP (the Enterprise Resource Planning solution) is often using short forms of german words to label their fields. The field "WERKS" is the abbreviation for the German word "Werkstoff", which describes the raw material of a part.
Sure, this may make sense for the team that defined the database structure. Everyone else will have a hard time, including our model.
(4) Adding Metadata
You can add metadata to your vector data in all vector databases. Metadata can later help to (pre-)filter the entire vector database before we perform a vector search.
Let’s say half of our data in our vector store is targeted at users in Europe and the other half at users in the US. If we know the user’s location, we don’t want to search the entire database, we want to be able to search directly for the relevant bits. If we have this information as a metadata field, most vector stores allow us to pre-filter the database before performing the similarity search.
(5) Optimization indexing structure – Full Search vs. Approximate Nearest Neighbor, HNSW vs. IVFPQ
I don’t believe that the similarity search is the weak point of most RAG systems – at least not when we look at the response time – but I want to mention it anyway.
Similarity Search in most vector scores is so fast, even when we have millions of entries because it’s using Approximate Nearest Neighbor techniques – e.g. FAISS, NMSLIB, ANNOY, …
FAISS, NMSLIB or ANNOY, using some Approximate Nearest Neighbours make it possible.

This is usually overkill for use cases with only a few thousand entries. If we are performing ANN or full nearest neighbor search is only slightly influencing the response time of our RAG system.
However, if you want to set up a scalable system, you can certainly speed up things.
(6) Choose the right embedding model
To embed the text chunks there are various options. If you are not sure, what models to use, have a look at the existing benchmarks for measuring the performance of text embedding models, like MTEB (Massive Text Embedding Benchmark).
When it comes to embeddings we usually need to decide how many dimensions we want the embeddings to have. Higher dimensions let you capture and store more semantical facets of the sentence, on the other side it needs more space to store and more compute time.

We translate all our content into embeddings and add them to our vector database. There are now several models available from different providers. If you want to get an idea of which models you can use, you can take a look at the models supported by the langchain.embeddings module. In the source code of the module you will find a rather long list of supported models:
__all__ = [
"OpenAIEmbeddings",
"AzureOpenAIEmbeddings",
"CacheBackedEmbeddings",
"ClarifaiEmbeddings",
"CohereEmbeddings",
...
"QianfanEmbeddingsEndpoint",
"JohnSnowLabsEmbeddings",
"VoyageEmbeddings",
"BookendEmbeddings"
]
3. Retrieval Optimization – Query Translation / Query Rewriting / Query Extension
Query Extension or Query Rewriting or Query Translation, all of them have to modify the original query to the LLM.
Basically we are using the Power of LLMs to augment and enhance the query we send to the Vector Search
There are different ways, e.g.
- Expanding the user’s query by coming up with LLM-generated documents to the user’s question.
- Alternatively, we can use the LLM to rephrase the user’s query slightly and send different prompts to the model and afterwards interpret the different feedback from the mode.
Let’s start with the first one.
(7) Query Expansion with generated answers – HyDE and co.
We use the LLM to generate an answer, before performing the similarity search. If it is a question that can only be answered using our internal knowledge, we indirectly ask the model to hallucinate, and use the hallucinated answer to search for content that is similar to the answer and not the user query itself.
![Expansion with generated answers - Image by the author (inspired by [Gao, 2022])](https://towardsdatascience.com/wp-content/uploads/2024/06/0Uy_ckFyxaaWVmFlK-1.png)
There are several techniques, like HyDE (Hypothetical Document Embeddings), Rewrite-Retrieve-Read, Step-Back Prompting, Query2Doc, ITER-RETGEN, etc.
In HyDE we let the LLM create an answer to the user’s query without context first, and use the answer to search for relevant information within our vector database.
![How does HyDE work? - Image by the author (inspired by [Gao, 2022])](https://towardsdatascience.com/wp-content/uploads/2024/06/1yHYD9N78jMvE7j0BOsnq3g-1.png)
Alternative to the approach of HyDE and co., we can expand the user’s query by using multiple System Prompts.
(8) Multiple System Prompts
The idea is simple: we generate e.g. 4 different prompts, which will deliver 4 different responses.
You can be creative. The difference between the prompts can be anything.
- If we have a long text and we want to summarise it, we can look for certain aspects in the text by using different prompts and summarise those findings in the final answer.
- Or we have 4 times more or less the same prompt, with the same context, but the system prompt is slightly different. So we tell the LLM 4 times in a slightly different way what to do with it or how to phrase the answer.

The idea pops up everywhere in Data Science. In Boosting Algorithms we usually have simple models, every model slightly different, making one small decision. In the end we somehow consolidate the result. And this concept is very powerful.
The same we do here, just use the model again to consolidate the different predictions. The downside is of course, the compute time and/or response time is higher.
(9) Query Routing
In query routing, we use the LLMs’ decision-making ability to decide what to do next.
Let’s say we have data from different domains in our vector store. To more targeted search for relevant content, it makes sense to let the model decide in a first step, what data pool we should use to answer the question.
The example vector store in the image below contains news from all over the world. News about sports and football, cooking trends and news about politics. When the user queries our chatbot, we don’t want to mix this data.
Sports rivalries between countries should not be mixed with politics. And content about cooking is probably not helpful if the user is looking for news about politics.

This way, we can significantly increase performance. And we can also give the end user the option of selecting the topic to be used to answer the question.
(10) Hybrid Search
Basically, the retrieval step of the RAG pipeline is nothing else than a search engine. Probably the most important part of our RAG system.
When we want to improve the similarity search, it is worth having a look into the search domain. One example is the Hybrid Search approach. We are performing a vector search, as well as a lexical (keyword) search, and somehow bring those results together.

In Machine Learning it is a common approach. Having different techniques, different models, predicting the same output, and summarising the results. The idea is always the same.
A bunch of experts trying to find solutions and making compromises is making better decisions than one stand-alone expert.
4. Post-Retrieval: How to improve the retrieval step?
Context Enrichment – e.g. Sentence Window Retrieval
Usually we try to keep the text chunk small, so we actually find what we are looking for and keep the search quality high.
On the other hand, it often makes it easier to give a correct answer not only by seeing the exact best matching sentence, but also the context around it.
Let’s look at the example in the image below:
We have a bunch of text chunks, from a wikipedia article about the german football club FC Bayern Munich. I have not tested it, but I could imagine that the first text snippet brings the highest similarity score.
Nevertheless, the information in the second chunk is probably more relevant. We want to catch that one as well. Here Context Enrichment comes into play.

There are various ways to enrich the context, in the following I will describe just two of them briefly:
- Sentence Windows Retriever
- Auto-Merging Retriever
(11) Sentence Window Retrieval
The text chunk with the highest similarity score represents the best-matching content found. Before sending the content to the LLM we add the k-sentences before and after the text chunk found. This makes sense since the information has a high probability to be connected to the middle part and maybe the piece of information in the middle text chunk is not complete.
The Auto-merging Retriever is doing it similarly, just this time, every chunk has certain "Parent" chunks attached to it, which do not necessarily the chunk before and after the chunk found.
(12) Auto-merging Retriever (aka Parent Document Retriever)
The Auto-merging Retriever does it similarly, only this time each small text chunk is assigned certain "parent" chunks, which do not necessarily have to be the chunk before and after the text chunk found.
You can put in a all your creativity in how you define and identify correlated relationships between text chunks.
For example, when we look at technical documents or legal contracts, paragraphs or sections often refer to other parts of the contract. The challenge is to enrich paragraphs with the relevant information from other paragraphs. So we need to be able to recognize links within the text which refer to other parts of the document.

We can build on top of that concept and set up a whole hierarchy like a decision tree with different levels of Parent Nodes, Child Nodes and Leaf Nodes. We could for example have 3 levels, with different chunk sizes [LlamaIndex, 2024]:
- 1st level: chunk size 2048
- 2nd level: chunk size 512
- 3rd level: chunk size 128 (Leaf Node)
When we index the data and perform the similarity search, we are using the smallest chunks, the leaf nodes. After this step we find the matching parent nodes to the leaf.
After the retrieval step, we have to interpret the content found and use it to solve the user query. To do this, we use large language models (LLMs). But which model is the right one for our use case?
5. Generation / Agents
(13) Picking the right LLM and provider – Open Source vs. Closed Source, Services vs Self-Hosted, Small model vs. huge model
Selecting the right model for your application is not as easy as you might think. It depends on the specific application and what your process looks like.
Some would say, the most obvious solution is:
simply use the most powerful one
However, there are some undeniable advantages of using smaller, cheaper and faster models.
There are some parts of the RAG process where the accuracy can be a bit lower, but the response time should be fast, e.g.: When we use an agent-based approach, we need to constantly make simple decisions along the pipeline.

Also, if the responses and decision-making capabilities of smaller models are powerful enough for our use case, there is no real need to go for the most powerful models. You will reduce the operating costs of your solution and the user will thank you with an improved response time of your system.
But how do we choose the model?
There are several benchmarks out there, comparing LLMs from different angles. But in the end, we just have to try them out for our RAG solution.
(14) Agents
Agents combine some of the components and execute them iteratively according to certain rules.
Agents use the so-called "chain of thought reasoning" concept, which describes the iterative process of:
- sending requests,
- interpreting the response using the LLM,
- deciding on the next step and
- acting again.
The question in the figure below shows an example that is usually too complex to answer in one go, as the answer is likely not written down anywhere.
We humans would break it down into simpler sub-questions that we can answer and then calculate the answer we are looking for. The agent does the same in this scenario.

With an agent-based approach, we can significantly improve accuracy. Of course, there is always a trade-off. Compared to one-shot prompting, we increase the required computing power and response time. But improve the accuracy.
Nevertheless, with this approach, we can surpass the accuracy of much larger models with smaller, faster models. In the end, this may be the better approach for your solution.

It always depends on your concrete use case – When we build a bot for pure information retrieval, we are always competing with the ultra-short response time of search engines.
The response time is key.
Waiting several seconds or even minutes for a result sucks.
6. Evaluation – Evaluate our RAG system
The performance of an RAG-based system highly depends on the data provided and the LLM’s ability to extract useful information. To do that, we need a few components, playing together. When we want to evaluate the whole system, we usually do not only want to track the overall performance but also get a feeling of how the individual components do what they should do.
Like before, we can split it up into an evaluation of the Retriever Components and the Generator components.
We can evaluate the search part using typical search metrics like DCG and nDCG, which evaluate the ranking quality. They check whether the really relevant content was actually classified as such in the similarity search. [EvidentlyAI, 2024][Leonie Monigatti, 2023]
!["Ideal ranking" vs. Real ranking: NDCG as a metric that helps evaluate the ranking quality - Image by the author [EvidentlyAI, 2024]](https://towardsdatascience.com/wp-content/uploads/2024/06/0Doi96e2CGK-eJWi2-1.png)
Assessing the response of the models itself is tricky.
How do we evaluate the response? Language is ambiguous, so how can we give the output something like a rating?
The easiest way would be to ask lots of people if they think the answer is helpful – let’s say we just get 1000 people to rate the LLM’s answer. That would give you a good idea of how well it works. But for a productive solution, it’s pretty impractical.
Every time you change your RAG procedure slightly, it will affect the result. I know how hard it can be to convince domain experts to test your solution. We can do it once or twice, but not every time we change something in our pipeline.
So we need to come up with a better way. One approach is, instead of using humans, we use other LLMs to evaluate the result – so we use a model "as a judge" approach.
(15) LLM-as-a-judge
The generation part can be evaluated by using a LLM-as-a-judge approach. The concept is simple.
- We generate an evaluation dataset
- Then define a so-called critique agent with suitable criteria we want to evaluate,
- and finally set up a test pipeline that automatically evaluates the responses of the LLMs based on the defined criteria
![LLM-as-a-judge approach - Inspired by [Databricks, 2023]](https://towardsdatascience.com/wp-content/uploads/2024/06/0fqZ4s8uGgiulLuiN-1.png)
Step (1) – Generate a synthetic evaluation dataset
Usually a set of (1) context, (2) question, and (3) answer. We do not necessarily have a full dataset in place. We could create it ourselves by providing an LLM with context and letting it guess what the questions to it could be. Step by step building up a synthetic dataset.
Step (2) – Setup a so-called critique agents
A critique agent is another LLM (usually a powerful one), we use to evaluate the response of the system based on a handful of criteria, e.g.
- Professionalism: If the answer is written using a professional tone
An example metric definition could look like this [Databricks, 2023]:
definition=(
"Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is "
"tailored to the context and audience. It often involves avoiding overly casual language, slang, or "
"colloquialisms, and instead using clear, concise, and respectful language."
),
grading_prompt=(
"Professionalism: If the answer is written using a professional tone, below are the details for different scores: "
"- Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for "
"professional contexts."
"- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in "
"some informal professional settings."
"- Score 3: Language is overall formal but still have casual words/phrases. Borderline for professional contexts."
"- Score 4: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
"- Score 5: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for formal "
"business or academic settings. "
),
Step (3) – Test the RAG system: With the just created evaluation dataset we go test the system
For each metric / criterion we want to test, we define a detailed description (e.g. on a scale from 1–5) and let the model decide. This is not an exact science, the answers of the model will vary, but it gives us an idea how well the system performs.
You can find some examples of how this grading prompt could look like in the Prometheus’s prompt templates or this MLFlow tutorial from Databricks.
(16) RAGAs
RAGAs (Retrieval-Augmented Generation Assessment) is a framework that allows you to evaluate each component of your RAG system.
One core concept is again the idea behind "LLM-as-a-judge" / "LLM-assisted evaluation". But Ragas offers a lot more. Different tools and techniques to enable continues learning of your RAG application.
A core concept that is is worth mentioning, is "Component-Wise Evaluation". Ragas offers predefined metrics to evaluate each component of the RAG pipeline in isolation, e.g. [Ragas, 2024]:
Generation:
- faithfulness: How factually acurate is the generated answer?
- answer relevancy: How relevant is the generated answer to the question?
Retrieval:
- context precision: The signal to noise ratio of retrieved context
- context recall: Can it retrieve all the relevant information required to answer the question
Other metrics focus on evaluating the RAG pipeline ent-to-end, like:
- Answer semantic similarity
- Answer correctness
(17) Continued data collection from the app and users
Collecting data is the key to specifically identifying and filling the gaps in your process. Often it is the data itself in the knowledge base that you feed your system that is not good enough. To be aware of this, we need to introduce methods that make it as easy as possible for users to provide this feedback.

Other interesting metrics are the
- Timestamp, response time of every step along the RAG pipeline (vectorization, similarity search, LLM response, etc.)
- Used prompt template
- Relevant documents found, version of the documents
- LLM response
- …
The RAG system is a concept of different steps. To optimize performance and response time, we need to know where the bottleneck is. That’s why we monitor the steps so that we can work on the biggest levers.

Summary
There is no clear path to follow. It’s a lot of trial and error. As with any other data science use case, we have a certain set of tools that we can use to try and find a solution to our specific problem.
That’s what makes these projects fun in the first place. Wouldn’t it be boring if there was a static cookbook to follow?
Let me know your thoughts in the comments
You can find me on LinkedIn!