
I recently started to favor Graph RAGs more than vector store-backed ones.
No offense to vector databases; they work fantastically in most cases. The caveat is that you need explicit mentions in the text to retrieve the correct context.
We have workarounds for that, and I’ve covered a few in my previous posts.
Building RAGs Without A Retrieval Model Is a Terrible Mistake
For instance, ColBERT and Multi-representation are helpful retrieval models we should consider when building RAG apps.
GraphRAGs suffer less from retrieval issues (I didn’t say they don’t suffer.) Whenever the retrieval requires some reasoning, GraphRAG performs extraordinarily.
Providing relevant context solves a key problem in Llm-based applications: hallucination. However, it does not eliminate hallucinations altogether.
When you can’t fix something, you measure it. And that’s the focus of this post. In other words, how do we evaluate RAG apps?
But before that, why do LLM’s lie in the first place?
Why do LLMs hallucinate (even RAGs)?
Language models sometimes lie—all right—and sometimes are inaccurate. This is primarily due to two reasons.
The first is that the LLM doesn’t have enough context to answer. This is why Retrieval Augmented generation (RAG) came into existence. RAGs provide context to the LLM that it hasn’t seen in its training.
Some models work well to answer within the provided context, and others don’t. For instance, LLama 3.1 8B works fine if you provide context to generate answers, while DistilBERT doesn’t.
The second reason is when the answer requires some reasoning. Not every LLM is good at reasoning; each does it differently. For context, Llama 2 13B doesn’t perform well on reasoning tasks compared to GPT-4o.
Of course, these models are from different generations, and it wouldn’t be appropriate to compare them side by side. But that’s also the point I’m trying to make – if you don’t choose your model wisely, you can expect it to produce hallucinated answers.
How to evaluate hallucination in RAGs
Now that we’re convinced that every LLM-based app can hallucinate and that measuring is the only way to keep it in check, how can we do that?
A few LLM evaluation frameworks have recently evolved. Two of them, RAGAS and Deepeval, are particularly popular. Both tools are open-source and free to use, although a paid version exists.
I’ll be using Deepeval in this post. A quick note: I am not affiliated with Deepeval; I simply like it.
Let’s start by installing the tool. You can get it from the PyPI repository.
pip install -qU deepeval ragas
Because LLM evaluations are vague, we can not test LLMs like we test software. The LLM-generated responses are never predictable, so they require another LLM to evaluate them.
The LLM evaluator needs to be a competent model. I’d choose GPT-4o-mini for this. It’s cost-effective and accurate, but you can experiment.
We can evaluate RAGs in two different ways. The first is a prompt-based technique called G-Eval. The second is RAGAS, which allows us to evaluate RAGs systematically.
Both RAGAS and G-Eval are incredibly helpful frameworks to evaluate RAGs. I’d use RAGAS as the default method when choosing between these frameworks. Since the computation is specific, the results are easy to compare with different evaluations. However, the evaluation criteria can sometimes be more complex than the RAGAS framework. In such a situation, you can dictate the evaluation steps using G-Eval.
G-Eval: The more versatile evaluation framework
As mentioned earlier, G-Eval is a prompt-based evaluation framework, which means we get to tell the LLM how to evaluate our RAG. We can do this by setting the input parameter criteria
or evaluation_steps
.
Here’s an example using the criteria
parameter.
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase
from deepeval.test_case import LLMTestCaseParams
test_case = LLMTestCase(
input="When did XYZ, Inc complete the acquisition of ABC, Inc?",
actual_output="XYZ, Inc completed the acquisition of ABC, Inc on January 15, 2025, solidifying its market leadership.",
expected_output="XYZ, Inc completed the acquisition of ABC, Inc on January 10, 2025.",
)
correctness_metric_criteria = GEval(
name="Correctness with Criteria",
criteria="Verify if the actual output provides a factually accurate and complete response to the expected output without contradictions or omissions.",
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)
correctness_metric_criteria.measure(test_case)
print("Score:", correctness_metric_criteria.score)
print("Reason:", correctness_metric_criteria.reason)
>> Score: 0.6683615588714628
>> Reason: The Actual Output accurately states the acquisition and maintains a similar context, but the date is incorrect compared to the Expected Output.
The above example verifies that the actual output aligns with the expected output. In this case, the date was different. But it has everything else correct.
The good thing about G-Eval is that we can define how we should evaluate it. We can specify that in the criteria if we’re okay with a date difference of less than 10 days. Here’s how the new evaluation looks.
...
correctness_metric_criteria = GEval(
...
criteria="Verify if the actual output is correct and the date in the expected output is not more than 10 days apart. ",
...
)
...
>> Score: 0.8543395625001086
>> Reason: The Actual Output provides an accurate acquisition date of XYZ, Inc completing the acquisition of ABC, Inc and matches the Expected Output except for a 5-day difference in dates, which is within the acceptable range.
As you’ve noticed, this shoots up the score to .85 from .66 because we’ve allowed a 10-day range for the date match.
The above example only checks whether an LLM’s response regarding an expected outcome is correct. However, we also need to check the retrieved data to evaluate RAG.
Here’s a RAG evaluation example. Notice that we’ve used the evaluation_steps
parameter instead of criteria
.
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
# Define the retrieval context with multiple relevant pieces of information
retrieval_context = [
"XYZ Corporation announced its plans to acquire ABC Enterprises in a deal valued at approximately $4.5 billion.",
"The merger is expected to consolidate XYZ's market position and expand its reach into new domains.",
"The acquisition was completed on January 15, 2024.",
"Post-acquisition, XYZ, Inc. aims to integrate ABC, Inc.'s technologies to enhance its product offerings.",
"The regulatory bodies approved the acquisition without any objections."
]
# Create the test case with input, actual output, expected output, and retrieval context
test_case = LLMTestCase(
input="When did XYZ, Inc. complete the acquisition of ABC, Inc?",
actual_output="XYZ, Inc. completed the acquisition of ABC, Inc. on January 17, 2024, solidifying its market leadership.",
expected_output="XYZ, Inc. completed the acquisition of ABC, Inc. on January 15, 2024.",
retrieval_context=retrieval_context
)
# Define the correctness metric with evaluation steps
correctness_metric_steps = GEval(
name="Correctness with Evaluation Steps",
evaluation_steps=[
"verify the retrieval_context has sufficient information to respond to the input. "
"Verify if the 'actual_output' provides the a completion date not more than 10 days different of the acquisition as stated in the 'retrieval_context'.",
"Ensure that the 'actual_output' matches the 'expected_output' in terms of factual accuracy.",
"Check for any contradictions or omissions between the 'actual_output' and the 'expected_output'."
],
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT,
LLMTestCaseParams.RETRIEVAL_CONTEXT
],
)
# Measure the test case using the defined metric
correctness_metric_steps.measure(test_case)
# Print the evaluation score and reason
print("Score:", correctness_metric_steps.score)
print("Reason:", correctness_metric_steps.reason)
>> Score: 0.7907626389536403
>> Reason: The retrieval_context provides accurate completion info as January 15, 2024. The actual_output date is close but slightly off, being two days later than in the expected_output. No other significant factual inaccuracies or contradictions are present.
In the above example, we’ve used evaluation_steps
instead of criteria
. But this is optional. Explaining the evaluation process in a single statement criteria
param would work just fine. But it’s always easier to break them down into steps.
G-Eval provides a single score to the evaluation regardless of the steps. You give it a single evaluation step or a dozen of them, and G-Eval will do all the work and produce a single score.
This is easy to understand and often sufficient. But what if we need to test an RAG pipeline component-by-component? This is where RAGAS comes into play.
RAGAS: Standard and more granular evaluation
RAGAS is a combination of four other evaluations in an RAG pipeline. They are answer relevancy, faithfulness, contextual recall, and contextual precision. The RAGAS score is simply the average of all these four.
Here’s an example:
In the below example, I’ve used both RagasMetric
And the individual components. In real evaluation, you can either use the individual ones or the RagasMetric only.
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics.ragas import RagasMetric
# Individual metrics of RagasMetric
from deepeval.metrics.ragas import RAGASAnswerRelevancyMetric
from deepeval.metrics.ragas import RAGASFaithfulnessMetric
from deepeval.metrics.ragas import RAGASContextualRecallMetric
from deepeval.metrics.ragas import RAGASContextualPrecisionMetric
# Define the retrieval context with multiple relevant pieces of information
retrieval_context = [
"XYZ Corporation announced its plans to acquire ABC Enterprises in a deal valued at approximately $4.5 billion.",
"The merger is expected to consolidate XYZ's market position and expand its reach into new domains.",
"The acquisition was completed on January 15, 2024.",
"Post-acquisition, XYZ, Inc. aims to integrate ABC, Inc.'s technologies to enhance its product offerings.",
"The regulatory bodies approved the acquisition without any objections."
]
# Create the test case with input, actual output, expected output, and retrieval context
test_case = LLMTestCase(
input="When did XYZ, Inc. complete the acquisition of ABC, Inc?",
actual_output="XYZ, Inc. completed the acquisition of ABC, Inc. on January 15, 2025, solidifying its market leadership.",
expected_output="XYZ, Inc. completed the acquisition of ABC, Inc. on January 15, 2024",
retrieval_context=retrieval_context
)
# Initialize the RagasMetric with a threshold and specify the model
ragas_metric = RagasMetric(threshold=0.5, model="gpt-4o-mini")
ragas_answer_relavancy_metric = RAGASAnswerRelevancyMetric(threshold=0.5, model="gpt-4o-mini")
ragas_faithfulness_metric = RAGASFaithfulnessMetric(threshold=0.5, model="gpt-4o-mini")
ragas_contextrual_recall_metric = RAGASContextualRecallMetric(threshold=0.5, model="gpt-4o-mini")
ragas_contextual_precision_metric = RAGASContextualPrecisionMetric(threshold=0.5, model="gpt-4o-mini")
# Measure the test case using the RagasMetric
ragas_metric.measure(test_case)
ragas_answer_relavancy_metric.measure(test_case)
ragas_faithfulness_metric.measure(test_case)
ragas_contextrual_recall_metric.measure(test_case)
ragas_contextual_precision_metric.measure(test_case)
# Print the evaluation score
print("Score:", ragas_metric.score)
# Alternatively, evaluate test cases in bulk
result = evaluate([test_case], [
ragas_metric,
ragas_answer_relavancy_metric,
ragas_faithfulness_metric,
ragas_contextrual_recall_metric,
ragas_contextual_precision_metric,
])
>>Metrics Summary
- ✅ RAGAS (score: 0.6664463603011416, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: None, error: None)
- ✅ Answer Relevancy (ragas) (score: 0.9988984715390413, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: None, error: None)
- ❌ Faithfulness (ragas) (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: None, error: None)
- ✅ Contextual Recall (ragas) (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: None, error: None)
- ❌ Contextual Precision (ragas) (score: 0.3333333333, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: None, error: None)
To understand this, let’s revisit how a RAG app responds to input. The first step is fetching contextual information. The LLM then uses the retrieved context to answer the user’s query.
Contextual Precession is a metric that evaluates whether statements relevant to the input are ranked higher. In our example, the one that contains the acquisition date is ranked #3 in the retrieved context. Hence, it has a low score (0.33) in the evaluation.
Contextual Recall tests if the retrieved context is sufficient to get an answer closer to the expected output. In our example, the acquisition date is available in the retrieved context. Hence, the score for this metric is 1.0.
Faithfulness is the metric that measures hallucinations in responses. It checks the correctness of the actual output with respect to the retrieved context. In our example, the retrieved context clearly states that the acquisition was completed in 2024. But the output says it’s in 2025, which is very wrong. Hence, it gets a 0 for the faithfulness score.
Finally, the answer relevancy metric measures whether the response generated was at least an attempt to answer the right question (input). Even though factually incorrect, the LLM answered the right question in our example, so it received a 0.99 score.
The RAGA score is a weighted average of these four individual metrics.
The downside of RAGAS is that you sometimes get a pass, even if the response wasn’t correct. That’s precisely the case in our example. Answering the acquisition date with a year difference isn’t negligible. But still, it was only one out of four metrics considered. Hence, the overall score of 0.66 is well above the threshold.
Thus, I’d suggest using individual metrics to understand the components rather than the whole system with a single metric (like RAGAS). This helps debug the app better.
Final thoughts
Evaluating apps with an LLM is different from evaluating a software project. Evaluating a RAG app is even more different and challenging.
This is mainly because the response of LLM isn’t always the same. They may sometimes hallucinate.
G-Eval and RAGAS are popular in evaluating the RAG applications. Even though these frameworks cover many aspects of an RAG app, they also find out about the hallucinations of LLMs.
Once you discover your app suffers from hallucinations, you could work on the workflow to fetch better contextual information (perhaps with a graph DB) or change the model.