
TLDR
This article explains the concept and the low-abstraction implementation of employing an LLM judge to evaluate another LLM judge. The purpose is to improve the evaluation process of LLM applications, reducing cases where LLM judges fail to make fair assessments.
Table of Contents
👉 Introduction
❇️ In the field of building LLM applications, how to ensure consistent and reliable performance is one of the most frequently asked questions regarding QA (Quality Assurance). Due to their indeterministic nature, LLM models can produce great variability in their outputs. Hence, rigorous evaluation of LLM applications is strictly required. Without good evaluation methods, we must accept a certain level of risk (e.g. customer complaints, etc.) due to the inability to promptly identify unexpected behaviors in LLM applications. Common LLM evaluation methodologies include heuristic evaluations, LLM-as-judge, and human review.
- 📍 Heuristic evaluators: e.g. a function to check whether output = "yes" or whether the output > 10.
- 📍 Llm-as-judge: using an LLM to judge the output of another LLM.
- 📍 Human judge: employ a human to evaluate the LLM’s output.
❇️ Employing an LLM judge is a top choice as it can be automated and is much cheaper (and more feasible) than human judges. Besides, LLM judges can deal with the free-text format, unlike heuristic evaluators. However, the non-deterministic nature of LLMs implies that even with controlled parameters, outputs may vary, raising concerns about the reliability of these judgments.
💥 The concern we will address today:
When opting for an LLM judge to evaluate our LLM application, we should also question the integrity of the LLM judge itself.
✅ So, the experiment described below aims to determine whether we can use an LLM judge (let’s call it "Supreme LLM Judge") to evaluate the judgments of another LLM judge without any ground truth reference (reference-free evaluation). The ultimate goal is to find ways to improve the first LLM judge.
The graph below explains such a framework.

👉 Research Question
Can "the evaluation of an LLM application by an LLM judge" be audited by another LLM judge for the continuous improvement of the evaluation process?
👉 Experiment Design
🔹 One important constraint set out in this experiment that I must mention is that both LLM judges will evaluate without a ground-truth reference. Evaluation with ground-truth reference would provide the judges with the correct answers and ask them to compare. However, for most scenarios where we do not have datasets curated by humans, reference-free evaluation is the preferred approach.
🔹 The proposed framework improves the conventional single-layer evaluation of LLM applications by adding a Supreme LLM Judge
. We can have two approaches for this framework.
- Approach 1: An LLM Application is evaluated by an LLM Judge, whose judgment is afterward reviewed by a Supreme LLM Judge (reference-free). Disagreements or anomalies are subsequently reviewed by a human.

- Approach 2: Both the LLM Judge and the Supreme LLM Judge independently evaluate the LLM Application (reference-free). The judgments are then compared, and any discrepancies are flagged for human review.
Approach 1 will be discussed further in this article.
👉 Implementation
🔹 The implementation of the aforementioned framework I opted for focuses on the high-level concept without delving too deeply into the fine-tuning for perfect performance.
No Llm Evaluation libraries or platforms (e.g., LangChain, LangSmith, LangFuse, etc.) were used. The code implementation has low abstraction, allowing readers to easily follow without getting lost in the intricate details of the code.
Since referencing the LLM judge and the Supreme LLM Judge can be hard to follow, let’s assign nominal roles for the components in the evaluation setup:
LLM Application
➡️The Student
LLM Judge
➡️The Teacher
Supreme LLM Judge
➡️The Reviewer
💥 The complete code can be found in this repository.
# LLM Application
def trigger_llm_app(context: str, question: str):
fmt_context_and_question = f"""Context: {context}nQuestion: {question}"""
messages = [
llm_app_prompt,
{"role": "user",
"content": fmt_context_and_question}
]
response = openai_client.chat.completions.create(messages=messages,
model="gpt-3.5-turbo")
return response.choices[0].message.content
# LLM Judge
def eval_llm_app(context: str, question: str, predicted_answer: str):
fmt_input = f"""Context: {context}nQuestion: {
question}nStudent's Answer: {predicted_answer}"""
messages = [
llm_judge_prompt,
{"role": "user",
"content": fmt_input}
]
response = openai_client.chat.completions.create(messages=messages,
model="gpt-3.5-turbo")
return response.choices[0].message.content
# Superior LLM Judge
def eval_llm_judge(context: str, question: str, student_answer: str, teacher_grading: str):
fmt_input = f"""Context: {context}nQuestion: {question}nStudent's Answer: {
student_answer}nTeacher's Grading: {teacher_grading}"""
messages = [
supreme_llm_judge_prompt,
{"role": "user",
"content": fmt_input}
]
response = openai_client.chat.completions.create(messages=messages,
model="gpt-4")
return response.choices[0].message.content
A subtle but important decision in this experiment design is to use GPT-4 as the Supreme LLM Judge, while the LLM Application and LLM Judge use GPT-3.5-turbo. This ensures that the Supreme LLM Judge’s evaluations are more robust and reliable (read more about the comparison here).
The prompts for each of the components in this experiment are as follows. You can see that I used few-shot prompting technique to improve the consistency of the evaluation outputs.
llm_app_prompt = {"role": "system",
"content": """You are a helpful assistant. Please use step-by-step reasoning to address questions based on the specific context provided."""}
llm_judge_prompt = {
"role": "system",
"content": """You are a math teacher tasked with grading a student's answer.
Evaluate the student's response considering the context of the question, the correctness of the answer, and the reasoning provided.
Conclude with a grade: assign '0' if the answer and the reasoning is incorrect and '1' if it is correct.
Your grading output should be strictly in this format (no other words allowed): 'Grade: 0' or 'Grade: 1'.
Below are examples for your reference:
- Example:
Question: How long does it take to drive 100 kilometers at 50 kilometers per hour?
Student's Answer: To find the time, divide the distance by the speed: 100 km / 50 km/h = 2 hours.
Grade: 1
- Example:
Question: Calculate the area of a square with a side length of 5 meters.
Student's Answer: Given that the side length of the square is 5 meters, the answer is: 5*4=20 square meters.
Grade: 0
- Example:
Question: How many seconds are in an hour?
Student's Answer: 3600 seconds
Grade: 1
- Example:
Question: Given two sets, Set A containing the elements 1, 2, and 3, and Set B containing the elements 3, 4, and 5, what is the intersection of Set A and Set B?
Student's Answer: The element that is common to both sets is 1.
Grade: 0
"""
}
supreme_llm_judge_prompt = {
"role": "system",
"content": """You are an exam reviewer tasked with evaluating teachers' grading. Your job is to review the grade given by the teacher to a student's answer and assess its correctness.
Important: Your review is of the teacher's grading, not the student's answer.
Output Format: Your review output should be strictly in this format (no other words allowed): 'Correctness: 0' or 'Correctness: 1'.
Below are examples for your reference:
- Example:
Question: How long does it take to drive 100 kilometers at 50 kilometers per hour?
Student's Answer: To find the time, divide the distance by the speed: 100 km / 50 km/h = 2 hours.
Grade: 1
Correctness: 1
- Example:
Question: Calculate the area of a square with a side length of 5 meters.
Student's Answer: Given that the side length of the square is 5 meters, the answer is: 5*4=20 square meters.
Grade: 0
Correctness: 1
- Example:
Question: How many seconds are in an hour?
Student's Answer: 3600 seconds
Grade: 0
Correctness: 0
- Example:
Question: Given two sets, Set A containing the elements 1, 2, and 3, and Set B containing the elements 3, 4, and 5, what is the intersection of Set A and Set B?
Student's Answer: The element that is common to both sets is 1.
Grade: 1
Correctness: 0
"""
}
🔹 The question we asked the LLM application:
In a group of 30 people who can speak either English or German, 10 can speak both, and 25 can speak German. How many speak only English?
The LLM Application must not only provide the correct answer but also explain its reasoning. The LLM Judge then evaluates this output – both the final answer and the reasoning. Finally, the Supreme LLM Judge will evaluate the evaluation given by the LLM Judge.
You can notice that I left redundant information in the context of this question to challenge the LLM Application.
🔹 I ran this evaluation cycle 100 times with the default temperature set by OpenAI API using the same question to examine the performance of the judges.
if __name__ == "__main__":
context = "In a group of 30 people who can speak either English or German, 10 can speak both, and 25 can speak German."
user_question = "How many speak only English?"
list_results = []
for i in range(100):
print(f"===> Iteration {i+1}")
list_results.append(evaluate(context, user_question))
👉 Experiment Results
💥 Once again, before reading our results, just a reminder of our definitions:
LLM Application
➡️The Student
LLM Judge
➡️The Teacher
Supreme LLM Judge
➡️The Reviewer
💥 IMPORTANT: We define a "positive case" as when the evaluation of the Teacher is wrong.
We will measure the performance of the Reviewer (Supreme LLM Judge) by the following metrics.
- recall_of_reviewer: measures the ability of the Reviewer to identify all the positive cases. It indicates how effectively the Reviewer can capture mistakes from the Teacher.
- precision_of_reviewer: is defined as the proportion of the Reviewer’s identified positive cases that are actually positive.
There is always a tradeoff between precision and recall. The more true positive cases you want to capture in your predictions (and you do not care much about false positive cases), the less precise your model becomes.
reviewer_precision: 0.43
reviewer_recall: 0.70
Let’s revisit our Research Question to see how the Supreme LLM Judge (the Reviewer) can help to audit the work of the LLM Judge (the Teacher).
The Supreme LLM Judge can identify 70% of the instances where the LLM Judge made incorrect evaluations. By analyzing these identified cases, we can understand why the LLM Judge was confused and improve our LLM Application’s evaluation process.
😮 You might be curious about the wrong judgments from the LLM Judge that the Supreme LLM Judge has captured.
Below are examples where the Reviewer successfully identified the Teacher’s grading errors. By looking into these examples, we can study why the LLM Judge did not perform well.
👋 Before reading the examples below, about the "human evaluator":
- Yes, I (the Author) am the human evaluator!
- In the context of this experiment, the human will grade the student’s answers as correct if the reasoning is sound, even if it is lengthy and contains redundant calculations and reasonings.
- Please note that this
human_grading
is against the Student’s answers.
{
"student_answer": "1. We know that there are 10 people who can speak both English and German.n2. There are 25 people who can speak German.n3. To find out how many speak only English, we need to subtract the number of people who can speak both English and German from the total number of people who can speak German.n4. So, the number of people who speak only English is 25 (people who speak German) - 10 (people who speak both) = 15 people.n5. Therefore, there are 15 people who speak only English in the group of 30 people.",
"teacher_grading": "1",
"reviewer_feedback": "0",
"human_grading": "0"
},
{
"student_answer": "To find out how many people speak only English, we first calculate the total number of people who can speak English by subtracting the number of people who can speak both English and German (10) from the total number of people who can speak German (25). This is because those 10 people are counted twice - once in the English-speaking group and once in the German-speaking group.nnTotal number of people who can speak English = Total number of people who can speak German - Number of people who can speak both English and GermannTotal number of people who can speak English = 25 - 10nTotal number of people who can speak English = 15nnTherefore, there are 15 people who speak only English.",
"teacher_grading": "1",
"reviewer_feedback": "0",
"human_grading": "0"
},
{
"student_answer": "Given that 10 people can speak both English and German, and 25 people can speak German, it means that 25 - 10 = 15 people speak only German.nSince there are 30 people in total and we have now found the number of people who speak only German, the number of people who speak only English is 30 - 15 - 10 = 5.nTherefore, 5 people speak only English.",
"teacher_grading": "0",
"reviewer_feedback": "0",
"human_grading": "1"
},
{
"student_answer": "To find out how many people speak only English, we can start by calculating the total number of people who can speak English. nnFrom the information provided, we know that there are 10 people who can speak both English and German. Since there are a total of 30 people in the group, and 25 of them can speak German, the remaining 5 people must be the ones who speak only English.nnTherefore, the number of people who speak only English is 5.",
"teacher_grading": "0",
"reviewer_feedback": "0",
"human_grading": "1"
},
{
"student_answer": "To find out how many people speak only English, we will first subtract the total number of German speakers from the total number of people in the group:nnTotal number of people = 30nNumber of German speakers = 25nnTherefore, the number of people who speak only English is:n30 - 25 = 5nnSo, there are 5 people in the group who speak only English.",
"teacher_grading": "0",
"reviewer_feedback": "0",
"human_grading": "1"
},
👉 Conclusions
✅ The evaluation of the LLM Judge by the Supreme LLM Judge gave us insights into the effectiveness of a multi-layered evaluation system for LLM applications. The Supreme LLM Judge achieved a recall rate of 70%, successfully identifying 70% of the incorrect evaluations made by the LLM Judge. This is not bad for the case of reference-free evaluation and for the proof-of-concept implementation. These captured incorrect evaluations will potentially help us derive solutions to continuously improve our LLM Judge.

✅ The difficulties we encounter when using an LLM judge to evaluate an LLM application also apply to using a Supreme LLM Judge to judge an LLM Judge (tongue twister!). Besides the quest to aim for high accuracy of the evaluations, ensuring consistent evaluation outputs is also a big challenge.
✅ Given that the second layer of evaluation requires human assessment with considerable effort, this approach is more suitable for audits or periodic offline evaluations rather than ongoing evaluations.
✅ It is also worth noting that using a random sampling method for evaluation might be a good approach to save resources. By strategically deploying a second layer of LLM evaluation with human reviewers, we can enhance the overall reliability and accuracy of the evaluation system, contributing to a high-performing LLM Application.
About me
I am Daniel Le, based in Berlin. I currently work in the fields of Machine Learning and Data Engineering.
I am interested in new technologies and how they can be implemented to solve real-world problems.
Should you have any inquiries or wish to discuss these interests further, please do not hesitate to connect with me on LinkedIn.