
The Challenge
I was recently provided a challenge: Develop a chatbot that can answer questions about a Json dataset using an LLM and pre-defined student data in JSON format. Deliberately, the JSON is poorly structured and in some cases well nested, perhaps representing a database call from a legacy system.
Why not simply upload the JSON to ChatGPT?
Simply calling ChatGPT on the data may not yield the best results, it’s also not scalable. As JSON data is a very structured and nested form of data, complex questioning about the intricate relationships between the data points pose a higher risk of returning hallucinations. I set off to investigate an advanced agent that could use purpose built tools to carry out the task and deal with these intricacies.
Choosing the tools
One of the most popular LLM frameworks is LangChain, which provides advanced prompting tools. It also provides the ability to scale through Langsmith, a service (in the simplest of terms) that provides a dashboard to monitor LLM calls and agent runs – with a full breakdown of action chains.
Upon investigation of the latest docs, I found that LangChain provides JsonToolkit, specifically designed to handle JSON https://python.langchain.com/v0.2/docs/integrations/toolkits/json/ so I set about utilising this tool for the job.
The Code
For complete control over the path of the agent, we need to ensure firstly that it’s finding the right student ID. This removes the possibility of the agent simply matching the first name it finds, and returning whatever it comes across first when there are duplicates. It would not be unusual at any school to have two students with the same name.
from langchain_community.agent_toolkits import JsonToolkit, create_json_agent
from langchain_community.tools.json.tool import JsonSpec
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
import json
import os
import datetime
# Load the environment variables
load_dotenv()
# Set up Langsmith for monitoring and tracing following GitHub documentation: https://docs.smith.langchain.com/
# Just by setting the environment variables, Langsmith will automatically start monitoring and tracing.
LANGCHAIN_TRACING_V2 = os.getenv("LANGCHAIN_TRACING_V2")
LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY")
LANGCHAIN_PROJECT = os.getenv("LANGCHAIN_PROJECT")
# Load the student data from JSON
with open("school_dummy_data.json") as file:
data = json.load(file)
# Set up the LLM model, we will be using the latest OpenAI model, GPT-4o for the best reasoning and understanding.
llm = ChatOpenAI(model="gpt-4o", temperature=0.9)
# Set up the JsonToolkit and JsonSpec for the agent using the latest toolkit: https://python.langchain.com/v0.2/docs/integrations/toolkits/json/
json_spec = JsonSpec(dict_=data, max_value_length=4000)
json_toolkit = JsonToolkit(spec=json_spec)
# Create the JSON agent executor and pass the LLM and JsonToolkit
json_agent_executor = create_json_agent(
llm=llm, toolkit=json_toolkit, verbose=True
)
# Define the current term by month name with datetime, this is crucial for the agent to understand dates and times that might be associated with the student data.
current_month = datetime.datetime.now().strftime("%B")
# Define a function to find the student ID based on the student name, this is crucial for the agent to provide complete accuracy when there are multiple students with the same name.
def find_student_id(name):
student_ids = []
for student in data["students"]:
if student["name"].lower() == name.lower():
student_ids.append(student["studentId"])
if len(student_ids) == 0:
return None
elif len(student_ids) == 1:
return student_ids[0]
else:
print(f"There are multiple students found with the name '{name}':")
for i, student_id in enumerate(student_ids, start=1):
print(f"{i}. Student ID: {student_id}")
for student in data["students"]:
if student["studentId"] == student_id:
print(f" Date of Birth: {student['dob']}")
print()
while True:
try:
choice = int(input("Please select which option shown corresponds to the student you want to query: "))
if choice < 1 or choice > len(student_ids):
raise IndexError
return student_ids[choice - 1]
except (ValueError, IndexError):
print("Invalid input. Please enter a valid number within the range.")
# Example usage
while True:
question = input("Do you have a question about a student? (Y/N): ")
if "exit" in question:
break
if question in ["Y", "y"]:
student_name = input("Please enter the student name: ")
if student_name.lower() == 'exit':
break
student_id = find_student_id(student_name)
if student_id:
print(f"Student ID for '{student_name}': {student_id}")
question = input("Please enter your student related question: ")
json_agent_executor.run(
f"{question} + The student name is {student_name}, student ID is {student_id}. You must make sure that the student ID is correct. For reference the current month is {current_month}.")
else:
print(f"Sorry, there is no student with the name '{student_name}'.")
elif question in ["N", "n"]:
# We add a route for general questions that are not related to a specific student
question = input("Please enter your general question: ")
if "exit" in question:
break
json_agent_executor.run(
f"{question} + For reference the current month is {current_month}.")
# Additional context not needed on not returning data outside of the JSON as the JSON toolkit will handle this.
Console Output
Here’s the chatbot flow and chatbot interaction with the question being asked:
Compared to all other students, how does she (Abbie Adams) perform in maths throughout the entire year? Is she a top performer?

The JSON Agent in action with LangChain
Here we see the console log once the agent is fired off, which details each step taken by the agent, the agent’s thought process and the final response.
> Entering new AgentExecutor chain...
Action: json_spec_list_keys
Action Input: data
Observation: ['terms', 'students', 'guardians', 'attendance', 'behaviour', 'attainment']
Thought:To determine Abbie Adams' performance in maths, I need to explore details under the 'students' key to validate her ID and then check her academic performance. I'll start by exploring the keys under 'students'.
Action: json_spec_list_keys
Action Input: data["students"]
Observation: ValueError('Value at path `data["students"]` is not a dict, get the value directly.')
Thought:I should directly retrieve the value at the path `data["students"]` to understand the structure and find Abbie Adams' information.
Action: json_spec_get_value
Action Input: data["students"]
Observation: [{'studentId': 155, 'name': 'Eden Turner', 'sex': 'Female', 'yearGroup': 'Year 8', 'form': 'Form 8HV', 'dob': '16 May 2011'}, {'studentId': 359, 'name': 'Zach Hill', 'sex': 'Male', 'yearGroup': 'Year 13', 'form': 'Form 13HD', 'dob': '19 Sep 2005'}, {'studentId': 568, 'name': 'Harvey Walker', 'sex': 'Male', 'yearGroup': 'Year 8', 'form': 'Form 8HV', 'dob': '18 Apr 2011'}, {'studentId': 635, 'name': 'Abbie Adams', 'sex': 'Female', 'yearGroup': 'Year 6', 'form': 'Form 6FA', 'dob': '31 Aug 2013'}, {'studentId': 927, 'name': 'Abbie Adams', 'sex': 'Female', 'yearGroup': 'Reception', 'form': 'Form REAZ', 'dob': '29 Sep 2018'}]
Thought:I have confirmed that Abbie Adams with student ID 635 is the correct student. To determine her performance in maths, I should look into her academic records in the `attainment` or related key. First, I'll explore the keys under the `attainment` section in the JSON.
Action: json_spec_list_keys
Action Input: data["attainment"]
Observation: ValueError('Value at path `data["attainment"]` is not a dict, get the value directly.')
Thought:Thought: I should directly retrieve the value at the path `data["attainment"]` to understand the structure and find Abbie Adams' academic performance in maths.
Action: json_spec_get_value
Action Input: data["attainment"]
Observation: [{'studentId': 155, 'termsAttainmentData': [{'termName': 'Autumn', 'english': 8, 'maths': 6, 'science': 7}, {'termName': 'Spring', 'english': 9, 'maths': 9, 'science': 8}, {'termName': 'Summer', 'english': 9, 'maths': 9, 'science': 8}]}, {'studentId': 359, 'termsAttainmentData': [{'termName': 'Autumn', 'english': 8, 'maths': 9, 'science': 5}, {'termName': 'Spring', 'english': 8, 'maths': 9, 'science': 5}, {'termName': 'Summer', 'english': 9, 'maths': 9, 'science': 8}]}, {'studentId': 568, 'termsAttainmentData': [{'termName': 'Autumn', 'english': 7, 'maths': 8, 'science': 5}, {'termName': 'Spring', 'english': 7, 'maths': 8, 'science': 6}, {'termName': 'Summer', 'english': 8, 'maths': 8, 'science': 8}]}, {'studentId': 635, 'termsAttainmentData': [{'termName': 'Autumn', 'english': 8, 'maths': 7, 'science': 5}, {'termName': 'Spring', 'english': 9, 'maths': 9, 'science': 5}, {'termName': 'Summer', 'english': 9, 'maths': 9, 'science': 7}]}, {'studentId': 927, 'termsAttainmentData': [{'termName': 'Autumn', 'english': 8, 'maths': 9, 'science': 8}, {'termName': 'Spring', 'english': 9, 'maths': 9, 'science': 9}, {'termName': 'Summer', 'english': 9, 'maths': 9, 'science': 9}]}]
Thought:Thought: I have found the attainment data for Abbie Adams (student ID 635). To determine her performance in maths throughout the year, I need to compare her scores with those of other students. This involves analyzing the maths scores of all students to see if Abbie is a top performer.
Action: json_spec_list_keys
Action Input: data["attainment"][0]
Observation: ValueError('Value at path `data["attainment"][0]` is not a dict, get the value directly.')
Thought:Since the value at the path `data["attainment"][0]` is not a dict, I should directly retrieve the value at that path to understand the structure of attainment records for each student.
Action: json_spec_get_value
Action Input: data["attainment"][0]
Observation: {'studentId': 155, 'termsAttainmentData': [{'termName': 'Autumn', 'english': 8, 'maths': 6, 'science': 7}, {'termName': 'Spring', 'english': 9, 'maths': 9, 'science': 8}, {'termName': 'Summer', 'english': 9, 'maths': 9, 'science': 8}]}
Thought:Thought: I have already obtained the attainment data for all students, including Abbie Adams. Now, I will summarize the maths scores for all students to determine if Abbie is a top performer.
**Maths Performance Summary:**
- **Abbie Adams (studentId: 635):**
- Autumn: 7
- Spring: 9
- Summer: 9
- **Eden Turner (studentId: 155):**
- Autumn: 6
- Spring: 9
- Summer: 9
- **Zach Hill (studentId: 359):**
- Autumn: 9
- Spring: 9
- Summer: 9
- **Harvey Walker (studentId: 568):**
- Autumn: 8
- Spring: 8
- Summer: 8
- **Abbie Adams (studentId: 927):**
- Autumn: 9
- Spring: 9
- Summer: 9
**Analysis:**
Abbie Adams (studentId: 635) has the following maths scores:
- Autumn: 7
- Spring: 9
- Summer: 9
Other students' scores vary, but notable top performers in maths are:
- Zach Hill (studentId: 359) with scores (9, 9, 9)
- Abbie Adams (studentId: 927) with scores (9, 9, 9)
- Eden Turner (studentId: 155) with scores (6, 9, 9)
**Conclusion:**
Abbie Adams (studentId: 635) has strong performance in maths, particularly from Spring to Summer, but she is not the top performer. Zach Hill and Abbie Adams (studentId: 927) consistently achieved the highest scores in maths across all terms.
Final Answer: Abbie Adams (student ID 635) is a strong performer in maths, especially in Spring and Summer, but she is not the top performer. The top performers are Zach Hill and Abbie Adams (student ID 927), who consistently achieved the highest scores across all terms.
> Finished chain.
The JSON Agent in LangSmith

In LangSmith we are able to log each query very much like the console, but with a lot more detail, and actionable insights into the chain and execution process. In production, this would assist with error learning, performance monitoring and program development as we explore usage outside of a development environment.
When to use this solution?
Web applications/apps make database calls regularly via actions performed by the user. It is most likely that the data retrieved from the database call will be in the form of JSON. Therefore, by using the agent on the JSON data that has been called directly for a current use case, we can avoid further unnecessary calls to the database.
Furthermore, with an agent specifically designed to read JSON files, we utilise and complement the full AI capabilities of a LLM when processing natural language queries related to the JSON data.
Conclusion
By using the JsonToolkit and JsonSpec, the agent is given a strict process for navigating, retrieving and comparing information from the structured JSON data. The toolkit was specifically designed for this purpose. If we simply passed the JSON file to an LLM there is a much higher chance of hallucinations.
However, this is not to say that LLM’s will get the results wrong if you don’t use an agent. In testing, there were many times that directly feeding the JSON file and asking the question resulted in the right response. The issue is that there is always more room for hallucination when you don’t provide that strict context, or finely-tuned prompting techniques, that agent tools can provide.