
Table of Contents
∘ Introduction ∘ Objective ∘ Chatbot Architecture ∘ Tech Stack ∘ Procedure ∘ Step 1 – Load the PDF Documents ∘ Step 2—Build the Vector Store ∘ Step 3- Loading the LLM ∘ Step 4- Creating the Retrieval Chain ∘ Step 5 – Build the User Interface ∘ Step 6 – Run the Chatbot Application ∘ Step 7- Containerize the Application ∘ Future Steps ∘ Conclusion ∘ References
Introduction
Not too long ago, I attempted to build a simple custom chatbot that would be run entirely on my CPU.
The results were appalling, with the application crashing frequently. That being said, this is not a shocking outcome. As it turns out, housing a 13B parameter model on a $600 computer is the programming equivalent to making a toddler trek a mountain.
This time, I made a more serious attempt towards building a research chatbot with an end-to-end project that uses AWS to house and provide access to the models needed to build the application.
The following article details my efforts in leveraging RAG to build a high-performant research chatbot that answers questions with information from research papers.
Objective
The aim of this project is to build a QA chatbot using the RAG framework. It will answer questions using the content in pdf documents available on the arXIV repository.
Before delving into the project, let’s consider the architecture, the tech stack, and the procedure for building the chatbot.
Chatbot Architecture

The diagram above illustrates the workflow for the LLM application.
When a user submits a query on a user interface, the query will get transformed using an embedding model. Then, the vector database will retrieve the most similar embeddings and send them along with the embedded query to the LLM. The LLM will use the provided context to generate an accurate response, which will be shown to the user on the user interface.
Tech Stack
Building the RAG application with the components shown in the architecture will require several tools. The noteworthy tools are the following:
- Amazon Bedrock
Amazon Bedrock is a serverless service that allows users access to models via API. As it uses a pay as you go system and charges by the number of tokens used, it is very convenient and cost-effective for developers.
Bedrock will be used to access both the embedding model and the LLM. In terms of configuration, using Bedrock will require creating an IAM user with access to the service. Furthermore, access to the models of interest must be granted in advance.
2. FAISS
FAISS is a popular library in the Data Science space and will be used to create the vector database for this project. It enables quick and efficient retrieval of relevant documents based on a similarity metric. It is free too, which always helps.
3. LangChain
The Langchain framework will facilitate the creation and usage of the RAG components (e.g., vector store, LLM).
4. Chainlit
The Chainlit library will be used to develop the user interface of the chatbot. It enables users to build an aesthetic front end with minimal code and offers features suited for chatbot applications.
Note: The technical portion of the article will include code snippets of Chainlit operations, but will not cover the syntax or functionality of these operations
5. Docker
For portability and ease of deployment, the application will be containerized using Docker.
Procedure
Developing the LLM application will require the following steps. Each step will be explored individually.
- Load the PDF Documents
- Build the Vector Store
- Create the Retrieval Chain
- Design the User Interface
- Run the Chatbot Application
- Run the Application in a Docker Container
Step 1 – Load the PDF Documents

ArXIV is a repository containing a plethora of free, open-source articles and papers on topics ranging from economics to engineering. The backend data of the application will comprise a few documents on LLMs from the repository.
Once the selected documents are stored in a directory, they will be loaded and transformed into text chunks using LangChain’s PyPDFDirectoryLoader and RecursiveCharacterTextSplitter, respectively.
Step 2—Build the Vector Store

The text chunks created in step 1 are embedded using Amazon’s Titan Text Embeddings model. They can be accessed by code with boto3, the Amazon SDK for Python. The titan model can be identified by the model_id provided in the Bedrock documentation.
The embedded chunks are stored in a FAISS vector store, which is saved locally as "faiss_index".
Step 3- Loading the LLM

The LLM for the application will be Meta’s 13B Llama 2 model. Much like the embedding model, the LLM is accessed with Amazon Bedrock.
One noteworthy parameter is temperature
, which affects the randomness of the model’s output. Since the application is designed to be used for research, randomness will be minimized by setting temperature to 0.
Step 4- Creating the Retrieval Chain

In LangChain, a "chain" is a wrapper that facilitates a series of events in a specific order. In this RAG application, the chain will receive the user query and retrieve the most similar chunks from the vector store. The chain will then send the embedded query and the retrieved chunks to the loaded LLM, which will generate a response using the provided context.
The chain also incorporates the ConversationBufferMemory, which allows the chatbot to retain memory of previous queries. This enables the user to ask follow up questions.
Another noteworthy mention is the hyperparameter k
for the retriever, which specifies the number of embeddings that should be taken from the vector store. For this use case, we set k to 3, meaning that the LLM application will use 3 embeddings for context to answer each query.
Step 5 – Build the User Interface
So far, the backend components of the application have been developed, so it is time to work on the front end. Chainlit makes it easy to build user interfaces for LangChain applications, as existing code only needs to be modified with additional chainlit commands.
Chainlit is used for creating the function that establishes the chain.
It is also used for creating the function that uses the chain to generate responses and send them to the user.
The Chainlit decorators are a necessary inclusion. The on_chat_start
decorator defines the operations that should be run when the chat session is started (i.e., setting up the chain), while the on_message
decorator defines the operations that should be run when the user submits a query (i.e., send the response).
In addition, the code incorporates the use of async
and await
commands so that the tasks are handled asynchronously.
Finally, since the LLM application is designed for research, the generated response will include the sources of the embeddings retrieved from the vector store after the similarity search. This makes the generated responses citable, and as a result, more credible in the eyes of the user.
Step 6 – Run the Chatbot Application
With all components in the chatbot workflow created, the application can be run and tested. With Chainlit, a session can be started with a simple one-liner:
chainlit run <app.py>

The chatbot is now up and running! It shows the message provided in the code upon the start of the session.
Let’s test it with a simple query:

When a query is submitted, the response is both concise and comprehendible. Furthermore, it includes the sources of the 3 vector embeddings that were used to generate the response, including the name of the document and the page number.
To ensure that the chatbot is retaining memory of previous queries, we can submit a follow up query.

Here, we ask for "another example" without providing additional information on the example that is needed. Since the bot is retaining memory, it knows that the query is referring to pretrained LLMs.
Overall, the application performs at a satisfactory level. One aspect that can’t be demonstrated in an article is the significantly lower computation demand needed to run the chatbot. Since AWS houses the embedding model and the LLM, there is no risk of any crashes from excessive CPU utilization.
Step 7- Containerize the Application
Although the chatbot is up and running, there is still one step remaining. The LLM application still needs to be containerized with Docker for easier portability and version control.
The first step for containerization is to develop the DockerFile.
In this Dockerfile, we create a Python image as a base, define arguments for the access key ID and secret access key from AWS, install the requirements.txt file in the container, copy the current directory in the container, and run the chainlit application.
It is pretty easy from here on out. Building the docker image takes a one-liner:
docker build --build-arg AWS_ACCESS_KEY_ID=<your_access_key_id> --build-arg AWS_SECRET_ACCESS_KEY=<your_secret_access_key> -t chainlit_app .
The command above buildss an image named chainlit_app. It includes the AWS access key id and the AWS secret key as arguments since they needed to access the models in Amazon Bedrock via API.
Finally, the application can be run in a Docker container:
docker run -d --name chainlit_app -p 8000:8000 chainlit_app

The application is now running in port 8000! Since the application is being run locally, the chatbot will be hosted on http://localhost:8000.
Let’s see if it is the RAG components (including the AWS Bedrock models) are still operational by submitting a query.

It works just as expected!
Future Steps
The current chatbot is able to respond to queries with decent performance and at a low cost. However, the application is still run locally and uses default parameters. Thus, there are still measures that can be taken to further enhance the performance and usability of the chatbot.
- Perform Rigorous Testing
The LLM application appears to perform effectively, with the responses being concise and accurate. However, the tool still needs to undergo rigorous testing before it can be deemed usable.
The testing would primarily serve to ensure that response accuracy is maximized while hallucinations are minimized.
2. Implement Advanced RAG Techniques
If the chatbot is unable to answer specific types of questions or just consistently performs poorly, it would be worth considering the use of advanced RAG techniques to improve certain aspects of the workflow, such as the retrieval of content from the vector database.
3. Polish the Front End
Currently, the tool uses the default front end provided by Chainlit. To make the tool more aesthetic and intuitive, the UI design can be further customized.
In addition, the citation feature of the chatbot (i.e., identifying the source of the response) can be improved by providing a hyperlink so that the user can immediately go to the page that contains the information they need.
4. Deploy to the Cloud
If there is a need to offer this application to a larger user base, the next step would be to deploy it in a remote server using cloud platforms with services like Amazon EC2 and Amazon ECS. High scalability, availability, and performance are attainable with many cloud platforms, but since the tool leverages AWS Bedrock, the natural progression would be to harness other resources under the AWS umbrella.
Conclusion

Working on this project, I felt blown away by how far the data science space has advanced. NLP applications that harness generative AI would have been difficult to build just 5 years ago as it would require considerable time, money, and manpower.
In 2024, such tools can be built with just takes a man, a little code, and minimal expense (the whole project has cost under $1 so far). It makes you wonder just what will be possible in the upcoming years.
For those more interested in the codebase for the project, please visit the GitHub repository:
anair123/Building-a-Research-Chatbot-with-AWS-and-Llama-2 (github.com)
Thank you so much for reading!
References
- Stehle, J., Eusebius, N., Khanuja, M., Roy, M., & Pathak, R. (n.d.). Getting started with Amazon Titan text embeddings in Amazon bedrock … https://aws.amazon.com/blogs/machine-learning/getting-started-with-amazon-titan-text-embeddings/