
Table of Contents
∘ Introduction ∘ Case Study ∘ Step 1 – Creating a Vector Store ∘ Step 2—Creating the QA Chain ∘ Step 3 – Creating the User Interface ∘ Evaluating the chatbot ∘ The Results ∘ The Final Verdict ∘ References
Introduction
The advent of local models has been welcomed by businesses looking to build their own custom LLM applications. They enable developers to build solutions that can run offline and adhere to their privacy and security requirements.
Such LLMs were originally huge and mostly catered to enterprises that have the funds and resources to provision GPUs and train models on large volumes of data.
However, local LLMs are now available in much smaller sizes, which begs the question: is it possible for individuals with basic CPUs to harness these same tools and technologies?
It’s a question worth considering as users stand to gain a lot from building their own personal, local Chatbots that can perform tasks offline.
Here, we explore this possibility by building a close-sourced chatbot using Meta’s Llama2 on a CPU and evaluate its performance as a reliable tool for individuals.
Case Study
To test the feasibility of a building local chatbot that can run offline on a personal computer, let’s carry out a case study.
The objective is to build a chatbot using a quantized version of Meta’s Llama2 (7B parameters). The model will be used to build a LangChain application that facilitates response generation, which can be accessed with a user interface that enables people to interact with the application.

The chatbot will be trained with two PDF documents (both accessible with the arXiv API):
- A Comprehensive Review of Computer Vision in Sports: Open Issues, Future Trends and Research Directions
- A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision
For context, this bot will be trained on a computer with the following specifications:
- Operating System: Windows 10
- Processor: Intel i7
- RAM: 8GB
Note: Following along case study will require prior knowledge of the LangChain framework and the Streamlit library
Step 1 – Creating a Vector Store
First, we create the vector store, which will store the embedded data from the documents and facilitate the retrieval of documents relevant to the users’ queries.

For that, the data has to be converted into chunks. This is done by loading the PDF documents with the PyPDFLoader and splitting the text into chunks of 500 characters.
Next, the chunks are converted into embeddings with the use of a sentence transformer from HuggingFace. It’s important to specify the device as "cpu" in the parameters.
With the text chunks created and the embedding model loaded, we can create the vector store. For this case study, we use the Facebook AI Similarity Search (FAISS).
This vector store will be saved locally for future use.
Step 2—Creating the QA Chain
Next, we need to load the retrieval QA chain, which retrieves the relevant documents from the vector store and uses them to answer the users’ queries.

The QA chain requires three components: the quantized Llama 2 model, the FAISS vector store, and a prompt template.
First, we download the quantized Llama2 model, which is available in the HuggingFace repository. For this case study, the model is downloaded through a file named "llama-2-7b-chat.ggmlv3.q2_K.bin", which uses 2.87 GB in memory.
The model is then loaded using CTransformers, the Python library for binding transformer models implemented in C. Since we want an objective chatbot that generates responses with less creativity, so we set temperature
to 0.
Next, we load the vector store previously created.
After that, we define the prompt template. This step is optional, but since we are looking to interact with research papers, we need to prioritize accuracy, which we can enforce through the instructions in the prompt. Hallucinations are highly undesired, so the main instruction is to respond with "I don’t know" to questions that can not be answered with the provided PDF documents.
With these elements, we can create the QA chain, which will generate responses to users’ queries using the loaded quantized Llama2 model, the vector store, and the prompt template.
Finally, we create the function that executes the response generation.
Step 3 – Creating the User Interface
The core elements needed for the LangChain application have been built, so we can pivot to building a user interface for the chatbot.
The Streamlit library is suited for this task as it contains features tailored for chatbot applications.
The following code incorporates the previously built functions into the user interface (to review the entire source code, please visit the GitHub repository).
The Streamlit app is run with the following:
streamlit run app.py
And voila! We have our personal closed-sourced chatbot up and running!

Evaluating the chatbot
We have our chatbot, so let’s evaluate its performance with 3 different questions:
- What is the benefit of computer vision in sports analysis?

2. Give me a list of sports that incorporate computer vision.

3. What algorithms are used to track players?

The Results
Overall, the bot seems to return satisfactory responses without including unsolicited information.
However, one limitation is evident from the response to the question: "What algorithms are used to track players?", where the answer cuts off mid-sentence. This can be attributed to the limited context window (i.e. number of tokens) of the quantized version of the Llama2 model. The chatbot is unable to properly answer questions that require many tokens.
Moreover, a limitation that isn’t conveyed in the responses itself is time. While the bot responded appropriately to questions, it took over a minute on average to generate responses on this CPU. With such a long run time, there isn’t a strong argument for using this tool as an alternative to manually searching for content in a collection of documents.
Finally, this entire exercise exhausted a lot of memory from my computer, which rendered other applications unusable.
The Final Verdict

Now that the case study has concluded, let’s revisit the initial question: Can we build LLM-powered applications on a CPU?
The answer is: Yes… but we probably shouldn’t.
The positives from this case study are that the quantized Llama2 model is easy to download and incorporate into the application and that the chatbot is able to generate responses of high quality.
However, the combination of limited tokens, long run time, and high memory usage makes the prospect of training closed-sourced chatbots on CPUs unfeasible.
Of course, this conclusion is predicated on the constraints of the device used to conduct the case study, so a computer with greater processing power and storage may yield more promising results. In addition, in time, smaller LLMs with larger context windows will be made available to the public, making closed-sourced chatbots easier to build and utilize on CPUs.
If you’re interested in finding out how this application would perform on your device, feel free to check out the code in the following repository:
anair123/Llama2-Powered-QA-Chatbot-For-Research-Papers (github.com)
Thank you for reading!
References
- Naika, B. T., Hashmi, M. F., & Bokde, N. D. (n.d.). A Comprehensive Review of Computer Vision in Sports: Open Issues, Future Trends and Research Directions. Arxiv. https://arxiv.org/pdf/2203.02281
- Zhao, Z., Chai, W., Hao, S., Hu, W., Wang, G., Cao, S., Song, M., Hwang, J.-N., & Wang, G. (n.d.). A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision. Arxiv. https://arxiv.org/pdf/2307.03353.pdf