
With the introduction of the open-source language model Mistral 7B by a French startup, Mistral, the breathtaking performance demonstrated by proprietary models like ChatGPT and claude.ai became available for the open-source community as well. To explore the feasibility of using this model on resource-constrained systems, its quantized versions have been shown to maintain great performance.
Even though 2-bit quantized Mistral 7B model passed accuracy test with flying colors in our earlier study, it was taking around 2 minutes on average to respond to questions on a Mac. Enter TinyLlama [1], a compact 1.1B language model pretrained on 3 trillion tokens with the same architecture and tokenizer as Llama 2. It is aimed for more resource-constrained environments.
In this article, we will compare the accuracy and response time performance of question answering capabilities of quantized Mistral 7B against quantized TinyLlama 1.1B in an ensemble Retrieval-Augmented Generation (RAG) setup.
Contents Enabling Technologies System Architecture Environment Setup Implementation Results and Discussions Final Thoughts
Enabling Technologies
This test will be conducted on a MacBook Air M1 with 8GB RAM. Due to its limited compute and memory resources, we are adopting quantized versions of these LLMs. In essence, quantization involves representing the model’s parameters using fewer bits, which effectively compresses the model. This compression results in reduced memory usage, faster execution times, and increased energy efficiency but at the compromise of accuracy. We will be using the 2-bit quantized Mistral 7B Instruct and 5-bit quantized TinyLlama 1.1B Chat models in the GGUF format for this study. GGUF is a binary format that is designed for fast loading and saving of models. To load such a GGUF model, we will be using the llama-cpp-python
library, which is a Python bindings for the llama.cpp
library.
Retrieval-Augmented Generation (RAG) is the process of enhancing the output of a LLM by referencing an authoritative knowledge base outside of its training data sources prior to its response generation. A RAG application consists of a retriever system to fetch relevant document snippets from a corpus, and an LLM to generate responses using the retrieved snippets as context. Since retriever is a major part of RAG, it has a significant influence on the overall question-answering (QA) system performance. Langchain, a powerful framework library to work with LLMs, includes the EnsembleRetriever
that accepts a list of retrievers as input [2]. The latter ensembles and reranks the results based on the Reciprocal Rank Fusion algorithm. By leveraging the strengths of different algorithms, we have previously shown that it achieves better accuracy [3]. In this article, we will combine a BM25 retriever and a FAISS retriever at ratio 0.3:0.7 for our ensemble.
To bring all these enablers together, let’s now take a look at the system architecture.
System Architecture
We will be reusing the modular architecture introduced in our previous article, as per below:

There are three modules for this QA system, namely:
- The first module involves loading and vectorizing an online pdf document.
- The second module involves loading the quantized Llm, instantiating a FAISS retriever and creating an ensemble retriever instance with the FAISS and BM25 retrievers. This is then followed by the creation of a retrieval chain encompassing the LLM, the ensemble and a custom prompt.
- The third module serves as a helper module for this RAG. It facilitates an objective measurement of the LLM performance across a set of questions by computing cosine similarity and the model response time.
Before looking at the implementation, let’s prepare the environment.
Environment Setup
The version of Python used here is 3.10.5. We will create a virtual environment to manage this project. To create and activate the environment, let’s do the followings:
python3.10 -m venv mychat
source mychat/bin/activate
Let’s proceed to install all the required libraries (which will also install quite a number of dependent libraries):
pip install langchain==0.0.259
pip install faiss-cpu
pip install rank_bm25
pip install sentence_transformers
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
Since langchain has been refactored recently, a specific version used for our code is shown above. To use llama-cpp-python
library with hardware acceleration on M1 processor, the last install command above enables Metal support. Using Metal, the computation runs on the GPU.
faiss-cpu
is a library for efficient similarity search and clustering of dense vectors using CPU, instead of GPU. rank_bm25
also known as the Okapi BM25, is a ranking function to estimate the relevance of documents to a given search query. sentence-transformers
provides easy methods to compute embeddings for sentences and others.
We are now ready to look at the code.
Implementation
As per our system architecture, there are 3 modules, namely LoadVectorize
, LLMPerfMonitor
and the main
module. The first two modules are reused from our previous study without any modification. The sample document loaded by the first module is a recently released 600+ page guide of an acceleration appliance, known as SteelHead.
To load either LLM, we will instantiate a LlamaCpp
instance with some typical model parameters. The model GGUF file itself was pre-downloaded and saved to a specific directory. Loading either Mistral 7B Instruct or TinyLlama Chat model will just involve the specification of a different value for attribute model_path
of the LlamaCpp
instance. We will then create an EnsembleRetriever
instance with a FAISS and a BM25 retrievers. Finally, a RetrievalQA
chain is created with the LLamaCpp
instance, the ensemble retriever and the prompt.
Mistral prompt follows the following template:
<s>[INST] {context} [/INST]</s>{question}
whereas TinyLlama prompt follows this:
<|system|>{context}</s><|user|>{question}</s><|assistant|>
Accordingly, the following code listing represents the main
module for the TinyLlama application.
# main.py
from langchain.retrievers import EnsembleRetriever
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.llms import LlamaCpp
import LoadVectorize
import LLMPerfMonitor
import timeit
# Prompt template
qa_template = """<|system|>
You are a friendly chatbot who always responds in a precise manner. If answer is
unknown to you, you will politely say so.
Use the following context to answer the question below:
{context}</s>
<|user|>
{question}</s>
<|assistant|>
"""
# Create a prompt instance
QA_PROMPT = PromptTemplate.from_template(qa_template)
llm = LlamaCpp(
model_path="../models/tinyllama_gguf/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf",
temperature=0.01,
max_tokens=2000,
top_p=1,
verbose=False,
n_ctx=2048
)
# load doc, vectorize and create retrievers
db,bm25_r = LoadVectorize.load_db()
faiss_retriever = db.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 3}, max_tokens_limit=1000)
r = 0.3 # ensemble ratio
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_r,faiss_retriever],weights=[r,1-r])
# Custom QA Chain
qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=ensemble_retriever,
chain_type_kwargs={"prompt": QA_PROMPT}
)
# List of questions
qa_list = LLMPerfMonitor_EN.get_questions_answers()
print('model;question;cosine;resp_time')
for i,query in enumerate(qa_list[::2]):
start = timeit.default_timer()
result = qa_chain({"query": query})
cos_sim = LLMPerfMonitor_EN.calc_similarity(qa_list[i*2+1],result["result"])
time = timeit.default_timer() - start # seconds
print(f'bm25-{r:.1f}_f-{1-r:.1f};Q{i+1};{cos_sim:.5};{time:.2f}')
Results and Discussions
To understand the accuracy of the responses, we computed cosine similarity between the model response and the sample answer from an expert. For illustration, here is a question asked against both models:
When was full transparency support was introduced on SteelHead for IPv6 optimized traffic?
The answer for this question is SteelHead RiOS v9.7. Below are the responses of each model.
Mistral:Starting with RiOS 9.7, IPv6 addressing is supported with port transparency and full transparency modes; however, IPv6 transparency is supported only when you select the Enable Enhanced IPv6 Auto-Discovery check box in the Optimization > Network Services: Peering Rules page.
TinyLlama:Full transparency support was introduced on SteelHead for IPv6 optimized traffic starting with RiOS 9.7, which is the latest version of the RiO software.
Both LLMs were able to identify the correct code version. However, neither response is fully accurate with the extra details included in their responses. This bodes well especially for TinyLlama. The accuracy for all the 10 questions are captured in Fig. 2 as a Treemap chart. Treemap is useful to represent any data with an inherent hierarchy, and it can use both area of rectangle as well as color shade to differentiate the magnitude of the chosen metric.

The lighter the blue shade, the higher is the cosine indicating higher accuracy. Based on the above, Mistral 7B has better accuracy than Tinyllama. The former’s enclosing rectangle has a light blue shade compared to the latter. Both models had the worst cosine value for question #7. However, when their responses were manually inspected, both models returned good answers but used different words compared to the expert’s answers, which affected their similarity.
Fig. 3 shows how the LLMs differed in terms of their response times. There is a definite winner here. TinyLlama response times were an order smaller than Mistral 7B. All TinyLlama responses were light beige at the bottom end of the colour scale as well as much small rectangles than Mistral 7B’s.

Based on the results obtained for the chosen questions, TinyLlama performed reasonably well in terms of accuracy. More importantly, it was able to respond a lot quicker than the more accurate Mistral 7B model. As claimed, TinyLlama does appear to be a good performing model on a resource-constrained system.
Final Thoughts
Since the introduction of the Mistral 7B model a few months ago, open-source LLM experienced a giant jump in their accuracy. Even when this model is not trained on your internal documents, we are able to harness its capabilities in a retrieval-augmented generation (RAG) setup. It is however not entirely feasible to run such large models in an resource-constrained environment. Quantization comes to our aid here by reducing the number of bits used by the model’s parameters. In addition, models with a smaller resource footprint are introduced for such constrained environments, and TinyLlama is one such model.
In this article, we compared the performance of quantized Mistral 7B Instruct against TinyLlama 1.1B Chat in terms of accuracy as well as their question response times. In terms of accuracy, the larger model Mistral 7B is still better than TinyLlama. However, in term of response time, TinyLlama responses were an order quicker than Mistral 7B. In one instance, TinyLlama responded 17 times earlier than Mistral 7B. As such, if an application could tolerate some drop in accuracy, TinyLlama is perfectly suited for more resource constrained systems.
Thank you for reading!
References
[1] https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 [2] https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble [3] Querying Internal Documents using Mistral 7B with Context from an Ensemble Retriever