
In my last article, I showed how I could fine-tune OpenAI’s ChatGPT to improve the results of performing tasks like formatting text documents. Although fine-tuning helped the Large Language Model (LLM) better perform the formatting task, the model didn’t seem to absorb much of the knowledge in the training data.
For this article, I explored using an open-source LLM, Meta’s LLaMa 2 [1], and a technique called Retrieval Augmented Generation (RAG) [2] to automatically find relevant reference text that helped the model answer questions. I used the text from 36 of my articles on Medium as reference material to see if I could create a private, custom chatbot that could answer questions about my writings.
For example, without providing any reference text, here’s how Llama 2 responded to a question about one of my articles.
RobG: What is Muybridge Derby?
LLaMa 2: Thank you for asking! I’m here to help you with your question. However, I must inform you that "Muybridge Derby" is not a real or recognized term in any context. It’s possible that it may be a misspelling or a made-up term, and I cannot provide information on something that does not exist. If you could provide more context or clarify the term you are referring to, I would be happy to try and assist you further. Please let me know if there’s anything else I can help with!
That was a very polite way to say it had never heard of the project. And it didn’t "hallucinate" and make up something false.
Here’s the question again with a follow-up request for more info, but the system had access to some reference text this time.
RobG: What is Muybridge Derby?
LLaMa 2: Muybridge Derby is a project by Robert A. Gonsalves where he used Midjourney and RunwayML to transform Eadweard Muybridge’s photo sequences into high-resolution videos.
RobG: Tell me more about the AI systems.
LLaMa 2: In the Muybridge Derby project, Robert A. Gonsalves used two AI systems to transform Eadweard Muybridge’s photo sequences into high-resolution videos: 1. Midjourney: a tool for creating interactive visual stories using AI-generated images and text. 2. RunwayML: a platform for creating, training, and deploying machine learning models. These systems allowed Gonsalves to transform Muybridge’s static photographs into dynamic videos that bring the animal locomotion to life.
Much better! The system answered the first question briefly and accurately. The response to the follow-up request was a bit wordy in that it repeated some of the info in the first answer, but overall, the results were good.
Overview
This section provides an overview of the project and the components I used. I’ll discuss each component and process in more detail further below.

I started with the text from 36 of my articles on Medium and converted them from HTML to Markdown language for ease of retrieval. I used an open-source framework for text generation called LlamaIndex to index, search, and provide sections of my articles as reference text to the LLMs.
To test the system’s accuracy, I used GPT-4 [3] to generate a set of 100 questions and answers about my articles. I tested three LLMs to see how well they worked: two variants of LLaMa with 7 and 13 billion parameters and ChatGPT [4], which has a whopping 175B parameters.
I assessed the results using a semantic text embedding model, MPNet [5], which a team at Hugging Face further trained as a model called all-mpnet-base-v2. I compared the embeddings of the answers from the three models to the embedding of the expected response from GPT-4 using cosine similarity to get accuracy scores, which are reported in the Results section below.
Components and Processes
Here are the details of the components and processes I used in this project.
Converting HTML to Markdown
Markdown is a lightweight markup language with plain-text-formatting syntax. John Gruber and Aaron Swartz created Markdown in 2004 as a markup language intended to be easy to read and easy to write [6]. Because of the simplified syntax, documents in Markdown format are easier to parse and break into cohesive text segments than HTML. I used the Beautiful Soup Python library to help convert my articles from HTML to Markdown.
For example, here is an excerpt from my Muybridge Derby article in HTML and Markdown formats.

You can see how the HTML format on the left has a lot more metadata and punctuation characters than the same document in Markdown format on the right. Having the documents in Markdown format makes the downstream processes easier. You can see my conversion code in this Google Colab.
LlamaIndex
LlamaIndex is an open-source project that connects LLMs to various data sources, enabling structured data interaction and application enhancement through natural language interfaces and knowledge-augmented responses. It facilitates data ingestion, indexing, and utilization across diverse formats and supports the development of interactive, data-informed applications [7].
It’s straightforward to use the LlamaIndex framework in Python. Here’s a simple example that shows the system’s API. It uses ChatGPT by default, but other LLMs, like LLaMa 2, can easily be used.
from llama_index import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('robgon_articles_md').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
results = query_engine.query("What is DeepHaiku? Be brief")
print(results)
I stored the Markdown files for my 36 articles in the robgon_articles_md folder. The SimpleDirectoryReader object reads them in and breaks them into multiple segments, using the formatting as hints to isolate cohesive blocks of text. It averaged about 18 blocks per article for a total of 632. The VectorStoreIndex object creates semantic embeddings for the segments to find references for the queries. A query engine is created from the index, which facilitates answering questions related to the documents.
Here is the response to a question about another one of my projects.
RobG: What is DeepHaiku? Be brief.
ChatGPT: Deep Haiku is a model that generates Haikus using a fine-tuned GPT-J 6B model. It starts with a prompt and creates 20 candidate Haikus, filtering for adherence to the [5, 7, 5] meter and optionally removing explicit language. The top candidates are displayed along with their scores.
This is an excellent summary of my DeepHaiku article. We’ll see in a moment how well LLaMa 2 7B and 13B perform answering questions like this.
Language Models
The following sections describe the three LLMs I used in this project, listed chronologically by release date.
ChatGPT
ChatGPT is a language generation model developed by OpenAI based on the GPT (Generative Pretrained Transformer) architecture released in November 2022. Like the original GPT, the newer ChatGPT model has 175 billion machine-learning parameters. It’s capable of understanding and generating human-like text based on the input it receives, and it’s used for various natural language processing tasks, such as text completion, summarization, and conversational agents [4]. A version of ChatGPT called GPT-3.5 Turbo was released in March 2023. It was trained using supervised learning with a massive corpus of text data and fine-tuned through reinforcement learning from human feedback. This allows it to create coherent and contextually relevant text based on the input it receives [8].
GPT-4
GPT-4 is a large-scale, multimodal model developed to accept image and text inputs and produce text outputs. It was also released in March 2023 by OpenAI. Although a version of the model supports image inputs, it’s not currently available to the public. GPT-4 demonstrated human-level performance on various benchmarks, including passing a simulated bar exam with high scores. The model underwent a post-training alignment process to enhance its factuality and adherence to desired behavior. Most of its development involved creating infrastructure and optimization methods that behave predictably across various scales [3].
LLaMa 2
Meta’s LLaMa 2 is a collection of pretrained and fine-tuned LLMs with a scale ranging from 7 billion to 70 billion parameters, released in July 2023. The collection includes a variant named LLaMa 2-Chat for interactive text generation. Meta showed how their models have demonstrated superior performance as compared to open-source chat models across numerous benchmarks and have shown promising results in human evaluations for helpfulness and safety. The developers explain their fine-tuning and safety enhancement approaches for LLaMa 2-Chat, aiming to facilitate further community involvement and responsible LLM development [1].
The authors described their safety considerations for the LLaMa 2 model. They explained their safety alignment process, which involved collecting safety-related annotations supported by experimental results. They also engaged "red teams" that tested and enhanced model safety. They presented an examination and mitigation of potential risks associated with the model’s deployment and concluded their paper with the following warning.
Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their specific applications of the model. – Hugo Touvron, et al. [1]
I’m happy to read that the researchers at Meta are taking model safety seriously. Fortunately, my use of the LLaMa 2 models didn’t stress the system to try and produce objectionable responses, but it’s good to know that mitigations are in place.
LLaMa 2 License
Although Meta released the source code and trained weights for LLaMa 2 as free and open-source, their license has a couple of unique twists. It stipulates that entities with over 700 million monthly active users must request and be granted a specific license from Meta to use the model. It explicitly prohibits anyone from using LLaMa 2 to enhance other LLMs. I’m not a lawyer, so please seek legal advice to fully understand the terms and conditions before using LLaMa 2.
Generating Test Questions and Answers
To judge the correctness of answers from the LLaMa 2 chat models, I used GPT-4 to generate 100 questions and answers about my articles on Medium. I started with 100 segments of text collected by LlamaIndex and fed the reference text into GPT-4 with the following prompt.
Ask one question about the reference material and answer it. Don’t use the phrase ‘reference material.’ Instead, use the article’s title or the project’s nickname. In the question don’t use the phrase ‘the author’. Use ‘Robert’ or ‘Robert A. Gonsalves’ instead. In the answer, refer to the author as ‘he/him/his’ if his name was used in the question. Use the format: q: a:
Here are some example questions and answers generated by GPT-4.
q: How did Robert use CLIP to filter the images for the MAGnet project? a: Robert used CLIP to filter the images by comparing the embedding from the phrase "modern painting" to the embeddings from the paintings. This allowed him to find the top 10,000 images that best matched the phrase "modern painting". This was done to ensure that the images used for training were good examples of modern painting.
q: What is the main objective of Robert A. Gonsalves’ project ‘GPT-4 Chords’? a: The main objective of his project ‘GPT-4 Chords’ is to demonstrate how to use the latest large language model from OpenAI, GPT-4, to help compose chords for original songs.
q: How did Robert A. Gonsalves modify the images created by DALL-E for the Mark Madness comic? a: Robert cleaned up the images in Photoshop, added the title, and replaced the gibberish text in the word balloons with dialog generated by GPT-3. He also used a trick in Photoshop to replicate the eyes of the characters Mark and Enrique, and used the Comic Sans font for the dialog.
These all look pretty good. The questions ask about things readers might be interested in, and the answers seem accurate, well-written, and brief.
Generating the Answers
For each of the three LLMs I tested, LLaMa 2 7B, LLaMa 2 13B, and ChatGPT, I fed in the corresponding reference material and the question with an additional prompt, "Be brief." The systems generated answers for all 100 questions. Here is a table that shows the answers to three of the questions.

All of the answers seem good. They all state the main facts clearly and succinctly. The only notable differences can be seen in the responses from LLaMa 2 13B. All three of its answers start with a general clause like "Sure! Here is the answer to your question …" and "Based on the information provided …" These phrases don’t add much value to the responses.
Comparing Text for Semantic Similarity
To assess the three models quantitatively, I used a semantic text encoding model called all-mpnet-base-v2 developed by the researchers at Hugging Face. The encoding model takes in text strings and produces a corresponding array of embeddings, lists of 768 floating-point numbers, that represent the meaning of the text. Here’s some Python code to show how it works.
from sentence_transformers import SentenceTransformer
import numpy as np
# Load the model
encoder = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
# Create embeddings for two example sentences.
sentences = ["This is an example sentence", "This is another example sentence"]
embeds = encoder.encode(sentences)
# Normalize the embeddings
text_features_1 = embeds[0] / np.linalg.norm(embeds[0], axis=-1, keepdims=True)
text_features_2 = embeds[1] / np.linalg.norm(embeds[1], axis=-1, keepdims=True)
# Calculate the cosine similarity
similarity = np.dot(text_features_1, text_features_2.T)
print(similarity)
I sent two text strings into the model to create embeddings and compared them using the cosine similarity metric. The resulting similarity ranges from 0.0 to 1.0, depending on how close the two text strings are in meaning. In this example, the strings were 90.2% the same. I used this metric to assess the results of the LMMs.
Results
I used the semantic embedding model to compare all 100 answers from the three LLMs to the baseline answers from GPT-4 and compiled the following graph that shows the results.

This Box and Whisker graph shows a visualization of the test results. The central boxes represent the middle 50% of the data, with lines indicating the median. Whiskers extend to the smallest and largest values, while dots indicate statistical outliers. The X’s represent the mean values of the scores. In general, you can see that the results from LLaMa 2 7B and ChatGPT are very close, although the LLaMa 2 does have some of the lowest scores. It’s curious to see that the results from LLaMa 2 13B are slightly worse than the other two models. This may be due to the extra wordiness of the answers.
Conclusion
My exploration of using Meta’s LLaMa 2 for enhancing the quality and accuracy of responses in a custom chatbot, built upon a knowledge base of my 36 articles on Medium, has provided insights into LLMs’ practical application and limitations. The experiment demonstrated that while providing LLaMa 2 with relevant reference text significantly improved response accuracy, the relationship between model size and response quality was not linear, as evidenced by the varied performance between LLaMa 2 7B and 13B. Furthermore, using semantic similarity as a metric for evaluating responses highlighted the nuanced differences and variability in the answers provided by different models.
This investigation not only underscores the potential of using LLaMa 2 for specific, customized applications but also brings to light the critical importance of ethical considerations and thorough testing in deploying such technologies. The guidelines and warnings provided by Meta are a pertinent reminder of the responsibilities of developers and researchers in ensuring the safe and ethical use of LLMs, especially as they become more integrated into various digital platforms and applications. The balance between technological advancement and ethical deployment will continue to be a pivotal focus in developing and applying LLMs in diverse environments.
Next Steps
Further exploration into the retrieval accuracy of LlamaIndex could involve a structured evaluation to understand the model’s effectiveness in identifying and providing relevant reference material. Additionally, investigating text comparison metrics beyond semantic similarity might offer insights into model responses’ linguistic and contextual nuances. This dual-faceted approach aims to provide a more comprehensive understanding of model performance and guide future developments in enhancing the accuracy and linguistic quality of LLM responses.
Source Code
I am releasing the source code for this project on GitHub under the Creative Commons Attribution Sharealike license.

Acknowledgments
I want to thank Jennifer Lim for reviewing the article and providing feedback.
References
[1] H. Touvron et al., Llama 2: Open Foundation and Fine-Tuned Chat Models, (2023)
[2] P. Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, (2020), Advances in Neural Information Processing Systems
[3] OpenAI, GPT-4 Technical Report, (2023)
[4] T. Brown et al. Language models are few-shot learners, Advances in Neural Information Processing Systems, (2020)
[5] S. Kaitao, et al. Mpnet: Masked and permuted pre-training for language understanding, Advances in Neural Information Processing Systems, (2020)
[6] A. Swartz, Markdown, (2017)
[7] J. Liu, LlamaIndex, (2022)
[8] OpenAI, Introducing ChatGPT, (2022)