Hands-On GenAI for Product & Engineering Leaders

Make better product decisions by taking a peek under the hood of LLM-based products

Ninad Sohoni

Published in

Towards Data Science

35 min readNov 28, 2023

Image generated by Bing Image Creator based on the prompt “product owner for a machine learning powered application working on a prototype”

Introduction

If you’re a regular driver, the hood of your car could be full of cotton for all you care. However, if you are anywhere in the design and execution chain responsible for building a better car, knowing what the different parts are and how they work together will help you build a better car.

Similarly, as a product owner, business leader, or an engineer responsible for creating new Large Language Model (LLM) powered products, or for bringing LLMs / generative AI to existing products, an understanding of building blocks that go into an LLM-powered products will help you tackle strategic and tactical questions pertaining to technology, such as,

Is our use case a good fit for LLM-powered solutions? Perhaps traditional analytics, supervised machine learning, or another approach is a better fit?
If LLMs are the way to go, can our use case be addressed by an off-the-shelf product (say, ChatGPT Enterprise) now or in the near-future? Classic build-vs-buy decision.
What are the different building blocks of our LLM-powered product? Which of these are commoditized, and which are likely to need more time to build and test?
How do we measure the performance of our solution? What levers are available to improve the quality of outputs from our product?
Is our data quality acceptable for the use case? Are we organizing our data correctly, and passing relevant data to the LLM?
Can we be confident that the LLM’s responses will always be factually accurate. That is, will our solution ‘hallucinate’ when generating responses once in a while?

While these questions are answered later in the article, the objective with getting a little hands-on is to build an intuitive understanding of LLM-powered solutions, which should help you answer these questions on your own, or at least put you in a better position to research further.

In a previous article, I delved into some foundational concepts associated with building LLM-powered products. But you can’t learn to drive just by reading blogs or watching videos — it requires you to get behind the wheel. Well, thanks to the age we live in, we have free-to-us tools (which cost millions of dollars to create) at our fingertips to build our own LLM solution in under an hour! So, in this article, I propose we do just that. It’s a much easier undertaking than learning to drive 😝.

Build a Chatbot that allows you to “chat” with websites

Objective: Build a chatbot that answers questions based on information on a provided website, to gain a better understanding of the building blocks of popular GenAI solutions today

We will create a question-answering chatbot that will answer questions based on information in a knowledge repository. This solution pattern, called Retrieval Augmented Generation (RAG), has become a go-to solution pattern in companies. One reason for the popularity of RAG is that rather than relying solely on the LLMs own knowledge, you can bring external information to the LLM in an automated manner. In real-world implementations, the external information can be from an organization’s own knowledge repository, holding proprietary information to enable the product to answer questions about the business, its products, business processes, etc. RAG also reduces LLM ‘hallucinations’, in that the generated responses are grounded in the information provided to the LLM. According to a recent talk,

“RAG will the the default way enterprises use LLMs”
-Dr. Waleed Kadous, Chief Scientist, AnyScale

For our hands-on exercise, we will let a user enter a website, which our solution will “read” into its knowledge repository. The solution will then be able to answer questions based on the information on the website. The website is a placeholder — in reality, this can be tweaked to consume text from any data source like PDFs, Excel, another product or internal system, etc. This approach works for other media — such as images — but they require a few different LLMs. For now, we will focus on text from websites.

For our example, we will use a sample book list webpage created for this blog: Books I’d Pick Up — If There Were More Hours in the Day! You are welcome to use another website of your choice.

Here’s what our result will look like:

LLM-powered chatbot to intelligently answer questions based on information on a website. (Image by the author)

Here are the steps we will go through to build our solution:

0. Getting Set Up — Google Colaboratory & OpenAI API Key
1. Create knowledge repository
2. Search question-relevant context
3. Generate answer using LLM
4. Add “chat” capability (optional)
5. Add a simple pre-coded UI (optional)

0.1. Getting Set Up — Google Colaboratory & OpenAI API Key

To build a LLM solution, we need a place to write and run code, and an LLM to generate responses to questions. We will use Google Colab for the code environment, and the model behind ChatGPT as our LLM.

Let’s start with setting up Google Colab, a free service by Google that enables running Python code in an easy-to-read format — no need to install anything on your computer. I find it convenient to add Colab to Google Drive so that I can later find Colab notebooks easily.

To do so, navigate to Google Drive (using a browser) > New > More > Connect More Apps > Search “Colaboratory” in the Google Marketplace > Install.

To start using Colabobatory (“Colab”), you can select New > More > Google Colaboratory. This will create a new notebook in your Google Drive so you can go back to it.

After adding Google Colaboratory as an application to Google Drive, it can be accessed via Google Drive. — Google Colaboratory accessible in Google Drive. (Image by the author)

Next, let’s get access to an LLM. There are several open source and proprietary options available. While open source LLMs are free, powerful LLMs generally require powerful GPUs to process inputs and generate responses, and there is a nominal operate cost for GPUs. In our example, we will instead use OpenAI’s service to use the LLM used by ChatGPT. To do so, you will require an API key, which is like a username/password rolled into one to let OpenAI know who is trying to access the LLM. As of this writing, OpenAI offered a $5 credit for new users, which should be sufficient for this hands-on tutorial. Here are the steps to get the API key,

Go to OpenAI’s Platform website > Get started > Sign up with email & password or use Google or Microsoft account. You may also need a phone number to verification.

Once logged in, click on your profile icon in the top right corner > View API keys > Create new secret key. The key will look something like the following (fake key for informational purposes only). Save it for use later.

sk-4f3a9b8e7c4f4c8f8f3a9b8e7c4f4c8f-UsH4C3vE64

Now, we are ready to build the solution.

0.2. Prepare Notebook for Building Solution

We need to install some packages in the Colab environment to facilitate our solution. Just type the following code in the text box (called a “cell”) in Colab and press “Shift + Return (enter)”. Alternatively, just click the “play” button on the left of the cell or use the “Run” menu at the top of the notebook. You may need to use the menu to insert new code cells for running subsequent code:

# Install OpenAI & tiktoken packages to use the embeddings model as well as the chat completion model
!pip install openai tiktoken
# Install the langchain package to facilitate a most of the functionality in our solution, from processing documents to enabling "chat" using LLM
!pip install langchain
# Install ChromaDB - an in-memory vector database package - to save the "knowledge" relied on by our solution to answer questions
!pip install chromadb
# Install HTML to text package to transform webpage content to a more human readable format
!pip install html2text
# Install gradio to create a basic UI for our solution
!pip install gradio

Next, we should pull in code from the packages we installed so that the packages can be used in the code we write. You can use the new code cell and hit “Shift + Return” again — and continue in this manner for each subsequent code block.

# Import packages needed to enable different functionality for the solution
from langchain.document_loaders import AsyncHtmlLoader # To load website content into a document
from langchain.text_splitter import MarkdownHeaderTextSplitter # To document into smaller chunks by document headings 
from langchain.document_transformers import Html2TextTransformer # To converrt HTML to Markdown text
from langchain.chat_models import ChatOpenAI # To use OpenAI's LLM
from langchain.prompts import PromptTemplate # To formulate instructions / prompts
from langchain.chains import RetrievalQA, ConversationalRetrievalChain # For RAG
from langchain.memory import ConversationTokenBufferMemory # To maintain chat history
from langchain.embeddings.openai import OpenAIEmbeddings # To convert text to numerical representation
from langchain.vectorstores import Chroma # To interact with vector database
import pandas as pd, gradio as gr # To show data as tables, and to build UI respectively
import chromadb, json, textwrap # Vector database, converting json to text, and prettify printing respectively
from chromadb.utils import embedding_functions # Setting up embedding function, following protocol required by Chroma

Finally, add the OpenAI API key to a variable. Note that this key is like your password — do not share it. Also, do not share your Colab notebook without removing the API key first.

# Add your OpenAI API Key to a variable
# Saving the key in a variable like so is bad practice. It should be loaded into environment variables and loaded from there, but this is okay for a quick demo
OPENAI_API_KEY='sk-4f3a9b8e7c4f4c8f8f3a9b8e7c4f4c8f-UsH4C3vE64' # Fake Key - use your own real key here

Now we are ready to start building the solution. Here is a high-level view of the next steps:

Core steps to build the RAG solution (Image by the author)

When coding, we will use LangChain, which has emerged as a popular framework to build solutions such as this one. It has packages for facilitating each of the steps from connecting to data sources to sending and receiving information from the LLM. LlamaIndex is another option to simplify building LLM-powered apps. While it’s not strictly required to use LangChain (or LlamaIndex), and in some cases the high-level abstraction may make has the risk of leaving teams oblivious to what’s happening under the hood, we will use LangChain but still look under the hood often.

Note that since the pace of innovation is so quick, it is likely that packages used in this code get updated, and some updates may cause the code to stop working unless updated accordingly. I do not intend to keep this code up-to-date. Nevertheless, the article is intended to serve as a demonstration, and the code could serve as reference, or a starting point that you may adapt to your needs.

1. Create Knowledge Repository

1.1. Identify & Read-in Documents
Let’s access the book list and read the content into our Colab environment. The content is loaded as HTML originally, which is useful for web-browsers. However, we will convert it to a more human readable format using a HTML to Text convertor.

url = "https://ninadsohoni.github.io/booklist/" # Feel free to use any other website here, but note that some code might need to be edited to display contents properly

# Load HTML from URL and transform to a more readable text format
docs = Html2TextTransformer().transform_documents(AsyncHtmlLoader(url).load())

# Let's take a quick peek again to see what do we have now
print("\n\nIncluded metadata:\n", textwrap.fill(json.dumps(docs[0].metadata), width=100), "\n\n")
print("Page content loaded:")
print('...', textwrap.fill(docs[0].page_content[2500:3000], width=100, replace_whitespace=False), '...')

Here is what running the code generates on Google Colab:

Result of executing the code above . The website content is loaded into Colab environment. (Image by the author)

1.2. Break Documents into Smaller Excerpts
There is one more step before we load the blog’s information into our knowledge repository (which is essentially a database of our choice). The text should not be loaded into the database as-is. It should first be split into smaller chunks. This is for a few reasons:

If our text is too long, it cannot be sent to the LLM due to exceeding the text length threshold (known as “context size”).
Longer text might have broad, loosely related information. We would be relying on the LLM to pick out the relevant portions — and this might not always work out as expected. With smaller chunks, we could use retrieval mechanisms to identify only the relevant pieces of information to send to the LLM, as we will see later.
LLMs are prone to have stronger attention at the beginning and end of text, so longer chunks could lead the LLM to pay less attention to more content later (known as “lost in the middle”).

The right chunk sizes for each use case will vary per the specifics of the use case, including the type of content, the LLM being used, and other factors. It is prudent to experiment with different chunk sizes and evaluate response quality before finalizing the solution. For this demonstration, let’s use context-aware splitting where each book recommendation from the list gets its own chunk,

# Now we split the entire content of the website into smaller chunks
# Each book review will get its own chunk since we are splitting by headings
# The LangChain splitter used here will also create a set of metadata from headings and associate it with the text in each chunk
headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"),
    ("###", "Header 3"), ("####", "Header 4"), ("#####", "Header 5") ]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on = headers_to_split_on)
chunks = splitter.split_text(docs[0].page_content)

print(f"{len(chunks)} smaller chunks generated from original document")

# Let's look at one of the chunks
print("\nLooking at a sample chunk:")
print("Included metadata:\n", textwrap.fill(json.dumps(chunks[5].metadata), width=100), "\n\n")
print("Page content loaded:")
print(textwrap.fill(chunks[5].page_content[:500], width=100, drop_whitespace=False), '...')

One of many document chunks as a result of splitting the original content. (Image by the author)

Note that if the chunks created so far are still longer than desired, they can be split further using other text splitting algorithms, easily available via LangChain or LlamaIndex. For example, each book’s review could be split into paragraphs, if needed.

1.3. Load Excerpts into Knowledge Repository
The text chunks are now ready to be loaded to the knowledge repository. These are first passed through an embedding model to convert the text to a series of numbers that capture the meaning of the text. Then the actual text along with the numerical representation (i.e., embeddings) will be loaded to the vector database — our knowledge repository. Note that embeddings are also generated by LLMs, just a different kind than the chat LLM. If you wish to read more about embeddings, the previous article demonstrates the concept using examples.

We will use a vector database to store all the information. This will manifest our knowledge repository. Vector databases are purpose-built to enable searching by embedding similarity. If we want to search something from the database, the search term is converted to a numerical representation by running it through the embedding model first, and then the question embeddings are compared to all of the embeddings in the database. Records (in our case, text chunks about each book on the list) that are closest to the question embeddings are returned as search results, as long as they clear a threshold.

# We will get embeddings for each chunk (and subsequently questions) using an embeddings model from OpenAI
openai_embedding_func = embedding_functions.OpenAIEmbeddingFunction(api_key=OPENAI_API_KEY)

# Initialize vector DB and create a collection
persistent_chroma_client = chromadb.PersistentClient()
collection = persistent_chroma_client.get_or_create_collection("my_rag_demo_collection", embedding_function=openai_embedding_func)
cur_max_id = collection.count() # To not overwrite existing data

# Let's add data to our collection in the vector DB
collection.add(
    ids=[str(t) for t in range(cur_max_id+1, cur_max_id+len(chunks)+1)],
    documents=[t.page_content for t in chunks],
    metadatas=[None if len(t.metadata) == 0 else t.metadata for t in chunks]
    )

print(f"{collection.count()} documents in vector DB")
#  25 documents in vector DB

# Optional: We will write a scrappy helper function to print data slightly better -
#  it limits the length embeddings shown on the screen (since these are over a 1,000 long numbers).
#  Also shows a subset of text from documents as well as metadatas fields
def render_vectorDB_content(chromadb_collection):
    vectordb_data = pd.DataFrame(chromadb_collection.get(include=["embeddings", "metadatas", "documents"]))
    return pd.DataFrame({'IDs': [str(t) if len(str(t)) <= 10 else str(t)[:10] + '...'for t in vectordb_data.ids],
                         'Embeddings': [str(t)[:27] + '...' for t in vectordb_data.embeddings],
                         'Documents': [str(t) if len(str(t)) <= 300 else str(t)[:300] + '...' for t in vectordb_data.documents],
                         'Metadatas': ['' if not t else json.dumps(t) if len(json.dumps(t))  <= 90 else '...' + json.dumps(t)[-90:] for t in vectordb_data.metadatas]
                        })

# Let's take a look at what is in the vector DB using our helper function. We will look at the first 4 chunks
render_vectorDB_content(collection)[:4]

A view of the first few text chunks loaded to the vector DB along with numerical representations (i.e., embeddings). (Image by the author)

2. Search Question-Relevant Context

We ultimately want our solution to pick out relevant information from our vector DB knowledge corpus and pass it along to the LLM along with the question we want the LLM to answer. Let’s try out the vector DB search, by asking the question “Can you recommend a few detective novels?”

# Here we are linking to the previously created an instance of the ChromaDB database using a LangChain ChromaDB client
vectordb = Chroma(client=persistent_chroma_client, collection_name="my_rag_demo_collection", 
                  embedding_function=OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
                  )
 
# Optional - We will define another scrappy helper function to print data slightly better
def render_source_documents(source_documents):
    return pd.DataFrame({'#': range(1, len(source_documents) + 1), 
                         'Documents': [t.page_content if len(t.page_content) <= 300 else t.page_content[:300] + '...' for t in source_documents],
                         'Metadatas': ['' if not t else '...' + json.dumps(t.metadata, indent=2)[-88:] for t in source_documents]
                         })

# Here's where we are compiling the question
question = "Can you recommend a few detective novels?"

# Running the search against the vector DB based on our question
relevant_chunks = vectordb.similarity_search(question)

# Printing results
print(f"Top {len(relevant_chunks)} search results")
render_source_documents(relevant_chunks)

Top search results for the question “Can you recommend a few detective novels?” (Image by the author)

We get top 4 results by default, unless we explicitly set the value to a different number. In this example, the top search, which is a Sherlock Holmes novel, mentions the term ‘detective’ directly. The second result (The Day of the Jackal) does not have the term ‘detective’, but mentions ‘police agencies’ and ‘uncover the plot’, which bear semantic association with “detective novels”. The third result (The Undercover Economist) mentions the term ‘undercover’, though it is about economics. I believe the last result was fetched due to its association with novels / books rather than “detective novels” specifically, because four results were requested.

Also, it is not strictly necessary to use a vector DB. You could load embeddings and facilitate search in other forms of storage. “Normal” relational databases or even Excel can be used. But you would have to handle the “similarity” calculation, which can be a dot product when using OpenAI embeddings, in your application logic. On the other hand, a vector DB does that for you.

Note that if we wanted to pre-filter some search results by metadata, we could do so. For our demonstration, let’s filter by the genre, which is under “Header 2” in the metadata we loaded from the book list.

# Let's try filtering and accessing records that match a specific metadata filter. This can probably be transformed into a pre-filter if needed
pd.DataFrame(vectordb.get(where = {'Header 2': 'Finance'}))

Search results based on applying a metadata pre-filter, showing only key columns. (Image by the author)

An interesting opportunity offered by LLMs is to use the LLM itself to inspect a user question, review available metadata, assess whether a metadata-based pre-filter is required and possible, and formulate the pre-filter query code, which can then be used on the vector DB to actually pre-filter data. See LangChain’s self-query retriever for more information about this.

3. Generate Answer using LLM

Next, we will add instructions to the LLM that basically say “I am going to give you some information snippets, and a question. Please answer the question using the provided information snippets”. Then, we bundle these instructions, the search results from the vector DB, and our question into a packet and send it to the LLM to respond. All these steps are facilitated by the following code.

Note that LangChain offers the opportunity to abstract some of this code, so your code doesn’t have to be as verbose as the code that follows. However, the objective with the code below is to showcase the instructions sent to the language model. Here’s where you can customize them too — like in this case, the default instructions are changed to request the LLM to keep responses as concise as possible. If the default works for your use case, your code can skip the question template part altogether and LangChain will use the default prompt from its own package when sending a request to the LLM.

# Let's select the language model behind free version of ChatGPT: GPT-3.5-turbo
llm = ChatOpenAI(model_name = 'gpt-3.5-turbo', temperature = 0, openai_api_key = OPENAI_API_KEY)

# Let's build a prompt. This is what actually gets sent to the ChatGPT LLM, with the context from our vector database and question injected into the prompt
template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that the available information is not sufficient to answer the question. 
Don't try to make up an answer. Keep the answer as concise as possible, limited to five sentences.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

# Define a Retrieval QA chain, which will take the question, get the relevant context from the vectorDB, and pass both to the language model for a response
qa_chain = RetrievalQA.from_chain_type(llm, 
                                       retriever=vectordb.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
                                       )

Now, let’s ask for detective novel recommendations again and see what we get as a response.

# Let's ask a question and run our question-answering chain
question = "Can you recommend a few detective novels?"

result = qa_chain({"query": question})

# Let's look at the result
result["result"]

Response from the solution recommending detective novels (Image by the author)

Let’s confirm whether the model reviewed all four of our previous search results from the vector DB, or did it just get the two results noted in the response?

# Let's look at the source documents used as context by the LLM
# We will use our helper function from before to limit the size of the information shown
render_source_documents(result["source_documents"])

What was passed in as context to the LLM along with the question, to facilitate the response. (Image by the author)

We can see that the LLM still had access to all four search results and reasoned that only the first two books were detective novels.

Be forewarned that the response from the LLM could change each time you ask a question, despite sending the same instructions, and the same information from the vector database. For example, on asking about fantasy book recommendations, the LLM sometimes gave three book recommendations, and sometimes more — though all from the book list. In all cases, the top recommended book stayed the same. Note that these variations were despite configuring the consistency — creativity spectrum — the ‘temperature’ parameter — to 0 to minimize variance.

4. Add “chat” capability (optional)

The solution now has the necessary core functionality — it is able to read in information from a website and answer questions based on that information. But it currently does not offer a “conversational” user experience. Thanks to ChatGPT, the “chat interface” has become the dominant design: we now expect this to be the “natural” way to interact with generative AI, and LLMs in particular 😅. The first steps towards getting to the chat interface involves adding “memory” to the solution.

“Memory” here is an illusion, in that the LLM is not actually remembering the conversation up to that point — it needs to be shown the full conversation history in each turn. So, if a user asks the LLM a follow-up question, the solution will package the original question, the LLM’s original answer, and the follow-up question and send it to the LLM. The LLM reads the entire conversation and generates a meaningful response to continue the conversation.

In question-answering chatbots, like the one we’re building, this approach needs to be extended further because there is the interim step to reach out to the vector database and pull relevant information to formulate the response to a user’s follow-up question. The way “memory” is simulated in question-answering chatbots is,

Retain all questions and responses (in a variable) as “chat history”
When the user asks a question, send the chat history and the new question to the LLM and ask it to generate a standalone question
At this point, the chat history is no longer needed. Use the standalone question to run a new search on the vector DB
Pass the standalone question and search results, along with instructions to the LLM to get a final answer. This step is similar to what we implemented in the previous stage “Generate Answer using LLM”

While we can keep a track of chat history in simpler variables, we will use one of LangChain’s memory types. The particular memory object we will use offers the nice feature of automatically truncating older chat history when it reaches a size limit you specify, generally the size of the text that the selected LLM can accept. In our case, the LLM should be able to accept a little over 4,000 “tokens” (which are word parts), which should roughly be 3,000 words or ~5 pages from a Word document. OpenAI offers a 16k variant of the same ChatGPT LLM, which can accept 4x the input. Hence, the need to configure the memory size.

Here is the code to achieve these steps. Again, LangChain provides a higher-level abstraction and the code does not have to be so explicit. This version is just to expose the underlying instructions sent to the LLM — first to condense the chat history into a single standalone question, which will then be used for the vector DB search, and the second to generate a response to the generated standalone question based on vector DB search results.

# Let's create a memory object to track chat history. This will start accumulating human messages and AI responses.
# Here a "token" based memory is used to restrict the length of chat history to what can be passed into the selected LLM. 
# Generally, the maximum token length configured will depend on the LLM. Assuming we are using the 4K version of the LLM, 
#  we will set the token maximum to 3K, to allow some room for the question prompt.
#  The LLM parameter is to make LangChain aware of the tokenization scheme of the selected LLM. 
memory = ConversationTokenBufferMemory(memory_key="chat_history", return_messages=True, input_key="question", output_key="answer", max_token_limit=3000, llm=llm)

# While LangChain includes a default prompt to generate a standalone question based on the users' latest question and
# any context from the conversation up to that point, we will extend the default prompt with additional instructions.
standalone_question_generator_template = """Given the following conversation and a follow up question, 
rephrase the follow up question to be a standalone question, in its original language.
Be as explicit as possible in formulating the standalone question, and 
include any context necessary to clarify the standalone question.

Conversation:
{chat_history}
Follow Up Question: {question}
Standalone question:"""
updated_condense_question_prompt = PromptTemplate.from_template(standalone_question_generator_template)

# Let's rebuild the final prompt (again, optional since LangChain uses a default prompt, though it might be a little different)
final_response_synthesizer_template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that the available information is not sufficient to answer the question. 
Don't try to make up an answer. Keep the answer as concise as possible, limited to five sentences.
{context}
Question: {question}
Helpful Answer:"""
custom_final_prompt = PromptTemplate.from_template(final_response_synthesizer_template)

qa = ConversationalRetrievalChain.from_llm(
    llm=llm, 
    retriever=vectordb.as_retriever(), 
    memory=memory,
    return_source_documents=True,
    return_generated_question=True,
    condense_question_prompt= updated_condense_question_prompt,
    combine_docs_chain_kwargs={"prompt": custom_final_prompt})

# Let's again ask the question we previously asked the retrieval QA chain
query = "Can you recommend a few detective novels?"
result = qa({"question": query})
print(textwrap.fill(result['answer'], width=100))

Detective novel recommendations from the solution. Same response as the one received earlier using only “question-answering” capability, without “memory” (Image by the author)

Let’s ask a follow-up question and look at the response to validate the solution now has “memory” and can respond conversationally to follow-up questions:

query = "Tell me more about the second book"
result = qa({"question": query})
print(textwrap.fill(result['answer'], width=100))

Response to follow-up question asking more information about “the second book”. The solution responds back with more information about the same book as before (Image by the author)

Let’s look at what is happening under the hood to validate that the solution does indeed go through the four steps outlined at the beginning of this section. Let’s start with the chat history to verify that the solution does indeed log the conversation so far:

# Let's look at chat history upto this point
result['chat_history']

Chat history after asking the second question. Note that the response is also included in the conversation at this point. (Image by the author)

Let’s look at what else is the solution tracking besides the chat history:

# Let's print the other parts of the results
print("Here is the standalone question generated by the LLM based on chat history:")
print(textwrap.fill(result['generated_question'], width=100 ))
print("\nHere are the source documents the model referenced:")
display(render_source_documents(result['source_documents']))
print(textwrap.fill(f"\nGenerated Answer: {result['answer']}", width=100, replace_whitespace=False) )

Outputs, other than chat history, after asking the second question. (Image by the author)

The solution internally uses the LLM to first convert the question “Tell me more about the second book” to “What additional information can you provide about ‘The Day of the Jackal’ by Frederick Forsyth?”. Armed with this question, the solution is able to search the vector DB for any relevant information and retrieves The Day of the Jackal chunk first this time. Though note that some other irrelevant search results about other books are also included.

Quick optional sidebar discussing potential issues

Potential Issue #1 — Poor Standalone Question Generation: In my tests, the chat solution wasn’t always successful in generating a good standalone question, until the question generator prompt was tweaked. For example, for a follow-up question, “Tell me about the second book”, more often than not the generated follow-up question was “What can you tell me about the second book?” which is not particularly meaningful in itself and led to random search results and consequently a seemingly random generated response from the LLM.

Potential Issue #2 — Changing Search Results Between Original & Follow-up Questions: It is noteworthy that even though the second generated question specifically names the book of interest, the returned results from the vector DB search include other book results, and more importantly, these search results are different than from those for the original question! In this example, this change in search results was desirable since the question changed from “detective novel recommendations” to a particular novel. However, when a user is asking follow-up questions intending to dig deeper into a topic, variations in question formulation or LLM-generated standalone question may lead to different search results or a different ranking of search results, which might not be desirable.

This issue is possibly mitigated automatically, at least to a degree, by doing a broader initial search from the vector DB — returning many results instead of just 4–5 as with our example — and re-ranking them to ensure that the most relevant results bubble up to the top and are always sent to the LLM to generate the final response (see Cohere’s ‘Reranking’). Besides, it should be relatively straightforward for an app to recognize that search results have changed. It might be possible to apply some heuristics around whether the degree of change in search results (measured by ranking and overlap metrics), and the degree of change in the question (measured by distance metrics such as cosine similarity) are at parity. At least in cases where there are unexpected swings in search results over chat turns, the end user could be alerted and brought into the loop for a closer inspection, depending on the use case criticality and training or sophistication of end users.

Another idea to control this behavior is to leverage the LLM to decide whether a follow-up question requires going to the vector DB again, or can the question be meaningfully answered with previously fetched results. Some use cases might want to generate two sets of search results and responses and let the LLM adjudicate between the answers, some others might be justified in passing the responsibility of controlling context to users by empowering them to freeze context (depending on the use case, user training or sophistication, and other considerations), and some others might simply be tolerant of changing search results over follow-up questions.

As you can probably tell, it is quite easy to get a basic solution working, but getting things just right — that‘s the hard part. The issues called out here is just scratching the surface. Alright, back to the main exercise …

5. Add a pre-coded UI

Finally, the chatbot’s functionality is ready. Now, we can add a nice user-interface to improve user experience. This is (somewhat) easily possible due to Python libraries such as Gradio and Streamlit, which build front-end widgets based on instructions written in Python. Here, we will go with Gradio to quickly create a user interface.

With dual objectives of catching anyone up in case they were not able to execute the code so far, and also to demonstrate some variations in getting to the same place, the following two blocks of code are self-contained and can be run in a completely new Colab notebook to generate the complete chatbot.

# Initial setup - Install necessary software in code environment
!pip install openai tiktoken langchain chromadb html2text gradio   # Uncomment by removing '#' at the beginning if you're starting here and haven't yet installed anything

# Import packages needed to enable different functionality for the solution
from langchain.document_loaders import AsyncHtmlLoader # To load website content into a document
from langchain.text_splitter import MarkdownHeaderTextSplitter # To document into smaller chunks by document headings 
from langchain.document_transformers import Html2TextTransformer # To converrt HTML to Markdown text
from langchain.chat_models import ChatOpenAI # To use OpenAI's LLM
from langchain.prompts import PromptTemplate # To formulate instructions / prompts
from langchain.chains import RetrievalQA, ConversationalRetrievalChain # For RAG
from langchain.memory import ConversationTokenBufferMemory # To maintain chat history
from langchain.embeddings.openai import OpenAIEmbeddings # To convert text to numerical representation
from langchain.vectorstores import Chroma # To interact with vector database
import pandas as pd, gradio as gr # To show data as tables, and to build UI respectively
import chromadb, json, textwrap # Vector database, converting json to text, and prettify printing respectively
from chromadb.utils import embedding_functions # Setting up embedding function, following protocol required by Chroma

# Add the OpenAI API Key to a variable
# Saving the key in a variable like so is bad practice. It should be loaded into environment variables and loaded from there, but this is okay for a quick demo
OPENAI_API_KEY='sk-4f3a9b8e7c4f4c8f8f3a9b8e7c4f4c8f-UsH4C3vE64' # Fake Key - use your own real key here

Before running the next set of code to render the chatbot UI, note that when rendered through Colab, the app becomes publicly accessible for 3 days for anyone with the link (the link is provided in the Colab notebook cell output). In theory, the app can be kept private by changing the last line in the code to demo.launch(share=False), but I was not able to get the app to work at all then. Instead, I prefer running it in ‘debug’ mode in Colab, so the Colab cell stays “running” until stopped, which then terminates the chatbot. Alternatively, run the code shown below in a different Colab cell to terminate the chatbot and delete content loaded to the Chroma vector DB within Colab.

# To be run at the end to terminate the demo chatbot
demo.close() # To end the chat session and terminate the shared demo

# Retrieve and delete the vector DB collection created for the chatbot
vectordb = Chroma(client=persistent_chroma_client, collection_name="my_rag_demo_collection", embedding_function=openai_embedding_func_for_langchain)
vectordb.delete_collection()

Below is the code to run the chatbot as an app. Most of this code reuses the code up to this point of the article, so should seem familiar. Note that there are some differences in the code below compared to the code earlier, including but not limited to there being no memory management using LangChain’s ‘token’ memory object that we used before. This means as the conversation continues for a while, the history will become too long to pass in to the language model’s context, and the app will need a restart.

# Initiate OpenAI embedding functions. There are two because the function protocol is different when passing the function to Chroma DB directly vs using it with Chroma DB via LangChain
openai_embedding_func_for_langchain = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
openai_embedding_func_for_chroma = embedding_functions.OpenAIEmbeddingFunction(api_key=OPENAI_API_KEY)

# Initiate the LangChain chat model object using the GPT 3.5 turbo model
llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0, openai_api_key=OPENAI_API_KEY)

# Initialize vector DB and create a collection
persistent_chroma_client = chromadb.PersistentClient()
collection = persistent_chroma_client.get_or_create_collection("my_rag_demo_collection", embedding_function=openai_embedding_func_for_chroma)

# Function to load website content into vector DB
def load_content_from_url(url):
  # Load HTML from URL and transform to a more readable text format
  docs = Html2TextTransformer().transform_documents(AsyncHtmlLoader(url).load())
  # Split docs by section
  headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ("####", "Header 4"), ("#####", "Header 5") ]
  chunks = MarkdownHeaderTextSplitter(headers_to_split_on = headers_to_split_on).split_text(docs[0].page_content)
  # Here we are linking to the previously created an instance of the ChromaDB database
  vectordb_collection = persistent_chroma_client.get_or_create_collection("my_rag_demo_collection", embedding_function=openai_embedding_func_for_chroma)
  # Let's add data to the vector DB; specifically to our collection in the vector DB
  cur_max_id = vectordb_collection.count()
  vectordb_collection.add(ids=[str(t) for t in range(cur_max_id+1, cur_max_id+len(chunks)+1)],
                           documents=[t.page_content for t in chunks],
                           metadatas=[None if len(t.metadata) == 0 else t.metadata for t in chunks]
                           )
  # Alert user that content is loaded and ready to be queried
  gr.Info(f"Website content loaded. Vector DB now has {vectordb_collection.count()} chunks")
  return

# Define the UI and the chat function
with gr.Blocks() as demo:
  
  # Function to chat with language model using documents for context
  def predict(message, history):
    # Here we are linking to the previously created an instance of the ChromaDB database using a LangChain ChromaDB client
    langchain_chroma = Chroma(client=persistent_chroma_client, collection_name="my_rag_demo_collection", embedding_function=openai_embedding_func_for_langchain)
    # Convert to langchain chat history format - list of tuples rather than list of lists
    langchain_history_format = []
    for human, ai in history:
      langchain_history_format.append((human, ai))
    # We are now defining the ConversationalRetrieval chain, starting with the prompts used
    standalone_question_generator_template = """Given the following conversation and a follow up question, 
    rephrase the follow up question to be a standalone question, in its original language. Be as explicit as possible 
    in formulating the standalone question, and include any context necessary to clarify the standalone question.
    
    Conversation: {chat_history}
    Follow Up Question: {question}
    Standalone question:"""
    updated_condense_question_prompt = PromptTemplate.from_template(standalone_question_generator_template)
    # Let's rebuild the final prompt (again, optional since LangChain uses a default prompt, though it might be a little different)
    final_response_synthesizer_template = """Use the following pieces of context to answer the question at the end. 
    If you don't know the answer, just say that the available information is not sufficient to answer the question. 
    Don't try to make up an answer. Keep the answer as concise as possible, limited to five sentences.
    {context}
    Question: {question}
    Helpful Answer:"""
    custom_final_prompt = PromptTemplate.from_template(final_response_synthesizer_template)
    # Define the chain
    qa_chain = ConversationalRetrievalChain.from_llm(llm, retriever=langchain_chroma.as_retriever(), 
                                                     return_source_documents=True, return_generated_question=True, 
                                                     condense_question_prompt= updated_condense_question_prompt,
                                                     combine_docs_chain_kwargs={"prompt": custom_final_prompt})
    # Execute the chain
    gpt_response = qa_chain({"question": message, "chat_history": langchain_history_format})
    # Add human message and LLM response to chat history
    langchain_history_format.append((message, gpt_response['answer']))
    return gpt_response['answer'] 

  gr.Markdown(
      """
      # Chat with Websites
      ### Enter URL to extract content from website and start question-answering using this Chatbot
      """
  )
  with gr.Row():
    url_text = gr.Textbox(show_label=False, placeholder='Website URL to load content', scale = 5)
    url_submit = gr.Button(value="Load", scale = 1)
    url_submit.click(fn=load_content_from_url, inputs=url_text)

  with gr.Row():
    gr.ChatInterface(fn=predict)

demo.launch(debug=True)

You can play around with the app by giving it a different URL to load content from. This goes without saying: this is not a production-grade app, and was only created to demonstrate building blocks of RAG-based GenAI solutions. This is an early prototype at best and if it were to be converted into a regular product, most of the software engineering work will lie ahead.

Revisiting FAQs from the Introduction

With the context and knowledge of the chatbot we created, let’s revisit some of the questions posed in the Introduction and dive just a little bit deeper.

Is our use case a good fit for LLM-powered solutions? Perhaps traditional analytics, supervised machine learning, or another approach is a better fit?
LLMs are good at “understanding” language related tasks as well as following instructions. So, the early use cases for LLMs have been question-answering, summarization, generation (text in this case), enabling better meaning-based search, sentiment analysis, coding, etc. LLMs have also picked up the ability to problem-solve and reason. For example, LLMs can act as automated graders for students’ assignments if you provide it with an answer key, or sometimes even without.
On the other hand, predictions or classifications based on a large number of data points, multi-armed bandit experiments for marketing optimization, recommender systems, reinforcement learning systems (Roomba, Nest thermostat, optimizing power consumption or inventory levels, etc.) are the forte of other types of analytics or machine learning … at least for the time being. Hybrid approaches where traditional ML models feed information to LLMs and vice-versa should also be considered as a holistic solution to a core business problem.
If LLMs are the way to go, can our use case be addressed by an off-the-shelf product (say, ChatGPT Enterprise) now or in the near-future? Classic build-vs-buy decision.
Services and products offered by OpenAI, AWS, and others are going to grow broader, better, and possibly cheaper. For example, ChatGPT let’s users upload their files for analysis, Bing Chat and Google’s Bard let you point to external websites for question answering, AWS Kendra brings semantic search to an enterprise’s information, Microsoft Copilot lets you bring LLMs to Word, Powerpoint, Excel, etc. For the same reason that companies do not build their own operating systems or their own databases, companies should think about whether they need to build AI solutions that might possibly get obsolete by current and future off-the-shelf products. On the other hand, if a company’s use cases are specific, or restrictive in some sense — such as not being able to send their sensitive data to any vendor due to sensitivity, or due to regulatory guidance — then, it might be required to build generative AI products within the company to address use cases. Products that use an LLM’s reasoning ability but undertake tasks or generate outputs too distinct than the vended solutions might warrant in-house development. For example, a system that is monitoring the factory floor, or a manufacturing processes, or inventory levels, etc. might warrant custom development, especially if there are no good domain-specific product offerings. Also, if the application requires specialized domain knowledge, then an LLM fine-tuned on domain-specific data is likely going to outperform a general-purpose LLM from OpenAI, and in-house development could be considered.
What are the different building blocks of our LLM-powered product? Which of these are commoditized, and which are likely to need more time to build and test?
The high-level building blocks for a RAG solution like we built are the data pipeline, vector database, retrieval, generation, and the LLM of course. There are lots of great choices for LLMs and for vector databases. The data pipelines, retrieval, prompt engineering for generation will require some good old-fashioned data-sciency experimentation to optimize for a use case. Once an initial solution is in place, productionization will require a lot of work, which is true of any data science / machine learning pipeline. This talk offers hard-earned wisdom on the topic of productionization: LLMs in Production: Learning from Experience, Dr. Waleed Kadous, Chief Scientist, AnyScale
How do we measure the performance of our solution? What levers are available to improve the quality of outputs from our product?
As with any technology (or non-technology ) solution, business impact should be measured using leading KPIs. Some direct metrics being difficult to measure get replaced by surrogate metrics such as average number of daily users (DAU) and other product metrics.
Business metrics should be complemented with technical metrics evaluating the performance of the RAG solution. The overall quality of the response — how good is the system’s response compared to the best generated response from an expert human or a state of the art frontier model such as GPT-4 (currently) could be evaluated using a range of metrics that test for informativeness, factualness, relevance, toxicity, etc. It will help to delve deeper into performance of individual components to iterate and improve each: the quality of information that the solution will use as context, retrieval, and generation.
i. How good is the data quality? If data available to an organization and stored in the vector database doesn’t have the required information, no human or LLM can conjure a response based on it.
ii. How good is the retrieval? Assuming the information is available, how successful is the system in finding and fetching the relevant bits?
iii. How good is the generation (i.e. synthesis)? Assuming the information is available, retrieved correctly, and passed on to the LLM to generate the final response, is the LLM using the information as expected?
Each of these areas could be evaluated separately and improved concurrently to improve the overall output.
Improve Data quality: Companies need to work on data pipelines to feed good information into the system. If there is bad quality of information in the vector database, having great LLMs will not improve the outputs drastically. In addition to employing traditional data quality and governance frameworks, companies should also consider improve the quality of chunking (more on this in the next question’s response).
Improve Retrieval: Retrieval could be improved through trying different retrieval algorithms, semantic re-ranking, hybrid search combining semantic search and keyword search, and fine-tuning embeddings. Improving instructions / prompt should also contribute to improving the quality of retrieval.
Improve Generation: As LLMs improve , the synthesis step will improve, and possibly retrieval too due to improved embedding models. Another option assuming resource & time availability is fine-tuning, which can improve the quality of responses for specific domains and tasks. For example, a smaller fine-tuned model on diagnosing specific medical conditions might be better at the task than a general purpose model like GPT-4, while also being faster and cheaper.
Is our data quality acceptable for the use case? Are we organizing our data correctly, and passing relevant data to the LLM?
Data quality can be assessed with the traditional data quality & governance frameworks. Additionally for LLM-powered solutions, the information required by LLMs to answer user questions or carry out tasks should be available within the data available to the solution.
Assuming the data is available, the data should be chunked appropriately for the use case and LLM being used. Chunks shouldn’t be too broad to dilute coherence with respect to a specific topic or too narrow to not include all the necessary context. Data shouldn’t be split into chunks in a way that necessary context is split between chunks and meaningless when separated in this way. For example, if the two sentences below are split into two chunks,
“OpenAI’s GPT-3.5 is a powerful LLM. It can support context sizes up to 16K tokens.”
A question such as “Tell me about GPT 3.5 LLM” may not fetch the 2nd sentence as it doesn’t mention GPT 3.5 and that information might not be provided to a user, just by virtue of suboptimal chunking. More dangerously, the sentence might still be fetched when asked about a completely different LLM due to semantic association of context sizes and tokens with LLMs, and the response might be that other model in focus has context sizes up to 16K, which would be factually inaccurate. This is a simplified example unlikely to be encountered in production, but the idea holds.
One possible approach to improve quality of chunks is to use context-aware text splitting, such as splitting by logical sections (as in our example of the book list). If any logical chunk is too big — such as Wikipedia pages on particular topics would be quite lengthy, they could be split further by logical sections or by semantic units such as by paragraphs, with a meaningful overlap between chunks, as well as ensuring the overall metadata and chunk specific metadata is passed to the LLM.
Can we be confident that the LLM’s responses will always be factually accurate. That is, will our solution ‘hallucinate’ when generating responses once in a while?
A key selling point of RAG is to drive factuality. GPT 3.5 and GPT-4 are good at following this instruction: “respond only from the provided context or say ‘the question cannot be answered based on the information provided’”. This is hypothesized to be due to a lot of reinforcement learning from human feedback (RLHF) conducted by OpenAI. As a corollary, other LLMs might not currently be as good at following instructions. For a production application, especially an external facing one, it will be prudent to conduct a lot of testing aimed at validating that the generated output is faithful to the available context retrieved from the vector database, even though the LLM believes it to be the case. Approaches range from manual tests on samples, to using a powerful model such as GPT-4 to test samples of retrieved context and generated responses by other models, to using services and products such as Galileo which focus on detecting LLM hallucinations in real-time.

Conclusion

Had you known all of this 11 months ago, it would have justified a demonstration with the CEO of your company. Possibly even a TED talk to a wider audience. Today, this has become a part of AI literacy baseline, especially if you are involved in delivery of generative AI products. Hopefully, you’re fairly caught up due to this exercise! 👍

A few closing thoughts,

There is serious promise in the technology — how many other technologies can “think” to this degree, and can be used as “reasoning engines” (in the words of Dr. Andrew Ng here).
While frontier models (currently, GPT-4) will continue to advance, open source models and their domain-specific and task specific fine-tuned variants will be competitive on numerous tasks and will find many applications.
For better or worse, this cutting edge technology that took millions (hundreds of millions?) of dollars to develop is available for free — you could fill a form and download Meta’s capable Llama2 model with a very permissive license. Nearly 300,000 baseline LLMs or their fine-tuned variants are on HuggingFace’s model hub. Hardware is also commoditized.
OpenAI models are now capable of being aware of and using “tools” (functions, APIs, etc.), letting solutions interface with not just humans and databases, but with other programs. LangChain and other packages have already demonstrated using LLMs as the “brain” for autonomous agents that can accept input, decide what action to take, and follow through, repeating these steps until the agent reaches its goal. Our simple chatbot used two LLM calls in a deterministic sequence — generate standalone question, and synthesize search results into a coherent natural language response. Imagine what hundreds of calls to rapidly evolving LLMs with agentic autonomy can achieve!
These rapid advancements are a result of tremendous momentum around GenAI, and it will proliferate enterprises, and day-to-day life through our devices. First in simpler ways, but later on in increasingly sophisticated applications that leverage the reasoning and decision-making capability of the technology, blending it with traditional AI.
Finally, now is a great time to get involved as the playing field is fairly level, at least for applying this technology — everyone is learning about this at more or less the same time since the ChatGPT boom in Dec 2022. Things are of course different on the R&D side, with Big Tech companies that have spent years, and billions of dollars in developing this technology. Regardless, to build more sophisticated solutions later, it’s the perfect time to get started now!

Additional Resources

LangChain: Deeplearning.ai course: LangChain: Chat with Your Data | LangChain documentation
Gradio: Deeplearning.ai course - Building Generative AI Applications with Gradio | Gradio documentation and guides
I have found Shawhin Talebi’s articles very instructive. See Cracking Open the OpenAI (Python) API, Cracking Open the Hugging Face Transformers Library, and other recent articles.
LLMs in Production: Learning from Experience, Dr. Waleed Kadous, Chief Scientist, AnyScale
This talk by Jerry Liu, co-founfer of LlamaIndex outlines various approaches for output evaluation: Practical Data Considerations for Building Production-Ready LLM Applications