The world’s leading publication for data science, AI, and ML professionals.

Build a Transparent Question-Answering Bot for Your Documents with LangChain and GPT-3

Guide to developing an informative QA bot with displayed sources used

Photo by Justin Ha on Unsplash.
Photo by Justin Ha on Unsplash.

A Question Answering system can be of great help in analyzing large amounts of your data or documents. However, the sources (i.e., parts of your document) that the model used to create the answer are usually not shown in the final answer.

Understanding the context and origin of responses is valuable not only for users seeking accurate information, but also for developers wanting to continuously improve their QA bots. With the sources included in the answer, developers gain valuable insights into the model’s decision-making process, facilitating iterative improvements and fine-tuning.

This article shows how to use LangChain and GPT-3 (text-davinci-003) to create a transparent Question-Answering bot that displays the sources used to generate the answer by using two examples.

In the first example, you’ll learn how to create a transparent QA bot that leverages your website’s content to answer questions. In the second example, we’ll explore the use of transcripts from different YouTube videos, both with and without timestamps.

Process the Data and Create a Vector Store

Before we can leverage the capabilities of an LMM like GPT-3, we need to process our documents (e.g., website content or YouTube transcripts) in the correct format (first chunks, then embeddings) and store them in a vector store. Figure 1 below shows the process flow from left to right.

Figure 1. Process flow of data processing and the creation of a vector store (image by author).
Figure 1. Process flow of data processing and the creation of a vector store (image by author).

Website content example

In this example, we’ll process the content of the web portal, It’s FOSS, which specializes in Open Source technologies, with a particular focus on Linux.

First, we need to obtain a list of all the articles we wish to process and store in our vector store. The code below reads the sitemap-posts.xml file, which contains a list of links to all the articles.

import xmltodict
import requests

r = requests.get("https://news.itsfoss.com/sitemap-posts.xml")
xml = r.text
rss = xmltodict.parse(xml)

article_links = [entry["loc"] for entry in rss["urlset"]["url"]]

At the time of this article, the list contains over 969 links to articles.

With the list of links, we can now write a little helper function called extract_content that uses BeautifulSoup to extract specific elements from the article’s page containing the relevant content.

from bs4 import BeautifulSoup
from tqdm.notebook import tqdm

def extract_content(url):
    html = requests.get(url).text
    soup = BeautifulSoup(html, features="html.parser")

    elements = [
        soup.select_one(".c-topper__headline"),
        soup.select_one(".c-topper__standfirst"),
        soup.select_one(".c-content"),
    ]

    text = "".join([element.get_text() for element in elements])

    return text

articles = []
# Limited the list of > 900 articles to 10 for this example
for url in tqdm(article_links[0:10], desc="Extracting article content"):
    articles.append({"source": url, "content": extract_content(url)})

As a final step, we iterate over the list of links and apply our helper function extract_content to each URL. For demonstration purposes, I have limited the list to 10 elements. If you want to crawl all articles, simply remove [0:10] from article_links[0:10].

The articles list now contains, for each article, a dictionary with the "source" (link to the article) and "content" (content of the article). The link to the article will be displayed later as the source in the final answer.

Since GPT-3 comes with a token limit (4,096 tokens), it makes sense to split long articles into chunks. These chunks will later be combined with a prompt and sent to GPT-3.

The code below splits up the content of the articles into several chunks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

rec_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, 
                                              chunk_overlap=150)

web_docs, meta = [], []

for article in tqdm(articles, desc="Splitting articles into chunks"):
    splits = rec_splitter.split_text(article["content"])
    web_docs.extend(splits)
    meta.extend([{"source": article["source"]}] * len(splits))

We use the RecursiveCharacterTextSplitter here because it aims to keep semantically relevant content together for as long as possible.

Once this is done, all we have to do is execute the following line to store the articles and their sources in our vector store.

import os
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

os.environ["OPENAI_API_KEY"] = "YOUR KEY"

article_store = FAISS.from_texts(
    texts=web_docs, embedding=OpenAIEmbeddings(), metadatas=meta
)

For this example, we use FAISS as a vector store and OpenAIEmbeddings as our embedding model. Of course, it would also be possible to explore other options for vector storage, such as Chroma, and for embedding models, try out solutions from Hugging Face.

Note: You can also store your vector store by running article_store.save_local("your_name") so you don’t have to recreate it every time you use it. See here for more details.

In case you are not interested in processing YouTube transcripts, you can skip the part below and proceed to the next section "Run Transparent Question Answering".

YouTube transcript example

The transcripts can be processed in two different and independent ways. The first option demonstrates how to process YouTube transcripts while preserving the links to the videos as sources (e.g., https://youtu.be/XYZ).

The second part does the same but illustrates how to preserve the links, including the timestamps e.g., https://youtu.be/XYZ&t=60) for more granular information.

For both ways, the transcripts of the following YouTube videos from the channel StatQuest are used:

YouTube transcript example (without timestamps)

The first part is very straightforward. The code below utilizes LangChain’s DocumentLoader YoutubeLoader, which incorporates youtube-transcript-api and pytube.

from langchain.document_loaders import YoutubeLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
import os

os.environ["OPENAI_API_KEY"] = "YOUR KEY"

yt_ids = [
    "OtD8wVaFm6E",  # XGBoost Part 1 (of 4): Regression
    "8b1JEDvenQU",  # XGBoost Part 2 (of 4): Classification
    "ZVFeW798-2I",  # XGBoost Part 3 (of 4): Mathematical Details
    "oRrKeUCEbq8",  # XGBoost Part 4 (of 4): Crazy Cool Optimizations
]

yt_docs = []

for yt_id in tqdm(yt_ids, desc="Retrieving transcripts"):
    splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=150, 
                                     separator=" ")
    yt_loader = YoutubeLoader(yt_id, add_video_info=True)
    yt_docs.extend(yt_loader.load_and_split(splitter))

To avoid any conflicts with the token limit, we split the data into several chunks by using the CharacterTextSplitter. The add_video_info is set to True to also receive the title and author information of the video.

The returned chunked transcripts are document objects. Before creating embeddings and storing them in a vector store, we manipulate or extend their metadata by adding information about the title, author, and link to the video.

# Manipulate / extend source attribute
for doc in yt_docs:
    doc.metadata["source"] = (
        doc.metadata["title"]
        + " ["
        + doc.metadata["author"]
        + "] "
        + "https://youtu.be/"
        + doc.metadata["source"]
    )

# Vector store
yt_store = FAISS.from_documents(yt_docs, OpenAIEmbeddings())

YouTube transcript example (with timestamps)

The second way is a bit more sophisticated. Here we retrieve the transcript with a different package called youtube-transcript-api. The output is a list of dictionaries containing the text, start time, and duration. We needed to switch to a different package here, as the YoutubeLoader package does not return the timestamps.

An example can be seen here:

[
 {'text': "gonna talk about XG boost part 1 we're",
  'start': 14.19,
  'duration': 6.21},
 {'text': 'gonna talk about XG boost trees and how',
  'start': 17.91,
  'duration': 6.66},
...
]

Creating a document object out of each text entry would not make much sense, as the entries are too short (e.g., 8 words per entry in the example above) to be useful later. When searching in a vector store, only a limited number of matching documents (e.g., 4) are returned, and the information content is insufficient.

Therefore, we need to aggregate or join the text entries into a proper chunk of text first. The code snippet below contains a custom helper function.

# Create transcript df
def create_transcript_df(yt_transcript: list, yt_id: str):
    return (
        pd.DataFrame(yt_transcript)
        .assign(start_dt=lambda x: pd.to_datetime(x["start"], unit="s"))
        .set_index("start_dt")
        .resample("3min")
        .agg({"text": " ".join})
        .reset_index()
        .assign(start_dt=lambda x: x["start_dt"].dt.minute * 60)
        .assign(
            source=lambda x: "https://youtu.be/"
            + yt_id
            + "&t="
            + x["start_dt"].astype("str")
        )
        .drop(columns=["start_dt"])
    )

The helper applies resampling to adjust the frequency of the time dimension to 3 minute steps. In other words, it merges the transcripts into 3-minute parts of text. With this function in hand, we can now start fetching and processing the transcriptions.

from youtube_transcript_api import YouTubeTranscriptApi

yt_ids = [
    "OtD8wVaFm6E",  # XGBoost Part 1 (of 4): Regression
    "8b1JEDvenQU",  # XGBoost Part 2 (of 4): Classification
    "ZVFeW798-2I",  # XGBoost Part 3 (of 4): Mathematical Details
    "oRrKeUCEbq8",  # XGBoost Part 4 (of 4): Crazy Cool Optimizations
]
transcript_dfs = []
for yt_id in tqdm(yt_ids, desc="Fetching transcription"):
    yt_transcript = YouTubeTranscriptApi.get_transcript(yt_id)
    transcript_dfs.append(create_transcript_df(yt_transcript, yt_id))

transcripts_df = pd.concat(transcript_dfs).reset_index(drop=True)

An excerpt of the outcome can be seen in the figure below.

Figure 2. Excerpt of transcripts_df (image by author).
Figure 2. Excerpt of transcripts_df (image by author).

Since the merged 3-minute parts could now cause issues with the token limits, we need to process them with a splitter again before generating embeddings and storing them in our vector store.

from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
import os

os.environ["OPENAI_API_KEY"] = "YOUR KEY"

text_splitter = CharacterTextSplitter(separator=" ", chunk_size=1500, 
                                      chunk_overlap=150)

yt_docs, yt_meta = [], []

for index, row in tqdm(transcripts_df.iterrows(), total=len(transcripts_df)):
    splits = text_splitter.split_text(row["text"])
    yt_docs.extend(splits)
    yt_meta.extend([{"source": row["source"]}] * len(splits))
    print(f"Split {row['source']} into {len(splits)} chunks")

yt_ts_store = FAISS.from_texts(yt_docs, OpenAIEmbeddings(), metadatas=yt_meta)

Run Transparent Question Answering

With our filled vector store, we now can focus on the transparent question answering. The figure below gives an overview of the process.

Figure 3. Overview of Transparent Question Answering Process (image by author).
Figure 3. Overview of Transparent Question Answering Process (image by author).

We start by defining a question, which is then converted by the embedding model or API into an embedding. The vector store utilizes this question embedding to search for ‘n’ (default: 4) similar documents or chunks in the storage. Subsequently, the content of each document or chunk is combined with a prompt and sent to GPT-3.

The results returned from GPT-3 are then combined with another prompt in a final step and sent back to GPT-3 once more to obtain the final answer, including sources.

Website content example

Before using RetrievalQAWithSourcesChain, we ensure our bot memorizes previous conversations by implementing a memory. This enhances contextually relevant interactions with users.

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    input_key="question",
    output_key="answer",
    return_messages=True,
)

To integrate previous chat history into the used prompts, we need to modify the existing template.

from langchain import PromptTemplate

template = """You are a chatbot having a conversation with a human.
Given the following extracted parts of a long document and a question, 
create a final answer.
{context}
{chat_history}
Human: {question}
Chatbot:"""

question_prompt = PromptTemplate(
    input_variables=["chat_history", "question", "context"], template=template
)

Afterward, we can utilize the RetrievalQAWithSourcesChain to ask questions. In this example, we set k=4, which means we would query the vector store for the 4 most similar documents.

from langchain.chains import RetrievalQAWithSourcesChain

article_chain = RetrievalQAWithSourcesChain.from_llm(
    llm=OpenAI(temperature=0.0),
    retriever=article_store.as_retriever(k=4),
    memory=memory,
    question_prompt=question_prompt,
)

result = article_chain({"question": "What is Skiff?"}, 
                        return_only_outputs=True)

The result is returned as a dictionary:

{'question': 'What is Skiff?',
 'answer':   'Skiff is a privacy-focused email service with unique 
              functionalities such as the ability to manage multiple 
              sessions, appearance tweaks, dark mode, white theme, 
              two layouts, supporting imports from Gmail, Outlook, 
              Proton Mail, and more, creating and managing aliases, 
              and connecting a crypto wallet from Coinbase, BitKeep, 
              Brave, and others to send/receive email utilizing Web3. 
              It also includes Pages to create/store documents securely, 
              the ability to use Skiff's server or IPFS (decentralized 
              technology) for file storage, and Skiff Pages, 
              Encrypted Cloud Storage With IPFS Support.n',
 'sources': 'https://news.itsfoss.com/skiff-mail-review/'}

We can observe that the result contains the sources used to answer the given question. To generate this final answer, the API was called 5 times: 4 times to extract relevant information from the 4 most similar chunks, and 1 additional time to produce the conclusive answer.

We can also ask questions referring to the previous question.

article_chain(
    {"question": "What are its functionalities?"},
    return_only_outputs=True,
)

The outcome would then look as follows.

{
'answer': "Skiff offers a range of functionalities, 
including Web3 integration, IPFS decentralized storage, 
creating and managing aliases, connecting crypto wallets, 
getting credits to upgrade your account, importing from Gmail, 
Outlook, Proton Mail, and more, Pages to create/store documents securely, 
encrypted cloud storage with IPFS support, and the ability to use 
Skiff's server or IPFS (decentralized technology) for file storage.n",
 'sources': 'https://news.itsfoss.com/anytype-open-beta/, 
             https://news.itsfoss.com/skiff-mail-review/'
}

Keep in mind that for this questions the API was also called 5 times.

YouTube transcript example (with and without timestamps)

The code for the YouTube transcript example looks quite similar to the one for the website. First, we initialize the ConversationBufferMemory and create a custom question prompt template.

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    input_key="question",
    output_key="answer",
    return_messages=True,
)

template = """You are a chatbot having a conversation with a human.
    Given the following extracted parts of a long document and a question, 
    create a final answer.
    {context}
    {chat_history}
    Human: {question}
    Chatbot:"""

question_prompt = PromptTemplate(
    input_variables=["chat_history", "question", "context"], template=template
)

Then we create the QA chain with sources.

# Use yt_store for YouTube transcripts without timestamps or
# yt_ts_store with timestamps as sources.
yt_chain = RetrievalQAWithSourcesChain.from_llm(
    llm=OpenAI(temperature=0.0),
    retriever=yt_store.as_retriever(k=4),
    memory=memory,
    question_prompt=question_prompt,
)

Let’s ask a question.

result = yt_chain(
    {
        "question": "What is the difference in building a tree for a 
                     regression case compared to a classification case?"
    },
    return_only_outputs=True
)

The result for the example without timestamps:

{'answer': ' The main difference between building a tree for a regression case 
              and a classification case is that in a regression case, the goal 
              is to predict a continuous value, while in a classification case,
              the goal is to predict a discrete value. In a regression case, 
              the tree is built by splitting the data into subsets based on 
              the value of a certain feature, while in a classification case, 
              the tree is built by splitting the data into subsets based on 
              the value of a certain feature and the class label. 
              Additionally, in a regression case, 
              the weights are all equal to one, 
              while in a classification case, the weights are the previous 
              probability times one minus the previous probability.n',
 'sources': 'XGBoost Part 2 (of 4): Classification [StatQuest with Josh Starmer] https://youtu.be/8b1JEDvenQU, 
             XGBoost Part 3 (of 4): Mathematical Details [StatQuest with Josh Starmer] https://youtu.be/ZVFeW798-2I, 
             XGBoost Part 4 (of 4): Crazy Cool Optimizations [StatQuest with Josh Starmer] https://youtu.be/oRrKeUCEbq8'
}

The result for the example with timestamps:

{'answer': 'The difference in building a tree for a regression case compared 
            to a classification case is that in a regression case, the goal 
            is to predict a continuous value, while in a classification case, 
            the goal is to predict a probability that the drug will be 
            effective. Additionally, the numerator for classification is the 
            same as the numerator for regression, but the denominator 
            contains a regularization parameter. The denominator for 
            classification is different from the denominator for regression, 
            and is the sum for each observation of the previously predicted 
            probability times 1 minus the previously predicted probability. 
            The only difference between building a tree for a regression case 
            and a classification case is the loss function.n',
 'sources': 'https://youtu.be/ZVFeW798-2I&t=0 
             https://youtu.be/8b1JEDvenQU&t=180
             https://youtu.be/OtD8wVaFm6E&t=0'
}

Conclusion

The combination of LangChain’s RetrievalQAWithSourcesChain and GPT-3 is excellent for enhancing the transparency of Question Answering. As the process figure illustrates (figure 3), multiple calls to OpenAI are necessary to obtain the final answer.

Depending on your service usage and the number of similar documents you need to process, the number of calls can increase, resulting in higher costs. It’s definitely worth keeping an eye on that. However, for your hobby projects, this shouldn’t be too critical. To keep a better eye on the costs and sent prompts, one could consider using Promptlayer or TruLens.

The Colab notebooks can be found here:

Sources

Owners or creators have been asked in advance whether I am allowed to use their content/data as examples for this article.


Related Articles