The world’s leading publication for data science, AI, and ML professionals.

🦜🔗 LangChain: Question Answering Agent over Docs

Learn about embeddings and agents to build a QA application

Photo by Mike Alonzo on Unsplash
Photo by Mike Alonzo on Unsplash

Introduction

One of the most common use cases in the NLP field is question-answering related to documents. For example, imagine feeding a pdf or perhaps multiple pdf files to the machine and then asking questions related to those files. This can be useful, for example, if you have to prepare for a university exam and want to ask the machine about things you didn’t understand. Actually, a more advanced use case would be to ask the machine to ask you questions to perform a sort of mock interrogation.

A lot of research has been done to solve this task and many tools have been developed, but today we can use the power of Large Language Models (LLMs) such as ag example GPT-3 from OpenAI and beyond.

LangChain is a very recent library that allows us to manage and create applications based on LLMs. In fact, the LLM is just one part of a much more complex AI architecture. When we create a system like this we do not only have to make queries to the OpenAI models and get a response, but for example save this response, structure the prompt properly etc.

In this article, we see how to build a simple Question Answering over Docs application using LangChain and OpenAI.

Embeddings

In this application, we will make use of a library called ChromaDB. This is an open-source library that allows us to save embeddings. Here a question might arise: But what are embeddings?

An embedding is nothing more than a projection in a vector space of a word (or text).

I try to explain myself in a simpler way. Suppose we have the following words available: "king," "queen," "man," and "woman."

We people from our experience intuitively understand distances between these words. For example, man is closer to king conceptually than queen. But machines are not intuitive they need data and metrics to work. So what we do is turn these words into data on a Cartesian space so that this intuitive concept of distance is represented accurately.

Embedding Example (Image By Author)
Embedding Example (Image By Author)

In the image above we have an example (dummy) of embeddings. We see that Man is much closer to King than to the other words, and the same is true for woman and queen.

Another interesting thing is that the distance between man and king is the same as that between woman and queen. So somehow it seems that this embedding has really captured the essence of these words.

Another thing to specify though is that of the metric, i.e. how the distance is measured. In most cases, this is measured using cosin similarity i.e. using the cosine of the angle between two embeddings.

Cosin Similarity (Image By Author)
Cosin Similarity (Image By Author)

The embedding in my example has only two dimensions, two axes. But the embeddings that are created by modern algorithms like BERT have hundreds or thousands of axes, so it’s hard to understand why the algorithm put text at a particular point in space.

In this demo, you can navigate a real embedding space in 3 dimensions and see how the words are near or far from each other.

Let’s code!

First of all, we need to install some libraries. Surely we will need Langchain and OpenAI to instantiate and manage LLMs.

After that we will go to install ChromaDB and TikToken (the latter is required to successfully install ChromaDB)

!pip install langchain
!pip install openai
!pip install chromadb
!pip install tiktoken

Now we need a text file that we are going to work on. In fact, our purpose is to ask questions to the LLM about this file. Downloading a file with Python is very simple, it can be done with the following commands.

import requests

text_url = 'https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt'
response = requests.get(text_url)

#let'extract only the text from the response
data = response.text

Now we import all the classes we will need.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings.cohere import CohereEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch
from langchain.vectorstores import Chroma

Obviously to use the OpenAI templates, and to do this you have to enter your personal API KEY. If you don’t know how to do this you can see my previous article about Langchain.

import os
os.environ["OPENAI_API_KEY"] = "your_open_ai_key"

In a real application, you will probably have many text files, and you want the LLM to figure out in which of these texts is the answer to your question. Instead, in this simple example we break the single text file into multiple parts (chunks) and treat each part as a different document. The model will have to figure out which part contains the answer to our question. We break this text into multiple parts by assigning each part a maximum length using the commands below.

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(data)
len(texts)

We see that 64 parts have been created from the original text. We can print the different parts individually since they are contained in a list.

texts[0],texts[1]

Now we create an object that we need to save the embeddings of the various parts of the created text.

embeddings = OpenAIEmbeddings()

But we want to save the embeddings in a DB that is persistent because recreating them every time we open the application would be a resource waste. This is where ChromaDB helps us. We can create and save the embeddings using text parts, and add metadata for each part. In this, the metadata will be strings that name each text part.

persist_directory = 'db'
docsearch = Chroma.from_texts(
    texts, 
    embeddings,
    persist_directory = persist_directory,
    metadatas=[{"source": f"{i}-pl"} for i in range(len(texts))]
    )
from langchain.chains import RetrievalQAWithSourcesChain

Now we want to turn docsearch into a retrieval because that will be its purpose.

from langchain import OpenAI

#convert the vectorstore to a retriever
retriever=docsearch.as_retriever()

We can also see the retriever what distance metric it is using, in this case, the default one is similarity as explained in the part on embeddings.

retriever.search_type

Finally, we can ask the retriever to take the document that most answers one of our queries. The retriever could also take more than one document if necessary.

docs = retriever.get_relevant_documents("What did the president say about Justice Breyer")

Let’s see now how many documents he took and what they contain.

len(docs)
docs

Now what we can do is to create an agent. An agent is able to perform a series of steps to solve the user’s task on its own. Our agent will have to go and look through the documents available to it where the answer to the question asked is and return that document.

#create the chain to answer questions
chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm = OpenAI(temperature=0), 
    chain_type="stuff", 
    retriever=retriever,
    return_source_documents = True
    )

If we want, we can also create a function to post-process the agent’s output so that it is more readable.

def process_result(result):
  print(result['answer'])
  print("nn Sources : ",result['sources'] )
  print(result['sources'])

Now everything is finally ready, we can use our agent and go and answer our queries!

question = "What did the president say about Justice Breyer"
result = chain({"question": question})
process_result(result)

Final Thoughts

In this article, we introduced LangChain, ChromaDB and some explanation about embeddings. We saw with a simple example how to save embeddings of several documents, or parts of a document, into a persistent database and do retrieval of the desired part to answer a user query. If you found this article useful follow me here on Medium! 😉

The End

Marcello Politi

Linkedin, Twitter, Website


Related Articles