The world’s leading publication for data science, AI, and ML professionals.

Ask Wikipedia ELI5-like Questions Using Long-Form Question Answering on Haystack

Build a Long-Form Question Answering platform using your documents and 26 lines of Python code

Recent advancements in NLP question answering (QA)-based systems have been astonishing. QA systems built on top of the most recent language models (BERT, RoBERTa, etc.) can answer factoid-based questions with relative ease and excellent precision. The task involves finding the relevant document passages containing the answer and extracting the answer by scanning the correct word token span.

More challenging QA systems engage with so-called "generative question answering". These systems focus on handling questions where the provided context passages are not simply the source tokens for extracted answers, but provide the larger context to synthesize original answers.

Long-Form QA motivation

Just last week, I was reviewing metric learning, and it occurred to me that it had some similarities with contrastive learning. I didn’t have the time right then to do a deep dive in order to satisfy my curiosity, as much as I wanted to. A QA platform that I could ask, "What is the main difference between metric and contrastive learning?" would have made this tangent quick and fruitful by providing a reliable, detailed answer quickly.

Remember the last time you researched a particular topic on Google, made dozens of queries looking for relevant web results, and subsequently, painstakingly synthesized a paragraph-long answer by yourself? What if QA system could do this for you automatically?

Long-Form Question Answering (LFQA) systems attempt to replicate and automate this arduous activity. As these QA systems are relatively new, researchers train the models for them on a single, publicly available dataset – ELI5 (Explain Like I’m Five). ELI5, sourced from the subreddit /r/explainlikeimfive/, captures the challenge of synthesizing information from multiple web sources and generating answers that a five-year-old would understand.

So, what kind of questions and answers are in the ELI5 dataset? Feel free to take a look at the r/explainlikeimfive/ subreddit or, even better, check out the ELI5-dedicated website at FAIR.

Although the Facebook AI research team, led by Angela Fan, Yacine Jernite, and Micheal Auli, publicly released the ELI5 dataset and accompanying language models, there are no readily available QA platforms that allow users to customize such an LFQA system easily. Until now.

LFQA in Haystack

Haystack is an end-to-end, open-source framework that enables users to build robust and production-ready pipelines for various question answering and semantic search cases. Starting with the 0.9.0 release, Haystack supports LFQA as well as the previously supported QA and semantic search scenarios.

Creating your own end-to-end LFQA pipeline with Haystack is simple. LFQA on Haystack consists of three main modules: DocumentStore, Retriever, and Generator. Let’s look more closely into these three modules and how they fit into the LFQA platform.

LFQA components in Haystack and the question answering flow. Image by the author.
LFQA components in Haystack and the question answering flow. Image by the author.

DocumentStore

As its name implies, DocumentStore holds your documents. Haystack has several document storage solutions available for different use cases. For LFQA, we need to use one of the vector-optimized document stores, where embedded document vectors represent our documents. Therefore, in our demo, we’ll use FAISSDocumentStore, but we could have easily used any other vector-optimized document store from the Haystack platform, such as Milvus or the recently added Weaviate.

Retriever

Before generating answers for the given query, our QA system needs to find supporting documents. The Retriever module’s job is to find the best candidate documents by calculating the similarity between query and document vectors. To find the documents that best match our query, we’ll use one of Haystack’s dense retrievers – the EmbeddingRetriever. The retriever first passes the query through its language model to get the query embeddings. Then, by comparing the dot product of the embedded query and the embedded document vectors in the document store, we can quickly find the correct documents and retrieve them.

We’ll use the already available BERT variant called Retribert, specifically fine-tuned for this query/document matching task. The Retribert language model is publicly available on the HuggingFace model hub, and the details of its training are available here.

Generator

After the retriever returns the most relevant documents for our query, we’re ready to input the selected documents into the ELI5 BART-based model to generate the answer for the given query. ELI5 BART language model, also available on the HuggingFace hub, has seq2seq (e.g. machine translation) architecture implemented by Haystack’s Seq2SeqGenerator.

To generate a paragraph-long answer from the supporting documents found by the retriever, we concatenate the query and supporting document(s) and pass it through the ELI5 BART model as input. The output of the model is our generated answer. For more details on how exactly to train the ELI5 model, please refer to this document.

LFQA demo

Now that we have a better understanding of the significant components necessary to build the LFQA system, let’s build and test it using Haystack!

In the rest of the article, we’ll show you how to quickly create an LFQA deployment scenario using the off-the-shelf components mentioned above. We’ll use the HuggingFace Wiki snippets dataset (100-word passages from 100k Wikipedia documents) as the source documents for our LFQA system. Then we’ll query the system with ELI5-like questions to see what kind of answers we get.

To follow this deployment scenario, you can use Google’s Colaboratory notebook for free GPU access.

Setting up Haystack

We’ll start with the pip install of the required libraries. In our case, all we need are Haystack and the HuggingFace datasets.

Initialize the document store

Now that we’ve installed the required libraries and their dependencies, including the HuggingFace transformers and more, we are ready to initialize our QA pipeline. We’ll start with the FAISSDocumentStore to store our documents.

There is almost no additional explanation needed for this one-liner. We’ll use the default flavor of FAISSDocumentStore with the "Flat" index. We need to initialize the vector_dim parameter to 128 because our Retribert language model encodes queries and documents into a vector with 128 dimensions.

Add Wikipedia documents to DocumentStore

After the FAISSDocumentStore initializes, we’ll load and store our Wikipedia passages. The HuggingFace dataset library offers an easy and convenient approach to load enormous datasets like Wiki Snippets. For example, the Wiki snippets dataset has more than 17 million Wikipedia passages, but we’ll stream the first one hundred thousand passages and store them in our FAISSDocumentStore.

Now, let’s write iterate over the first 100k Wiki snippets and save them to our DocumentStore:

Now that all our documents are in the FAISSDocumentStore, we need to initialize our second Haystack component – the Retriever. For LFQA, we’ll use EmbeddingRetriever initialized with the retribert-base-uncased language model we briefly discussed. After retriever initialization, we are ready to calculate the embeddings for each document and store them in the document store.

Grab a coffee, as it will take approximately fifteen minutes to update the embeddings for all the Wikipedia documents in the FAISSDocumentStore. We could speed up the document embedding process by using a dedicated GPU instance but, for the purpose of the demo, even Colab’s GPU will do just fine.

Test the Retriever

Before we blindly use the EmbeddingRetriever to fetch the documents and pass them to the answer generator, let’s first empirically test it to ensure an example query finds the relevant documents. We’ll use Haystack’s pre-made component for document search – the DocumentSearchPipeline. When you try out your ELI5-like questions, don’t forget that you are using just a tiny slice of one hundred thousand wiki snippets. Use the pipeline below to ensure the topics and the documents you want are in the database before you ask your questions.

And indeed, the DocumentSearchPipeline does find the relevant documents:

Generator

The final component in our LFQA stack is the Generator. We’ll initialize Haystack’s generic Seq2SeqGenerator with a specific model for LFQA – the bart_eli5 model Apart from this model, we will use the default initialization values for the other parameters. You can fine-tune various aspects of the text generation with other Seq2SeqGenerator constructor parameters. Refer to Haystack documentation for more details. The last thing we need to do is to connect the retriever and the generator in one of the predefined Haystack pipelines – the GenerativeQAPipeline.

As you have probably guessed, the GenerativeQAPipeline combines the retriever and the generator to generate answers to our queries. It represents the primary API for interacting with the LFQA system we composed.

Running Queries

We’ll interact with GenerativeQAPipeline to get answers to our queries. In addition to specifying the query itself, we’ll set a constraint on the number of matched documents our retriever passes to the generator. This can be any number, but for this demonstration, we chose to limit the sources to 4. Let’s start by asking one of the ELI5-like queries:

We get the following answer:

🚀🎇 This answer is simply astonishing. It starts with a brief explanation of a Bermudan sloop and continues to elaborate on the features that make it widely prized in the category of sailing ships.

Let’s try another one:

The answer:

Brief and to the point. The answer is not as elaborate as in the previous example. However, we can force the model to generate longer answers. We need to pass the optional parameter _minlength (number of words) in the Seq2SeqGenerator constructor.

The new answer for the same query is:

Excellent, that’s a more detailed answer like we wanted. Feel free to experiment, trying out different questions, but don’t forget that the answers are coming from the small sample of Wikipedia we used.


Related Articles