
Retrieval augmented generation has become one of the most discussed topics in generative AI literature. With the daily influx of blog articles and scientific papers, it can be challenging to stay updated. However, the popularity of RAG is well-deserved, as no other solution has proven as efficient for mitigating hallucinations in large language models.
RAG enhances a language model’s general knowledge with reliable external sources like Wikipedia pages, private PDFs, etc. That’s why the most important step for RAG is to ensure that our retrieval finds the right documents to feed into the model.
We need RAG so much because we currently face limitations in fitting full documents into the context window. Reasons include restricted token length for model inputs, the proportional increase in computational cost, and issues like "lost in the middle" which refers to a phenomenon where models struggle to use information found in the middle of a long input context [2].
![Relationship between model performance and the position of the relevant information in the context window.[2]](https://towardsdatascience.com/wp-content/uploads/2024/01/1EQ_JERWU7eiVyLLoScAKeg.png)
If the retrieved documents are too long or irrelevant, as the saying goes, garbage in, garbage out.
There are many techniques for enhancing RAG, creating the additional challenge of knowing when to apply each. In this article, we will analyze query transformations and how to use a router to select the appropriate transformation based on the input prompt.
The idea behind query transformations is that the retriever may not consider a user’s initial prompt particularly similar to the relevant documents in the database. In these cases, we can modify the query to increase its relevance to our sources before retrieving and feeding them to the language model.
We will start with a simple RAG application by loading three Wikipedia pages about Nicolas Cage, The Best of Times (the television pilot in which Nicolas Cage made his acting debut), and Leonardo DiCaprio.
We will then split the documents into chunks of 256 characters with no overlap. These chunks will be embedded and indexed in a Vector Store, which stores everything in memory by default, but if you want persistent options, dozens of options are available.
WikipediaReader = download_loader("WikipediaReader")
loader = WikipediaReader()
pages = ['Nicolas_Cage', 'The_Best_of_Times_(1981_film)', 'Leonardo DiCaprio']
documents = loader.load_data(pages=pages, auto_suggest=False, redirect = False)
Llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
gpt3 = OpenAI(temperature=0, model="text-davinci-003")
embed_model = OpenAIEmbedding(model= OpenAIEmbeddingModelType.TEXT_EMBED_ADA_002)
service_context_gpt3 = ServiceContext.from_defaults(llm=gpt3, chunk_size = 256, chunk_overlap=0, embed_model=embed_model)
index = VectorStoreIndex.from_documents(documents, service_context=service_context_gpt3)
retriever = index.as_retriever(similarity_top_k=3)
Now, we have to make sure that the model answers only based on the context and does not rely on its training data, even if it may have previously learned the answer.
# The response from original prompt
from llama_index.prompts import PromptTemplate
template = (
"We have provided context information below. n"
"---------------------n"
"{context_str}"
"n---------------------n"
"Given this information, please answer the question: {query_str}n"
"Don't give an answer unless it is supported by the context above.n"
)
qa_template = PromptTemplate(template)
We will test the RAG application we just created using two more complex queries. Let’s take a look at the first one.
Query 1 – "Who directed the pilot that marked the acting debut of Nicolas Cage?"
The first challenging query requires linking multiple pieces of information: Nicolas Cage’s acting debut and the director of that specific film. The prompt mentions only Nicolas Cage, while the director’s name is not included anywhere.
Since the model doesn’t know the name of the television pilot that marked Cage’s debut, called "The Best of Times", it cannot retrieve the relevant details from the documents we indexed.
question = "Who directed the pilot that marked the acting debut of Nicolas Cage?"
contexts = retriever.retrieve(question)
context_list = [n.get_content() for n in contexts]
prompt = qa_template.format(context_str="nn".join(context_list), query_str=question)
response = llm.complete(prompt)
print(str(response))

Query 2 – "Compare the education received by Nicolas Cage and Leonardo DiCaprio."
For the second query, the retriever selects relevant chunks of text only about Leonardo DiCaprio’s education. The chunks about Nicolas Cage are irrelevant, so we cannot get an accurate comparison.
question = "Compare the education received by Nicolas Cage and Leonardo DiCaprio."
contexts = retriever.retrieve(question)
context_list = [n.get_content() for n in contexts]
prompt = qa_template.format(context_str="nn".join(context_list), query_str=question)
response = llm.complete(prompt)
print(str(response))

Let’s analyze some query transformation techniques and see which works best in each case.
HyDE

Hypothetical Document Embeddings (HyDE) is a technique for generating document embeddings to retrieve relevant documents without needing actual training data. First, an LLM creates a hypothetical answer in response to a query. While reflecting patterns of relevance to the query, this answer includes information that might not be factually accurate.
Next, both the query and the generated answer are transformed into embeddings. The system then identifies and retrieves actual documents from a predefined database that are closest to these embeddings in the vector space.
![Illustration of the HyDE model [3]](https://towardsdatascience.com/wp-content/uploads/2024/01/1dZnaxSGEvZlNJtiLQ5nDNg.png)
from llama_index.indices.query.query_transform import HyDEQueryTransform
from llama_index.query_engine.transform_query_engine import (
TransformQueryEngine,
)
index = VectorStoreIndex.from_documents(documents, service_context=service_context_gpt3)
query_engine = index.as_query_engine(similarity_top_k=3)
hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
Query 1
response = hyde_query_engine.query("Who directed the pilot that marked the acting debut of Nicolas Cage?")
print(response)

We have made some partial progress – the model’s answer is still incorrect, but it has moved closer to the right response. Specifically, it was now able to identify the name of the television pilot ("Best of Times"). Let’s see what the hallucinated answer looks like.
query_bundle = hyde("Who directed the pilot that marked the acting debut of Nicolas Cage?")
hyde_doc = query_bundle.embedding_strs[0]
hyde_doc

While it is not true that Francis Coppola directed "The Best of Times", at least the hallucination included the movie’s name.
Query 2
response = hyde_query_engine.query("Compare the education received by Nicolas Cage and Leonardo DiCaprio.")
print(response)

The answer is correct because the hallucinated answer improved the output significantly. The LLM already had information about the actors’ education in the training data.

Sub Questions
The Sub Questions technique uses a divide-and-conquer approach to handle complex questions. It first analyzes the questions and breaks them down into simpler sub-questions. Each sub-question targets different relevant documents that can provide part of the answer.
The engine then gathers the intermediate responses and synthesizes all the partial results into a final response.
# setup base query engine as tool
query_engine_tools = [
QueryEngineTool(
query_engine=vector_query_engine,
metadata=ToolMetadata(
name="Sub-question query engine",
description="Questions about actors",
),
),
]
query_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=query_engine_tools,
service_context=service_context,
use_async= False
)
Query 1


There are no ways to break down this question into simpler sub-questions that would make it easier to answer. The model tried to generate one sub-question but didn’t provide any additional context beyond the original query, wasting computation on ineffective transformations.
Query 2


This time, generating sub-questions was extremely useful because we needed to compare two distinct pieces of information – the educational backgrounds of two different people. Each sub-question can be answered independently using the retrieved context.
Multi-Step Query Transformation
![Self-ask + Search Engine [1]](https://towardsdatascience.com/wp-content/uploads/2024/01/1Nen6WVH8sfr-Oudxk_wwRw.png)
The multi-step query transformation approach is based on the self-ask method, where the language model asks and answers follow-up questions to itself before answering the original question. This helps the model combine facts and insights it learned separately during pretraining.
The original paper showed that LLMs often fail to compose two facts together, even if they know each one independently. For example, a model may know Fact A and Fact B but fail to deduce the implication of A and B together.
The self-ask method aims to overcome this limitation. At test time, we simply provide the prompt and question to the model. Then, it automatically generates any necessary follow-up questions to connect facts, compose reasoning steps, and decide when to stop [1].
query_engine = MultiStepQueryEngine(
query_engine=query_engine,
query_transform=step_decompose_transform_gpt3,
index_summary=index_summary
)
Query 1
Don Mischer directed the pilot that marked the acting debut of Nicolas Cage.
This is the first time we have had a correct answer to this query! Let’s see the intermediate questions.

The output of the first sub-question was the following:
The Best of Times pilot that marked the acting debut of Nicolas Cage was not directed by anyone in the Coppola family. It was directed by Rod Amateau.
Having identified the pilot name, the second sub-question now asks specifically about the director:
Who directed the Best of Times pilot that marked the acting debut of Nicolas Cage?
This multi-step approach helped the model build on the additional context from the prior question and led us to the correct answer.
Query 2
Nicolas Cage received education in the field of theater, film, and television at UCLA School of Theater, Film and Television. On the other hand, Leonardo DiCaprio attended the Los Angeles Center for Enriched Studies, Seeds Elementary School, and John Marshall High School. However, DiCaprio dropped out of high school and later earned a general equivalency diploma.

The multi-step approach was also useful in this case. Our opinion is that for the second query, the sub-question technique from before is a better fit because it can be parallelized. The questions are independent and don’t need to build on top of each other. But a multi-step approach can also be useful.
RouterQueryEngine
Each query transformation proved useful for different cases. The sub-question decomposition works best for questions that can be broken into simpler sub-questions, like comparing Nicolas Cage and Leonardo DiCaprio’s education.
Multi-step transformation works best for queries that require exploring context iteratively, like linking multiple facets of information.
Simple queries might not need any transformations at all, and applying them would be a waste of resources.
To handle choosing between all these cases, we can use a RouterQueryEngine – we give an LLM a set of tools for query transformations and let it decide the best one to apply based on the input prompt.
query_engine = RouterQueryEngine(
selector=PydanticSingleSelector.from_defaults(),
query_engine_tools=[
simple_tool,
multi_step_tool,
sub_question_tool,
],
)
We created a router that allows selection between no transformation, sub-question decomposition, or multi-step transformations as needed for each unique query. Let’s examine how the router reasons about which approach to take.
First of all, we will choose a very simple question that needs no transformations.
response_1 = query_engine.query("What is Nicolas Cage's profession?")

The router made the right choice by assessing that the query is relatively simple. No query transformations are needed to decompose or expand this question since it asks directly about a single fact regarding Nicolas Cage’s occupation.
response_2 = query_engine.query("Compare the education received by Nicolas Cage and Leonardo DiCaprio.")

To answer a comparative question, the router accurately broke it down into simpler sub-questions about each individual.
response_3 = query_engine.query("Who directed the pilot that marked the acting debut of Nicolas Cage?")

For the third query, the router recognizes that answering correctly requires linking multiple pieces of contextual information – specifically, identifying the pilot episode that marked Cage’s acting debut and then determining who directed that particular pilot.
Conclusion
As we have explored, enhancing RAG through advanced query transformations can significantly improve model performance.
While query transformation is just one of many techniques for improving retrieval, it demonstrates the potential and need for customized analysis that merges retrieval with the reasoning capabilities inherent to LLMs.
. . .
If you enjoyed this article, join Text Generation – our newsletter has two weekly posts with the latest insights on Generative AI and Large Language Models.
_You can find the full code for this project on GitHub._
You can also find me on LinkedIn.
. . .
References