The world’s leading publication for data science, AI, and ML professionals.

Building RAGs Without A Retrieval Model Is a Terrible Mistake

Here are my favorite techniques – one is faster, the other is more accurate.

Photo by Alexander Grey
Photo by Alexander Grey

I build RAG apps; it’s fun!

But the apps I build don’t do well in production. They’re promising prototypes, but they never go live!

The culprit is almost always the retrieval. Come on, this is the heart of RAGs. What are we supposed to build without this?

This is until I index documents for faster or better retrieval.

Indexing helps us engineer solutions that retrieve data faster. It significantly reduces latency, improving the overall app experience. We use indexing in almost every app we build. It has nothing to do with LLMs or RAGs.

Almost all the databases ship with indexing support. For instance, Postgres can do B-Tree, GiST, SP-GiST, BRIN, GIN, and Hash types of indexing. That’s a list long enough to go to a separate future post.

In this post, I’ll discuss the popular indexing strategies I frequently use for better document retrieval. These techniques are, however, specific to RAG apps. You’ll see why in a moment.

My two go-to indexing techniques are multi-representation and ColBERT. These aren’t the only methods we have. And it’s an evolving sub-field in RAGs.

If there are many, why do I prefer these two over any other? There are two main reasons.

They are both easy to understand and implement

But they also do well in most instances

You’ll also see that indexing improves speed and helps retrieve more relevant documents, especially ColBERT, compared to a system without indexing.

Real quick—before moving on, you should know how to chunk large documents into smaller pieces for embedding and retrieval. Whenever I say document, I mean these chunks. As always, we have more than one method for chunking documents. Your job as an engineer is to pick the best for your case.

I’ve discussed these in my previous posts. Take a moment to check position-based churning vs. semantic chunking and agentic chunking. I even tried clustering as a cheap and fast alternative for agentic chunking.


Multi-representation: Faster and reasonably accurate

The name speaks for itself. ryte?

We store two versions of our documents – One for indexing. But the other is the actual document that we retrieve.

Simple!

Once we chunk our documents, we create a version optimized for retrieval. This would usually be a summary of the original chunk, which should contain all the vital information.

We store the optimized version in a vector store and the original chunk in a document store. We bind them both with the same key to trace which optimized version is for which document.

Now, we retrieve as usual. We ask the vector store to fetch the documents closer to our text input in the vector space. But as we retrieve, we use the key we set before to fetch the original document instead of the optimized version.

Here’s a code walkthrough.

# Multi-Representation Indexing Tutorial

## 1. Import required libraries and load environment variables
import uuid
from dotenv import load_dotenv
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain.storage import InMemoryByteStore
from langchain.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_core.documents import Document
# Load environment variables (e.g., API keys)
load_dotenv()
## 2. Load and preprocess the document
# Load the document from a web page. We use Django's security page as an example
loader = WebBaseLoader("https://docs.djangoproject.com/en/5.1/topics/security/")
docs = loader.load()
# Split the document into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = text_splitter.split_documents(docs)
## 3. Generate optimized summaries for each chunk
# Set up the language model and prompt
model = ChatOpenAI()
template = """
Generate a concise and coherent summary of the following document, ensuring that all key details, important concepts, and relevant information are preserved. 
Highlight the main points and conclusions while maintaining clarity and context.
{document}
"""
prompt = ChatPromptTemplate.from_template(template)
output_parser = StrOutputParser()
# Create a chain for summarization
chain = prompt | model | output_parser
# Generate optimized summaries for each chunk
optimized_chunks = chain.batch(chunks)
## 4. Set up the multi-vector retriever
# Initialize vector store and document store
vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
documentstore = InMemoryByteStore()
# Generate unique keys for each chunk
key = "unique_id"
keys = [str(uuid.uuid4()) for _ in range(len(optimized_chunks))]
# Create Document objects with metadata
docs = [
    Document(page_content=chunk, metadata={key: keys[ix]})
    for ix, chunk in enumerate(optimized_chunks)
]
# Set up the multi-vector retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=documentstore,
    id_key=key,
)
## 5. Add documents to the retriever
# Add documents to the vector store
retriever.vectorstore.add_documents(docs)
# Add documents to the document store
retriever.docstore.mset(
    list(zip(keys, [Document(page_content=d) for d in optimized_chunks]))
)
## 6. Retrieve relevant documents
# Example query
query = "How to restrict access to certain hosts?"
relevant_docs = retriever.get_relevant_documents(query)
# Print the retrieved documents
for doc in relevant_docs:
    print(f"Content: {doc.page_content}n")
    print(f"Metadata: {doc.metadata}n")
    print("---n")

If we ignore the library loading part, five key steps are involved. Let’s look at them in detail.

Multi-representation - Image by the author
Multi-representation – Image by the author

Firstly, we load our sources. Then we chunk them.

I’ve used recursive character splitting here, but I do not like position-based chunking. I use it here for simplicity’s sake.

Then, we create an LLM chain to summarize our original chunks. The summaries we create should have all the essential details. This is the crucial part of multi-representation indexing. Although I’ve given a short prompt, you should spend most of your time here.

Better prompts make better summaries. But they also get you the best documents retrieved.

Note that we use the batch function to convert all the chunks into summaries in one go. This comes in handy for large datasets.

In steps 4 and 5, we create a MultiVectorRetriever. Essentially, this is the multi-representation indexing process. You create a retriever aware of your vector store and the document store and how the optimized vector versions relate to the documents in the document store. That’s the document key.

Now, we’re ready to run queries and retrieve documents. Here’s the output we get.

Content: Django now requires setting ALLOWED_HOSTS explicitly to enhance security, even when using a virtual host with ServerName configured. This change prevents potential security vulnerabilities from fake Host headers in HTTP requests. It is recommended to update Django configurations accordingly to mitigate risks effectively. Source: https://docs.djangoproject.com/en/5.1/topics/security/

Metadata: {}

---
Metadata: {}
---

This is the original chunk from the website. However, we only have the optimized version in the vector store for retrieval.

How does it get more relevant documents?

You might ask, "Okay, this may improve performance, but how does it help retrieve more relevant documents?"

Here’s how.

A large chunk of text mainly contains noise. That’s less relevant information to the main point. The cosine similarity, or any other distance-based similarity search, goes wrong with this noise.

By only including the necessary details in the chunk, we reduce the noise to a greater extent. Now, similarity searches only have to worry about ‘similarity,’ not the noise.

ColBERT: When accuracy matters more than speed

The idea behind ColBERT is complex—at least to my immature mind—but not too complicated to grasp.

But here’s the need for a different approach.

The chunk and the query are embedded into vector forms in regular retrieval. Then, a similarity search is performed to find the nearest neighbors.

This approach is straightforward – but has an issue.

Documents are often comparatively larger. Queries are not. The comparison is not apple-to-apple.

ColBERT handles it in a slightly different way. Here’s a diagram that illustrates it.

ColBERT indexing for RAG - Image by the Author
ColBERT indexing for RAG – Image by the Author

We take chunks one by one and tokenize them. We do the same for our search query.

Then, we compute the similarities between every token in the search query and the tokens in the chunk. We then take the maximum and keep a note of it.

We do this process for every token in our search query. Now, you’ve got an array of maximum similarity scores. Each number in this represents a token in the chunk. You take the sum of this array. Now you’ve got a single score for a chunk.

Okay, we need to do this for all the chunks in the vector store. Now, we’ve got an array of similarity values. This time, each number in the array represents a document.

Lastly, you can compare the values you’ve computed for all the chunks to find the most similar chunks.

That’s a lot of work.

I know what you’re thinking.

If it’s complex, why would you still pick it as your first choice?

The credit goes to a library called "RAGatouille" (Not Ratatouille 🐀 .), making ColBERT implementation effortless.

Here’s how.

Let’s first install RAGatouille using Pypi.

!pip install ragatouille

If you do it on your local computer, you’ll need to download many files and wait a long time. You’re lucky if you do it on a Colab notebook or a cloud resource.

We have to import the pre-trained model for ColBERT. The following lines would do.

from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
Image by the Author
Image by the Author

We’re now ready to index and retrieve using the ColBERT model. But let’s fetch and prepare our text for this. We’ll use the same Django security documentation for this purpose.

import requests
from bs4 import BeautifulSoup, SoupStrainer

url = "https://docs.djangoproject.com/en/5.1/topics/security/"
only_content = SoupStrainer(id="docs-content")
soup = BeautifulSoup(requests.get(url).content, "html.parser", parse_only=only_content)
full_document = soup.get_text(separator='n', strip=True)

full_document
Image by the Author
Image by the Author

Here’s the magical part. Indexing with RAGatouille is just a single line. Retrieval is another.

RAG.index(
    collection=[full_document],
    index_name="Djagno-security",
    max_document_length=180,
    split_documents=True,
)

This step is going to take a lot of time. ColBERT isn’t good at speed. We will come to that later.

The point here is that it’s now simple to implement. It is simpler than the other methods.

And here’s how the retrieval step goes.

results = RAG.search(query="How to restrict access based on host names?", k=3)
results
Image by the Author
Image by the Author

The first time you run a query against your documents, it will take a lot of time. This is because RAGatouille fetches and indexes the document simultaneously.

Don’t be alarmed!

Your subsequent queries will take little time, and you will get more accurate document chunks with reasonably low latency.

The big caveat of ColBERT

I love ColBERT just as much as I hate it.

Even though ColBERT is excellent at retrieving relevant information, it has a high computing cost and indexing time.

If you’ve followed this post, you’d have already noticed it takes a lot of time to index even a tiny web page. Imagine if you’re indexing all the documents an organization that has lived for a couple of decades has produced.

You’ll need a giant computer, or your vacations will be canceled.

I haven’t used ColBERT for any of the projects because of this, but it’s still in all my prototypes.

Final thoughts

You can build an RAG app overnight, and the following day, you can put it live.

Yet, without making critical decisions, the app won’t survive in production for various reasons. Often, the complaints are latency and inaccuracy.

This is almost always the retrieval or communication with the LLMs phases for RAG apps.

In this post, I’ve shared some fantastic techniques I learned that helped me reduce retrieval-related issues.

I hope it helps you too.


Thanks for reading, friend! Besides Medium, I’m on LinkedIn and X, too!


Related Articles