Beyond English: Implementing a multilingual RAG solution

An introduction to the do’s and don’ts when implementing a non-english Retrieval Augmented Generation (RAG) system

Jesper Alkestrup
Towards Data Science

--

RAG, an all knowing colleague, available 24/7 (Image generated by author w. Dall-E 3)

TLDR

This article provides an introduction to the considerations one should take into account when developing non-English RAG systems, complete with specific examples and techniques. Some of the key points include:

  • Prioritize maintaining syntactic structure during data loading, as it is crucial for meaningful text segmentation.
  • Format documents using simple delimiters like \n\n to facilitate efficient text splitting.
  • Opt for rule-based text splitters, given the computational intensity and subpar performance of ML-based semantic splitters in multilingual contexts.
  • In selecting an embedding model, consider both its multilingual capabilities and asymmetric retrieval performance.
  • For multilingual projects, fine-tuning an embedding model with a Large Language Model (LLM) can enhance performance, and may be needed to achieve sufficient accuracy.
  • Implementing an LLM-based retrieval evaluation benchmark is strongly recommended to fine-tune the hyperparameters of your RAG system effectively, and can be done easily with existing frameworks.

It is no wonder that RAG has become the trendiest term within search technology in 2023. Retrieval Augmented Generation (RAG) is transforming how organizations utilize their vast quantity of existing data to power intelligent ChatBots. These bots, capable of conversations in natural language, can draw on an organization’s collective knowledge to function as an always-available, in-house expert to deliver relevant answers, grounded in verified data. While a considerable number of resources are available on building RAG systems, most are geared toward the English language, leaving a gap for smaller languages.

This 6-step easy-to-follow guide will walk you through the do’s and don’ts when creating RAG systems for non-English languages.

RAG structure, a brief recap

This article presumes familiarity with concepts like embeddings, vectors, and tokens. For those needing a brief refresher on the architecture of RAG systems, they essentially consist of two core components:

  1. Indexing phase (the focus of this article): This initial stage involves processing the input data. The data is first loaded, appropriately formatted, then split. Later, it undergoes vectorization through embedding techniques, culminating in its storage within a knowledge base for future retrieval.
  2. Generative phase: In this phase, a user’s query is input to the retrieval system. This system then extracts relevant information snippets from the knowledge base. Leveraging a Large Language Model (LLM), the system interprets this data to formulate a coherent, natural language response, effectively addressing the user’s inquiry.

Now let’s get started!

Disclaimer:

This guide doesn’t aim to be an exhaustive manual on using any particular tool. Instead, its purpose is to shed light on the overarching decisions that should guide your tool selection. In practice, I strongly recommend leveraging an established framework for constructing your system’s foundation. For building RAG systems, I would personally recommend LlamaIndex as they provide detailed guides and features focused strictly on indexing and retrieval optimization.

Additionally, this guide is written with the assumption that we’re dealing with languages that use the latin script and read from left to right. This includes languages like German, French, Spanish, Czech, , Turkish, Vietnamese, Norwegian, Polish, and quite a few others. Languages outside of this group may have different needs and considerations.

1. Data loader: The devil’s in the details

A cool looking multi-modal dataloader (Image generated by author w. Dall-E 3)

The first step in a RAG system involves using a dataloader to handle diverse formats, from text documents to multimedia, extracting all relevant content for further processing. For text-based formats, dataloaders typically perform consistently across languages, as they don’t involve language-specific processing. With the advent of multi-modal RAG systems, it is however crucial to be aware of the reduced performance of speech to text models compared to their English counterparts. Models like Whisper v3 demonstrate impressive multilingual capabilities, but it’s wise to check out their performance on benchmarks like Mozilla Common Voice or the Fleurs dataset, and ideally evaluate those on your own benchmark.

For the remainder of this article, we’ll however concentrate on text-based inputs.

Why retaining syntactic structure is important

A key aspect of data loading is to preserve the original data’s syntactic integrity. The loss of elements such as headers or paragraph structures can impact the accuracy of subsequent information retrieval. This concern is heightened for non-English languages due to the limited availability of machine learning-based segmentation tools.

Syntactic information plays a crucial role because the effectiveness of RAG systems in delivering meaningful answers depends partly on their ability to split data into semantically accurate subsections.

To highlight the differences between a data loading approach that retains the structure and one that does not, let’s take the example of using a basic HTML dataloader versus a PDF loader on a medium article. Libraries such as LangChain and LlamaIndex both rely on the exact same libraries, but just wrap the functions in their own document classes (Requests+BS4 for web, PyPDF2 for PDFs).

HTML Dataloader: This method retains the syntactic structure of the content.

import requests
from bs4 import BeautifulSoup
url = "https://medium.com/llamaindex-blog/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
filtered_tags = soup.find_all(['h1', 'h2', 'h3', 'h4', 'p'])
filtered_tags[:14]
<p class="be b dw dx dy dz ea eb ec ed ee ef dt"><span><a class="be b dw dx eg dy dz eh ea eb ei ec ed ej ee ef ek el em eo ep eq er es et eu ev ew ex ey ez fa bl fb fc" data-testid="headerSignUpButton" href="https://medium.com/m/signin?operation=register&amp;redirect=https%3A%2F%2Fblog.llamaindex.ai%2Fboosting-rag-picking-the-best-embedding-reranker-models-42d079022e83&amp;source=post_page---two_column_layout_nav-----------------------global_nav-----------" rel="noopener follow">Sign up</a></span></p>
<p class="be b dw dx dy dz ea eb ec ed ee ef dt"><span><a class="af ag ah ai aj ak al am an ao ap aq ar as at" data-testid="headerSignInButton" href="https://medium.com/m/signin?operation=login&amp;redirect=https%3A%2F%2Fblog.llamaindex.ai%2Fboosting-rag-picking-the-best-embedding-reranker-models-42d079022e83&amp;source=post_page---two_column_layout_nav-----------------------global_nav-----------" rel="noopener follow">Sign in</a></span></p>
<p class="be b dw dx dy dz ea eb ec ed ee ef dt"><span><a class="be b dw dx eg dy dz eh ea eb ei ec ed ej ee ef ek el em eo ep eq er es et eu ev ew ex ey ez fa bl fb fc" data-testid="headerSignUpButton" href="https://medium.com/m/signin?operation=register&amp;redirect=https%3A%2F%2Fblog.llamaindex.ai%2Fboosting-rag-picking-the-best-embedding-reranker-models-42d079022e83&amp;source=post_page---two_column_layout_nav-----------------------global_nav-----------" rel="noopener follow">Sign up</a></span></p>
<p class="be b dw dx dy dz ea eb ec ed ee ef dt"><span><a class="af ag ah ai aj ak al am an ao ap aq ar as at" data-testid="headerSignInButton" href="https://medium.com/m/signin?operation=login&amp;redirect=https%3A%2F%2Fblog.llamaindex.ai%2Fboosting-rag-picking-the-best-embedding-reranker-models-42d079022e83&amp;source=post_page---two_column_layout_nav-----------------------global_nav-----------" rel="noopener follow">Sign in</a></span></p>
<h1 class="pw-post-title gp gq gr be gs gt gu gv gw gx gy gz ha hb hc hd he hf hg hh hi hj hk hl hm hn ho hp hq hr bj" data-testid="storyTitle" id="f2a9">Boosting RAG: Picking the Best Embedding &amp; Reranker models</h1>
<p class="be b iq ir bj"><a class="af ag ah ai aj ak al am an ao ap aq ar is" data-testid="authorName" href="https://ravidesetty.medium.com/?source=post_page-----42d079022e83--------------------------------" rel="noopener follow">Ravi Theja</a></p>
<p class="be b iq ir dt"><span><a class="iv iw ah ai aj ak al am an ao ap aq ar eu ix iy" href="https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fsubscribe%2Fuser%2F60738cbbc7df&amp;operation=register&amp;redirect=https%3A%2F%2Fblog.llamaindex.ai%2Fboosting-rag-picking-the-best-embedding-reranker-models-42d079022e83&amp;user=Ravi+Theja&amp;userId=60738cbbc7df&amp;source=post_page-60738cbbc7df----42d079022e83---------------------post_header-----------" rel="noopener follow">Follow</a></span></p>
<p class="be b bf z jh ji jj jk jl jm jn jo bj">LlamaIndex Blog</p>
<p class="be b du z dt"><span class="lq">--</span></p>
<p class="be b du z dt"><span class="pw-responses-count lr ls">5</span></p>
<p class="be b bf z dt">Listen</p>
<p class="be b bf z dt">Share</p>
<p class="pw-post-body-paragraph nl nm gr nn b no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi gk bj" id="4130"><strong class="nn gs">UPDATE</strong>: The pooling method for the Jina AI embeddings has been adjusted to use mean pooling, and the results have been updated accordingly. Notably, the <code class="cw oj ok ol om b">JinaAI-v2-base-en</code> with <code class="cw oj ok ol om b">bge-reranker-large</code>now exhibits a Hit Rate of 0.938202 and an MRR (Mean Reciprocal Rank) of 0.868539 and with<code class="cw oj ok ol om b">CohereRerank</code> exhibits a Hit Rate of 0.932584, and an MRR of 0.873689.</p>
<p class="pw-post-body-paragraph nl nm gr nn b no np nq nr ns nt nu nv nw nx ny nz oa ob oc od oe of og oh oi gk bj" id="8267">When building a Retrieval Augmented Generation (RAG) pipeline, one key component is the Retriever. We have a variety of embedding models to choose from, including OpenAI, CohereAI, and open-source sentence transformers. Additionally, there are several rerankers available from CohereAI and sentence transformers.</p>

PDF data loader, example in which syntactic information is lost (saved article as PDF, then re-loaded)

from PyPDF2 import PdfFileReader
pdf = PdfFileReader(open('data/Boosting_RAG_Picking_the_Best_Embedding_&_Reranker_models.pdf','rb'))
pdf.getPage(0).extractText()
'Boosting RAG: Picking the Best\nEmbedding & Reranker models\n
Ravi Theja·Follow\nPublished inLlamaIndex Blog·7 min read·Nov 3\n
389 5\nUPDATE: The pooling method for the Jina AI embeddings has been adjusted\n
to use mean pooling, and the results have been updated accordingly.\n
Notably, the JinaAI-v2-base-en with bge-reranker-largenow exhibits a Hit\n
Rate of 0.938202 and an MRR (Mean Reciprocal Rank) of 0.868539 and\n
withCohereRerank exhibits a Hit Rate of 0.932584, and an MRR of 0.873689.\n
When building a Retrieval Augmented Generation (RAG) pipeline, one key\n
component is the Retriever. We have a variety of embedding models to\n
choose from, including OpenAI, CohereAI, and open-source sentence\n
Open in app\nSearch Write\n'

Upon initial review, the PDF dataloader’s output appears more readable, but closer inspection reveals a loss of structural information — how would one tell what is a header, and where a section ends? In contrast, the HTML file retains all the relevant structure.

Ideally, you want to retain all original formatting in the data loader, and only decide on filtering and reformatting in the next step. However, that might involve building custom data loaders for your use case, and in some cases be impossible. I recommend to simply start with a standard data loader, but spend a few minutes to inspect examples of the loaded data carefully and understand what structure has been lost.

Understanding what syntactic that is lost is crucial, as it guides potential improvements if the system’s downstream retrieval performance needs enhancement, allowing for targeted refinements.

2. Data formatting: Boring… but important

Document chunking (Image generated by author w. Dall-E 3)

The second step, formatting, serves the primary purpose of uniforming the data from your data loaders in a way that prepares the data for the next step of text splitting. As the following section explains, dividing the input text into a myriad of smaller chunks will be necessary. A successful formatting sets up the text in a way that provides the best possible conditions for dividing the content into semantically meaningful chunks. Simply put, your goal is to transform the potentially complex syntactic structure retrieved from a html or a markdown file, into a plain text file with basic delimiters such as /n (line change) and /n/n (end of section) to guide the text splitter.

A simple function to format the BS4 HTML object into a dictionary with title and text could look like the below:

def format_html(tags):
formatted_text = ""
title = ""

for tag in tags:
if 'pw-post-title' in tag.get('class', []):
title = tag.get_text()
elif tag.name == 'p' and 'pw-post-body-paragraph' in tag.get('class', []):
formatted_text += "\n"+ tag.get_text()
elif tag.name in ['h1', 'h2', 'h3', 'h4']:
formatted_text += "\n\n" + tag.get_text()

return {title: formatted_text}

formatted_document = format_html(filtered_tags)
{'Boosting RAG: Picking the Best Embedding & Reranker models': "\n
UPDATE: The pooling method for the Jina AI embeddings has been adjusted to use mean pooling, and the results have been updated accordingly. Notably, the JinaAI-v2-base-en with bge-reranker-largenow exhibits a Hit Rate of 0.938202 and an MRR (Mean Reciprocal Rank) of 0.868539 and withCohereRerank exhibits a Hit Rate of 0.932584, and an MRR of 0.873689.\n
When building a Retrieval Augmented Generation (RAG) pipeline, one key component is the Retriever. We have a variety of embedding models to choose from, including OpenAI, CohereAI, and open-source sentence transformers. Additionally, there are several rerankers available from CohereAI and sentence transformers.\n
But with all these options, how do we determine the best mix for top-notch retrieval performance? How do we know which embedding model fits our data best? Or which reranker boosts our results the most?\n
In this blog post, we’ll use the Retrieval Evaluation module from LlamaIndex to swiftly determine the best combination of embedding and reranker models. Let's dive in!\n
Let’s first start with understanding the metrics available in Retrieval Evaluation\n\n
... }

For complex RAG systems where there might be multiple correct answers relative to the context, storing additional information like document titles or headers as metadata along the text chunks is beneficial. This metadata can be used later for filtering, and if available, formatting elements like headers should influence your chunking strategy. A library like LlamaIndex natively work with the concept of metadata and text wrapped together in Node objects, and I highly recommend using this or a similar framework

Now that we’ve done our formatting correctly, let’s dive into the key aspects of text splitting!

3: Text splitting: Size matters

Splitting text, the simple way (Image generated by author w. Dall-E 3)

When preparing data for embedding and retrieval in a RAG system, splitting the text into appropriately sized chunks is crucial. This process is guided by two main factors, Model Constraints and Retrieval Effectiveness.

Model Constraints

Embedding models have a maximum token length for input; anything beyond this limit gets truncated. Be aware of your chosen model’s limitations and ensure that each data chunk doesn’t exceed this max token length.

Multilingual models, in particular, often have shorter sequence limits compared to their English counterparts. For instance, the widely used Paraphrase multilingual MiniLM-L12 v2 model has a maximum context window of just 128 tokens.

Also, consider the text length the model was trained on — some models might technically accept longer inputs but were trained on shorter chunks, which could affect performance on longer texts. One such is example, is the Multi QA base from SBERT as seen below,

Retrieval effectiveness

While chunking data to the model’s maximum length seems logical, it might not always lead to the best retrieval outcomes. Larger chunks offer more context for the LLM but can obscure key details, making it harder to retrieve precise matches. Conversely, smaller chunks can enhance match accuracy but might lack the context needed for complete answers. Hybrid approaches use smaller chunks for search but include surrounding context at query time for balance.

While there isn’t a definitive answer regarding chunk size, the considerations for chunk size remain consistent whether you’re working on multilingual or English projects. I would recommend reading further on the topic from resources such as Evaluating the Ideal Chunk Size for RAG System using Llamaindex or Building RAG-based LLM Applications for Production.

Text splitting: Methods for splitting text

Text can be split using various methods, mainly falling into two categories: rule-based (focusing on character analysis) and machine learning-based models. ML approaches, from simple NLTK & Spacy tokenizers to advanced transformer models, often depend on language-specific training, primarily in English. Although simple models like NLTK & Spacy support multiple languages, they mainly address sentence splitting, not semantic sectioning.

Since ML based sentence splitters currently work poorly for most non-English languages, and are compute intensive, I recommend starting with a simple rule-based splitter. If you’ve preserved relevant syntactic structure from the original data, and formatted the data correctly, the result will be of good quality.

A common and effective method is a recursive character text splitter, like those used in LangChain or LlamaIndex, which shortens sections by finding the nearest split character in a prioritized sequence (e.g., \n\n, \n, ., ?, !).

Taking the formatted text from the previous section, an example of using LangChains recursive character splitter would look like:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-base-v2")

def token_length_function(text_input):
return len(tokenizer.encode(text_input, add_special_tokens=False))

text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 128,
chunk_overlap = 0,
length_function = token_length_function,
separators = ["\n\n", "\n", ". ", "? ", "! "]
)

split_texts = text_splitter(formatted_document['Boosting RAG: Picking the Best Embedding & Reranker models'])

Here it’s important to note that one should define the tokenizer as the embedding model intended to use, since different models ‘count’ the words differently. The function will now, in a prioritized order, split any text longer than 128 tokens first by the \n\n we introduced at end of sections, and if that is not possible, then by end of paragraphs delimited by \n and so forth. The first 3 chunks will be:

Token of text: 111 

UPDATE: The pooling method for the Jina AI embeddings has been adjusted to use mean pooling, and the results have been updated accordingly. Notably, the JinaAI-v2-base-en with bge-reranker-largenow exhibits a Hit Rate of 0.938202 and an MRR (Mean Reciprocal Rank) of 0.868539 and withCohereRerank exhibits a Hit Rate of 0.932584, and an MRR of 0.873689.

-----------

Token of text: 112

When building a Retrieval Augmented Generation (RAG) pipeline, one key component is the Retriever. We have a variety of embedding models to choose from, including OpenAI, CohereAI, and open-source sentence transformers. Additionally, there are several rerankers available from CohereAI and sentence transformers.
But with all these options, how do we determine the best mix for top-notch retrieval performance? How do we know which embedding model fits our data best? Or which reranker boosts our results the most?

-----------

Token of text: 54

In this blog post, we’ll use the Retrieval Evaluation module from LlamaIndex to swiftly determine the best combination of embedding and reranker models. Let's dive in!
Let’s first start with understanding the metrics available in Retrieval Evaluation

Now that we have successfully split the text in a semantically meaningful way, we can move onto the final part of embedding these chunks for storage.

4. Embedding Models: Navigating the jungle

Embedding models convert text to vectors (Image generated by author w. Dall-E 3)

Choosing the right embedding model is critical for the success of a Retrieval Augmented Generation (RAG) system, and something that is less straight forward than for the English language. A comprehensive resource for comparing models is the Massive Text Embedding Benchmark (MTEB), which includes benchmarks for over 100 languages.

The model of your choice must either be multilingual or specifically tailored to the language you’re working with (monolingual). Remember, the latest high-performing models are often English-centric and may not work well with other languages.

If available, refer to language-specific benchmarks relevant to your task. For instance, in classification tasks, there are over 50 language-specific benchmarks, aiding in selecting the most efficient model for languages ranging from Danish to Spanish. However, it’s important to note that these benchmarks may not directly indicate a model’s efficiency in retrieving relevant information for a RAG system, because retrieval is different from classification, clustering or another task. The task is to find models trained for asymmetric search, as those not trained for this specific task might inaccurately prioritize shorter passages over longer, more relevant ones.

The model should excel in asymmetric retrieval, matching short queries to longer text chunks. The reason why is that, in a RAG system, you often match a brief query to more extensive passages to extract meaningful answers. The MTEB benchmarks related to asymmetric search are listed under the Retrieval. A challenge is that as of November 2023, MTEB’s Retrieval benchmark includes only English, Chinese, and Polish.

When dealing with languages like Norwegian, where there may not be specific retrieval benchmarks, you might wonder whether to choose the best-performing model from classification benchmarks or a general multilingual model proficient in English retrieval?

As for practical advice, a simple rule of thumb is to opt for the top-performing multilingual model in the MTEB Retrieval benchmark. Beware that the retrieval score itself, is however still based on English, so benchmarking on your own language is needed to qualify the performance (step 6). As of December 2023, the E5-multilingual family is a strong choice for an open source model. The model is fine-tuned for asymmetric search, and by tagging texts as ‘query’ or ‘passage’ before embedding, it optimizes the retrieval process by considering the nature of the input. This approach ensures a more effective match between queries and relevant information in your knowledge base, enhancing the overall performance of your RAG system. As seen on the benchmark, the cohere-embed-multilingual-v3.0 likely has better performance, but has to be paid for.

The step of embedding is often done as part of storing the documents in a vector DB, but a simple example of embedding all the split sentences using the E5 family can be done as below using the Sentence Transformer library.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-large')

prepended_split_texts = ["passage: " + text for text in split_texts]
embeddings = model.encode(prepended_split_texts, normalize_embeddings=True)

print(f'We now have {len(embeddings)} embeddings, each of size {len(embeddings[0])}')
We now have 12 embeddings, each of size 1024

If off the shelf embeddings turn out not to provide sufficient performance for your specific retrieval domain, fear not. With the advent of LLMs it has now become feasible to auto-generate training-data from your existing corpus, and increase the performance of up to 5–10% by fine-tuning an existing embedding on your own data, LlamaIndex provides a guide here or SBERTs GenQ approach where mainly the Bi-Encoder training part is relevant.

5. Vector databases: The home of embeddings

Embeddings are stored in a database for retrieval (Image generated by author w. Dall-E 3)

After loading, formatting, splitting your data, and selecting an embedding model, the next step in your RAG system setup is to embed the data and store these vector embeddings for retrieval. Most platforms, including LangChain and LlamaIndex, provide integrated local storage solutions, using vector databases like Qdrant, Milvus, Chroma DB or offer direct integration with cloud-based storage options such as Pinecone or ActiveLoop. The choice of vector storage is generally unaffected by whether your data is in English or another language. For a comprehensive understanding of storage and search options, including vector databases, I recommend exploring existing resources, such as this detailed introduction: All You Need to Know About Vector Databases and How to Use Them to Augment Your LLM Apps. This guide will provide you with the necessary insights to effectively manage the storage aspect of your RAG system.

At this point, you have successfully created the knowledge base that will serve as the brain of the retrieval system.

Generating responses (Image generated by author w. Dall-E 3)

6. The generative phase: Go read elsewhere 😉

The second half of the RAG system, the generative phase, is equally important in ensuring a successful solution. Strictly speaking, it’s a search optimization problem with a sprinkle of LLM on top, where the considerations are less language-dependent. This means that guides for English retrieval optimization are generally applicable to other languages as well, hence it is not included here.

In its simplest form, the generative phase involves a straightforward process: taking a user’s question, embedding it using the selected embedding model from step 4, performing a vector similarity search in the newly created database, and finally feeding the relevant text chunks to the LLM. This allows the system to respond to the query in natural language. However, to achieve a high-performing RAG system, several adjustments on the retrieval side are necessary such as re-ranking, filtering and much more. For further insights, I recommend exploring articles such as 10 ways to improve the performance of retrieval augmented generation systems or Improving Retrieval performance in RAG pipelines with Hybrid Search

Outro: Evaluating your RAG system

What are the right choices? (Image generated by author w. Dall-E 3)

So what do you do from here, what is the right configuration for your exact problem, and language?

As it might be clear at this point, deciding on the optimal settings for your RAG system can be a complex task due to the numerous variables involved. A custom query & context benchmark is essential to evaluate different configurations, especially since a pre-existing benchmark for your specific multilingual dataset and use case is very unlikely to exist.

Thankfully, with Large Language Models (LLMs), creating a tailored benchmark dataset has become feasible. A benchmark for retrieval systems typically comprises search queries and their corresponding context (the text chunks we split in step 4). If you have the raw data, LLMs can automate the generation of fictional queries related to your dataset. Tools like LlamaIndex provide built-in functions for this purpose. By generating custom queries, you can systematically test how adjustments in the embedding model, chunk size, or data formatting impact the retrieval performance for your specific scenario.

Creating a representative evaluation benchmark has a fair amount of do’s and dont’s involved, and in early 2024 I will follow up with a separate post on how to create a well performing retrieval benchmark — stay tuned!

--

--