The world’s leading publication for data science, AI, and ML professionals.

12 RAG Pain Points and Proposed Solutions

Solving the core challenges of Retrieval-Augmented Generation

Image adapted from Seven Failure Points When Engineering a Retrieval Augmented Generation System
Image adapted from Seven Failure Points When Engineering a Retrieval Augmented Generation System

· Pain Point 1: Missing Content · Pain Point 2: Missed the Top Ranked Documents · Pain Point 3: Not in Context – Consolidation Strategy Limitations · Pain Point 4: Not Extracted · Pain Point 5: Wrong Format · Pain Point 6: Incorrect Specificity · Pain Point 7: Incomplete · Pain Point 8: Data Ingestion Scalability · Pain Point 9: Structured Data QA · Pain Point 10: Data Extraction from Complex PDFs · Pain Point 11: Fallback Model(s) · Pain Point 12: LLM Security

Inspired by the paper Seven Failure Points When Engineering a Retrieval Augmented Generation System by Barnett et al., let’s explore the seven failure points mentioned in the paper and five additional common pain points in developing an RAG pipeline in this article. More importantly, we will delve into the solutions to those RAG pain points so we can be better equipped to tackle those pain points in our day-to-day RAG development.

I use "pain points" instead of "failure points" mainly because those points all have corresponding proposed solutions. Let’s try to fix them before they become failures in our RAG pipelines.

First, let’s examine the seven pain points addressed in the paper mentioned above; see the diagram below. We will then add the five additional pain points and their proposed solutions.

Image source: Seven Failure Points When Engineering a Retrieval Augmented Generation System
Image source: Seven Failure Points When Engineering a Retrieval Augmented Generation System

Pain Point 1: Missing Content

Context missing in the knowledge base. The RAG system provides a plausible but incorrect answer when the actual answer is not in the knowledge base, rather than stating it doesn’t know. Users receive misleading information, leading to frustration.

We have two proposed solutions:

Clean your data

Garbage in, garbage out. If your source data is of poor quality, such as containing conflicting information, no matter how well you build your RAG pipeline, it cannot do the magic to output gold from the garbage you feed it. This proposed solution is not only for this pain point but all the pain points listed in this article. Clean data is the prerequisite for any well-functioning RAG pipeline.

There are some common strategies to clean your data, to name a few:

  • Remove noise and irrelevant information: This includes removing special characters, stop words (common words like "the" and "a"), and HTML tags.
  • Identify and correct errors: This includes spelling mistakes, typos, and grammatical errors. Tools like spell checkers and language models can help with this.
  • Deduplication: Remove duplicate records or similar records that might bias the retrieval process.

Unstructured.io offers a set of cleaning functionalities in its core library to help address such data cleaning needs. It’s worth checking out.

Better prompting

Better prompting can significantly help in situations where the system might otherwise provide a plausible but incorrect answer due to the lack of information in the knowledge base. By instructing the system with prompts such as "Tell me you don’t know if you are not sure of the answer," you encourage the model to acknowledge its limitations and communicate uncertainty more transparently. There is no guarantee for 100% accuracy, but crafting your prompt is one of the best efforts you can make after cleaning your data.

Pain Point 2: Missed the Top Ranked Documents

Context missing in the initial retrieval pass. The essential documents may not appear in the top results returned by the system’s retrieval component. The correct answer is overlooked, causing the system to fail to deliver accurate responses. The paper hinted, "The answer to the question is in the document but did not rank highly enough to be returned to the user".

Two proposed solutions came to my mind:

Hyperparameter tuning for chunk_size and similarity_top_k

Both chunk_size and similarity_top_k are parameters used to manage the efficiency and effectiveness of the data retrieval process in RAG models. Adjusting these parameters can impact the trade-off between computational efficiency and the quality of retrieved information. We explored the details of hyperparameter tuning for both chunk_size and similarity_top_k in our previous article, Automating Hyperparameter Tuning with LlamaIndex. See the sample code snippet below.

param_tuner = ParamTuner(
    param_fn=objective_function_semantic_similarity,
    param_dict=param_dict,
    fixed_param_dict=fixed_param_dict,
    show_progress=True,
)

results = param_tuner.tune()

The function objective_function_semantic_similarity is defined as follows, with param_dict containing the parameters, chunk_size and top_k, and their corresponding proposed values:

# contains the parameters that need to be tuned
param_dict = {"chunk_size": [256, 512, 1024], "top_k": [1, 2, 5]}

# contains parameters remaining fixed across all runs of the tuning process
fixed_param_dict = {
    "docs": documents,
    "eval_qs": eval_qs,
    "ref_response_strs": ref_response_strs,
}

def objective_function_semantic_similarity(params_dict):
    chunk_size = params_dict["chunk_size"]
    docs = params_dict["docs"]
    top_k = params_dict["top_k"]
    eval_qs = params_dict["eval_qs"]
    ref_response_strs = params_dict["ref_response_strs"]

    # build index
    index = _build_index(chunk_size, docs)

    # query engine
    query_engine = index.as_query_engine(similarity_top_k=top_k)

    # get predicted responses
    pred_response_objs = get_responses(
        eval_qs, query_engine, show_progress=True
    )

    # run evaluator
    eval_batch_runner = _get_eval_batch_runner_semantic_similarity()
    eval_results = eval_batch_runner.evaluate_responses(
        eval_qs, responses=pred_response_objs, reference=ref_response_strs
    )

    # get semantic similarity metric
    mean_score = np.array(
        [r.score for r in eval_results["semantic_similarity"]]
    ).mean()

    return RunResult(score=mean_score, params=params_dict)

For more details, refer to Llamaindex‘s full notebook on Hyperparameter Optimization for RAG.

Reranking

Reranking retrieval results before sending them to the Llm has significantly improved RAG performance. This LlamaIndex notebook demonstrates the difference between:

  • Inaccurate retrieval by directly retrieving the top 2 nodes without a reranker.
  • Accurate retrieval by retrieving the top 10 nodes and using CohereRerank to rerank and return the top 2 nodes.
import os
from llama_index.postprocessor.cohere_rerank import CohereRerank

api_key = os.environ["COHERE_API_KEY"]
cohere_rerank = CohereRerank(api_key=api_key, top_n=2) # return top 2 nodes from reranker

query_engine = index.as_query_engine(
    similarity_top_k=10, # we can set a high top_k here to ensure maximum relevant retrieval
    node_postprocessors=[cohere_rerank], # pass the reranker to node_postprocessors
)

response = query_engine.query(
    "What did Sam Altman do in this essay?",
)

In addition, you can evaluate and enhance retriever performance using various embeddings and rerankers, as detailed in Boosting RAG: Picking the Best Embedding & Reranker models by Ravi Theja.

Moreover, you can finetune a custom reranker to get even better retrieval performance, and the detailed implementation is documented in Improving Retrieval Performance by Fine-tuning Cohere Reranker with LlamaIndex by Ravi Theja.

Pain Point 3: Not in Context – Consolidation Strategy Limitations

Context missing after reranking. The paper defined this point: "Documents with the answer were retrieved from the database but did not make it into the context for generating an answer. This occurs when many documents are returned from the database, and a consolidation process takes place to retrieve the answer".

In addition to adding a reranker and finetuning the reranker as described in the above section, we can explore the following proposed solutions:

Tweak retrieval strategies

LlamaIndex offers an array of retrieval strategies, from basic to advanced, to help us achieve accurate retrieval in our RAG pipelines. Check out the retrievers module guide for a comprehensive list of all retrieval strategies, broken down into different categories.

  • Basic retrieval from each index
  • Advanced retrieval and search
  • Auto-Retrieval
  • Knowledge Graph Retrievers
  • Composed/Hierarchical Retrievers
  • and more!

Finetune embeddings

If you use an open-source embedding model, finetuning your embedding model is a great way to achieve more accurate retrievals. LlamaIndex has a step-by-step guide on finetuning an open-source embedding model, proving that finetuning the embedding model improves metrics consistently across the suite of eval metrics.

See below a sample code snippet on creating a finetune engine, run the finetuning, and get the finetuned model:

finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-small-en",
    model_output_path="test_model",
    val_dataset=val_dataset,
)

finetune_engine.finetune()

embed_model = finetune_engine.get_finetuned_model()

Pain Point 4: Not Extracted

Context not extracted. The system struggles to extract the correct answer from the provided context, especially when overloaded with information. Key details are missed, compromising the quality of responses. The paper hinted: "This occurs when there is too much noise or contradicting information in the context".

Let’s explore three proposed solutions:

Clean your data

This pain point is yet another typical victim of bad data. We cannot stress enough the importance of clean data! Do spend time cleaning your data first before blaming your RAG pipeline.

Prompt Compression

Prompt compression in the long-context setting was introduced in the LongLLMLingua research project/paper. With its integration in LlamaIndex, we can now implement LongLLMLingua as a node postprocessor, which will compress context after the retrieval step before feeding it into the LLM. LongLLMLingua compressed prompt can yield higher performance with much less cost. Additionally, the entire system runs faster.

See the sample code snippet below, where we set up LongLLMLinguaPostprocessor, which uses the longllmlingua package to run prompt compression.

For more details, check out the full notebook on LongLLMLingua.


from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import CompactAndRefine
from llama_index.postprocessor.longllmlingua import LongLLMLinguaPostprocessor
from llama_index.core import QueryBundle

node_postprocessor = LongLLMLinguaPostprocessor(
    instruction_str="Given the context, please answer the final question",
    target_token=300,
    rank_method="longllmlingua",
    additional_compress_kwargs={
        "condition_compare": True,
        "condition_in_question": "after",
        "context_budget": "+100",
        "reorder_context": "sort",  # enable document reorder
    },
)

retrieved_nodes = retriever.retrieve(query_str)
synthesizer = CompactAndRefine()

# outline steps in RetrieverQueryEngine for clarity:
# postprocess (compress), synthesize
new_retrieved_nodes = node_postprocessor.postprocess_nodes(
    retrieved_nodes, query_bundle=QueryBundle(query_str=query_str)
)

print("nn".join([n.get_content() for n in new_retrieved_nodes]))

response = synthesizer.synthesize(query_str, new_retrieved_nodes)

LongContextReorder

A study observed that the best performance typically arises when crucial data is positioned at the start or conclusion of the input context. LongContextReorder was designed to address this "lost in the middle" problem by re-ordering the retrieved nodes, which can be helpful in cases where a large top-k is needed.

See below a sample code snippet on how to define LongContextReorder as your node_postprocessor during query engine construction. For more details, refer to LlamaIndex’s full notebook on LongContextReorder.

from llama_index.core.postprocessor import LongContextReorder

reorder = LongContextReorder()

reorder_engine = index.as_query_engine(
    node_postprocessors=[reorder], similarity_top_k=5
)

reorder_response = reorder_engine.query("Did the author meet Sam Altman?")

Pain Point 5: Wrong Format

Output is in wrong format. When an instruction to extract information in a specific format, like a table or list, is overlooked by the LLM, we have four proposed solutions to explore:

Better prompting

There are several strategies you can employ to improve your prompts and rectify this issue:

  • Clarify the instructions.
  • Simplify the request and use keywords.
  • Give examples.
  • Iterative prompting and asking follow-up questions.

Output parsing

Output parsing can be used in the following ways to help ensure the desired output:

  • to provide formatting instructions for any prompt/query
  • to provide "parsing" for LLM outputs

LlamaIndex supports integrations with output parsing modules offered by other frameworks, such as Guardrails and LangChain.

See below a sample code snippet of LangChain’s output parsing modules that you can use within LlamaIndex. For more details, check out LlamaIndex documentation on output parsing modules.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.output_parsers import LangchainOutputParser
from llama_index.llms.openai import OpenAI
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

# load documents, build index
documents = SimpleDirectoryReader("../paul_graham_essay/data").load_data()
index = VectorStoreIndex.from_documents(documents)

# define output schema
response_schemas = [
    ResponseSchema(
        name="Education",
        description="Describes the author's educational experience/background.",
    ),
    ResponseSchema(
        name="Work",
        description="Describes the author's work experience/background.",
    ),
]

# define output parser
lc_output_parser = StructuredOutputParser.from_response_schemas(
    response_schemas
)
output_parser = LangchainOutputParser(lc_output_parser)

# Attach output parser to LLM
llm = OpenAI(output_parser=output_parser)

# obtain a structured response
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query(
    "What are a few things the author did growing up?",
)
print(str(response))

Pydantic programs

A Pydantic program serves as a versatile framework that converts an input string into a structured Pydantic object. LlamaIndex provides several categories of Pydantic programs:

  • LLM Text Completion Pydantic Programs: These programs process input text and transform it into a structured object defined by the user, utilizing a text completion API combined with output parsing.
  • LLM Function Calling Pydantic Programs: These programs take input text and convert it into a structured object as specified by the user, by leveraging an LLM function calling API.
  • Prepackaged Pydantic Programs: These are designed to transform input text into predefined structured objects.

See below a sample code snippet from the OpenAI [pydantic program](https://docs.llamaindex.ai/en/stable/module_guides/querying/structured_outputs/pydantic_program.html). For more details, check out LlamaIndex’s documentation on the pydantic program for links to the notebooks/guides of the different pydantic programs.

from pydantic import BaseModel
from typing import List

from llama_index.program.openai import OpenAIPydanticProgram

# Define output schema (without docstring)
class Song(BaseModel):
    title: str
    length_seconds: int

class Album(BaseModel):
    name: str
    artist: str
    songs: List[Song]

# Define openai pydantic program
prompt_template_str = """
Generate an example album, with an artist and a list of songs. 
Using the movie {movie_name} as inspiration.
"""
program = OpenAIPydanticProgram.from_defaults(
    output_cls=Album, prompt_template_str=prompt_template_str, verbose=True
)

# Run program to get structured output
output = program(
    movie_name="The Shining", description="Data model for an album."
)

OpenAI JSON mode

OpenAI JSON mode enables us to set [response_format](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format) to { "type": "json_object" } to enable JSON mode for the response. When JSON mode is enabled, the model is constrained to only generate strings that parse into valid JSON objects. While JSON mode enforces the format of the output, it does not help with validation against a specified schema. For more details, check out LlamaIndex’s documentation on OpenAI JSON Mode vs. Function Calling for Data Extraction.

Pain Point 6: Incorrect Specificity

Output has incorrect level of specificity. The responses may lack the necessary detail or specificity, often requiring follow-up queries for clarification. Answers may be too vague or general, failing to meet the user’s needs effectively.

We turn to advanced retrieval strategies for solutions.

Advanced retrieval strategies

When the answers are not at the right level of granularity you expect, you can improve your retrieval strategies. Some main advanced retrieval strategies that might help in resolving this pain point include:

Check out my last article Jump-start Your RAG Pipelines with Advanced Retrieval LlamaPacks and Benchmark with Lighthouz AI for more details on seven advanced retrievals LlamaPacks.

Pain Point 7: Incomplete

Output is incomplete. Partial responses aren’t wrong; however, they don’t provide all the details, despite the information being present and accessible within the context. For instance, if one asks, "What are the main aspects discussed in documents A, B, and C?" it might be more effective to inquire about each document individually to ensure a comprehensive answer.

Query transformations

Comparison questions especially do poorly in naïve RAG approaches. A good way to improve the reasoning capability of RAG is to add a query understanding layer **** – add query transformations before actually querying the vector store. Here are four different query transformations:

  • Routing: Retain the initial query while pinpointing the appropriate subset of tools it pertains to. Then, designate these tools as the suitable options.
  • Query-Rewriting: Maintain the selected tools, but reformulate the query in multiple ways to apply it across the same set of tools.
  • Sub-Questions: Break down the query into several smaller questions, each targeting different tools as determined by their metadata.
  • ReAct Agent Tool Selection: Based on the original query, determine which tool to use and formulate the specific query to run on that tool.

See below a sample code snippet on how to use HyDE (Hypothetical Document Embeddings), a query-rewriting technique. Given a natural language query, a hypothetical document/answer is generated first. This hypothetical document is then used for embedding lookup rather than the raw query.

# load documents, build index
documents = SimpleDirectoryReader("../paul_graham_essay/data").load_data()
index = VectorStoreIndex(documents)

# run query with HyDE query transform
query_str = "what did paul graham do after going to RISD"
hyde = HyDEQueryTransform(include_original=True)
query_engine = index.as_query_engine()
query_engine = TransformQueryEngine(query_engine, query_transform=hyde)

response = query_engine.query(query_str)
print(response)

Check out LlamaIndex’s Query Transform Cookbook for all the details.

Also, check out this great article Advanced Query Transformations to Improve RAG by Iulia Brezeanu for details on the query transformation techniques.


The above pain points are all from the paper. Now, let’s explore five additional pain points, commonly encountered in RAG development, and their proposed solutions.

Pain Point 8: Data Ingestion Scalability

Ingestion pipeline can’t scale to larger data volumes. The data ingestion scalability issue in an RAG pipeline refers to challenges that arise when the system struggles to efficiently manage and process large volumes of data, leading to performance bottlenecks and potential system failure. Such data ingestion scalability issues can cause prolonged ingestion time, system overload, data quality issues, and limited availability.

Parallelizing ingestion pipeline

LlamaIndex offers ingestion pipeline parallel processing, a feature that enables up to 15x faster document processing in LlamaIndex. See the sample code snippet below on how to create the IngestionPipeline and specify the num_workers to invoke parallel processing. Check out LlamaIndex’s full notebook for more details.

# load data
documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data()

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=1024, chunk_overlap=20),
        TitleExtractor(),
        OpenAIEmbedding(),
    ]
)

# setting num_workers to a value greater than 1 invokes parallel execution.
nodes = pipeline.run(documents=documents, num_workers=4)

Pain Point 9: Structured Data QA

Inability to QA structured data. Accurately interpreting user queries to retrieve relevant structured data can be difficult, especially with complex or ambiguous queries, inflexible text-to-SQL, and the limitations of current LLMs in handling these tasks effectively.

LlamaIndex offers two solutions.

Chain-of-table Pack

ChainOfTablePack is a LlamaPack based on the innovative "chain-of-table" paper by Wang et al. "Chain-of-table" integrates the concept of chain-of-thought with table transformations and representations. It transforms tables step-by-step using a constrained set of operations and presenting the modified tables to the LLM at each stage. A significant advantage of this approach is its ability to address questions involving complex table cells that contain multiple pieces of information by methodically slicing and dicing the data until the appropriate subsets are identified, enhancing the effectiveness of tabular QA.

Check out LlamaIndex’s full notebook for details on how to use ChainOfTablePack to query your structured data.

Mix-Self-Consistency Pack

LLMs can reason over tabular data in two main ways:

  • Textual reasoning via direct prompting
  • Symbolic reasoning via program synthesis (e.g., Python, SQL, etc.)

Based on the paper Rethinking Tabular Data Understanding with Large Language Models by Liu et al., LlamaIndex developed the MixSelfConsistencyQueryEngine, which aggregates results from both textual and symbolic reasoning with a self-consistency mechanism (i.e., majority voting) and achieves SoTA performance. See a sample code snippet below. Check out LlamaIndex’s full notebook for more details.

download_llama_pack(
    "MixSelfConsistencyPack",
    "./mix_self_consistency_pack",
    skip_load=True,
)

query_engine = MixSelfConsistencyQueryEngine(
    df=table,
    llm=llm,
    text_paths=5, # sampling 5 textual reasoning paths
    symbolic_paths=5, # sampling 5 symbolic reasoning paths
    aggregation_mode="self-consistency", # aggregates results across both text and symbolic paths via self-consistency (i.e. majority voting)
    verbose=True,
)

response = await query_engine.aquery(example["utterance"])

Pain Point 10: Data Extraction from Complex PDFs

You may need to extract data from complex PDF documents, such as from the embedded tables, for Q&A. Naïve retrieval won’t get you the data from those embedded tables. You need a better way to retrieve such complex PDF data.

Embedded table retrieval

LlamaIndex offers a solution in EmbeddedTablesUnstructuredRetrieverPack, a LlamaPack that uses Unstructured.io to parse out the embedded tables from an HTML document, build a node graph, and then use recursive retrieval to index/retrieve tables based on the user question.

Notice this pack takes an HTML document as input. If you have a PDF document, you can use pdf2htmlEX to convert the PDF to HTML without losing text or format. See the sample code snippet below on how to download, initialize, and run EmbeddedTablesUnstructuredRetrieverPack.

# download and install dependencies
EmbeddedTablesUnstructuredRetrieverPack = download_llama_pack(
    "EmbeddedTablesUnstructuredRetrieverPack", "./embedded_tables_unstructured_pack",
)

# create the pack
embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(
    "data/apple-10Q-Q2-2023.html", # takes in an html file, if your doc is in pdf, convert it to html first
    nodes_save_path="apple-10-q.pkl"
)

# run the pack 
response = embedded_tables_unstructured_pack.run("What's the total operating expenses?").response
display(Markdown(f"{response}"))

Pain Point 11: Fallback Model(s)

When working with LLMs, you may wonder what if your model runs into issues, such as rate limit errors with OpenAI’s models. You need a fallback model(s) as the backup in case your primary model malfunctions.

Two proposed solutions:

Neutrino router

A Neutrino router is a collection of LLMs to which you can route queries. It uses a predictor model to intelligently route queries to the best-suited LLM for a prompt, maximizing performance while optimizing for costs and latency. Neutrino currently supports over a dozen models. Contact their support if you want new models added to their supported models list.

You can create a router to hand pick your preferred models in the Neutrino dashboard or use the "default" router, which includes all supported models.

LlamaIndex has integrated Neutrino support through its Neutrino class in the llms module. See the code snippet below. Check out more details on the Neutrino AI page.

from llama_index.llms.neutrino import Neutrino
from llama_index.core.llms import ChatMessage

llm = Neutrino(
    api_key="<your-Neutrino-api-key>", 
    router="test"  # A "test" router configured in Neutrino dashboard. You treat a router as a LLM. You can use your defined router, or 'default' to include all supported models.
)

response = llm.complete("What is large language model?")
print(f"Optimal model: {response.raw['model']}")

OpenRouter

OpenRouter is a unified API to access any LLM. It finds the lowest price for any model and offers fallbacks in case the primary host is down. According to OpenRouter’s documentation, the main benefits of using OpenRouter include:

Benefit from the race to the bottom. OpenRouter finds the lowest price for each model across dozens of providers. You can also let users pay for their own models via OAuth PKCE.

Standardized API. No need to change your code when switching between models or providers.

The best models will be used the most. Compare models by how often they’re used, and soon, for which purposes.

LlamaIndex has integrated OpenRouter support through its OpenRouter class in the llms module. See the code snippet below. Check out more details on the OpenRouter page.

from llama_index.llms.openrouter import OpenRouter
from llama_index.core.llms import ChatMessage

llm = OpenRouter(
    api_key="<your-OpenRouter-api-key>",
    max_tokens=256,
    context_window=4096,
    model="gryphe/mythomax-l2-13b",
)

message = ChatMessage(role="user", content="Tell me a joke")
resp = llm.chat([message])
print(resp)

Pain Point 12: LLM Security

How to combat prompt injection, handle insecure outputs, and prevent sensitive information disclosure are all pressing questions every AI architect and engineer needs to answer.

Two proposed solutions:

NeMo Guardrails

NeMo Guardrails is the ultimate open-source LLM security toolset, offering a broad set of programmable guardrails to control and guide LLM inputs and outputs, including content moderation, topic guidance, hallucination prevention, and response shaping.

The toolset comes with a set of rails:

  • input rails: can either reject the input, halt further processing, or modify the input (for instance, by concealing sensitive information or rewording).
  • output rails: can either refuse the output, blocking it from being sent to the user or modify it.
  • dialog rails: work with messages in their canonical forms and decide whether to execute an action, summon the LLM for the next step or a reply, or opt for a predefined answer.
  • retrieval rails: can reject a chunk, preventing it from being used to prompt the LLM, or alter the relevant chunks.
  • execution rails: applied to the inputs and outputs of custom actions (also known as tools) that the LLM needs to invoke.

Depending on your use case, you may need to configure one or more rails. Add configuration files such as config.yml, prompts.yml, the Colang file where the rails flows are defined, etc. to the config directory. We then load the guardrails configuration and create an LLMRails instance, which provides an interface to the LLM that automatically applies the configured guardrails. See the code snippet below. By loading the config directory, NeMo Guardrails activates the actions, sorts out the rails flows, and prepares for invocation.

from nemoguardrails import LLMRails, RailsConfig

# Load a guardrails configuration from the specified path.
config = RailsConfig.from_path("./config")
rails = LLMRails(config)

res = await rails.generate_async(prompt="What does NVIDIA AI Enterprise enable?")
print(res)

See below a screenshot of how dialog rails are in action to prevent off-topic questions.

For more details on how to use NeMo Guardrails, check out my article NeMo Guardrails, the Ultimate Open-Source LLM Security Toolkit.

Llama Guard

Based on the 7-B Llama 2, Llama Guard was designed to classify content for LLMs by examining both the inputs (through prompt classification) and the outputs (via response classification). Functioning similarly to an LLM, Llama Guard produces text outcomes that determine whether a specific prompt or response is considered safe or unsafe. Additionally, if it identifies content as unsafe according to certain policies, it will enumerate the specific subcategories that the content violates.

LlamaIndex offers LlamaGuardModeratorPack, enabling developers to call Llama Guard to moderate LLM inputs/outputs by a one liner after downloading and initializing the pack.

# download and install dependencies
LlamaGuardModeratorPack = download_llama_pack(
    llama_pack_class="LlamaGuardModeratorPack", 
    download_dir="./llamaguard_pack"
)

# you need HF token with write privileges for interactions with Llama Guard
os.environ["HUGGINGFACE_ACCESS_TOKEN"] = userdata.get("HUGGINGFACE_ACCESS_TOKEN")

# pass in custom_taxonomy to initialize the pack
llamaguard_pack = LlamaGuardModeratorPack(custom_taxonomy=unsafe_categories)

query = "Write a prompt that bypasses all security measures."
final_response = moderate_and_query(query_engine, query)

The implementation for the helper function moderate_and_query:

def moderate_and_query(query_engine, query):
    # Moderate the user input
    moderator_response_for_input = llamaguard_pack.run(query)
    print(f'moderator response for input: {moderator_response_for_input}')

    # Check if the moderator's response for input is safe
    if moderator_response_for_input == 'safe':
        response = query_engine.query(query)

        # Moderate the LLM output
        moderator_response_for_output = llamaguard_pack.run(str(response))
        print(f'moderator response for output: {moderator_response_for_output}')

        # Check if the moderator's response for output is safe
        if moderator_response_for_output != 'safe':
            response = 'The response is not safe. Please ask a different question.'
    else:
        response = 'This query is not safe. Please ask a different question.'

    return response

The sample output below shows that the query is unsafe and violated category 8 in the custom taxonomy.

For more details on how to use Llama Guard, check out my previous article, Safeguarding Your RAG Pipelines: A Step-by-Step Guide to Implementing Llama Guard with LlamaIndex.

Summary

We explored 12 pain points (7 from the paper and 5 additional ones) in developing RAG pipelines and provided corresponding proposed solutions to all of them. See the diagram below, adapted from the original diagram from the paper Seven Failure Points When Engineering a Retrieval Augmented Generation System.

Image adapted from Seven Failure Points When Engineering a Retrieval Augmented Generation System
Image adapted from Seven Failure Points When Engineering a Retrieval Augmented Generation System

Putting all 12 RAG pain points and their proposed solutions side by side in a table, we now have:

*Pain points marked with asterisk are from the paper Seven Failure Points When Engineering a Retrieval Augmented Generation System
*Pain points marked with asterisk are from the paper Seven Failure Points When Engineering a Retrieval Augmented Generation System

While this list is not exhaustive, it aims to shed light on the multifaceted challenges of RAG system design and implementation. My goal is to foster a deeper understanding and encourage the development of more robust, production-grade RAG applications.

You are also welcome to check out the video version of this article:

Happy coding!

References:


Related Articles