Learning Retrieval Augmented Generation

It may not come as a surprise that retrieval augmented generation (RAG) is among the most applied techniques in the world of generative AI and large language model-powered applications. In fact, according to a Databricks report, more than 60% of LLM-powered applications use RAG in some form. Therefore, in the global LLM market, which is currently valued at around $6 Billion and growing at almost 40% YoY, RAG undoubtedly becomes one of those crucial techniques to master.
Building a PoC RAG pipeline is not too challenging today. There are readily available examples of code leveraging frameworks like LangChain or LlamaIndex and no-code/low-code platforms like RAGArch, HelloRAG, etc.
Conversing with Documents: Unleashing the Power of LLMs and LangChain
A production-grade RAG system, on the other hand, is composed of several specialised layers specific to generative AI applications complementing the standard software architecture. All these layers, stacked together and supported by a technology infrastructure, create a robust RAG system design. This we will call an operations stack for RAG, or a RAGOps Stack.
In this blog post, we will have a detailed discussion on the components of RAGOps stack. Before going deep into the layers of the stack, we’ll build a context with a quick introduction to RAG and the overarching anatomy of a RAG system. The blog will include the following sections:
- Introduction to Retrieval Augmented Generation – This section will cover the what and why of RAG along with some real-world use cases.
- Anatomy of a RAG system – Here, we will look at the core components of the RAG pipeline and, a few challenges associated with RAG.
- RAG Ops Stack: This section will be an introduction to the Ops stack beyond the core RAG pipeline.
- Critical Layers: This section will discuss the four mandatory layers for a production RAG system. These are data layer, model layer, deployment layer and the application orchestration layer.
- Essential Layers: This section covers the important layers that enhance the performance, reliability and safety of a RAG system. The layers here include the prompt layer, caching layer, security & privacy layer and, the evaluation & monitoring layers.
- Enhancement Layers: This section covers a few of those use-case specific layers like human-in-the-loop, explainability, personalisation, etc., which are tailored to the requirements of the business.
- Best Practices: The concluding discussion will focus on certain challenges and the best practices to overcome them.
In case you’re already familiar with RAG pipelines and are more interested in the RAGOps stack, you can skip the first two sections and start reading from section 3. I hope you find this article as enjoyable as I found researching and writing it. Let’s get started.
Before we begin, please pardon me for some shameless self promotion. This blog is based on Chapter 7 of my book, A Simple Guide to Retrieval Augmented Generation. If you like what you read here and want to build contextual AI systems that fulfil their potential, this book will be a good starting point. Early access is available at a discount at manning.com! Please click on my affiliate link below to get your copy.
Introduction to Retrieval Augmented Generation
30th November, 2022 will be remembered as the watershed moment in Artificial Intelligence. OpenAI released ChatGPT and the world was mesmerised. We are nearing the two-year mark since that date, and terms like generative AI, Large Language Models (LLMs), transformers have gained unprecedented popularity. This is thanks to the phenomenal ability of the LLMs to process and generate natural language (initially text, but now even images and other data modalities).
LLMs are large machine learning models trained on massive volumes of data leveraging an architecture known as the transformers architecture. You are encouraged to read more about transformers and LLMs. The important thing to understand for the purpose of understanding RAG is that LLMs are designed to predict the next word given a sequence of words.

If you’re interested, there’s a fantastic new book by Sebastian Raschka that is a deep dive into training LLMs. Please click on my affiliate link below to get your copy –
The usage of LLMs has soared. Users can write emails, caption their Instagram photos, have a casual conversation with ChatGPT and even generate blogs like this one. However, with the usage, the expectations have also exploded. At a high level, there are three expectations from any LLM and the applications built on them. We expect LLMs to be Comprehensive i.e., know everything, be Current i.e., be up-to-date with information and be factually Correct every single time. LLMs, on the other hand, are designed just to predict the next word in a sequence. There are three main limitations that prevent them from being comprehensive, current and correct –
- A Knowledge Cut-off Date: Training an LLM is both expensive and time-consuming . It requires vast amount of data and takes several weeks, or even months, to train an LLM. The data that LLMs are trained on is therefore not always up to the current. Any event that happened after this cut-off date of the data is not available to the model.
- Training dataset constraints: LLMs, as we already read, have been trained on large volumes of data sourced from a variety of sources including the open internet. They do not have any knowledge of information that is not public. The LLMs have not been trained on non-public information like internal company documents, customer information, product documents, etc.
- Hallucinations: Often, it is observed that LLMs provide responses that are factually incorrect. Despite being factually incorrect, the LLM responses sound extremely confident and legitimate. This characteristic of "lying with confidence", called hallucinations, has proved to be one of the biggest criticisms of LLMs.

Does that mean this technology is not useful? Absolutely not – The hype would’ve died down by now. LLMs, because of their tremendous ability to understand language, can consume and process information with extreme efficiency. If you can point an LLM to a source of information, it can process that information to generate accurate results rooted in the source. This source of information can be your company documents, third-party databases, or even the internet.
This is the main idea behind Retrieval Augmented Generation, and in 2024, it is one of the most widely used techniques in generative AI applications.
What is RAG?
In one line, Retrieval Augmented Generation is a technique that provides an LLM with the information which the LLM may not have and that is necessary to respond to a user’s query (or prompt, as we call it in the generative AI parlance). To understand RAG, let’s understand two concepts of ‘memory’ –
- Parametric Memory: LLMs, you may know, are incredible stores of knowledge in themselves. Based on the data they’ve been trained on, they store factual information. This idea of storage, inherent to an LLM, is also called ‘parametric memory‘ or memory that is stored in the model parameters. But we’ve been discussing that this parametric memory is not enough because of the three limitations.
- Non-parametric Memory: To make the LLMs more useful, they also need information that is not stored in their parametric memory. This can be thought of as ‘non-parametric memory‘ or memory that is not stored in the model parameters.
The technique of embellishing the parametric memory of an LLM by providing access to an external non-parametric source of information, thereby enabling the LLM to generate an accurate response to the user query is called Retrieval Augmented Generation.

In a way, RAG can supplement an LLMs internal knowledge with unlimited external non-parametric memory. The data from the external sources can be cited for increased trust and reliability. It has also been demonstrated that RAG systems are less prone to hallucinations. Let’s continue to build on this and see how RAG works.
How does it work?
At the core of the system is still an LLM (An LLM which can be large or small, open source or proprietary, foundation or fine-tuned). We all know that when we prompt an LLM, it generates a response. But we’ve been saying that these responses can be sub-optimal and inaccurate. If we can find a way to search through an information store or a knowledge base to fetch an accurate source of information, then add this information to the prompt and pass it to the LLM, we can expect the LLM to generate responses that are accurate and rooted in a verifiable source.
To enable this search and retrieval of information, a retriever component is introduced into the system. So now the three steps of the process become:
- retrieval of accurate information,
- augmentation of the user prompt with the retrieved information,
- and lastly, generation of an accurate and a contextual response using an LLM.

Real-World Applications of RAG
RAG is not just an exercise in theory. RAG systems today power applications like search engines such as Perplexity, Google, and Bing, advanced Question Answering systems, and conversational agents like customer support bots. RAG also enables personalisation in AI-generated content and is being used in educational tools and legal research amongst other domains.
Some of the use cases of RAG systems in production are –
- Search Engine Experience: Conventional search results are shown as a list of page links ordered by relevance. More recently, Google Search, Perplexity, You.com have used RAG to present a coherent piece of text, in natural language, with source citation. As a matter of fact, search engine companies are now building LLM first search engines where RAG is the cornerstone of the algorithm. Even ChatGPT now has a web search mode.
- Conversational agents: LLMs can be customised to product/service manuals, domain knowledge, guidelines, etc. using RAG. The agent can also route users to more specialised agents depending on their query. SearchUnify has an LLM+RAG powered conversational agent for their users.
- Real-time Event Commentary: Imagine an event like a sports or a new event. A retriever can connect to real-time updates/data via APIs and pass this information to the LLM to create a virtual commentator. These can further be augmented with Text To Speech models.IBM leveraged the technology for commentary during the 2023 US Open
- Content Generation: The widest use of LLMs has probably been in content generation. Using RAG, the generation can be personalised to readers, incorporate real-time trends and be contextually appropriate. Yarnit is an AI based content marketing platform that uses RAG for multiple tasks.
- Personalised Recommendation: Recommendation engines have been a game changes in the digital economy. LLMs are capable of powering the next evolution in content recommendations. Check out Aman’s blog on the utility of LLMs in recommendation systems.
- Virtual Assistants: Virtual personal assistants like Siri, Alexa and others are are now using LLMs to enhance the experience. Coupled with more context on user behaviour, these assistants can become highly personalised.
Creating Impact: A Spotlight on 6 Practical Retrieval Augmented Generation Use Cases
So, how do you build one?
Anatomy of a RAG System
To construct a RAG-enabled system there are several components that need to be assembled. This includes creation and maintenance of the non-parametric memory, or a knowledge base, for the system. Another required process is one that facilitates real-time interaction by sending the prompts to and accepting the response from the LLM, with retrieval and augmentation steps in the middle. Evaluation is yet another critical component, ensuring the effectiveness of the system. All these components of the system need to be supported by a robust service infrastructure.
Overview of RAG Pipelines
The retrieval, augmentation, and generation components form the generation pipeline that the user interacts with in real time. The generation pipeline retrieves information from the knowledge base. Therefore, it is critical to establish a process that can create and maintain the knowledge base. This is done through another pipeline known as the indexing pipeline.
Indexing Pipeline
The set of processes that is employed to create the knowledge base for RAG applications forms the indexing pipeline. It is a non real-time pipeline that updates the knowledge base at periodic intervals. The indexing pipeline can be summarised in five steps –
Step 1 : Connect to previously identified external sources
Step 2 : Extract documents and parse text from these documents
Step 3 : Break down long pieces of text into smaller manageable pieces
Step 4 : Convert these small pieces into a suitable format
Step 5 : Store this information
Read more about the indexing pipeline in the blogs below –
Generation Pipeline
The set of processes that is employed to search and retrieve information from the knowledge base to generate responses to user queries forms the generation pipeline. It facilitates real-time interaction with users. This can also be distilled in five steps.
Step 1: User asks a question to our system
Step 2: The system searches for information relevant to the input question
Step 3: The information relevant to the input question is fetched, or retrieved, and added to the input question
Step 4: This question + information is passed to an LLM
Step 5: The LLM responds with a contextual answer
7 Retrieval Metrics for Better RAG Systems
RAG Value Chain: Retrieval Strategies in Information Augmentation for Large Language Models
The figure below illustrates the two pipelines coming together to form the core of the RAG system.

Key Components of RAG Systems
Apart from the two pipelines we can also think of certain other components that are required in a RAG system.
The main components of a RAG enabled system include –
- Data Loading component : connects to external sources, extracts and parses data
- Data Splitting component : breaks down large pieces of text into smaller manageable parts
- Data Conversion component : converts text data into a more suitable format
- Storage component : stores the data to create a knowledge base for the system
These four components above complete the indexing pipeline

- Retrievers : are responsible for searching and fetching information from the Storage
- LLM Setup : is responsible for generating the response to the input
- Prompt Management : enables the augmentation of the retrieved information to the original input
These three components complete the generation pipeline

- Evaluation component : measures the accuracy and reliability of the system before and after deployment
- Monitoring : tracks the performance of the RAG-enabled system and helps detect failures
- Service Infrastructure : in addition to facilitating deployment and maintenance, ensures a seamless integration of various system components for optimal performance.
Other components include caching which helps store previously generated responses to expedite retrieval for similar queries, guardrails to ensure compliance with policy, regulation and social responsibility, and security to protect LLMs against breaches like prompt injection, data poisoning etc.

This high level anatomy is the intuition behind a robust operations stack for RAG. Let us now delve deeper into the RAGOps stack.
In case you’re interested in coding a simple RAG pipeline in python using LangChain, check out the repository below –
GitHub – abhinav-kimothi/A-Simple-Guide-to-RAG: This repository is the source code for examples and…
The RAGOps Stack
Standard software application stack may include layers like database, runtime, front-end framework, OS, middleware, etc. A RAG system includes additional components. These may be vector stores and embeddings models which are essential components of the indexing pipeline. Knowledge Graphs are increasingly becoming popular indexing structures. The generation component can have different kinds of language models. Prompt management is increasingly becoming complex.
The production ecosystem for RAG and LLM applications is still evolving. Early tooling and design patterns have emerged. RAGOps (RAG Operations) refers to the operational practices, tools, and processes involved in deploying, maintaining, and optimising RAG systems in production environments
Note: RAG, like generative AI in general, is an evolving technology and therefore the operations stack continues to evolve. You may find varying definitions and structures.
The RAGOps stack can be visualised as layers in three categories –
- Critical layers that are fundamental to the operation of a RAG system. A RAG system is likely to fail if any of these layers are missing or incomplete.
- Essential layers that are important for performance, reliability and safety of the system. These essential components bring the system to a standard that provides value to the user.
- Enhancement layers that improve the efficiency, scalability and usability of the system. These components are used to make the RAG system better and are decided based on the end requirements.

Let us now discuss these layers one by one.
The RAGOps Stack: Critical Layers
The critical layers enable the two core pipelines of the RAG system – the indexing pipeline and the generation pipeline. There are four layers that are critical to the stack.
Data Layer
The data layer responsible for collecting data from source systems, transforming it into a usable format and storing it for efficient retrieval. It can have three components –
-
Data Ingestion component collects data from source systems like databases, content management systems, file systems, APIs, devices etc. and even the internet. Data Ingestion Tools: AWS Glue, Azure Data Factory, Google Cloud Dataflow, Fivetran, Apache NiFi, Apache Kafka, Airbyte are amongst several tools available for use. For rapid prototyping and Proof of Concepts (PoCs), frameworks like LangChain and LlamaIndex have inbuilt functions that can assist in connecting to some sources and extracting information.
-
Data Transformation component converts the ingested data from a raw form to a usable form. The process of chunking, embedding, cleaning, metadata creation etc. are the responsibility of this layer. Data Tranformation Tools: AWS Glue, Azure Data Factory, Google Cloud Dataflow, Fivetran, Apache NiFi, Apache Kafka, Airbyte, Apache Spark, dbt, Unstructured.io, etc.
-
Data Storage component stores the transformed data in a way that allows for fast and efficient retrieval. This includes document storage, vector storage and graph storage. Data Storage Tools: Pinecone is a fully managed cloud-native service Vector db service. Milvus, Qdrant and Chroma are amongst the open source vector databases. Document stores like redis, S3, etc., can be used to store raw files. Neo4j is the leading graph data store.

A strong data layer is the foundation of an efficient RAG system. Data layer also comes in handy when a need for fine-tuning of models is required.
Model Layer
Foundation models like LLMs, embeddings, etc. enable generative AI applications. These can be open-source or proprietary models provided by service providers. Some can be custom trained or fine-tuned. The components of model layer are –
-
A Model Library contains the list of models that have been chosen for the application. It can contain pre-trained LLMs (foundation models), fine-tuned models, embeddings models and task specific models. Models: Providers like OpenAI, Gemini by Google, Voyage AI, Cohere provide a variety of embeddings model choices and a host of open-source embeddings models can also be used via HuggingFace Transformers. GPT series by OpenAI, Gemini Series by Google, Claude Series by Anthropic, and Command R series by Cohere are popular proprietary LLMs. Llama series by Meta and Mistral are open-source models that have gained popularity.
-
Model Training and Fine-tuning component is responsible for building custom models and fine-tuning foundation models on custom data. This is generally done for task-specific models and for domain adaptation. Model Training and Fine-tuning Tools: HugginFace, AWS SageMaker, Azure ML, etc.
-
Inference Optimization component is responsible for generating responses quickly and cost-effectively. Inference Optimization Tools: ONNX and NVIDIA TensorRT-LLM are popular frameworks that optimize inferencing.

Model Deployment
Model deployement is responsible for making the RAG system available to the application layer. It handles the infrastructure of the models. There are four main methods to model deployement –
-
Fully Managed Deployment can be provided by proprietary model providers where all infrastructure for model deployment, serving, and scaling is managed and optimized by these providers. Fully Managed Service Providers: OpenAI, Google, Anthropic, Cohere, AWS SageMaker, Google Vertex AI, Azure Machine Learning, HuggingFace, Amazon Bedrock.
-
Self-hosted Deployment is enabled by cloud VM providers. The models are deployed in private clouds or on-premises, and the infrastructure is managed by the application developer. Self-hosted Solutions: VM providers like AWS, GCP, Azure, and hardware providers like Nvidia, Kubernetes and Docker for containerisation, Nvidia Triton Inference Server for inference optimization.
-
Local/edge Deployment involves running optimized versions of models on local hardware or edge devices, ensuring data privacy, reduced latency, and offline functionality. Local/edge Deployment Solutions: ONNX, TensorFlow Lite, PyTorch Mobile, GGML, NVIDIA TensorRT, GPT4All.

With the data and the model layers, most essential components of the RAG system are in place. Now we need a layer that manages the co-ordination between the data and the models. This is the responsibility of the Application Orchestration Layer.
Application Orchestration Layer
An application orchestration layer is like a musical conductor leading a group of musicians in an orchestra. It is responsible for managing the interactions amongst the other layers in the system. The major components of the orchestration layer are –
- Query Orchestration component is responsible for receiving and orchestrating user queries. All pre-retrieval query optimization steps like query classification, query expansion, query rewriting, etc. are orchestrated by this component.
- Retrieval Coordination component hosts the various retrieval logics. Depending on the input from the query orchestration module it will select the appropriate retrieval method (dense retrieval or hybrid retrieval etc.) and interact with the data layer.
- Generation Coordination component receives the query and the context from the previous components and coordinates all the post retrieval steps. Its primary function is to interact with the model layer and prompt the LLM to generate the output
- Multi-agent orchestration component is used for agentic RAG where multiple agents handle specific tasks.
- Workflow automation component can sometimes be employed for managing the flow and the movement of data between different components.
Orchestration Frameworks & Tools: LangChain and LlamaIndex. Microsoft’s AutoGen and CrewAI are upcoming frameworks for multi-agent orchestration. Apache Airflow and Dagster are popular tools used for workflow automation.

These four critical layers complete the core RAG system. This core system can interact with the end software application layer which acts as the interface between the RAG system and the user. Application layer can be custom built or leverage hosting platforms like Streamlit, Vercel, and Heroku.

The next set of layers improve the reliability, performance and usability of a RAG system
The RAGOps Stack: Essential Layers
The critical layers do not evaluate or monitor the system. Web applications are also vulnerable to cyber attacks. Latency and cost are growing concerns in the field of generative AI. To address these challenges and make the RAG system viable, essential layers help.
Prompt Engineering
The critical application orchestration layer, which is responsible for co-ordination amongst the components of a RAG system also manages the prompts (or instructions) that are sent as input to the LLMs. While this is manageable independently by the orchestration layer in small scale systems, in more complex systems the number of prompts can be in hundreds or even thousands. Poor prompting leads to hallucinations and imperfect responses. Therefore, a separate layer is essential for crafting and managing prompts. Tools like Azure Prompt Flow, LangChain Expression Language (LCEL), Weights & Biases prompts, PromptLayer come in handy.
Evaluation Layer
Regular evaluation of retrieval accuracy, context relevance, faithfulness and answer relevance of the system is necessary to ensure the quality of responses. TruLens by TruEra, Ragas, Weights & Biases are commonly used platforms and frameworks for evaluation. ARISE, RAGAS, ARES are evaluation frameworks that are popular.
A previous blog discusses evaluation in detail. If it is of interest to you, please give it a read.
Stop Guessing and Measure Your RAG System to Drive Real Improvements
Monitoring Layer
While evaluation comes in handy during the development of the system, continuous monitoring ensures the long term health of the system. Observing the execution of the processing chain is essential for understanding system behaviour and identifying points of failure. The assessment of the information going to the language models is done by the monitoring layer in addition to regular system metrics tracking like resource utilisation, latency and error rates. ARISE, RAGAS, ARES are evaluation frameworks that are also used in monitoring. TraceLoop, TruLens and Galileo are examples of providers that offer monitoring services.
LLM Security and Privacy
Software security is an independent and expansive domain. In the context of RAG, there are some additional considerations that pop up. RAG systems need to follow all data privacy regulations. AI models are susceptible to manipulation and poisoning. Prompt injection is a malicious attack via prompts in order to retrieve sensitive information. Data protection strategies like anonymization, encryption, differential privacy should be employed. This is maintained in the security and privacy layer. Lakera, OWASP, Lasso Security, etc. are tools that can be leveraged.
Caching Layer
Generative AI models have high costs and inherent latency associated with them. Semantic caching frequently asked queries controls this to an extent and is therefore an important component of the RAGOps stack.
These essential layers stacked together with the critical layers create a robust, accurate and high performing RAG system.

With the critical and essential layers, the RAG system is good to go. But, there might be some more components needed depending on the requirements of the application being developed.
Enhancing RAG Systems with Optional Layers
Enhancement layers are the parts of the RAGOps stack that are optional but can lead to significant gains depending upon the use case environment. These are focused on efficiency and usability of the system.
Human-in-the-Loop
Provides critical oversight to reduce bias and model hallucinations. This becomes critical in use cases that require near-perfect accuracy.
Cost Optimization
This layer helps manage resources efficiently, which is particularly important for large-scale systems.
Explainability and Interpretability
This layer helps provide transparency for system decisions, especially important for domains requiring accountability.
Collaboration and Experimentation
This layer enhances productivity and iterative improvements. Weights and Biases is a popular platform that help track experiments.
Multimodal Layer
RAG applications are no longer text only. Data of other modalities, especially image, is now a regular feature of RAG applications. This layer manages the adapters to incorporate multimodal data into the RAG system.
There can be more such layers that cater to feedback, personalisation, scaling etc. The idea is that the stack should be modular and expandable.
With the knowledge of the critical, essential and enhancement layers, you should be ready to put together a technology stack to build your RAG system.
Factors effecting the choice of tools
There are several service providers, tools, and technologies that you can use in the development of RAG systems. Throughout our discussion above, we have listed examples of these. But how does one evaluate which tool to choose? There are seven factors that you should consider depending on your requirements.
- Scalability and Performance required : The estimated volumes and the acceptable levels of latency should dictate the choice of auto-scaling, vector databases and inference optimisation tools.
- Integration with existing stack: If your system already operates on AWS, GCP, or Azure, using services that integrate well with these platforms can streamline development and maintenance.
- Cost efficiency: Costs, even with pay as you go models, can escalate quickly with scale. Choose the model and deployment strategy with this in mind.
- Domain adaptation: The peculiarity of the domain for which the system is being developed will have a significant impact on the choice of embeddings and language model. It will also dictate whether custom training or fine-tuning is required.
- Vendor lock-in constraints: Generative AI is an evolving field and there are no clear winners yet. Use interoperable technologies wherever possible. This helps in maintaining flexibility.
- Community support: Access to resources, tutorials, troubleshooting, and regular updates can accelerate development and reduce debugging time. Tools with active communities like HuggingFace, LangChain, etc. are more likely to offer frequent updates, plugins, and third-party integrations.
Production Best Practices
It is inevitable that some issues creep up during development or deployment and even post-deployment. Though RAG is still in its nascent form some early trends of common mishaps and best practices have emerged.
Addressing Latency
Due to pre-retrieval, retrieval, reranking, etc., RAG systems add to the inherent latency of the LLM. Query classification, hybrid retrieval filtering, limiting similarity searches and caching help in managing this latency.
Reducing Hallucinations
Though RAG is designed to reduce hallucinations, they can never be eliminated with certainty. Adding post generation validations and human verification may be necessary for high risk applications.
Scalability Strategies
RAG systems may struggle with scalability as the number of users and the data in the knowledge base grows. Autoscaling vector databases and cloud solutions should be employed if the usage is expected to grow rapidly.
Handling Privacy and Security
LLMs may expose sensitive data and PII. PII masking, data redaction, privacy filters have started playing an important role in the RAGOps stack.
A holistic RAGOps stack enables the building of production-grade RAG systems. Starting with an introduction to RAG, we delved into the anatomy of a RAG system before jumping into a detailed discussion on the layers of the RAGOps stack. This field is developing rapidly and new technologies and use cases are getting introduced every week. So are the challenges. The RAGOps stack is, consequently, bound to evolve.
What did you think about this discussion on the stack? Are there any layers that you find misplaced or missing from this framework? Which ones do you find most interesting? Please let me know in the comments.
If you liked what you read, please clap, comment and share this blog with your network.
This article is based on my book, A Simple Guide to Retrieval Augmented Generation published by Manning Publications. If you’re interested, do check it out.

My name is Abhinav and I’d love to stay connected on LinkedIn, X, Instagram and Medium. You can also check out my linktree for other resources.
I write about Machine Learning, LLMs, RAG and AI Agents. If this is of interest to you, please check out my other blogs –
A Taxonomy of Retrieval Augmented Generation
Beyond Naïve RAG: Advanced Techniques for Building Smarter and Reliable AI Systems
Stop Guessing and Measure Your RAG System to Drive Real Improvements
Generative AI Terminology – An evolving taxonomy to get you started
Getting the Most from LLMs: Building a Knowledge Brain for Retrieval Augmented Generation