The world’s leading publication for data science, AI, and ML professionals.

How to add Domain-Specific Knowledge to an LLM Based on Your Data

Turn your LLM into a field expert

Photo by Hubi's Tavern on Unsplash
Photo by Hubi’s Tavern on Unsplash

Introduction

In recent months, Large Language Models (LLMs) have profoundly changed the way we work and interact with technology, and have proven to be helpful tools in various domains, serving as writing assistants, code generators, and even creative collaborators. Their ability to understand context, generate human-like text, and perform a wide range of language-related tasks has propelled them to the forefront of Artificial Intelligence research.

While LLMs excel at generating generic text, they often struggle when confronted with highly specialized domains that demand precise knowledge and nuanced understanding. When used for domain-specific tasks, these models can exhibit limitations or, in some cases, even produce erroneous or hallucinatory responses. This highlights the need for incorporating domain knowledge into LLMs, enabling them to better navigate complex, industry-specific jargon, exhibit a more nuanced understanding of context, and limit the risk of producing false information.

In this article, we will explore one of several strategies and techniques to infuse domain knowledge into LLMs, allowing them to perform at their best within specific professional contexts by adding chunks of documentation into an LLM as context when injecting the query.

This method works with any type of documentation, and only uses secure, open source technologies that will run locally on your computer, without the need to access the internet. Thanks to that, I could use it on personal and confidential data that I didn’t want third party websites to access.

Principle

Here’s a breakdown of how it works :

Graph explanation of the process. Image by Author.
Graph explanation of the process. Image by Author.

The first step is to take our documentation and build a vector index database based on our documentation. Vector databases are a type of database designed to store and query high-dimensional vectors efficiently. These databases enable fast similarity and semantic search while allowing users to find vectors that are the closest to a given query vector based on some distance metric, instead of performing queries on values contained in rows and columns like in traditional OLTP and OLAP databases.

That means that we can create embeddings that represent any documentation and populate the database with it.

Then, once it is built, we can perform a query that will also be embedded, and injected into the vector index database, which will return the most related pieces of documentation for our query.

Finally, these can be injected into a local LLM as context alongside our original query. This way, the selected context will be small enough to be accepted by most LLMs and, since it is related to our query, the model will have sufficient knowledge to accurately answer the question. A little bit of prompt engineering can also help.

Case example and repo

In this article, we will use a local, open-source LLM and inject it with domain knowledge of all the Python Enhancement Programs (PEPs). This principle can be applied to any sort of documentation, but I’ll use the PEPs because it’s easily accessible and it’s public domain, which makes it perfect as an example dataset.

You can find the full code that I used to write this article on this repo : https://github.com/Anvil-Late/knowledge_llm/tree/main

GitHub – Anvil-Late/knowledge_llm: Guide to adding domain knowledge to LLMs

Quick preview of the results

Here’s how results would look like

Example of a query being processed and answered. Image by Author.
Example of a query being processed and answered. Image by Author.

How to install the LLM

If you don’t have an LLM installed on your computer, you can find a step-by-step guide on how to do that here : https://medium.com/better-programming/how-to-run-your-personal-chatgpt-like-model-locally-505c093924bc

[GPT TUTORIAL] How to Run Your Personal, ChatGPT-like Model, Locally

How to build and query the Vector Index Database

You can find the full code to build the vector index database on this repo : https://github.com/Anvil-Late/knowledge_llm/tree/main

Broadly speaking, in the src folder :

  • parse.py creates the PEP corpus
  • embed.py creates the embedded corpus
  • You can pull the docker image of the Qdrant vector index database and run it with the commands docker pull qdrant/qdrant and docker run -d -p 6333:6333 qdrant/qdrant

  • create_index.py creates and populates the vector index database
  • query_index.py embeds a query and retrieves the most relevant documentation

If you need more details, you can find my step-by-step guide here : https://betterprogramming.pub/efficiently-navigate-massive-documentations-ai-powered-natural-language-queries-for-knowledge-372f4711a7c8

AI-Powered Documentation Search – Navigate Your Database With Natural Language Queries

Combine everything

First, we’ll write a script that generates a prompt for the LLM :

import os
from query_index import DocSearch
import logging
import re
from utils.parse_tools import remove_tabbed_lines
logging.disable(logging.INFO)

def set_global_logging_level(level=logging.ERROR, prefices=[""]):
    """
    Override logging levels of different modules based on their name as a prefix.
    It needs to be invoked after the modules have been loaded so that their loggers have been initialized.

    Args:
        - level: desired level. e.g. logging.INFO. Optional. Default is logging.ERROR
        - prefices: list of one or more str prefices to match (e.g. ["transformers", "torch"]). Optional.
          Default is `[""]` to match all active loggers.
          The match is a case-sensitive `module_name.startswith(prefix)`
    """
    prefix_re = re.compile(fr'^(?:{ "|".join(prefices) })')
    for name in logging.root.manager.loggerDict:
        if re.match(prefix_re, name):
            logging.getLogger(name).setLevel(level)

def main(
    query,
    embedder = "instructor",
    top_k = None, 
    block_types = None, 
    score = False, 
    open_url = True,
    print_output = True
    ):

    # Set up query
    query_machine = DocSearch(
        embedder=embedder,
        top_k=top_k,
        block_types=block_types,
        score=score,
        open_url=open_url,
        print_output=print_output
    )

    query_output = query_machine(query)

    # Generate prompt
    prompt = f"""
Below is an relevant documentation and a query. Write a response that appropriately completes the query based on the relevant documentation provided.

Relevant documentation: {remove_tabbed_lines(query_output)}

Query: {query}

Response: Here's the answer to your query:"""

    print(prompt)
    return prompt

if __name__ == '__main__':
    set_global_logging_level(logging.ERROR, ["transformers", "nlp", "torch", "tensorflow", "tensorboard", "wandb"])
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--query', type=str, default=None)
    parser.add_argument('--top_k', type=int, default=5)
    parser.add_argument('--block_types', type=str, default='text')
    parser.add_argument('--score', type=bool, default=False)
    parser.add_argument('--open_url', type=bool, default=False)
    parser.add_argument('--embedder', type=str, default='instructor')
    parser.add_argument('--print_output', type=bool, default=False)
    args = parser.parse_args()
    main(**vars(args))

logging.disable(logging.INFO) and set_global_logging_level prevent excessive prints during code execution, since everything printed by this script will be captured.

We combine this prompt generation with prompt injection with the following bash script :

#!/bin/bash

# Get the query from the command-line argument
query="$1"

# Launch prompt generation script with argument --query
if ! prompt=$(python src/query_llm.py --query "$query" --top_k 1); then
    echo "Error running query_llm.py"
    exit 1
fi

# Run the terminal command
<PATH_TO_LLAMA.CPP>/main 
    -t 8 
    -m <PATH_TO_LLAMA.CPP>/models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin 
    --color 
    -c 4000 
    --temp 0.1 
    --repeat_penalty 1.1 
    -n -1 
    -p "$prompt" 
    -ngl 1 

What happens here is that the prompt generation script prints the prompt, and the bash script captures it in the $prompt variable, which is then used in the llama.cpp ./main command with the -p(or --prompt) parameter.

The LLM will then take over and complete the prompt starting from ‘Response: Here’s the answer to your query:’.

Remember to replace <PATH_TO_LLAMA.CPP> to the path of your llama.cpp clone in your computer, and Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin to your LLM. Personally, I chose this one because it gave me pretty good results and it is not under a restrictive license, but feel free to try it with other models !

Conclusion

Let’s recap what we have accomplished here:

Throughout this article, we have delved into an effective strategy to augment the capabilities of Large Language Models (LLMs) by infusing them with domain knowledge. While LLMs have demonstrated remarkable proficiency in a variety of tasks, they often encounter difficulties when confronted with highly specialized domains that necessitate precise knowledge and nuanced understanding.

To address these limitations, we explored a methodology that involves incorporating domain-specific documentation into LLMs. By constructing a vector index database based on the documentation, we established a foundation for efficient similarity and semantic search. This allowed us to identify the most relevant pieces of documentation for a given query, which could then be injected as context into a local LLM.

The approach we presented was exemplified through the utilization of Python Enhancement Programs (PEPs) as a representative dataset. However, it is important to note that this methodology is applicable to any form of documentation. The code snippets and repository provided in this article serve as practical demonstrations, showcasing the implementation process.

By following the outlined steps, users can enhance LLM performance within specific professional contexts, enabling the models to navigate complex industry-specific jargon and generate more accurate responses. Moreover, the secure and open-source technologies employed in this strategy ensure that the process can be executed locally without external internet dependencies, thereby safeguarding privacy and confidentiality.

In conclusion, the infusion of domain knowledge into LLMs empowers these models to excel in specialized tasks, as they gain a deeper understanding of the context in which they operate. The implications of this approach extend across diverse domains, enabling LLMs to provide invaluable assistance and insights tailored to specific professional requirements. By leveraging the potential of LLMs combined with domain expertise, we unlock a new realm of possibilities for improving human-AI interactions and leveraging the power of artificial intelligence in specialized domains.

Thank you for reading!

If you have any questions, don’t hesitate to leave it in the comments, I’ll do my best to answer you!

If you liked this, you can also support my work on Medium directly and get unlimited access by becoming a member using my referral link here 🙂


Related Articles