Analyzing Unstructured PDF Data w/ Embedding Models and LLMs

We live in a world with extremely messy data, this much almost any data scientist or analyst knows. It’s almost inevitable that any data science project you start, the hardest part of the entire project will be acquiring, packaging and cleaning data in a way that allows you to deliver insights. The issue for most companies and individuals these days though isn’t necessarily the acquiring data (there’s lots of that), but rather the packaging and cleaning of unstructured, messy data.

Take PDFs of a public company’s SEC filings for example. Inside those documents there are hundreds of pages detailing important company information. Additionally, new filings are created on a quarterly and annual basis, spanning from when the company went public, until their most recent filings. This is so much data just not in the traditional sense. The documents contain tables, graphs, summaries, explanations and so much more. The issue with analyzing these documents isn’t that there isn’t enough information in them (well sometimes that can be true), but the larger issue is finding a quick and easy way to extract all of the most important details to the individual user. That last piece is possibly the most important part: the individual user. What I may find important, another person may not care about, so how do we build something to dynamically query this unstructured data source? Well, that’s exactly what I hope to cover in this post.

NOTE: I use the example of SEC filings, but the code and methods I am going to be sharing can be applied to almost any unstructured data you have (e.g. instruction manuals, cookbooks, website extracts, etc.).

Library Installs & Setup

from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

from pymilvus import MilvusClient, model

from ollama import Client

Before diving into the meat and potatoes of how we are going to build out this tool, these are the necessary libraries and packages we will need to install and import prior to beginning. As we work through the script, I will describe in detail what each library is being used for.

PDF Loading/Parsing

The exact SEC filing I will be using for this is Apple’s 2023 10k annual filing. All SEC filings for public companies are public domain and you can download this exact filing by following this link:

https://investor.apple.com/sec-filings/sec-filings-details/default.aspx?FilingId=17028298

The first thing that we will need to do is read in our SEC filing and parse the text from the documents. To do this we will be leveraging LangChain’s PyMuPDFLoader package. There are tons of different PDF loaders out there, but from my experiences, this one has always performed the best for what I’ve needed it for, so we will use it for this project. The following code snippet loads in the document, where _filepath is the local file path to our SEC filings.

loader = PyMuPDFLoader(file_path)
aapl_sec_filing_pages = loader.load()

After loading our PDF into our environment, we need to split our document into chunks that will be digestible for our LLM. We will do this using another LangChain function called RecursiveCharacterTextSplitter. For the purposes of this project, we will use chunk sizes of 1000 tokens, with an overlap of 100 chunks. I have defined a function that will do this for us:

def text_split_documents(documents):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100,
        length_function=len
    )

    return splitter.split_documents(documents)

After defining our function, we simply need to apply this function to our loaded PDF using the code below:

split_filing = text_split_documents(aapl_sec_filing_pages)

The RecursiveCharacterTextSplitter function will return the chunks of text along with the sources of where that text came from (document name and page number). This is important information that we will need to use later when feeding the context and source to the LLM. To extract the information from _splitfiling we will apply a couple transformations to the Data:

document_content = [chunk.page_content for chunk in split_filing]
document_source_names = [(chunk.metadata['source'].split('/')[-1] + f" (pg. {chunk.metadata['page']})") for chunk in split_filing]

This stores the page content and the sources in lists that we will index from later.

At this point we have fully loaded and chunked our SEC filing data. The next step will be uploading and storing this information to a vector database which our LLM will pull from to answer user questions.

Creating and Populating our Vector Database

For this project I am using Milvus as our vector databasing tool, however there are lots of others that you could choose from. The use of vector databases is essential in this process because we need a way to efficiently store our chunks of text, while also storing that information in a way that is conducive to calculating text similarities. I won’t go into too much detail here because the topic of vector databases is lengthy, but I will link this blog post that explains them well if you are interested in learning more:

What Is A Vector Database, And Why Do You Need One?

First, we will need to initiate our MilvusClient which will allow us to create our collection (think of this as a database). We also need to define our embedding function, which is simply the model we will be using to generate our text embedding values. We can do this using the following code:

client = MilvusClient("aapl_10k.db")
embedding_fn = model.DefaultEmbeddingFunction()

This code utilizes the default embedding function that Milvus uses, which is an implementation of the paraphrase-albert-small-v2 Embedding Model. It should be noted that you can use any embedding model you desire, you’d just need to replace the _embeddingfn with your model of choice.

NOTE: I’ve found the multilingual-e5-large embedding model to be one of the best, free, open-source embedding models out there.

The next step is to encode our documents with the embedding function and insert those embeddings to a vector database. The following code snippet begins by encoding the documents, then creates our vector database, and finally inserts our encoded documents into the database:

vectors = embedding_fn.encode_documents(document_content)
dims = embedding_fn.dim

client.create_collection(
    collection_name="aapl_10k_collection",
    dimension=dims
)

data = [
    {"id": i, "vector": vectors[i], "text": document_content[i]}
    for i in range(len(vectors))
]

client.insert(collection_name="aapl_10k_collection", data=data)

Querying our Vector Database

At this point we’ve successfully encoded and uploaded our data to a vector database. The next step will be querying our database with questions that we may have about Apple’s SEC filings. This step is straightforward and can be done using the following code snippet:

questions = ["What was Apple's net income this year?"]
query_vectors = embedding_fn.encode_queries(questions)

res = client.search(
    collection_name="aapl_10k_collection",
    data=query_vectors,
    limit=5, # change this to number of results you desire
    output_fields=["id", "text"],
)

In this case, questions is a variable we need to populate with a list of our questions. I’ve pre-populated it with a question about Apple’s income for testing purposes. The response of this search will be a list of k chunks of text (along with their ids) that most relate to your question. I’ve defaulted k to 5, but change this to whatever you desire.

Generating LLM Response

The final step in this process is feeding our chunks of context to our Llm to analyze and answer our questions. For this final section, I will be using Ollama, which is a tool that allows you to use Llama 3 locally on your computer. If you prefer to use a different LLM, please just modify the code to invoke your LLM of choice.

If you don’t have Ollama…

Getting Ollama running is extremely simple and straightforward.

Step 1: Download Ollama (link below)

Ollama

Step 2: Navigate to the downloaded file and open it. You should see this prompt, click "Next

Step 3: Click "Install", you may need to enter your computer password after this

Step 4: Copy and paste the following snippet into your terminal to confirm successful installation: ollama run llama3. If successful, you should be able to begin using Llama 3 directly in your terminal.

NOTE: Make sure you have the Ollama application running before executing any LLM code, if it isn’t it will fail.

If You Already Have Ollama…

The following code snippet takes the response from our vector database and formulates an LLM prompt to answer the original question(s).

llm_responses = []
q_num = 0
for q in res:
    context = []
    sources = []
    for chunk in q:
        sources.append(chunk['id'])
        context.append(chunk['entity']['text'])

    PROMPT_TEMPLATE = f"""
You are a financial analyst who has extensive knowledge of financial markets and 
specialize in understanding SEC filings. You are only given the following chunks of context to 
answer any questions:

{context}"""

    client = Client(host='http://localhost:11434')
    response = client.chat(model='llama3', messages=[
        {
            'role': 'system',
            'content': PROMPT_TEMPLATE,
        },
        {
            'role': 'user',
            'content': questions[q_num],
        }
    ])

    sources_str = ""

    for idx in sources:
        sources_str += document_source_names[idx] + ', '

    output_str = "Answer: "

    output_str += response['message']['content']

    output_str += 'nSources: ' + sources_str

    llm_responses.append(output_str)
    q_num += 1

q_count = 1
for answr in llm_responses:
    print(f'------------------------------------------------Question {q_count}:------------------------------------------------n')
    print(answr)
    print('n')
    q_count += 1

LLM Output

Here are the results from our LLM! You can see that our LLM is easily able to answer all our questions asked and provide a detailed output, accompanied by the sources of the information used. At the end of the post, I’ve included a link to my GitHub where you can fully download all the code which I’ve modularized and setup to run using only the function call you see below:

PDF_LLM_Code

Conclusion

If you made it this far, congrats and thanks for reading! Hopefully you found this post helpful and interesting. The code snippets I provided above can be easily changed out for your own use cases and I encourage everyone to try applying this to tons of other use cases. Something I didn’t cover was combining multiple Pdf documents, but that can be easily done with adjustments to the PDF parsing code.

Disclaimer: Unless otherwise noted, all images are by the author