The world’s leading publication for data science, AI, and ML professionals.

Process Pandas DataFrames with a Large Language Model

Seamless Integration of Python, Pandas, and LLM

Pandas, Image by Stone Wang, Unsplash
Pandas, Image by Stone Wang, Unsplash

Nowadays, it is easy to use different large language models (LLMs) via the web interface or the public API. But can we seamlessly integrate LLM into the data analysis process and use the model directly from Python or Jupyter Notebook? Indeed, we can, and in this article, I will show three different ways to do it. As usual, all components used in the article are available for free.

Let’s get into it!

1. Pandas AI

The first Python library I am going to test is Pandas AI. It allows us to ask questions about our Pandas dataframe in natural language. As a toy example, I created a small dataframe with all EU countries and their populations:

import pandas as pd

df = pd.DataFrame({
    "Country": ['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland',
                'France', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Liechtenstein', 'Lithuania',
                'Luxembourg', 'Malta', 'Monaco', 'Montenegro', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Serbia',
                'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland'],
    "Population": [8_205000, 10_403000, 7_148785, 4_491000, 1_102677, 10_476000, 5_484000, 1_291170, 5_244000,
                   64_768389, 82_369000, 11_000000, 9_930000, 308910, 4_622917, 58_145000, 2_217969, 35000, 3_565000,
                   497538, 403000, 32965, 666730, 16_645000, 4_907000, 38_500000, 10_676000, 21_959278, 7_344847,
                   5_455000, 2_007000, 46_505963, 9_045000, 7_581000]
})
df.to_csv('data.csv', index=False)

Before using Pandas AI, let’s create the LLM instance:

from pandasai.llm.local_llm import LocalLLM
from pandasai.llm import OpenAI

# Local LLM
pandas_llm = LocalLLM(api_base="http://localhost:8000/v1")
OR
# OpenAI
pandas_llm = OpenAI(api_token="...")

Here, we have two choices. Those readers who have an OpenAI API key can use the OpenAI class. It will provide better and faster results, but this API is obviously not free. Another option is to use a LocalLLM instance, which uses the OpenAI-compatible server under the hood (a Llama-CPP server is a good option for it). For my test, I will run a CodeLlama-13B-Instruct model locally on my PC.

Now, we are ready to ask the LLM a question about the dataframe. First, let’s try a simple way:

import logging
from pandasai import SmartDataframe

logging.basicConfig(level=logging.DEBUG,
                    format="[%(levelname)s] [%(asctime)-15s] %(message)s")

sdf = SmartDataframe(df, config={"llm": pandas_llm})
sdf.chat("Find a country with a highest population.")

Here, I enabled logging to see what’s going on "under the hood." The log output shows that the library made a pretty complex prompt that looks like this:

{'messages': 
  [{'role': 'user', 'content': 'Find a country with a highest population.'},
   {'role': 'user', 'content': 'dfs[0]:n  name: nulln  description: nulln  type: pd.DataFramen  rows: 34n  columns: 2n
      schema:n    fields:n    - name: Countryn      type: objectn
      samples:n
      - Austrian      - Serbian      - Switzerlandn
      - name: Populationn      type: int64n
      samples:n      - 1102677n      - 10676000n      - 4907000nnnnn
    Update this initial code:n
    ```pythonn# TODO: import the required dependenciesnimport pandas as pdnn
    # Write code herenn# Declare result var: ntype (possible values "string", "number", "dataframe", "plot"). 
    Examples: { "type": "string", "value": f"The highest salary is {highest_salary}." }
    or { "type": "number", "value": 125 } or { "type": "dataframe", "value": pd.DataFrame({...}) }
    or { "type": "plot", "value": "temp_chart.png" }nn```nnnn
    ### QUERYn Find a country with a highest population.nn
    Variable `dfs: list[pd.DataFrame]` is already declared.nn
    At the end, declare "result" variable as a dictionary of type and value.nn
    If you are asked to plot a chart, use "matplotlib" for charts, save as png.nnn
    Generate python code and return full updated code:'}], 
'model': ''}}

As we can see, the prompt asks a model to generate Python code, and it also gives the model the first samples of the dataframe.

I ran the code with my local LLM, and… nothing happened. I got only this: "No code found in the response".

This prompt works well with the "real" OpenAI API, but it is too complicated for a 13B model. Let’s use the advanced way and specify more parameters with an Agent class:

agent = Agent(
    [df],
    config={"llm": pandas_llm},
    description="<s>[INST]Create a Python code and help user to answer
                the question.[/INST]"
)

query = """You have a dataframe with fields "Country" and "Population",
           saved in "data.csv". Find the country with a highest population
           in the dataframe using Pandas."""
code = agent.generate_code(query)
agent.execute_code(code)

Here, I created a prompt that better helps the model understand what to do. First, it turned out that the 13B model did not grasp the idea of having the local variable "dfs" (mentioned in the original prompt) but was able to load a dataframe from the file. Second, I specified the "INST" section of the prompt, required for a "CodeLlama-13B-Instruct" model. Practically, a generated prompt looks like this:

{'messages': 
  [{'role': 'system', 'content': '<s>[INST]Create a Python code and help user to answer the question.[/INST]'},
   {'role': 'user', 'content': 'You have a dataframe with fields "Country" and "Population",
                                saved in "data.csv". Find the country with a highest population in the dataframe using Pandas.'},
   {'role': 'user', 'content': 'dfs[0]:n  name ...
  ...
'model': ''}}

As we can see, the first "system" section with an "INST" text was successfully added.

After running the code, I got the result:

Output in Python Notebook, Image by author
Output in Python Notebook, Image by author

To do this, CodeLlama generated this code "under the hood" (because of the LLM’s randomness, your result would probably be different):

import matplotlib.pyplot as plt

# load the data
df = pd.read_csv("data.csv")

# create a bar plot of the top 10 most populous countries
df_top10 = df.sort_values("Population", ascending=False).head(10)
ax = df_top10.plot(kind="barh", x="Country", y="Population", figsize=(10, 6))
ax.set_title("Top 10 Most Populous Countries")
ax.set_xlabel("Country")
ax.set_ylabel("Population")
plt.show()

Interestingly, I only asked the model to "find the country with the highest population" and did not ask to draw the graph. Apparently, "matplotlib" was mentioned in the prompt, and the model used it as a "hint" of what to do. This prompt (see example above) is probably still too complex for a 13B model. But technically, the model answered my question correctly, and we can see that this approach works.

2. LangChain

A popular LangChain library also has a special Pandas Dataframe agent that can do similar tasks.

First, we need to create a language model instance. In the same way as before, I will be using a local CodeLlama 13B model for the requests):

from langchain_openai import OpenAI as LangChainOpenAI

llm = LangChainOpenAI(openai_api_key="12345678",
                      openai_api_base="http://localhost:8000/v1",
                      verbose=True)

Now we are ready to create the agent:

from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_core.callbacks import StdOutCallbackHandler, BaseCallbackManager

prefix = """<s>[INST]You are a Python expert. Create a Python code and help user to answer the question.[/INST].

You have the following tool:"""

handlers = [StdOutCallbackHandler()]
callback_manager = BaseCallbackManager(handlers)

agent = create_pandas_dataframe_agent(llm,
                                      df,
                                      verbose=True,
                                      agent_executor_kwargs={"handle_parsing_errors": True},
                                      agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
                                      callback_manager=callback_manager,
                                      prefix=prefix)

Here, I also added the "INST" section, which is required for CodeLlama.

Everything is ready, it’s time to ask the model a question about our dataframe:

agent.invoke("Write the Python code to calculate total population of all countries.")

We can enable logging and see that LangChain generated this prompt:

<s>[INST]You are a Python expert.
Create a Python code and help user to answer the question.[/INST].
You have the following tool:
python_repl_ast: A Python shell. 
Use this to execute python commands. Input should be a valid python 
command. When using this tool, sometimes output is abbreviated - make sure 
it does not look abbreviated before using it in your answer.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [python_repl_ast]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

This is the result of `print(df.head())`:
|    | Country   |   Population |
|---:|:----------|-------------:|
|  0 | Austria   |      8205000 |
|  1 | Belgium   |     10403000 |
|  2 | Bulgaria  |      7148785 |
|  3 | Croatia   |      4491000 |
|  4 | Cyprus    |      1102677 |

Begin!
Question: Write the Python code to calculate total population of all
countries.

A "create_pandas_dataframe_agent" method uses PythonAstREPLTool, and as we can see from the prompt, we ask the model to generate a Python code and put it into the "Action Input" section. The prompt itself is pretty complex, but surprisingly, a 13B model was able to understand it and generate this answer:

#> Action: python_repl_ast
#> Action Input: df['Population'].sum()
463034138

Still, this prompt is complex for a 13B model, and when I ran the code several times, sometimes the answer was incorrect:

#> Thought: I will use Pandas to calculate the total population of all countries.
#> Action: I will use `df['Population'].sum()` to calculate the total population of all countries.
#> Action Input: df['Population'].sum()
Unknown action

Funny enough, I also tried a CodeLlama 34B model, and it did not work for a simple reason: a prompt says "_action: the action to take, should be one of [python_replast]." A model got this example "as is" and made this answer:

Thought: I need to sum up all population values in the dataframe.
Action: [python_repl_ast]
Action Input:  df['Population'].sum()

In my opinion, the answer is good, and the model did what it was asked, but the LangChain library was not able to parse a "[python_repl_ast]" string, it was expected only "python_repl_ast" without brackets. Anyway, the LangChain prompt is likely optimized for the OpenAI API, and the result with other models is just not guaranteed.

3. Text Processing

In the first two examples, we tested the possibility of using agents and generated Python code with LLM. But we can also use LLM "natively" for the tasks in which LLMs are good – for natural language processing (NLP).

As a second toy example, let’s create a dataframe with the list of goods for the "lost and found" department:

df_goods = pd.DataFrame({
    "Item": ["Toshiba laptop", "iPhone 12", "iPhone 14", "Old bicycle",
             "Public Transport card", "Pair of gloves", "Kids Helmet",
             "Samsung Smartphone", "iPhone 14", "Cap", "Shawl"],
})
display(df_goods)

In Jupyter, it looks like this:

Sample dataframe, Image by author
Sample dataframe, Image by author

Let’s say we want to use a Pandas "map-style" function to perform an NLP task. Let’s group all items into four categories: "Electronics," "Clothes," "Documents," and "Other," and put the result in a new column. An "old-school" approach is to code some rules or regular expressions, but we can just ask LLM to do the job!

First, let’s load a Llama 13B model:

from llama_cpp import Llama

llm = Llama(
      model_path="llama-2-13b-chat.Q4_K_M.gguf",
      n_gpu_layers=-1,
      n_ctx=2048,
      verbose=True
)

This is a request for the model:

question = "You are a sales expert. You have a list of categories:
            Electronics, Clothes, Documents, and Other.
            Write a category for each item."

Now, let’s create a prompt:

prompt = f"""<s>[INST]{question}
I will give you a list of items in comma-separated format.
Write the output in JSON format {{NAME}}: {{RESULT}}.
Here is the list: {items_csv}.
Now write the answer.
[/INST]"""

The prompt is made as universal as possible, so we can reuse it for different questions. It also turned out that even a free 13B model has a good understanding of the JSON format. In my case, it produced this output:

Sure! Here is the list of items with their corresponding categories:

{
"Toshiba laptop": "Electronics",
"iPhone 12": "Electronics",
"iPhone 14": "Electronics",
"Old bicycle": "Other",
"Public Transport card": "Documents",
"Pair of gloves": "Clothes",
"Kids Helmet": "Clothes",
"Samsung Smartphone": "Electronics",
"Cap": "Clothes",
"Shawl": "Clothes"
}

Let me know if you need anything else!

The output looks good enough, and we are ready to make a universal solution using it. First, let’s create a helper method to extract JSON from text:

import json
import re

def extract_json(output_str: str) -> dict:
    """ Extract JSON from a given string """
    try:
        json_str = re.search(r"{(?:[^{}])*}", output_str).group(0)
        data = json.loads(json_str)
        return data
    except Exception as exp:
        print(f"Cannot extract data from {output_str}: {exp}")
        return None

Now, let’s create a method to generate the prompt:

def make_prompt(question: str, items: List) -> str:
    """ Generate a prompt for the LLM """
    items_csv = ",".join(items)
    return f"""<s>[INST]{question}
               I will give you a list of items in comma-separated format.
               Write the output in the JSON format {{NAME}}: {{RESULT}}.
               Here is the list: {items_csv}.
               Now write the answer.
               [/INST]"""

Finally, we are ready to process the Pandas column using LLM:

def llm_map(model: Any, df_column: pd.Series, question: str, batch_size: int=64) -> pd.Series:
    """ Create a new Pandas.Series object based on existing column and question """
    items = df_column.unique()
    chunks = [items[i * batch_size:(i + 1) * batch_size] for i in range((len(items) + batch_size - 1) // batch_size)] 
    results = {}
    for chunk in chunks:
        prompt = make_prompt(question, chunk)
        res = model(prompt, max_tokens=1024, stream=False)
        output_text = res["choices"][0]["text"]
        data = extract_json(output_text)
        if data is not None:
            results.update(data)

    return df_column.map(lambda item_name: results.get(item_name, None))

Here, I first created a list of unique items in the column – there is no need to ask the model the same question several times. I also added a "batch_size" parameter – if the dataframe is too large, we can send data in chunks. It also turned out that sometimes the LLM gives an incorrect JSON. Then, the "extract_json" function returns None, and it makes sense to repeat the question again 1–2 times (for clarity reasons, this is not implemented in the code).

Now, we are ready to apply a "map" to the Pandas dataframe in a single line of code:

question = "You are a sales expert. You have a list of categories:
            Electronics, Clothes, Documents, and Other.
            Write a category for each item."

df_goods["Category"] = llm_map(llm, df_goods["Item"], question)
display(df_goods)

The output looks good and accurate:

Processing result, Image by author
Processing result, Image by author

Our solution is universal, and we can use "llm_map" with different questions:

question = "You are a sales expert. Write an estimated USD price for each item."

df_goods["Price"] = llm_map(llm, df_goods["Item"], question)
display(df_goods)

The output looks like this:

Processing result, Image by author
Processing result, Image by author

Naturally, the processing using a 13B model is relatively slow. Readers can change the local model call in the "llm_map" method to an OpenAI API; it will be faster but not free.

The "batch_size" is also an interesting parameter. In theory, free LLMs like Google Gemma have a pretty large context window (8192 tokens for Gemma and 2048 for LLaMA). It allows us to process a large number of items in a single request, but I got a feeling that (at least for the 7B and 13B models) results for long prompts are less accurate.

Conclusion

In this article, we were able to process Pandas dataframes with large language models in two different ways.

The first way utilizes the concept of tools and agents: we ask a model to write Python code, and then this code is executed using a local agent. This approach is beneficial for data analysis. First, as we know, LLMs are not good at math. For example, we can ask the model to calculate the mean, but the result will likely be incorrect. Second, a large dataset will just not fit the LLM prompt size limit. Asking the model to write the Python code eliminates these disadvantages. It also allows us to reuse the generated code later (calling a model every time to do the same processing would be slow and expensive).

The second way utilizes the "natural" ability of the LLM to process the text. This can be beneficial when we need to process a lot of unstructured text data.

As a final thought, I like the idea of the "seamless" integration of the software libraries with LLMs – I can try different questions and see the code and processing results directly in the IDE. Alas, a local model is often slow (it takes about 3 minutes to process a prompt with a 34B CodeLlama on my PC), and the results are not always perfect. But in the future, AI assistants like this will definitely be a helpful tool for developers and data analysts.

Thanks for reading. If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.

Those who are interested in using language models and natural language processing are also welcome to read other articles:


Related Articles