Analyzing Humanitarian Data Unstructured Excel Tables with ChatGPT Code Interpreter

Some Initial Exploration with Code Interpreter

Matthew Harris
Towards Data Science

--

Created by DALL-E2 with prompt “child’s crayon drawing of a happy robot processing data, with graphs in the background”

TL;DR

The new experimental feature ‘Code Interpreter’ provides native support for generating and running Python code as part of using ChatGPT. It shows great potential for performing data engineering and analysis tasks, providing a conversational interface that non-technical users could potentially use. This article presents some tests of ChatGPT (GPT-4) Code Interpreter on an unstructured Excel table from my previous blog post, to see if it is able to automatically convert this table to a more standard form that could be loaded into a database. With limited prompting, it was able to identify the hierarchical heading structure but was unable to generate code that would parse the table accurately. On adjusting the prompt to suggest using the openpyxl Python library to extract information about Excel merged cells, it was able to parse the table in one attempt. However, on repeating the task with the exact same prompt, it failed. With no control yet over the temperature parameter to make results more deterministic, Code Interpreter does not appear able to tackle this particular task consistently. It’s early days though and only a beta feature, the pattern for automated data processing using Large Language Models is likely here to stay and will no doubt improve over time.

This week ChatGPT released a new feature called Code Interpreter, which allows ChatGPT to generate and call Python code, as well as upload data files to perform tasks such as data analysis. As I’ve explored in previous blog posts, Large Language Models have the potential for simplifying data engineering and analysis tasks. The LangChain project has some great patterns, and there is already a lot of commercial activity in this area, so it’s interesting to see OpenAI starting to offer native support.

There are many articles already available exploring OpenAI Code Interpreter, but I wondered how well it might perform using some of the tabular data I’ve previously explored as found on the amazing Humanitarian Data Exchange (HDX). Being able to offer natural language interfaces for platforms such as HDX opens the way for less technical users to explore and understand this data, which has implications for anticipating and accelerating response times for humanitarian disaster events.

Getting Access to Open AI’s Code Interpreter

Code Interpreter is an ‘Alpha’ feature currently, meaning it’s in an early testing phase and not part of standard ChatGPT. To access it you will need to:

  1. Be a ChatGPT+ subscriber, costing $20 per month
  2. Go to https://chat.openai.com/
  3. Select the “…” next to your name bottom-left, and select “Settings”
  4. Click on “Beta Features” and activate “Code Interpreter”
  5. Back in the chat window hover over either GPT-3.5 or GPT-4 and select “Code Interpreter”

It’s worth noting that originally you had to be on OpenAI’s plugin waitlist, but I’m not sure if that’s still the case. The features appeared for me even though I’ve not had confirmation of being given access via the list. If the above doesn’t work, you might need to be added.

Analyzing Unstructured Tables in Excel Files

As mentioned in a previous blog post, tables in Excel files can come in all sorts of wonderful forms with merged cells, blank rows, and other things which can make automated processing a bit of a challenge. For this article, I decided to try using GPT-4 with Code Interpreter to analyze a typical example of an Excel table as found on the Humanitarian Data Exchange (HDX) …

Example of an irregular table in Excel, with blank top rows, labels and merged cells. Perfectly readable for humans, but a challenge to parse for data science. This file was sourced from the Humanitarian Data Exchange

Though Code Interpreter has access to open datasets, it doesn’t yet include HDX and doesn’t have the ability to access the internet. I expect this will change quite soon, but in the meantime, I downloaded the file, started a session on https://chat.openai.com/ as described above, then uploaded this file by clicking the ‘+’ in the chat entry box …

After uploading, ChatGPT gets right into things by loading the data into Python Pandas and doing some initial analysis …

Right away it has spotted that the table in the sheet is somewhat irregular, and immediately tries to address this …

That’s actually spot on. In fact, it goes on to ask …

I responded …

It’s also possible to see how exactly it came to this conclusion by selecting the ‘Show Work’ dropdown under each response which provides the generated Python code and result …

Its approach seems reasonable, but it’s worth noting that it doesn’t use native Excel information about merged cells, instead directly loading into Pandas.

ChatGPT gets to work and automatically starts trying to process the data …

But it fails at this right away. On selecting ‘Show Work’ we see that it has generated some Python for parsing our table …

# Drop the rows that are not needed
data = data.drop([0, 1])

# Reset the index
data = data.reset_index(drop=True)

# Define the land size categories
land_sizes = data.iloc[:, 1].values

# Initialize empty lists to hold the tidy data
sub_counties = []
genders = []
land_sizes_tidy = []
numbers = []
percentages = []

# Iterate over the sub-counties and genders
for i in range(4, data.shape[1], 2):
sub_county = data.columns[i].split(' ')[0]
gender = data.columns[i].split(' ')[1]

# Iterate over the land size categories
for j in range(len(land_sizes)):
land_size = land_sizes[j]
number = data.iloc[j, i]
percentage = data.iloc[j, i + 1]

# Append the values to the lists
sub_counties.append(sub_county)
genders.append(gender)
land_sizes_tidy.append(land_size)
numbers.append(number)
percentages.append(percentage)

# Create a tidy dataframe
tidy_data = pd.DataFrame({
'Sub County': sub_counties,
'Gender': genders,
'Land Size': land_sizes_tidy,
'Number': numbers,
'Percentage': percentages
})

tidy_data

But this leads to a bounds error …

ChatGPT doesn’t give up and automatically starts debugging …

Which is pretty impressive in that it seems to have identified the issue correctly. However, it’s perhaps less than impressive given it had already identified the exact column hierarchy at the very start of the conversation and seems to have ‘Lost’ this information.

Again, it gets back into things and automatically proceeds …

Which gives a table with headers like this …

Where we see data included in the column heading, suggesting it hasn’t identified where the column end and data starts. In fact, it even spots this and bravely continues …

At this point it gets itself into a bit of a confused state, trying out lots of things in cycles not displayed here.

Eventually, I think the token limit was breached and generation stopped, with the table looking like this …

Spot-checking the above values in ‘Show Work’ output compared to the original table, we see that for the last ‘Total’ row the values look correct, but there are two ‘Bomet Central Femail N Bomet’ column headings. It spots this …

Since it seemed so close, I asked ChatGPT to proceed …

I had left it a little while before asking it to continue which I suspect resulted in the code environment job being terminated. It seemed happy to start this back up again, but in doing so had lost some variables …

I did what was prompted and reuploaded the file, and it picked things up again. Eventually, this is the table it produced …

Which is great …. for just the Total rows from the original table. ChatGPT has lost all other rows where the data was split by acreage, so in fact, the parsing has failed.

I pointed out that it was in fact missing a column related to land size, which it had in fact identified immediately after the first initial upload …

At this point ChatGPT started off on another quest, iterating through multiple attempts at parsing the sheet, none of which we ultimately successful. A full link to the chat can be found here.

I started a brand new chat session and tried again, which resulted in different results, which has implications for the reproducibility of this technique. But no matter how much I tried, the results were never correct.

Giving ChatGPT Some Coding Suggestions

In my previous blog post, I was able to achieve the most success in parsing tables like the example provided by using information extracted from Excel to indicate which cells had been merged. This is key to understanding the table heading hierarchy and what we humans use when looking at Excel tables. If using only Pandas Excel parsing — which is what ChatGPT did in our test — rather than interrogating the Excel structure cell-by-cell this information is lost, making it a lot harder to successfully parse tables with merged heading cells.

So what about if we give ChatGPT a tip to use ‘openpyxl’ directly and to take into account merged cells?

Just like last time, it gets things right from the start …

And again proceeds with its analysis …

However, this time, having been provided some guidance regarding technique …. It flattened the table almost perfectly with just the single prompt above and no further intervention.

Results of ChatGPT ‘Code Interpreter’ processing an unstructured Excel table into a form that might be loaded into a SQL Database. This was created with a prompt that suggested using the ‘openpyxl’ Python library to identify merged cells in column headings

The only issue is that last row, but with further prompting we can remove this easily. To see the full interaction and generated code for this see here.

I asked ChatGPT to generate the full code to see what it finally came up with …

from openpyxl import load_workbook
import pandas as pd

# Load the workbook
wb = load_workbook(filename='/mnt/data/Number of Acreage under Irrigation.xlsx')
ws = wb.active
data = ws.values
data = list(data)

# Convert the tuples to lists
data = [list(row) for row in data]

# Function to handle None values in the first two columns
def handle_none(data):
last_valid = [None, None]
for row in data:
for i in range(2):
if row[i] is not None:
last_valid[i] = row[i]
else:
row[i] = last_valid[i]
return data

data_filled = handle_none(data)

# Construct the headers
headers = [None] * len(data_filled[0])
for row in data_filled[2:6]:
for i, value in enumerate(row):
if value is not None:
if headers[i] is None:
headers[i] = value
else:
headers[i] += '_' + value

# Adjust the first two headers
headers[0] = 'Question'
headers[1] = 'Category'

# Adjust '%' headers
for i in range(len(headers)):
if headers[i] == '%':
headers[i] = headers[i-1].rsplit('_', 1)[0] + '_%'

# Adjust 'Female_N' and 'Female_%' headers
headers[6] = 'Sub county_Chepalungu_Female_N'
headers[7] = 'Sub county_Chepalungu_Female_%'
headers[10] = 'Bomet Central_Female_N'
headers[11] = 'Bomet Central_Female_%'

# Create a DataFrame
df = pd.DataFrame(data_filled[6:], columns=headers)

# Save the DataFrame as a CSV file
df.to_csv('/mnt/data/Number_of_Acreage_under_Irrigation_SQL.csv', index=False)

Which seems reasonable. It’s not generic and has lines specific to the file being processed. I suspect we would need more prompting to (maybe) make ChaGPT generate generic code, but for the task in this study it was able to parse the unstructured table nicely.

Great result!

Given the fact in the first test that ChatGPT gave different results with the same prompt, I decided to repeat the exact same succsessful prompt to see how things behaved in the successful test. Sadly, it came up with an entirely different, and incorrect answer using the exact same prompt.

NOT a great result!

In the API the model can be made more deterministic and produce repeatable results by decreasing the temperature parameter, but as Code Interpreter isn’t available in the API just yet, I was not able to experiment with this.

Conclusions

After initially failing, we were able to prompt ChatGPT to correctly parse an unstructured table by providing some coding tips about how one might do this in Python, which is a pretty amazing result actually. However, results were not reproducible with the exact same prompt failing on a second attempt. This is likely because we don’t yet have control over the model temperature parameter in this beta feature.

Another interesting limitation was noted, for example when token limits are breached and completions stop before the task is complete, requiring another prompt to carry on. Also, the process is rather slow as ChatGPT goes through iterations trying out different chunks of code. It’s not yet a technique which could be applied to tasks requiring quick responses.

Basically, Code Interpreter looks really impressive and shows great promise, but doesn’t appear ready just yet for the task attempted above.

So for now at least, albeit a very short time … I have one up on ChatGPT. 😊

--

--

Matt is the Head of Data Science at DataKind, helping social sector organizations harness the power of data science and AI in the service of humanity.