Parsing Irregular Spreadsheet Tables in Humanitarian Datasets (with Some Help from GPT-3)

Processing Irregular Excel Tables Without Using Hard-coded Rules

Matthew Harris
Towards Data Science

--

Created by DALL-E2 with prompt “A painting of 10 wood tables”. There are 9 tables in the image above.

TL;DR

As part of a previous study using data from the Humanitarian Data Exchange, I had to analyze thousands of Excel files where tables within those files were often difficult to parse into database tables. Irregular layouts with merged cells, hierarchical columns, and annotations are difficult to anticipate with rule-based parsing when files originate from hundreds of organizations across the world. In this article, I explore using GPT-3 zero- single- and single-shot with reasoning completion to reformat irregular (small) tables, as well as fine-tuning the model to predict table attributes which can then be used for accurate parsing.

There have been quite a few times on my travels when I’ve needed to review a large number of Excel files to understand what data they contain, how well it is structured, and the work required to clean it into a form where we can get to the juicy stuff like training models. For the most part this is fairly straightforward, as long as the data is regular with nice neat column headings. However, life is never that easy and it’s often the case the tables in these files can be in a less-than-perfect format to parse into neat data frames that can be uploaded into relational databases. Excel supports a lot of features such as pivot tables and cell merging, which human beings use to create a wide variety of layouts, with blank rows, random text here and there, and more!

Here is an example to illustrate …

Example of an irregular table in Excel, with blank top rows, labels and merged cells. Perfectly readable for humans, but a challenge to parse for data science. This file was sourced from the Humanitarian Data Exchange

If we read the above file directly into Pandas …

import pandas as pd
df = pd.read_excel(filename)
df = df.fillna("")
display(df)

We get this …

Example of Pandas dataframe after parsing a table on an Excel sheet, where there are blank rows and merged cells to indicate hierarchical columns. Example data from the Humanitarian Data Exchange

Loading this into a database would result in near-unusable data because …

  1. There is a table title in the top-right cell
  2. Column ‘Unnamed: 1’ title is actually what’s in the first column row 5 “What is the average size of land you own that …”
  3. Columns ‘Unnamed:2’ and ‘Unnamed:3’ are aggregate totals split into ’N’ numeric and ‘%’ percentage values
  4. Most columns are hierarchical, with merged cells above unmerged cells

It’s not that bad, right?

It is of course possible to provide parameters to Pandas read_excel that will convert hierarchical columns to indexes, which can then be collapsed into a single row. Alternatively, we might manipulate in Openpxyl using information from Excel itself about merged cells. However, these methods require knowledge of the table — specifically where the headings finish and the data starts and how hierarchical columns are structured — a luxury we might not always have if processing thousands of spreadsheets. Maintaining rule-based parsing for large volumes of files can be time-consuming and brittle, requiring continued maintenance as new layouts appear on the scene.

As it happens, I am not alone! Parsing irregular tables is a challenge being actively researched. For example, Microsoft authors have shown some great results using Convolutional Neural Networks to develop an algorithm called ‘TableSense’ [1]. This technique treats Excel sheets in a similar way to images but with richer featurization as each cell can have a range of attributes and data types, as well as formatting and merging characteristics. Very cool. I hope fantastic work like this will be included in Microsoft’s products soon, but until then I wanted to explore some other approaches.

It’s also worth noting that my use-case is not just to identify the range in a sheet where the table is (see training data for the Microsoft paper above), but elements in the table so irregular formats can be converted to something that can be easily imported into a database. The main challenge is hierarchical columns in Excel, flattening these into a single row that captures information from overlying merged cells. Sounds simple to fix, but the challenge is: where do the headings stop and the data start? This is obvious to us humans, but it’s surprising how something so simple can be quite noisy in the real world when processing sheets using code.

Given all the recent attention for generative AI and Large Language Models (LLMs), I wondered if perhaps OpenAI’s GPT-3 might be up to the challenge. These models are trained on huge amounts of data extracted from the internet, which includes tables and CSV files, so they might be useful in handling some of the nuances of tables put together by us crazy humans.

Prompting GPT-3 to Tidy Up (a Small) Table

We will first try to solve our problem as zero- and few-shot tasks for GPT-3, before moving on to using fine-tuning techniques.

Zero-shot, one-shot and few-shot tasks, contrasted with traditional fine-tuning. The panels above show four methods for performing a task with a language model. From Brown et al [2]

GPT-3 is trained on text scraped from the web, so we cannot prompt it with Excel (yet!), therefore we first have to convert our sheet into a form that is occurs on the web, CSV string …

df = pd.read_excel('sample.xlsx', sheet_name='Sheet1')
df = df.fillna("")
print(df.to_csv())
,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,Table 3: Number of acreage under irrigation,,,,,,,,,,,
1,,,OVERALL,,Sub county,,,,,,,
2,,,,,Chepalungu,,,,Bomet Central,,,
3,,,,,Male,,Female,,Male,,Female,
4,,,N,%,N,%,N,%,N,%,N,%
5,What is the average size of land you own that is currently under irrigation?,0 - 2 acres,22,2.8%,4,2.2%,10,3.8%,3,1.7%,5,2.9%
6,,2 - 5 acres,6,.8%,2,1.1%,2,.8%,0,0.0%,2,1.2%
7,,5 - 10 acres,1,.1%,0,0.0%,0,0.0%,0,0.0%,1,.6%
8,,More than 10 acres,0,0.0%,0,0.0%,0,0.0%,0,0.0%,0,0.0%
9,,None,760,96.3%,176,96.7%,251,95.4%,170,98.3%,163,95.3%
10,,Total,789,100.0%,182,100.0%,263,100.0%,173,100.0%,171,100.0%

Side note: I also tried with Markdown and HTML tables, but got best results for my use-case with CSV.

It’s worth noting that for this analysis the tables we are dealing with are thin, ie having < 100 columns. This means the first 10 rows can be represented easily in a GPT-3 prompt. This is fine for most of the Excel tables I have been analyzing for the Humanitarian Data Exchange, but might not extend to other scenarios. Also, this analysis doesn’t consider cases where there are multiple tables on the same Excel sheet … that is for a later blog post. 🙂

Zero-Shot Prompt

Let’s now see if GPT-3 can reformat our untidy table with just a single prompt, a zero-shot task [2] where we are providing no examples, just a CSV file of the table we want to be reformatted …

import openai as ai

# Open AI API key should be put into this file
ai.api_key_path = "./api_key.txt"

csv_as_str = df.to_csv()

prompt = (
"Reformat this table to be a simpler markdown table with "
+ "no hierarchical columns, no pivoting, values and percentages in different columns, "
+ "and no blank cells\n\n"
+ csv_as_str
)

completions = ai.Completion.create(
engine="text-davinci-003",
temperature=0.0,
prompt=prompt,
max_tokens=999,
n=1,
stop=None,
)

Markdown(completions.choices[0].text)

It discarded unnecessary rows and converted the data to a nice regular table with column headings, but look closely and you’ll see it’s lost some key information, such as the breakdown by Male/Female. Classic hallucination territory, it looks very plausible but is wrong.

Let’s play with the temperature parameter. Lower values make the model more deterministic (giving the same results every time for the same prompt) whereas higher values are more random. With a higher temperature value, we get …

prompt = (
"Reformat this table to be a simpler markdown table with "
+ "no hierarchical columns, no pivoting, values and percentages in different columns, "
+ "and no blank cells\n\n"
+ csv_as_str
)

completions = ai.Completion.create(
engine="text-davinci-003",
temperature=1.0,
prompt=prompt,
max_tokens=999,
n=1,
stop=None,
)

Markdown(completions.choices[0].text)

Looks nice! Almost all of the correct column headings from merged cells in our CSV, which is pretty amazing actually. However, spot-checking a few cells shows that though many are correct, some are not. Also, ‘Overall’ has been split into Male and Female in the above which is incorrect.

Another issue here is that calling GPT-3 exactly the same prompt will produce different results because of the high temperature value …

completions = ai.Completion.create(
engine="text-davinci-003",
temperature=1.0,
prompt=prompt,
max_tokens=999,
n=1,
stop=None,
)

Markdown(completions.choices[0].text)

Not unreasonable, albeit with incorrect values, but an entirely different layout. Reproducibility is very important for our task, we should be able to process the tabular data in exactly the same way with each processing run.

So high temperatures are not a good option for this use-case it seems.

What about if we provide more context in the table? CSV isn’t very expressive, for example, merged columns in hierarchical headers tell humans that the columns are grouped, but a CSV file doesn’t capture this …

1,,,OVERALL,,Sub county,,,,,,,
2,,,,,Chepalungu,,,,Bomet Central,,,
3,,,,,Male,,Female,,Male,,Female,
4,,,N,%,N,%,N,%,N,%,N,%

In the above example, GPT-3 must infer that blank columns to the right of merged row titles correspond with those titles, and many times it actually does this. However, we can help a little since we know whether a cell is merged in our Excel file.

To represent this in CSV we can unmerge merged cells and populate with their merged value …

def pad_merged_cells(sheet):
"""
Unmerge merged cells and fill with merged value.

Input Parameters
----------------
sheet: Obj
Openpyxl sheet object

Output Parameters
-----------------
df: Dataframe
Pandas dataframe of the table
"""

dd = pd.DataFrame(sheet.values)

# Scan for maxn rows
maxn = 10

hasmerged = False
if len(sheet.merged_cells.ranges) > 0:
hasmerged = True

if hasmerged:
merge_list = []
for merge in sheet.merged_cells.ranges:
merge_list.append(merge)

for cell_group in merge_list:
min_col, min_row, max_col, max_row = range_boundaries(
str(cell_group))
top_left_cell_value = sheet.cell(row=min_row, column=min_col).value
sheet.unmerge_cells(str(cell_group))
for row in sheet.iter_rows(
min_col=min_col, min_row=min_row, max_col=max_col, max_row=max_row
):
for cell in row:
cell.value = top_left_cell_value

# Extract data and save to dataframe
data = []
for row in sheet.iter_rows(min_row=1):
row_data = []
for cell in row:
if cell.value is None:
row_data.append(None)
else:
row_data.append(cell.value)
if any(row_data):
data.append(row_data)

df = pd.DataFrame(data)

# Remove duplicate columns
df = df.T.drop_duplicates().T

# Remove duplicate rows
df = df.drop_duplicates()

# Fill NaN with blank string for easier viewing
df = df.fillna("")

return df, sheet, hasmerged


wb = openpyxl.load_workbook(filename)
sheet = wb['Sheet1']
merged_table, sheet, hasmerged = pad_merged_cells(sheet)

display(merged_table)
Table where merged cells are unmerged and populated with merged value, to provide context in CSV file format
,Table 3: Number of acreage under irrigation,,,,,,,,,,,
1,,,OVERALL,OVERALL,Sub county,Sub county,Sub county,Sub county,Sub county,Sub county,Sub county,Sub county
2,,,OVERALL,OVERALL,Chepalungu,Chepalungu,Chepalungu,Chepalungu,Bomet Central,Bomet Central,Bomet Central,Bomet Central
3,,,OVERALL,OVERALL,Male,Male,Female,Female,Male,Male,Female,Female
4,,,N,%,N,%,N,%,N,%,N,%
5,What is the average size of land you own that is currently under irrigation?,0 - 2 acres,22,2.8%,4,2.2%,10,3.8%,3,1.7%,5,2.9%
6,What is the average size of land you own that is currently under irrigation?,2 - 5 acres,6,.8%,2,1.1%,2,.8%,0,0.0%,2,1.2%
7,What is the average size of land you own that is currently under irrigation?,5 - 10 acres,1,.1%,0,0.0%,0,0.0%,0,0.0%,1,.6%
8,What is the average size of land you own that is currently under irrigation?,More than 10 acres,0,0.0%,0,0.0%,0,0.0%,0,0.0%,0,0.0%
9,What is the average size of land you own that is currently under irrigation?,None,760,96.3%,176,96.7%,251,95.4%,170,98.3%,163,95.3%
10,What is the average size of land you own that is currently under irrigation?,Total,789,100.0%,182,100.0%,263,100.0%,173,100.0%,171,100.0%

The CSV file now captures overlying merged column headings. Let’s see if this improves things, first with temperature=0.0 …

csv_as_str_merged = merged_table.to_csv()

prompt = (
"Reformat this table to be a simpler markdown table with "
+ "no hierarchical columns, no pivoting, values and percentages in different columns, "
+ "and no blank cells\n\n"
+ csv_as_str_merged
)

completions = ai.Completion.create(
engine="text-davinci-003",
temperature=0.0,
prompt=prompt,
max_tokens=999,
n=1,
stop=None,
)

Markdown(completions.choices[0].text)

And the same, but with temperature=1.0, just for fun …

A bit better, but there is always something a bit off. A missing category, cell values shifted, and neither table is usable if we require an accurate representation of the source data.

At this point, I experimented with various combinations of:

  • Prompts
  • Temperature
  • Using Markdown, HTML, and CSV to define the input table
  • Prompting GPT-3 to generate the python for parsing rather than parsing the tables

Occasionally the process was able to generate a table where column heading and values were perfect, but typically this required high temperature values and so wasn’t reproducible. For the most part, results looked plausible but the data was incorrect.

In fairness, we are really asking a lot of GPT-3 with what is a complicated zero-shot task. I am really impressed at how well it did, and perhaps with some better prompting and reframing of the problem — or GPT-4! — results may improve, but I wasn’t able to achieve what was required.

Single-Shot Prompt

Now, let’s provide an example in the prompt. I took a similar Excel file from the Humanitarian Data Exchange …

Table we will use for our example in single-shot prompt. This file was sourced from the Humanitarian Data Exchange

We want this to be processed to look like this …

What our sample file should like after reformatting

Obviously, this is an unrealistic ‘Real world’ example, as the format and content are very similar to the table we are trying to process, but it’s a good first test.

Converting our input table to CSV and unmerging merged cells as described above, we get …

We can now construct our single-shot prompt (assuming a temperature of zero for reproducibility) …

from io import StringIO

wb = openpyxl.load_workbook(prompt_sample_table1, data_only=True)
sheet = wb["Sheet1"]
example_before, sheet, hasmerged = pad_merged_cells(sheet)
example_before_csv = example_before.to_csv()
example_after, hasmerged, report = parse_excel_sheet(sheet)
example_after_markdown = example_after.to_markdown()
example_after_csv = example_after.to_csv()

example_before_csv = """
,0,1,2,3,4,5,6,7
0,Table 16: % of infants on Minimum Dietary Diversity,,,,,,,
1,,,OVERALL,OVERALL,Sub county,Sub county,Sub county,Sub county
2,,,OVERALL,OVERALL,Chepalungu,Chepalungu,Bomet Central,Bomet Central
3,,,N,%,n,%,n,%
4,Infants on Dietary Diversity,Infants on Minimum Dietary Diversity,37,17.5%,24,17.9%,13,16.7%
5,Infants on Dietary Diversity,Infants not on Dietary Diversity,175,82.5%,110,82.1%,65,83.3%
6,Infants on Dietary Diversity,Total,212,100.0%,134,100.0%,78,100.0%
"""

example_after_markdown = (
"""
| | | | OVERALL - N | OVERALL - % | Sub county - Chepalungu | Sub county - Chepalungu - % | Sub county - Bomet Central | Sub county - Bomet Central - % |
|---:|:-------------------------------------|:----------------------------------------|--------------:|:--------------|--------------------------:|:------------------------------|-----------------------------:|:---------------------------------|
| 1 | Infants on Dietary Diversity | Infants on Minimum Dietary Diversity | 37 | 17.5% | 24 | 17.9% | 13 | 16.7% |
| 2 | Infants on Dietary Diversity | Infants not on Dietary Diversity | 175 | 82.5% | 110 | 82.1% | 65 | 83.3% |
| 3 | Infants on Dietary Diversity | Total | 212 | 100.0% | 134 | 100.0% | 78 | 100.0% |
""".replace(
":|", "|"
)
.replace("|:", "|")
.replace("\n", "\n<RETURN>")
)

example_after_csv = """
, , ,OVERALL - N,OVERALL - %,Sub county - Chepalungu,Sub county - Chepalungu - %,Sub county - Bomet Central,Sub county - Bomet Central - %
1,Infants on Dietary Diversity,Infants on Minimum Dietary Diversity,37,17.5%,24,17.9%,13,16.7%
2,Infants on Dietary Diversity,Infants not on Dietary Diversity,175,82.5%,110,82.1%,65,83.3%
3,Infants on Dietary Diversity,Total,212,100.0%,134,100.0%,78,100.0%
"""

table_to_parse_padded = """
,0,1,2,3,4,5,6,7,8,9,10,11
0,Table 3: Number of acreage under irrigation,,,,,,,,,,,
1,,,OVERALL,OVERALL,Sub county,Sub county,Sub county,Sub county,Sub county,Sub county,Sub county,Sub county
2,,,OVERALL,OVERALL,Chepalungu,Chepalungu,Chepalungu,Chepalungu,Bomet Central,Bomet Central,Bomet Central,Bomet Central
3,,,OVERALL,OVERALL,Male,Male,Female,Female,Male,Male,Female,Female
4,,,N,%,N,%,N,%,N,%,N,%
5,What is the average size of land you own that is currently under irrigation?,0 - 2 acres,22,2.8%,4,2.2%,10,3.8%,3,1.7%,5,2.9%
6,What is the average size of land you own that is currently under irrigation?,2 - 5 acres,6,.8%,2,1.1%,2,.8%,0,0.0%,2,1.2%
7,What is the average size of land you own that is currently under irrigation?,5 - 10 acres,1,.1%,0,0.0%,0,0.0%,0,0.0%,1,.6%
8,What is the average size of land you own that is currently under irrigation?,More than 10 acres,0,0.0%,0,0.0%,0,0.0%,0,0.0%,0,0.0%
9,What is the average size of land you own that is currently under irrigation?,None,760,96.3%,176,96.7%,251,95.4%,170,98.3%,163,95.3%
10,What is the average size of land you own that is currently under irrigation?,Total,789,100.0%,182,100.0%,263,100.0%,173,100.0%,171,100.0%
"""

prompt = (
"Reformat this table to only have a single header row: \n\n"
+ example_before_csv
+ "\n\n"
+ "Result: \n\n"
+ example_after_csv
+ "\n\n"
+ "Reformat this table to only have a single header row: \n\n"
+ table_to_parse_padded
+ "\n\n"
+ "Result: \n\n"
)

print("\n\n", prompt, "\n\n")

completions = ai.Completion.create(
engine="text-davinci-003",
temperature=0.0,
prompt=prompt,
n=1,
stop=None,
max_tokens=2068,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)

print("\n========== Model prediction:\n")

display(pd.read_csv(StringIO(completions.choices[0].text)))

Here is the generated prompt …

Reformat this table to only have a single header row: 


,0,1,2,3,4,5,6,7
0,Table 16: % of infants on Minimum Dietary Diversity,,,,,,,
1,,,OVERALL,OVERALL,Sub county,Sub county,Sub county,Sub county
2,,,OVERALL,OVERALL,Chepalungu,Chepalungu,Bomet Central,Bomet Central
3,,,N,%,n,%,n,%
4,Infants on Dietary Diversity,Infants on Minimum Dietary Diversity,37,17.5%,24,17.9%,13,16.7%
5,Infants on Dietary Diversity,Infants not on Dietary Diversity,175,82.5%,110,82.1%,65,83.3%
6,Infants on Dietary Diversity,Total,212,100.0%,134,100.0%,78,100.0%


Result:


, , ,OVERALL - N,OVERALL - %,Sub county - Chepalungu,Sub county - Chepalungu - %,Sub county - Bomet Central,Sub county - Bomet Central - %
1,Infants on Dietary Diversity,Infants on Minimum Dietary Diversity,37,17.5%,24,17.9%,13,16.7%
2,Infants on Dietary Diversity,Infants not on Dietary Diversity,175,82.5%,110,82.1%,65,83.3%
3,Infants on Dietary Diversity,Total,212,100.0%,134,100.0%,78,100.0%


Reformat this table to only have a single header row:


,0,1,2,3,4,5,6,7,8,9,10,11
0,Table 3: Number of acreage under irrigation,,,,,,,,,,,
1,,,OVERALL,OVERALL,Sub county,Sub county,Sub county,Sub county,Sub county,Sub county,Sub county,Sub county
2,,,OVERALL,OVERALL,Chepalungu,Chepalungu,Chepalungu,Chepalungu,Bomet Central,Bomet Central,Bomet Central,Bomet Central
3,,,OVERALL,OVERALL,Male,Male,Female,Female,Male,Male,Female,Female
4,,,N,%,N,%,N,%,N,%,N,%
5,What is the average size of land you own that is currently under irrigation?,0 - 2 acres,22,2.8%,4,2.2%,10,3.8%,3,1.7%,5,2.9%
6,What is the average size of land you own that is currently under irrigation?,2 - 5 acres,6,.8%,2,1.1%,2,.8%,0,0.0%,2,1.2%
7,What is the average size of land you own that is currently under irrigation?,5 - 10 acres,1,.1%,0,0.0%,0,0.0%,0,0.0%,1,.6%
8,What is the average size of land you own that is currently under irrigation?,More than 10 acres,0,0.0%,0,0.0%,0,0.0%,0,0.0%,0,0.0%
9,What is the average size of land you own that is currently under irrigation?,None,760,96.3%,176,96.7%,251,95.4%,170,98.3%,163,95.3%
10,What is the average size of land you own that is currently under irrigation?,Total,789,100.0%,182,100.0%,263,100.0%,173,100.0%,171,100.0%


Result:

And here is the completion from GPT-3 converted to a dataframe for easier display …

Generated table from single-shot prompt to reformat table with hierarchical headers (completion was a CSV, converted to a pandas dataframe here for easier display)

Nice! When provided an example, GPT-3 was able to reformat our new table perfectly. However, this isn’t a great test because the example and test tables were very similar in structure and content, but it’s interesting to note that even though the example did not have the Male/Female hierarchy, GPT-3 was able to collapse this extra level correctly.

Let’s use the same example table to reformat a table that has a different layout and content data …

When processed with the same code results in this …

Which is close, the headings are spot on, but the farm column has shifted to the left. Our single-shot prompt does quite well for reformatting very similar tables, but slight variation leads to poor results.

Single-Shot, with reasoning

There has been quite a bit of research already around prompt engineering. A really great resource can be found in the OpenAI Cookbook’s Techniques to Improve Reliability [3]. One of the most effective methods to improve results is to include reasoning in the example prompt [4]. Using our previous table, let’s adjust the prompt to include reasoning …

prompt = (
"We need to reformat this table to only have a single header row: \n\n"
+ example_before_csv
+ "\n"
+ "Let's think step by step \n"
+ "Row 1 is just an index row, it has no text or data \n"
+ "Row 2 contains just label text \n"
+ "Rows 3 to 5 contain column headers \n"
+ "Rows 6 onwards contain data \n"
+ "Columns are separated by commas, there should be 7 commas on each row \n"
+ "If we combine each colummn of rows 3 to 5 by concatenating vertically, we get \n"
+ example_after_csv
+ "\n\n"
+ "We need to reformat this table to only have a single header row: \n\n"
+ table_to_parse_padded
+ "\n\n"
+ "Let's think step by step \n\n"
)

The full prompt looks like this …

We need to reformat this table to only have a single header row: 


,0,1,2,3,4,5,6,7
0,Table 16: % of infants on Minimum Dietary Diversity,,,,,,,
1,,,OVERALL,OVERALL,Sub county,Sub county,Sub county,Sub county
2,,,OVERALL,OVERALL,Chepalungu,Chepalungu,Bomet Central,Bomet Central
3,,,N,%,n,%,n,%
4,Infants on Dietary Diversity,Infants on Minimum Dietary Diversity,37,17.5%,24,17.9%,13,16.7%
5,Infants on Dietary Diversity,Infants not on Dietary Diversity,175,82.5%,110,82.1%,65,83.3%
6,Infants on Dietary Diversity,Total,212,100.0%,134,100.0%,78,100.0%

Let's think step by step
Row 1 is just an index row, it has no text or data
Row 2 contains just label text
Rows 3 to 5 contain column headers
Rows 6 onwards contain data
Columns are separated by commas, there should be 7 commas on each row
If we combine each colummn of rows 3 to 5 by concatenating vertically, we get

, , ,OVERALL - N,OVERALL - %,Sub county - Chepalungu - N,Sub county - Chepalungu - %,Sub county - Bomet Central - N,Sub county - Bomet Central - %
1,Infants on Dietary Diversity,Infants on Minimum Dietary Diversity,37,17.5%,24,17.9%,13,16.7%
2,Infants on Dietary Diversity,Infants not on Dietary Diversity,175,82.5%,110,82.1%,65,83.3%
3,Infants on Dietary Diversity,Total,212,100.0%,134,100.0%,78,100.0%


We need to reformat this table to only have a single header row:


,0,1,2,3,4,5
0,Random text here,,,,,
1,,Cows,Cows,Cowboy Hats,Cowboy Hats,Cowboy Hats
2,,TOT,Farm,Small ,Medium,Large
3,,97,6666,0.65,,0.13
4,,93,Bar M,,0.2,
5,,140,Daisy,0.3,0.89,
6,,,Plew,0.2,0.5,0.1
7,,119,Birdie,0.2,0.1,
8,,29,Kit Kat,,0.55,


Let's think step by step

Which results in this completion from GPT-3 for our input table …

Row 1 is just an index row, it has no text or data 
Row 2 contains just label text
Rows 3 to 8 contain column headers
Rows 9 onwards contain data
Columns are separated by commas, there should be 6 commas on each row
If we combine each colummn of rows 3 to 8 by concatenating vertically, we get

, ,Cows - TOT,Cows - Farm,Cowboy Hats - Small ,Cowboy Hats - Medium,Cowboy Hats - Large
1,,97,6666,0.65,,0.13
2,,93,Bar M,,0.2,
3,,140,Daisy,0.3,0.89,
4,,,Plew,0.2,0.5,0.1
5,,119,Birdie,0.2,0.1,
6,,29,Kit Kat,,0.55,

It’s correct! The reformatted table is exactly what we wanted …

Results are improved if we provide reasoning in the single-shot prompt

That said, the task we have provided isn’t that great because even though the content is different to the provided example, the heading layout is still quite similar. In fact, if we tweak the table we want to reformat a little and add an extra column ‘Organic’ …

Adding an extra column to the input

The prediction is now incorrect …

Only off by just one extra comma in the title row, but this results in everything being shifted to the right.

We might continue to engineer the prompt with more reasoning or apply more advanced techniques to automatically structure our prompt workflow, but the real issue is that one example isn’t really enough to capture all the variations of table formats we might encounter. It’s amazing how well GPT-3 does even with one example, but it’s not yet good enough for production for this task (at least how it has been framed so far).

Few-Shot …. err, or not

The next approach might be to provide more than a single example. However, table excerpts require a lot of tokens (more on this later), so if we have to provide multiple examples in a prompt, plus the tokens in the result, we start to hit the Open APIs token limits. For the davinci model this is currently set at 4,000 tokens. Also, since we are charged by token it can get expensive to send and receive a lot of tokens for a small non-profit like DataKind. There are also performance implications with longer prompts, so few-shot prompts were not explored for this task.

So I decided to skip few-shot for now.

Fine-tuning

It was interesting to explore zero- and single-shot prompts, and had that worked for this use-case it would have been an amazing result. In future, as models improve this may well become a viable option, but for now, it might make sense to reframe the task a little.

Another approach is to provide lots of examples through fine-tuning. As OpenAI note:

Fine-tuning lets you get more out of the models available through the API by providing:

  1. Higher quality results than prompt design
  2. Ability to train on more examples than can fit in a prompt
  3. Token savings due to shorter prompts
  4. Lower latency requests

At first, I considered fine-tuning by providing GPT-3 (i) prompts of the raw table (with merged cells unmerged) and; (ii) completions being the reformatted table. The challenge with this approach however is that it still uses a lot of tokens, especially as we are now going to use hundreds of examples.

Instead of passing in raw table excerpts, let’s try using attributes of that table and have GPT-3 predict key further attributes which we can use for parsing …

Reframing the Task — Using Table Attributes as Prompts

As a human (well, mostly human), when I scan a table in Excel I am able to pick out the structure by looking at the values and making a decision about where the data resides.

Identifying the data section in a table is key to parsing into a regular tabular structure

Once I know the row at which data begins, it’s straightforward to deduce heading hierarchies from the rows above and collapse them into a single header row in order to create a nice, regular table to use …

Processed table with flat headings, easily imported into a relational database

Identifying where the data begins at first seems trivial with a bit of manipulation in openpyxl or pandas.read_excel. However, if working through tens of thousands of spreadsheets with different heading layouts, blank rows, and more, it can be challenging to develop a set of rules which can be used to identify exactly where the data starts in every sheet.

It’s complicated because:

  • Column headings can have a high degree of variability and look like data
  • Blank cells and annotations can confusing parsing rules easily
  • Data isn’t always numeric, it can be categorical and look a lot like column headings
  • Some column headings are numbers and can look like data, for example, years

So what table attributes/features should we use to predict the row number where data first occurs?

I came up with a short list of table attributes I thought might might be useful …

import openpyxl

def get_sheet_attributes(sheet, maxn):
"""
Returns a set of table attributes for a given sheet

Input Parameters:
sheet: Obj
Openpyxl sheet object
maxn: int
Number of rows to scan at start of sheet

Returns:
null_cells_in_rows: list of ints
Count of NULL records in forst maxn rows
float_cells_in_rows: list of ints
Count of numeric records in first maxn rows
unique_vals_in_rows: list of ints
Count of unique values in first maxn rows
year_vals_in_rows: list of ints
Count of year values in first maxn rows
hxl_row: int
Row number of HXL header row
first_float_row: int
Row number of row with most numeric records
first_not_null_row: int
Row number of row with most non-null records

"""
dd = pd.DataFrame(sheet.values)

null_cells_in_rows = list(
dd[0:maxn].apply(lambda x: x.isnull().sum(), axis="columns")
)
float_cells_in_rows = []
unique_vals_in_rows = []
year_vals_in_rows = []
report_json = {}
hxl_row = None
for index, row in dd[0:maxn].iterrows():
unique_vals = list(row.unique())
unique_vals = [i for i in unique_vals if i is not None and str(i) != "nan"]
unique_vals_in_rows.append(len(unique_vals))
float_count = 0
year_count = 0
if check_hdx_header(list(row)):
hxl_row = index
for col in dd.columns:
val = row[col]
# Handle numbers that come through as strings
if isinstance(val, str):
val = val.replace(",", "").replace(" ", "")
if val.isnumeric():
val = int(val)
# Check for year values
if (
((isinstance(val, int) or isinstance(val, float)) and val % 1 == 0)
and val > 1900
and val < 2100
):
year_count += 1
continue
# Check for HXL tags
if isinstance(val, float) or isinstance(val, int) or "^=" in str(row[col]):
float_count += 1
float_cells_in_rows.append(float_count)
year_vals_in_rows.append(year_count)

max_floats = max(float_cells_in_rows)
min_nulls = min(null_cells_in_rows)
first_float_row = 0
if sum(float_cells_in_rows) > 0:
for i in range(1, len(float_cells_in_rows)):
# Use a ratio or special case where we go from zero to some
if float_cells_in_rows[i] / max_floats > 0.5 or (
float_cells_in_rows[i] > 0 and float_cells_in_rows[i - 1] == 0
):
first_float_row = i
break
first_not_null_row = np.argmin(null_cells_in_rows)

report = f"Nulls in first {maxn} rows: {str(null_cells_in_rows)}\n"
report += f"Numeric first {maxn} rows: {str(float_cells_in_rows)}\n"
report += f"Unique values in first {maxn} rows: {str(unique_vals_in_rows)}\n"
report += f"Year values in first {maxn} rows: {str(year_vals_in_rows)}\n"
report += f"HXL row: {str(hxl_row)}\n"

report += f"\nFirst reduced nulls row: {str(first_not_null_row)}\n"
report += f"First increased numeric row (excluding years): {str(first_float_row)}\n"

report_json = {
"null_cells_in_rows": null_cells_in_rows,
"float_cells_in_rows": float_cells_in_rows,
"unique_vals_in_rows": unique_vals_in_rows,
"year_vals_in_rows": year_vals_in_rows,
"hxl_row": hxl_row,
"first_float_row": first_float_row,
"first_not_null_row": first_not_null_row,
}

return report, report_json

wb = openpyxl.load_workbook(filename, data_only=True)
for s in wb.sheetnames:
sheet = wb[s]
report, report_json = get_sheet_attributes(sheet, maxn)
print(report)

Which gives output like this …

Nulls in first 10 rows: [12, 11, 10, 10, 8, 2, 0, 1, 1, 1]
Numeric first 10 rows: [0, 0, 0, 0, 0, 0, 5, 5, 5, 5]
Unique values in first 10 rows: [0, 1, 2, 2, 2, 2, 12, 8, 6, 3]
Year values in first 10 rows: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
HXL row: None

First reduced nulls row: 6
First increased numeric row (excluding years): 6

These will be our prompts for fine-tuning the model.

To create completions for the fine-tuning file, I used the Humanitarian Data Exchange dataset for Kenya (see here for more details of how I extracted Excel files). Parsing files and looping through sheets in each, I generated prompts.

I used the following logic to create estimates of the row data starts on a sheet using the table parameters above …

# Make a guess at which row is the data row
datarow = max_not_null_row
# Sometimes we have header rows where none are null, in this case we want to use the row with the most floats
if max_float_row > datarow:
datarow = max_float_row
# HXL row is always the row before the data row
if hxl_row is not None:
datarow = hxl_row
# If we a row with a lot of year values below datarow, use that
if year_vals_in_rows[datarow] > 3:
datarow = datarow + 1

This rule-based approach actually does a pretty good job, but it is not perfect, hence the need for GPT-3. However, it’s handy for creating a test set where most of the completions are accurate, I then only have to adjust a few where the logic above does not hold.

For my training set, I used one table per organization from multiple Excel sheets labelled ‘Kenya’ from 10 humanitarian provider organizations, where the prediction of the first data row was made using the rule-based approach above. I then reviewed this list and compared with the actual sheets to make corrections where the spreadsheet table started on a different row. I excluded cases where there were multiple tables on a sheet for this study, after which I had 232 fine-tuning prompts like this …

{"prompt": "Nulls in first 15 rows: [9, 8, 7, 7, 3, 1, 2, 2, 2, 2, 2]\nNumeric first 15 rows: [0, 0, 0, 0, 0, 3, 3, 3, 3, 3, 3]\nUnique values in first 15 rows: [0, 1, 2, 2, 3, 8, 7, 7, 6, 6, 5]\nYear values in first 15 rows: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\nHXL row: None\nFirst reduced nulls row: 5\nFirst increased numeric row (excluding years): 5", "completion": "Data starts at row: 5\n", "meta_data": "./data/Kenya/kenya-hand-washing-statistics-in-bomet-county_118ea93f-83ce-4b86-b1c4-ca54ea9acc8a/Hand_washing_practices_xlsx_efc74f32_ac23_463a_924b_d53c3656b406/Hand washing practices.xlsx"}
{"prompt": "Nulls in first 15 rows: [2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1]\nNumeric first 15 rows: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\nUnique values in first 15 rows: [0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1]\nYear values in first 15 rows: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\nHXL row: None\nFirst reduced nulls row: 1\nFirst increased numeric row (excluding years): 0", "completion": "Data starts at row: 1\n", "meta_data": "./data/Kenya/shcchealthcare-dataset_02995168-3644-4b78-92be-cdf67275b39d/2018_SHCC_Overview_Data_xlsx_d053b42a_7d31_41b5_a6d9_c8b0a424241c/2018 SHCC Overview Data.xlsx"}
{"prompt": "Nulls in first 15 rows: [6, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 3, 7, 6, 3]\nNumeric first 15 rows: [0, 0, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 0, 0, 0]\nUnique values in first 15 rows: [1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 0, 1, 4]\nYear values in first 15 rows: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\nHXL row: None\nFirst reduced nulls row: 1\nFirst increased numeric row (excluding years): 2", "completion": "Data starts at row: 2\n", "meta_data": "./data/Kenya/eastern-southern-africa-refugees-and-idps-situation-and-response-dec2019_e1f9f55e-08db-4166-a787-c7ea9969dc4d/UNICEF_ESARO_Regional_refugee_and_idp_db_2019_November_27_2019_xlsx_0696b7f3_6368_403e_bcb7_eccdc617961f/UNICEF ESARO Regional refugee and idp db 2019 November 27.2019.xlsx"}

Side Note: In the above you might notice I’ve added a ‘meta_data’ element to each prompt. This isn’t part of the required JSONL prompt record, but I include this to be able to easily associate each prompt with a file for debugging. The prompt file still seems to be accepted by OpenAI with this extra data, I think as long as there are ‘prompt’ and ‘completion’ elements, it’s happy!

I then fine-tuned a DaVinci model …


ai.api_key_path="./api_key.txt"

train_file = './prompts.json'

print("Uploading training file ...")
training_id = cli.FineTune._get_or_upload(train_file, True)

print("Fine-tuning model ...")
create_args = {
"training_file": training_id,
"model": "davinci"
}
resp = ai.FineTune.create(**create_args)
job_id = resp["id"]
status = resp["status"]

print(f'Fine-tunning model with jobID: {job_id}.')

I manually checked fine-tuning status as follows …

ai.api_key_path="./api_key.txt"
result = ai.FineTune.retrieve(id=job_id)

print(result['status'])

Then once finished, retrieved the model …

model = result["fine_tuned_model"]

For the test set I used one table from each Excel file sourced from organizations not in the training set (labeled ‘Kenya’), first running the rule-based prediction above to generate prompts and completions and then correcting where this returned incorrect values. Again, excluding cases where multiple tables were specified on an excel sheet. This gave me a test set of 72 prompts.


def make_gpt3_prediction(prompt, model, temperature=0.99, max_tokens=13):
"""
Wrapper to call GPT-3 to make a prediction (completion) on a single prompt.
Also calls post_process() to clean up the prediction.

Parameters
----------
prompt : str
Prompt to use for prediction
model : str
GPT-3 model to use
temperature : float
Temperature to use for sampling
max_tokens : int
Maximum number of tokens to use for sampling

Returns
-------
result : dict
Dictionary with prompt, predicted, predicted_post_processed
"""
result = {}
result["prompt"] = prompt
model_result = ai.Completion.create(
engine=model,
prompt=prompt,
temperature=temperature,
max_tokens=max_tokens,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=["\n"],
logprobs=1,
)
result["predicted"] = model_result["choices"][0]["text"].replace(" ", "")
result["logprobs"] = model_result["choices"][0]["logprobs"]["top_logprobs"]
return result

def output_prediction_metrics(results, prediction_field="predicted_post_processed"):
"""
Prints out model performance report if provided results in the format:

[
{
'prompt': ' \'ISO3\' | "[\'RWA\', \'RWA\', \'RWA\', \'RWA\', \'RWA\', \'RWA\', \'RWA\', \'RWA\']"',
'predicted': ' #country+code+iso3+v_iso3+',
'expected': '#country+code'
},
... etc ...
]

Parameters
----------
results : list
See above for format
prediction_field : str
Field name of element with prediction. Handy for comparing raw and post-processed predictions.
"""
y_test = []
y_pred = []
for r in results:
if "expected" not in r:
print("Provided results do not contain expected values.")
sys.exit()
y_pred.append(r[prediction_field])
y_test.append(r["expected"])

print(f"There were {len(y_test)} predictions made.")
print(f"\nPrediction using field {prediction_field} ...\n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred),2)}")
print(
f"Precision: {round(precision_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
)
print(
f"Recall: {round(recall_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
)
print(
f"F1: {round(f1_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
)

# File generated by downloading and processing HDX files. See this blog post
# for more details: https://medium.com/towards-data-science/predicting-metadata-for-humanitarian-datasets-using-gpt-3-b104be17716d
country='Kenya'
resources = pd.read_pickle(hdx_resources_pkl_file)

df = resources[(resources["resource_format"]=='XLSX')][["resource_format","file","sheet","dataset_name","dataset_org_title"]]
df.drop_duplicates(inplace=True)
orgs = df["dataset_org_title"].unique()

# Number of rows to use when calculating table row parameters
maxn = 15

# Determine test/train split, 0:10 used for training, 11:len(orgs) for test
dataset_orgs_cutoff = 10

for dataset_org in orgs[dataset_orgs_cutoff: len(orgs)]:
rows = df.loc[df['dataset_org_title']== dataset_org]
row = rows.iloc[0] # Take one sheet from each org to get more variation
filename = row["file"]
sheetname = row["sheet"]

wb = openpyxl.load_workbook(filename, data_only=True)
for s in wb.sheetnames:
sheet = wb[s]

# Extract table attributes
report = get_sheet_attributes(sheet, maxn)

report_elements = report.split('\n\n')
prompt = report_elements[0] + report_elements[1]
completion = report_elements[2]

# Make our GPT-3 prediction
res = make_gpt3_prediction(prompt, model, temperature=0.0)

predicted = res["predicted"].split(':')[1].strip()
actual = completion.split(':')[1].strip()

results.append({
"prompt": prompt,
"predicted": predicted,
"expected": actual
})

output_prediction_metrics(results, prediction_field="predicted")

Side Note: In my previous blog post to predict HXL tags, I had to filter completions by log probability, but in this study it wasn’t required.

GPT-3 predicted the first data row on sheets in our test set with the following results …

Prediction using field predicted ...

Accuracy: 0.97
Precision: 1.0
Recall: 0.97
F1: 0.99

So GPT-3 does a nice job of predicting where the first data row is.

Putting it All Together

Step 1 — Read in our data

Example spreadsheet, with varying hierarchical headers and notes in cells

Step 2 — Unmerge merged columns and populate with merged value

Pandas dataframe of sheet after being processed by function ‘pad_merged_cells’ to unmerge and fill with merged values

Step 3 — Calculate table parameters to generate GPT-3 prompt

Nulls in first 10 rows: [20, 20, 20, 21, 10, 8, 19, 9, 21, 0]
Numeric first 10 rows: [0, 0, 0, 0, 0, 0, 0, 0, 0, 14]
Unique values in first 10 rows: [1, 1, 1, 0, 11, 13, 2, 4, 0, 21]
Year values in first 10 rows: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
HXL row: None

First reduced nulls row: 9
First increased numeric row (excluding years): 9

Step 4 — Call GPT-3 to predict where the data row starts

GPT-3 prediction: 9

Step 5 — now we have the row data begins, concatenate column headings above this into one row

Parsed sheet, with collapsed hierarchical columns, no random labels. This can now be imported into a database

This is a nice table we can upload into a relational database. See references section below for the full code.

Admittedly, it would be easy to manually parse this sheet manually, even specify some rules related to the table parameters we found, but the point of the above process is that it can be applied to a wide range of table layouts needed for analyzing the thousands of Excel sheets in the Humanitarian Data Exchange datasets.

Conclusions and Future Work

Though there is great potential in zero- and single-shot prompting in general, they did not work out just yet for this particular task when prompted with CSV tables. As Large Language Models advance this will likely change — I’m excited to see what GPT-4 might be capable of — but for now, it seems that fine-tuning is a better option, predicting key table attributes which can be used in reformatting. This method does of course require some pre-processing in order to determine table parameters for the prompt. It’s also worth noting that in using table ‘features’ it starts to look more like a classification task than text completion, and might be better framed that way. Irrespective, the technique performs well using the Humanitarian Data Exchange Excel files.

I think it would be really interesting to extend this work to also handle cases where Excel sheets have multiple tables on any given sheet. This would require more table features than I’ve used in this study, such as cell formatting and column (rather than row) attributes.

More fun ahead!

References

[1] Haoyu Dong et al, TableSense: Spreadsheet Table Detection with Convolutional Neural Networks (2021)

[2] Brown et al, Language Models are Few Shot Learners (2020).

[3] OpenAI Cookbook: Techniques to improve reliability

[4] Kojima et al, Large Language Models are Zero-shot reasoners

Code for this analysis can be found in this notebook.

--

--

Matt is the Head of Data Science at DataKind, helping social sector organizations harness the power of data science and AI in the service of humanity.