Predicting Metadata for Humanitarian Datasets Using GPT-3

Matthew Harris
Towards Data Science
19 min readJan 18, 2023

--

Responding to humanitarian disasters quickly, better still, anticipating them can save lives [1]. Data is key to this, not just having lots of data, but clean data which is well understood [2] in order to create a clear view of the situation on the ground. In many cases this critical data is stored in hundreds of small spreadsheets, so piecing them altogether can be time consuming and difficult to maintain as new data comes in during a humanitarian incident. Automating the process of data discovery would potentially speed responses and improve outcomes for affected people.

One way to make discovery easier is to ensure that tabular data has metadata describing each column. This can help in linking together datasets, for example knowing that a column in a table of landmine locations specifies longitude and latitude, similar to a column in another table locating field hospitals. It’s not always obvious from column names what data they might contain, which can be in many languages adhering to varying standards. In an ideal world such metadata is provided with the data, but as we will see below this isn’t typically the case. Doing it manually can be a BIG job.

In this article I look at how we might help automate this process by using OpenAI’s GPT-3 Large Language Model to predict metadata attributes of Humanitarian datasets, and improve on the performance of previous work.

The Humanitarian Data Exchange (HDX)

The Humanitarian Data Exchange (HDX) is a fantastic platform which aims to address some of these issues by bringing humanitarian datasets together in a standardized way. As I write this there are 20,403 datasets globally, that cover a wide range of domains and file types. CSV and Excel files in these datasets result in about 148,000 distinct tables, lots of lovely data!

Types of files on the Humanitarian Data Exchange (HDX) Platform. See this notebook for how the data was collated.

The Humanitarian Exchange Language (HXL)

One great thing about the HDX platform is that it encourages data owners to tag their data with tags in the Humanitarian Exchange Language (HXL) format. This metadata makes it easier to combine and use data in a meaningful way, speeding things up when time is important.

HXL Tags come in two forms, those set at the dataset level, and field-level tags which apply to columns in tabular data. The latter look like this:

Example of a table with HXL tags on the second row [#HXL Standards examples ]

Notice the second row just below the column headers, those are HXL tags. They consist of the tag which is prefixed with a ‘#’ (e.g ‘#adm1’) and in some cases attributes, eg ‘+name’.

The challenge is that these field-level tags are not always set on datasets in HDX, making it harder to use the amazing data there. Looking at CSV and Excel data for Kenya, most tables appear to be missing column HXL tags.

Analysis of data files for Kenya on the Humanitarian Data Exchange (HDX) Platform, to see which have HXL column tags. See this notebook for how data was collated.

Wouldn’t it be great if we could fill in those blanks and populate HXL tags for columns that don’t yet have them?

There has already been some really fantastic work by Microsoft on predicting HXL tags using fastText embedding, see this notebook and corresponding paper [3]. The authors achieved 95% accuracy predicting tags and 92% predicting attributes, really great performance.

However, I wondered if we might use another technique now that there are some new kids on the block …

GPT-3

As I mentioned in a previous article, there seems to be a real buzz around generative AI in the last year. One of the stars of this story has been Open AI’s GPT-3 Large Language Model (LLM) which has some pretty amazing capabilities. Importantly it can be fine-tuned to learn patterns in special applications of language such as computer code.

So it occurred to me that HXL tags are just another type of language ‘Special case’ and that it might be possible to fine-tune GPT-3 with some HXL tag examples, then see if it can predict them on new data.

Getting some training data from HDX

First, it’s worth clarifying some HDX hierarchy of datasets, resources and tables. A ‘Dataset’ can include a set of ‘Resources’ which are files. Datasets have their own page like this one which provide lots of useful information about history, who uploaded it and dataset-level tags.

Example of an HDX Dataset on the HDX platform

The example above has two CSV file resources, which if you select More > Preview on HDX display the HXL tags.

An example resource for a dataset on the HDX Platform

It’s a super cool platform!

We will be downloading resources like the one above for our analysis. HDX provides a python library for interacting with their API, which can be installed with …

pip install hdx-python-api

You will then need to set up the connection. As we are only downloading open datasets we don’t need to set up an API key …

from hdx.utilities.easy_logging import setup_logging
from hdx.api.configuration import Configuration
from hdx.data.dataset import Dataset

setup_logging()
Configuration.create(hdx_site="prod", user_agent="my_agent_name", hdx_read_only=True)

After some experimentation, I wrote a little wrapper to download resources (files) for each dataset. It supports CSV, TSV, XLS and XLSX file types, which should include enough tables for our model fine-tuning. It also saves the dataset and resource HDX JSON metadata along with each file.

def is_supported_filetype(format):
"""
Checks if the file format is currently supported for extracting meta data.

Parameters
----------
format : str
The file format to check.

Returns
-------
bool
True if the file format is supported, False otherwise.
"""
matches = ["CSV", "XLSX", "XLS", "TSV"]
if any(x in format for x in matches):
return True
else:
return False

def download_data(datasets, output_folder):
"""
Downloads data from HDX. Will save dataset and resource meta data for each file

Parameters
----------
datasets : pandas.DataFrame
A dataframe containing the datasets to download.
output_folder : str
The folder to download the data to.
"""
if not os.path.exists(output_folder):
os.mkdir(output_folder)

for index, row in datasets.iterrows():
dataset = Dataset.read_from_hdx(row["id"])
resources = dataset.get_resources()
for resource in resources:
dir = f"./{output_folder}/{row['name']}_{row['id']}"
print(
f"Downloading {row['name']} - {resource['name']} - {resource['format']}"
)
resource["dataset_name"] = row["name"]
if not os.path.exists(dir):
dump_hdx_meta_file(dataset, dir, "dataset.json")
try:
dir = f'{dir}/{get_safe_name(resource["name"])}_{get_safe_name(resource["id"])}'
if not os.path.exists(dir):
dump_hdx_meta_file(resource, dir, "resource.json")
if is_supported_filetype(resource["format"]):
url, path = resource.download(dir)
else:
print(
f"*** Skipping file as it is not a supported filetype *** {resource['name']}"
)
else:
print(f"Skipping {dir} as it already exists")
except Exception as e:
traceback.print_exc()
sys.exit()

print("Done")

The above is a bit long-winded because I wanted to be able to restart the download and have the process continue where it left off. Also, the API seemed to error from time to time, likely due to my internet connection, so there are some Try/Excepts in there. Not a fan of Try/Excepts usually but the aim is to create a training dataset, so I don’t mind some missing resources as long as I have a representative sample to train GPT-3.

Using the search HDX API we search for ‘HXL’ to find datasets which are likely to have HXL tags, then download files for those …

datasets_hxl = pd.DataFrame(Dataset.search_in_hdx("HXL"))
download_data(datasets_hxl, output_folder)

This can take a while (a few hours) so get yourself a nice cup of tea!

Column HXL tags are not listed in HDX resource metadata as far as I could tell, so to extract these we will have to analyze our downloaded files. After a bit of experimentation I wrote a few helper functions …

def check_hdx_header(first_row):
"""
This function checks if the first row of a csv file likely an HDX header.
"""
matches = ["#meta", "#country", "#data", "#loc", "#geo"]
if any(x in first_row for x in matches):
return True
else:
return False


def set_meta_data_fields(data, file, dataset, resource, sheet, type):
"""
This function create a data frame with meta data about the data, as well as a snippet of its
first nrows.

Parameters:
data: a dataframe
file: the name of the data file
dataset: the dataset JSON object from HDX
resource: the resource JSON object from HDX
sheet: the sheet name if the data was a tab in a sheet
type: the type of file, CSV, XLSX, etc.
Returns:
dict: a dictionary with metadata about the dataframe
"""

nrows = 10

# Data preview to only include values
data = data.dropna(axis=1, how="all")

cols = str(list(data.columns))
if data.shape[0] > 0:
first_row = str(list(data.iloc[0]))
has_hxl_header = check_hdx_header(first_row)
num_rows = int(data.shape[0])
num_cols = int(data.shape[1])
first_nrows = data.head(nrows)
else:
first_row = "No data"
has_hxl_header = "No data"
num_rows = 0
num_cols = 0
first_nrows = None

dict = {}

dict["resource_id"] = resource["id"]
dict["resource_name"] = resource["name"]
dict["resource_format"] = resource["format"]
dict["dataset_id"] = dataset["id"]
dict["dataset_name"] = dataset["name"]
dict["dataset_org_title"] = dataset["organization"]["title"]
dict["dataset_last_modified"] = dataset["last_modified"]
dict["dataset_tags"] = dataset["tags"]
dict["dataset_groups"] = dataset["groups"]
dict["dataset_total_res_downloads"] = dataset["total_res_downloads"]
dict["dataset_pageviews_last_14_days"] = dataset["pageviews_last_14_days"]
dict["file"] = file
dict["type"] = type
dict["dataset"] = dataset
dict["sheet"] = sheet
dict["resource"] = resource
dict["num_rows"] = num_rows
dict["num_cols"] = num_cols
dict["columns"] = cols
dict["first_row"] = first_row
dict["has_hxl_header"] = has_hxl_header
dict["first_nrows"] = first_nrows
return dict


def extract_data_details(f, dataset, resource, nrows, data_details):
"""

Reads saved CVS and XLSX HDX files and extracts headers, HDX tags and sample data.
For XLSX files, it extracts data from all sheets.

Parameters
----------
f : str
The file name
dataset : str
The dataset name
resource : str
The resource name
nrows : int
The number of rows to read
data_details : list
The list of data details

Returns
-------
data_details : list
The list of data details

"""
if f.endswith(".xlsx") or f.endswith(".xls"):
print(f"Loading xslx file {f} ...")
try:
sheet_to_df_map = pd.read_excel(f, sheet_name=None)
except Exception:
print("An exception occurred trying to read the file {f}")
return data_details
for sheet in sheet_to_df_map:
data = sheet_to_df_map[sheet]
data_details.append(
set_meta_data_fields(data, f, dataset, resource, sheet, "xlsx")
)
elif f.endswith(".csv"):
print(f"Loading csv file {f}")
# Detect encoding
with open(f, "rb") as rawdata:
r = chardet.detect(rawdata.read(100000))
try:
data = pd.read_csv(f, encoding=r["encoding"], encoding_errors="ignore")
except Exception:
print("An exception occurred trying to read the file {f}")
return data_details
data_details.append(set_meta_data_fields(data, f, dataset, resource, "", "csv"))
else:
type = f.split(".")[-1]
print(f"Type {type} for {f}")
data = pd.DataFrame()
data_details.append(set_meta_data_fields(data, f, dataset, resource, "", type))

return data_details


# Loop through downloaded folders
def extract_all_data_details(startpath, data_details):
"""
Extracts all data details for downloaded HDX files in a given directory.

Parameters
----------
startpath : str
The path to the directory containing all datasets.
data_details : list
Results

Returns
-------
data_details : pandas.DataFrame
Results, to which new meta data was appended.
See function set_meta_data_fields for columns
"""
for d in os.listdir(startpath):
d = f"{startpath}/{d}"
with open(f"{d}/dataset.json") as f:
dataset = json.load(f)
for r in os.listdir(d):
if "dataset.json" not in r:
with open(f"{d}/{r}/resource.json") as f:
resource = json.load(f)
for f in os.listdir(f"{d}/{r}"):
file = str(f"{d}/{r}/{f}")
if ".json" not in file:
data_details = extract_data_details(
file, dataset, resource, 5, data_details
)
data_details = pd.DataFrame(data_details)
return data_details

Now we can run it on our previously downloaded datafiles …

hxl_resources_data_details = extract_all_data_details(f"./data/hxl_datasets/", [])
print(hxl_resources_data_details.shape)

(25695, 22)

This dataframe has 25,695 rows for each tabular dataset found when scanning CSV and Excel files for datasets on HDX found searching for ‘HXL’, along with a data preview, columns names and in some cases HXL tags.

The Train/Test split

Normally, I would simply use Scikit learn’s train_test_split on the data to be used with the model. However, in doing this I noticed that often repeat resources (files) from the same dataset might occur in both training and test sets. For example, an organization might provide files for multiple airports, each being in exactly the same format with the same HXL tags. If we generate a prompts dataframe then split, airports from this dataset will appear in both the training and test set, which wouldn’t reflect very well our problem where we need to predict HXL tags for brand new datasets.

To get around this I did the following:

  1. Split the HDX ‘datasets’ into train/test (remember a dataset can have multiple resource files)
  2. Using each I created dataframes of resources, one row per data file
  3. Then using these train/test resources dataframes, I created train/test dataframes, one row per column. These are the GPT-3 prompts needed for fine-tuning

Creating GPT-3 fine-tuning prompts

In order to fine tune GPT-3, we need to provide a prompt and response training file in JSONL format. For the prompt I decided to use (i) Column name; (ii) A sample of data from that column. The Completion will be the HXL tag and attributes.

Here is an example …

{"prompt": " 'scheduled_service' | \"['1', '1', '0', '0', '0', '0', '0', '0']\"", "completion": " #status+scheduled"}

The format for GPT-3 is very particular and took a while to get right! Things like having a space at the start of completion are recommended by OpenAI.


def get_hxl_tags_list(resources):
"""
Build a list of the HXL tags found in a dataframe of HDX resources.

Parameters
----------
resources : pandas dataframe
A dataframe of HDX resources

Returns
-------
hxl_tags : list
A list of HXL tags.
"""
hxl_tags = []
for row, d in resources.iterrows():
if d["has_hxl_header"] == True:
fr = d["first_row"].replace(" ", "")
for c in fr.split(","):
fr = re.sub("\[|\]|\"|\'","", c)
hdxs = fr.split("+")
for h in hdxs:
if h not in hxl_tags and len(h) > 0:
hxl_tags.append(h.lower())
hxl_tags = list(set(hxl_tags))
hxl_tags.remove('nan')
return hxl_tags

def get_prompt(col_name, data):
"""
Builds the prompt for GPT-3 for predicting HXL tags and attributes

Parameters
----------
col_name : str
Column name
data : list
A list of sample data for the column

Returns
-------
prompt : string
A prompt for GPT-3.
"""
ld = len(data) - 1
col_data = json.dumps(str(list(data.iloc[1:ld])))
prompt = f" {col_name} | {col_data}".lower()
return prompt

def create_training_set(resources):
"""
Builds a jsonl training data file for GPT-3 where each row is a prompt for a column HXL tag.

It will only output prompts where the sample data for the column didn't contain nans.

Parameters
----------
resources : pandas dataframe
A dataframe of HDX resources

Returns
-------
train_data : list
A list of prompts and completions for the HXL tag autocomplete feature.
"""
train_data = []
for row, d in resources.iterrows():
if d["has_hxl_header"] == True:
cols = d["columns"][1:-1].split(",")
hdxs = d["first_row"][1:-1].split(",")
data = d["first_nrows"]
has_hxl_header = d["has_hxl_header"]
if len(cols) == len(hdxs) and len(cols) > 1:
ld = len(data) - 1
for i in range(0, len(cols)):
if i < len(hdxs):
hdx = re.sub("'|\"", "", hdxs[i])
# Only include is has HXL tags and good sample data in column
if has_hxl_header == True and hdx != np.nan:
prompt = get_prompt(cols[i], data.iloc[:,i])
if 'nan' not in hdx and 'nan, nan' not in prompt:
p = {
"prompt": prompt,
"completion": f" {hdx}",
}
train_data.append(p)
return train_data

You’ll notice in the above that I exclude any prompts where there are NaNs in the data. I figured we’d start with good data samples but this is something to be revisited in future.

We can now generate a training dataset and save to a file for GPT-3 …

# Create training set
X_train = create_training_set(X_train_resources)
print(f"Training records: {len(X_train)}")

train_file = "fine_tune_openai_train.jsonl"

with open(train_file, "w") as f:
for p in X_train:
json.dump(p, f)
f.write("\n")

print("Done")

This is what the training data looks like …

{"prompt": "  'Country ISO3' | \"['COD', 'COD', 'COD', 'COD', 'COD', 'COD', 'COD', 'COD']\"", "completion": "  #country+code"}
{"prompt": " 'Year' | \"['2010', '2005', '2000', '1995', '1990', '1985', '1980', '1975']\"", "completion": " #date+year"}
{"prompt": " 'Indicator Name' | \"['Barro-Lee: Percentage of female population age 15-19 with no education', 'Barro-Lee: Percentage of female population age 15-19 with no education', 'Barro-Lee: Percentage of female population age 15-19 with no education', 'Barro-Lee: Percentage of female population age 15-19 with no education', 'Barro-Lee: Percentage of female population age 15-19 with no education', 'Barro-Lee: Percentage of female population age 15-19 with no education', 'Barro-Lee: Percentage of female population age 15-19 with no education', 'Barro-Lee: Percentage of female population age 15-19 with no education']\"", "completion": " #indicator+name"}
{"prompt": " 'Indicator Code' | \"['BAR.NOED.1519.FE.ZS', 'BAR.NOED.1519.FE.ZS', 'BAR.NOED.1519.FE.ZS', 'BAR.NOED.1519.FE.ZS', 'BAR.NOED.1519.FE.ZS', 'BAR.NOED.1519.FE.ZS', 'BAR.NOED.1519.FE.ZS', 'BAR.NOED.1519.FE.ZS']\"", "completion": " #indicator+code"}
{"prompt": " 'Value' | \"['48.1', '51.79', '52.1', '43.62', '35.44', '38.02', '43.47', '49.08']\"", "completion": " #indicator+value+num"}
{"prompt": " 'Country ISO3' | \"['COD', 'COD', 'COD', 'COD', 'COD', 'COD', 'COD', 'COD']\"", "completion": " #country+code"}
{"prompt": " 'Year' | \"['2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008']\"", "completion": " #date+year"}

There were 139,503 rows in this training dataset, one row per column as found in the tabular data we downloaded from HDX, specifically for cases where the column had HXL tags.

Generating an OpenAI API Key

Before we can do anything, you will need to sign up for an OpenAI account. Once that’s done, you should have $18 of free credits. If using a small amount of data, this should suffice, but for this analysis and a few model trainings I racked up a bill of $50, so you may need to attach a credit card to your account.

Once you have an account you can generate an API key. I opted to save this to a local file and referenced the file in code, but the OpenAI Python library support using environment variables also.

Fine-tuning GPT-3

Right, here comes the exciting bit! With our nice training data, we can fine-tune GPT-3 as follows …

import openai
from openai import cli

# Open AI API key should be put into this file
openai.api_key_path = "./api_key.txt"

print("Uploading training file ...")
training_id = cli.FineTune._get_or_upload(train_file, True)
# validation_id = cli.FineTune._get_or_upload(validation_file_name, True)

print("Fine-tuning model ...")
create_args = {
"training_file": training_id,
# "validation_file": test_file,
"model": "ada",
}
# https://beta.openai.com/docs/api-reference/fine-tunes/create
resp = openai.FineTune.create(**create_args)
job_id = resp["id"]
status = resp["status"]

print(f"Fine-tunning model with jobID: {job_id}.")

In the above we submit the fine-tuning model run to OpenAI, we can then see status with …

result = openai.FineTune.retrieve(id=job_id)
print(result['status'])

I opted to keep things simple, but you can also submit to OpenAI and monitor status via a stream as shown here.

Once the status is ‘succeeded’ you can now get a model ID to use for predictions (completions) …

result = openai.FineTune.retrieve(id=job_id)
model = result["fine_tuned_model"]

Predicting HXL tags with our fine-tuned GPT-3 model

We now have a model, let’s see what it can do!

To call GPT-3 you can use the Open AI Python library ‘create’ method. It’s worth checking out the documentation to see what parameters you can tune.

def create_prediction_dataset_from_resources(resources):
"""
Generate a list of model column-level prompts from a list of resources (tables).

It will only output prompts where the sample data for the column didn't contain nans.

Parameters
----------
resources : list
A list of dictionaries containing the resource name, columns, first_row, and first_nrows.

Returns
-------
prediction_data : list
A list of dictionaries containing GPT-3 prompts (one per column in resource table)
"""

prediction_data = []
for index, d in resources.iterrows():
cols = d["columns"][1:-1].split(",")
hdxs = d["first_row"][1:-1].split(",")
data = d["first_nrows"]
has_hxl_header = d["has_hxl_header"]
if len(cols) == len(hdxs) and len(cols) > 1:
ld = len(data) - 1
# Loop through columns
for i in range(0, len(cols)):
if i < len(hdxs) and i < data.shape[1]:
prompt = get_prompt(cols[i], data.iloc[:,i])
# Skip any prompts with at least two nan values in sample data
if 'nan, nan' not in prompt:
r = {
"prompt": prompt
}
# If we were called with HXL tags (ie for test set), populate 'expected'
if has_hxl_header == True:
hdx = re.sub("'|\"| ", "", hdxs[i])
# Row has HXL tags, but this particular column doesn't have tags
if hdx == 'nan':
continue
else:
r["expected"]= hdx
prediction_data.append(r)
return prediction_data

def make_gpt3_prediction(prompt, model, temperature=0.99, max_tokens=13):
"""
Wrapper to call GPT-3 to make a prediction (completion) on a single prompt.



Parameters
----------
prompt : str
Prompt to use for prediction
model : str
GPT-3 model to use
temperature : float
Temperature to use for sampling
max_tokens : int
Maximum number of tokens to use for sampling

Returns
-------
result : dict
Dictionary with prompt, predicted, and
log probabilities of each completed token
"""
result = {}
result["prompt"] = prompt
model_result = openai.Completion.create(
engine=model,
prompt=prompt,
temperature=temperature,
max_tokens=max_tokens,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=["\n"],
logprobs=1
)
result["predicted"] = model_result["choices"][0]["text"].replace(" ","")
result["logprobs"] = model_result['choices'][0]['logprobs']['top_logprobs']
return result

def make_gpt3_predictions(
sample_size, prediction_data, model, temperature=0.99, max_tokens=13, logprob_cutoff=-0.01
):

"""
Wrapper to call GPT-3 to make predictions on test file for sample_size samples.

Parameters
----------
sample_size : int
Number of predictions to make from test file
prediction_data : list
List of dictionaries with prompts
model : str
GPT-3 model to use
postprocess : bool
Whether to postprocess the predictions
temperature : float
Temperature to use for sampling
max_tokens : int
Maximum number of tokens to use for sampling
prob_cutoff : float
Logprob cutoff for filtering out low probability tokens

Returns
-------
results : list
List of dictionaries with prompt, predicted, predicted_post_processed
"""
results = []
prediction_data = sample(prediction_data, sample_size)
for i in range(0, sample_size):
prompt = prediction_data[i]["prompt"]
res = make_gpt3_prediction(
prompt, model, temperature, max_tokens
)

# Filter out low logprob predictions
pred = ""
seen_tokens = []
for w in res["logprobs"]:
token = list(w.keys())[0]
prob = w[token]
if prob > logprob_cutoff and token not in seen_tokens:
pred += token
if '+' not in token:
seen_tokens.append(token)
else:
break
pred = re.sub(r" |\+$|\+v_$", "", pred)

r = {
"prompt": prompt,
"predicted": res["predicted"],
"predicted_log_prob_cutoff": pred,
#"logprobs": res["logprobs"]
}
# For test sets we have expected values, add back for performance reporting
if "expected" in prediction_data[i]:
r['expected'] = prediction_data[i]['expected'].replace(' ', '')
results.append(r)
return results

Which we call with the following, limiting to 500 prompts …

# Generate the prompts we want GPT-3 to complete
print("Building model input ...")
prediction_data = create_prediction_dataset_from_resources(X_test_resources)

# How many predictions to try from the test set
sample_size = 500

# Make the predictions
print("Making GPT-3 predictions (completions) ...")
results = make_gpt3_predictions(
sample_size, prediction_data, model, temperature=0.99, max_tokens=20, logprob_cutoff=-0.001
)

This yields the following results …

def output_prediction_metrics(results, prediction_field="predicted_post_processed"):
"""
Prints out model performance report if provided results in the format:

[
{
'prompt': ' \'ISO3\' | "[\'RWA\', \'RWA\', \'RWA\', \'RWA\', \'RWA\', \'RWA\', \'RWA\', \'RWA\']"',
'predicted': ' #country+code+iso3+v_iso3+',
'predicted_post_processed': '#country+code',
'expected': '#country+code'
},
... etc ...
]

Parameters
----------
results : list
See above for format
prediction_field : str
Field name of element with prediction. Handy for comparing raw and post-processed predictions.
"""
y_test = []
y_pred = []
y_justtag_test = []
y_justtag_pred = []
for r in results:
if "expected" not in r:
print("Provided results do not contain expected values.")
sys.exit()
y_pred.append(r[prediction_field])
y_test.append(r["expected"])
expected_tag = r["expected"].split("+")[0]
predicted_tag = r[prediction_field].split("+")[0]
y_justtag_test.append(expected_tag)
y_justtag_pred.append(predicted_tag)

print(f"GPT-3 results for {prediction_field}, {len(results)} predictions ...")
print("\nJust HXL tags ...\n")
print(f"Accuracy: {round(accuracy_score(y_justtag_test, y_justtag_pred),2)}")
print(
f"Precision: {round(precision_score(y_justtag_test, y_justtag_pred, average='weighted', zero_division=0),2)}"
)
print(
f"Recall: {round(recall_score(y_justtag_test, y_justtag_pred, average='weighted', zero_division=0),2)}"
)
print(
f"F1: {round(f1_score(y_justtag_test, y_justtag_pred, average='weighted', zero_division=0),2)}"
)

print(f"\nTags and attributes with {prediction_field} ...\n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred),2)}")
print(
f"Precision: {round(precision_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
)
print(
f"Recall: {round(recall_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
)
print(
f"F1: {round(f1_score(y_test, y_pred, average='weighted', zero_division=0),2)}"
)

return
output_prediction_metrics(results, prediction_field="predicted")

GPT-3 results for predicted, 500 predictions ...

Just HXL tags ...

Accuracy: 0.99
Precision: 0.99
Recall: 0.99
F1: 0.99

Tags and attributes with predicted ...

Accuracy: 0.0
Precision: 0.0
Recall: 0.0
F1: 0.0

Uhhhh!? That’s, well …. terrible. Predicting just the HXL tag worked really well, but predicting tag and attributes, not so much.

Let’s look at some of the failed predictions …


{
"prompt": " 'gho (code)' | \"['mort_100', 'mort_100', 'mort_100', 'mort_100', 'mort_100', 'mort_100', 'mort_100', 'mort_100']\"",
"predicted": "#indicator+code+v_hor_funder_",
"expected": "#indicator+code"
}
{
"prompt": " 'region (code)' | \"['afr', 'afr', 'afr', 'afr', 'afr', 'afr', 'afr', 'afr']\"",
"predicted": "#region+code+v_reliefweb+f",
"expected": "#region+code"
}
{
"prompt": " 'dataid' | \"['310633', '310634', '310635', '310636', '310629', '310631', '310630', '511344']\"",
"predicted": "#meta+id+fts_internal_view_all",
"expected": "#meta+id"
}
{
"prompt": " 'gho (url)' | \"['https://www.who.int/data/gho/indicator-metadata-registry/imr-details/5580', 'https://www.who.int/data/gho/indicator-metadata-registry/imr-details/5580']\"",
"predicted": "#indicator+url+name+has_more_",
"expected": "#indicator+url"
}
{
"prompt": " 'year (display)' | \"['2014', '2014', '2014', '2014', '2014', '2014', '2014', '2014']\"",
"predicted": "#date+year+name+tariff+for+",
"expected": "#date+year"
}
{
"prompt": " 'byvariablelabel' | \"[nan]\"",
"predicted": "#indicator+label+code+placeholder+Hubble",
"expected": "#indicator+label"
}
{
"prompt": " 'gho (code)' | \"['ntd_bejelstatus', 'ntd_pintastatus', 'ntd_yawsend', 'ntd_leishcend', 'ntd_leishvend', 'ntd_leishcnum_im', 'ntd_leishcnum_im', 'ntd_leishcnum_im']\"",
"predicted": "#indicator+code+v_ind+olk_ind",
"expected": "#indicator+code"
}
{
"prompt": " 'enddate' | \"['2002-12-31', '2003-12-31', '2004-12-31', '2005-12-31', '2006-12-31', '2007-12-31', '2008-12-31', '2009-12-31']\"",
"predicted": "#date+enddate+enddate+usd+",
"expected": "#date+end"
}
{
"prompt": " 'endyear' | \"['2013', '2013', '2013', '2013', '2013', '2013', '2013', '2013']\"",
"predicted": "#date+year+endyear+end_of_",
"expected": "#date+year+end"
}
{
"prompt": " 'country (code)' | \"['dnk', 'dnk', 'dnk', 'dnk', 'dnk', 'dnk', 'dnk', 'dnk']\"",
"predicted": "#country+code+v_iso2+v_",
"expected": "#country+code"
}

Interesting. It seems the model completed and captured the correct tag and attributes almost perfectly, then added some extra attributes at the end. So for example …

"predicted": "#country+code+v_iso2+v_",
"expected": "#country+code"

Let’s see how often the expected tags and attributes occurred in the first half of the predictions …

passes = 0
fails = 0
for r in results:
if r["predicted"].startswith(r["expected"]):
passes += 1
else:
fails += 1
#print(json.dumps(r, indent=4, sort_keys=False))

print(f" Out of {passes + fails} predictions, the expected tags and attributes where in the predicted tags and attributes {round(100*passes/(passes+fails),1)}% of the time.")

Out of 500 predictions, the expected tags and attributes where in the predicted tags and attributes 99.0% of the time.

Out of 500 predictions, the expected tags and attributes were in the predicted tags and attributes 99% of the time. Put another way, the expected values were the first part of most predictions.

So GPT-3 has great accuracy for predicting tags and attributes but adds extra attributes at the end.

So, how to exclude those extra tokens?

Well, it turns out that the GPT-3 returns log probabilities for each token. As you will notice above, we also calculated a prediction assuming we stopped completing tokens if the log probability was above some cutoff value …

# Filter out low logprob predictions
pred = ""
seen_tokens = []
for w in res["logprobs"]:
token = list(w.keys())[0]
prob = w[token]
if prob > logprob_cutoff and token not in seen_tokens:
pred += token
if '+' not in token:
seen_tokens.append(token)
else:
break
pred = re.sub(r" |\+$|\+v_$", "", pred)

Let’s see how that performed, assuming a cutoff of -0.001 ..

output_prediction_metrics(results, prediction_field="predicted_log_prob_cutoff")

Just HXL tags ...

Accuracy: 0.99
Precision: 1.0
Recall: 0.99
F1: 0.99

Tags and attributes with predicted_log_prob_cutoff ...

Accuracy: 0.94
Precision: 0.99
Recall: 0.94
F1: 0.95

That’s pretty good, 0.94 for tags and attributes. Since we know that the correct tags and attributes occur in the prediction 99% of the time, we should be able to do a little better with some tuning of the log probability cutoff and maybe with some post-processing.

Conclusions and Future Work

The above is a quick analysis to see how GPT-3 might be applied for predicting meta data, specifically HXL tags on humanitarian datasets. It performs really well on this task and has a lot of potential for similar metadata prediction tasks.

More work is needed to refine the approach of course, such as:

  1. Trying other models (I used 'ada' above) to see if this improves performance (though it will cost more)
  2. Model hyperparameter tuning. The log probability cutoff will likely be very important
  3. More prompt engineering to perhaps include column list on the table might provide better context, we well as overlying columns on two-row header tables.
  4. More preprocessing. Not much was done for this article, blindly taking tables extracted from CSV files, so the data is can be a bit messy

That said, some great potential here I feel for using GPT-3 to predict metadata on datasets.

More to follow soon!

References

[1] Mark Lowcock, Under-Secretary-General for Humanitarian Affairs and Emergency Relief Coordinator, Anticipation saves lives: How data and innovative financing can help improve the world’s response to humanitarian crises (2019)

[2] Sarah Telford, Opinion: Humanitarian world is full of data myths. Here are the most popular (2018)

[3] Vinitra Swamy et al, Machine Learning for Humanitarian Data: Tag Prediction using the HXL Standard (2019)

A notebook used for this analysis can be found here.

--

--

Matt is the Head of Data Science at DataKind, helping social sector organizations harness the power of data science and AI in the service of humanity.