
Data Labeling is a critical component for machine learning projects. It is built on the old adage, "garbage in, garbage out." Labeling involves creating annotated datasets for training and evaluation. But this process can be time-consuming and expensive, especially for projects with lots of data. But, what if we could use the advances in LLMs to reduce the cost and effort involved in data labeling tasks?
GPT-4 is a state-of-the-art language model developed by OpenAI. It has a remarkable ability to understand and generate human-like text and has been a game changer in the natural language processing (NLP) community and beyond. In this blog post, we’ll explore how you can use GPT-4 to bootstrap labels for various tasks. This can significantly reduce the time and cost involved in the labeling process. We’ll focus on sentiment classification to demonstrate how prompt engineering can enable you to create accurate and reliable labels using GPT-4 and how this technique can be used for much more powerful things as well.
Leveraging GPT-4’s Predictions for Data Pre-labeling
As in writing, editing is often less strenuous than composing the original work. That’s why starting with pre-labeled data is more attractive than starting with a blank slate. Using GPT-4 as a prediction engine to pre-label data stems from its ability to understand context and generate human-like text. Therefore, it would be excellent to leverage GPT-4 to reduce the manual effort required for data labeling. This could result in cost savings and make the labeling process less mundane.
So how do we do this? If you’ve used GPT models, you’re probably familiar with prompts. Prompts set the context for the model before it begins generating output and can be tweaked and engineered (i.e. prompt engineering) to help the model deliver highly specific results. This means we can create prompts that GPT-4 can use to generate text that looks like model predictions. For our use case, we will craft our prompts in a way that guides the model toward producing the desired output format as well.
Let’s take a straightforward example of sentiment analysis. If we are trying to classify the sentiment of a given string of text as positive
, negative
, or neutral
we could provide a prompt like:
"Classify the sentiment of the following text as 'positive', 'negative', or 'neutral': <input_text>"
Once we have a well-structured prompt, we can use the OpenAI API to generate predictions from GPT-4. Here’s an example using Python:
import openai
import re
openai.api_key = "<your_api_key>"
def get_sentiment(input_text):
prompt = f"Respond in the json format: {{'response': sentiment_classification}}nText: {input_text}nSentiment (positive, neutral, negative):"
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": prompt}
],
max_tokens=40,
n=1,
stop=None,
temperature=0.5,
)
response_text = response.choices[0].message['content'].strip()
sentiment = re.search("negative|neutral|positive", response_text).group(0)
# Add input_text back in for the result
return {"text": input_text, "response": sentiment}
We can run this with a single example to inspect the output we’re receiving from the API.
# Test single example
sample_text = "I had a terrible time at the party last night!"
sentiment = get_sentiment(sample_text)
print("Resultn",f"{sentiment}")
Result:
{'text': 'I had a terrible time at the party last night!', 'response': 'negative'}
Once we’re satisfied with our prompt and the results we’re getting, we can scale this up to our entire dataset. Here, we’ll assume a text file with one example per line.
import json
input_file_path = "input_texts.txt"
output_file_path = "output_responses.json"
with open(input_file_path, "r") as input_file, open(output_file_path, "w") as output_file:
examples = []
for line in input_file:
text = line.strip()
if text:
examples.append(convert_ls_format(get_sentiment(text)))
output_file.write(json.dumps(examples))
We can import the data with pre-labeled predictions into Label Studio and have reviewers verify or correct the labels. This approach significantly reduces the manual work required for data labeling, as human reviewers only need to validate or correct the model-generated labels rather than annotate the entire dataset from scratch. See our full example notebook here.
Note that in most situations, OpenAI is allowed to use any information sent to their APIs to train their models further. So it’s important to not send protected or private data to these APIs for labeling if we don’t want to expose the information more broadly.
Reviewing Pre-labeled Data in Label Studio
Once we have our pre-labeled data ready, we will import it into a data labeling tool, such as Label Studio, for review. This section will guide you through setting up a Label Studio project, importing the pre-labeled data, and reviewing the annotations.

Step 1: Install and Launch Label Studio
First, you need to have Label Studio installed on your machine. You can install it using pip:
pip install label-studio
After installing Label Studio, launch it by running the following command:
label-studio
This will open Label Studio in your default web browser.
Step 2: Create a New Project
Click on "Create Project" and enter a project name, such as "Review Bootstrapped Labels." Next, you need to define the labeling configuration. For Sentiment Analysis, we can use the text Sentiment Analysis Text Classification.
These templates are configurable, so if we want to change any of the properties, it’s really straightforward. The default labeling configuration is shown below.
<View>
<Header value="Choose text sentiment:"/>
<Text name="my_text" value="$reviewText"/>
<Choices name="sentiment" toName="my_text" choice="single" showInline="true">
<Choice value="Positive"/>
<Choice value="Negative"/>
<Choice value="Neutral"/>
</Choices>
</View>
Click "Create" to finish setting up the project.
Step 3: Import Pre-labeled Data
To import the pre-labeled data, click the "Import" button. Choose the json file and select the pre-labeled data file generated earlier (e.g., "output_responses.json"). The data will be imported along with the pre-populated predictions.
Step 4: Review and Update Labels
After importing the data, you can review the model-generated labels. The annotation interface will display the pre-labeled sentiment for each text sample, and reviewers can either accept or correct the suggested label.
You can improve quality further by having multiple annotators review each example.
By utilizing GPT-4-generated labels as a starting point, the review process becomes much more efficient, and reviewers can focus on validating or correcting the annotations rather than creating them from scratch.
Step 5: Export Labeled Data
Once the review process is complete, you can export the labeled data by clicking the "Export" button in the "Data Manager" tab. Choose the desired output format (e.g., JSON, CSV, or TSV), and save the labeled dataset for further use in your machine learning project.
Cost Analysis
One question rolling around in my mind was: "How much did this cost me at the end of the day?"
Note: Prices shown below reflect current data for the author at the time of publication. Pricing may differ in the future or based on geographic location.
For language models, OpenAI charges based on the number of tokens in your request. Tokens are typically the number of words in the query, but special characters and emojis can sometimes count as an individual token. OpenAI’s pricing page states, "You can think of tokens as pieces of words, where 1,000 tokens is about 750 words." For more information on how tokens are counted, see this page.
The cost per token differs according to the model used. For example, the GPT-4 8K-context model costs $0.03/1K tokens for the prompt, and each generated token costs $0.06/1K tokens, while the GPT-3.5-turbo model costs $0.002/1K tokens.

To estimate the cost of pre-labeling a dataset, we can use a simple formula that considers the number of examples in the dataset, the price per token for prompts and completions, and the average number of tokens per example.

Where:

Additionally, we can calculate the total number of tokens in the dataset as follows:

Where:

Using this formula, we can estimate the cost of pre-labeling a dataset by multiplying the number of examples by the sum of the prompt cost and the completion cost, adjusted for the average number of tokens per example.
For instance, if we have a dataset with 1,000 examples that we want to pre-label for sentiment analysis with GPT-4, we can compute it with the following: a prompt price of $0.03 per 1K tokens, a completion price of $0.06 per 1K tokens, a prompt length of 20 tokens, an average example length of 80 tokens, and an average result token length of 3 tokens, the total cost of pre-labeling would be:

In this example, pre-labeling the dataset using GPT-4 would cost $3.18. Note: the same dataset with GPT-3.5-turbo would cost ~$0.21.
If our pre-labeling task requires less specialized knowledge, we may want to use a less robust model to save cost. It’s usually worth manually reviewing a handful of examples with varying levels of complexity to get a sense of how accurate one model is compared to another. For information on the models, see the OpenAI Pricing page.
Beyond Sentiment Analysis: Label Any NLP Task

The great thing about this approach is that it’s not limited to just sentiment analysis. We can pre-label data for various NLP tasks using GPT-4 by using prompt engineering and guiding it to produce the correct output. Here are a few examples of prompts for different NLP tasks, all of which can be reviewed and labeled in Label Studio (examples given were generated using GPT-4). Remember that more complex tasks may require longer prompts, which will count towards your token count and subsequently, cost.
Summarization
- Prompt:
Respond in the json format with a summary for the following text: {‘summary’: summary}
Text: I love going to the park on a sunny day. The customer service was terrible; they were rude and unhelpful. I am neither happy nor sad about the new policy changes. The cake was delicious and the presentation was fantastic. I had a really bad experience with the product; it broke after two days.
Summary:
Response:
{'summary': 'Artificial intelligence is a rapidly developing technology that can learn from data and make decisions without being explicitly programmed. It has the potential to help solve some of the world's most pressing problems, but there are concerns about risks and ethical considerations such as bias and job displacement. It is important to consider AI's impact and ensure responsible and beneficial use.'}
Taxonomy
Prompt:
Respond in the JSON format with a summary for the following text: {‘diagnosis’: diagnosis}, where the possible diagnoses are Pneumonia, Acute bronchitis, Asthma exacerbation, Congestive heart failure, Chronic obstructive pulmonary disease (COPD), Pulmonary embolism, Pleural effusion, Interstitial lung disease, or Lung cancer.
Text: The patient presented with shortness of breath and a persistent cough. On physical examination, crackles were heard in the lower lung fields. Chest X-ray revealed bilateral infiltrates consistent with pneumonia. The patient was started on antibiotics and showed improvement in symptoms within 48 hours.
Diagnosis:
Response:
{'diagnosis': 'Pneumonia'}
Named Entity Recognition (NER)
Prompt:
You are an expert NER labeler. Only return JSON. Do not return explanations. Return the CoNLL format for the following sentence in JSON format.
Response:
{
"data": [
{
"word": "Harry",
"entity": "B-PER"
},
{
"word": "Potter",
"entity": "I-PER"
},
{
"word": "was",
"entity": "O"
},
{
"word": "a",
"entity": "O"
},
{
"word": "student",
"entity": "O"
},
{
"word": "at",
"entity": "O"
},
{
"word": "Hogwarts",
"entity": "B-LOC"
}
]
}
In my opinion, there’s no substitute for human evaluation, but using the GPT models to give us a starting point can be a huge advantage.
Conclusion
Labeling data is always difficult, and sometimes, even getting that starting point is a huge advantage. In this blog, we showed how you can use the GPT models by OpenAI to generate predictions for data to serve as the starting point for your data labeling workflows. This process can significantly reduce the amount of human effort involved, and focus labelers’ attention on providing more value for their efforts. Check out the resources for more info on the topics presented in this blog.
Resources
Full example notebook – notebook with all code ready to run in Colab
Label Studio – Open Source data labeling tool
OpenAI Pricing page – details for the pricing estimate in this post