From Chaos to Clarity: Streamlining Data Cleansing Using Large Language Models

Cleaning up survey responses using OpenAI’s GPT Model. Full Code with Github link.

Naresh Ram
Towards Data Science

--

Image by Dall-E 2. Generated and modified by the author.

In the digital age, accurate and reliable data is paramount for businesses striving to deliver personalized experiences and make informed decisions[1]. However, the sheer volume and complexity of data often pose significant challenges requiring many hours of tedious and manual work. Enter the game-changing technology of large language models (LLMs). These advanced AI tools, with their natural language processing capabilities and pattern recognition, have the potential to revolutionize the process of cleansing data to make it more usable.

Among the wrenches and the screwdrivers in the data scientists’ tool chest are the LLMs, reshaping activities and harnessing powers to enhance data quality. The proverbial whack of a hammer will unlock actionable insights and ultimately pave the way for better customer experiences.

That said, let’s drill right into the use case that I will be using as an example today.

Photo by Scott Graham on Unsplash

The Use Case

The worst thing you could do when conducting a survey among students is to leave a factual field as free-form text! You can imagine some of the responses we got.

Jokes aside, one of our clients, Study Fetch, an AI-powered platform that uses course material to create personalized all-in-one study sets for students, conducted a survey among university students. After receiving a whopping 10K+ responses, their CEO and Co-Founder, Esan Durrani, stumbled upon a little hiccup. Turns out, the “major” field in the survey was a free-form text box, meaning respondents could type in whatever they pleased. Now, as data scientists, we know that’s not the brightest move if you want to crunch some statistical numbers. So, the raw data from the survey ended up looking like this…

Anthropology 
Chem E
Computer Science
Business and Law
Drama
cs
IMB

Oh my! Get ready to grab your Excel and embark on a sorting adventure that might take you a mere hour or, who knows, maybe even three. Only then will this data heresy be thoroughly purged.

Yet, fear not, as we have the hammer of the Large Language Model (LLM).

As a wise man once said, if a hammer is all you have, everything looks like a nail. And boy, doesn’t the data cleansing job look like the most perfect nail?

We can simply ask our friendly neighborhood LLM to classify these into known majors. Specifically, OpenAI’s Generative Pre-trained Transformers (GPT), an LLM that powers the popular Chatbot app ChatGPT, will work for this case. GPT models use upwards of 175 billion parameters and have been trained on 2.6 billion stored web pages scraped from Common Crawl, an open dataset. Additionally, through a technique known as reinforcement learning from human feedback (RLHF), trainers can nudge and prod the model into providing more accurate and useful responses. [2]

I think for our purpose, 175 billion+ parameters, should do just fine. As long as we are able to come up with the right prompt.

Photo by Kelly Sikkema on Unsplash

It’s all in the Prompt

Ryan and Esan, from the AI company whose bread-and-butter is writing great prompts, proposed the first version of our prompt. It was a great one and did work very well using language inference[3], but there were two things that could be improved:

  • It was written to work for one record
  • It was written as a ‘Completion’ using the Da Vinci Model (My bank account recoiled in fear at the mere mention of IT)

It would cost us too much and that simply wasn’t going to do. So, Ryan and I independently rewrote the prompt as a chat prompt using ‘gpt-3.5-turbo’ to perform bulk action. OpenAI’s prompt best practices and the course ChatGPT Prompt Engineering for Developers came in handy for me. We went through a few more iterations of ideate, implement, analyze, and reform and we had a good working version.

Without further ado, here is the prompt after revision 2:

1. Arts and Humanities: Literature and arts.
2. Social Sciences: psychology, sociology, and political science
3. Business and Management: marketing, finance, economics, MBA
4. Life Sciences: Biology, environmental,agriculture-related
5. Engineering and Technology: Electrical, mechanical, etc.
6. Medicine Health and Allied Sciences: Nursing, medical school
7. All Other Programs: Everything else.

The above information pertains to categories of undergraduate programs.
Categorize the below data into one of the program categories above.

Desired Format: CSV of data row, program name

Data
###
Drama
Literature
Physics
Health sciences
Mechanical Engineering
###

Response from the LLM for this Prompt was

Drama, Arts and Humanities: Literature and arts
IMB, All Other Programs: Everything else
Health sciences, Medicine Health and Allied Sciences: Nursing, medical school
MBA, Business and Management: marketing, finance, economics, MBA
Mechanical Engineering, Engineering and Technology: Electrical, mechanical, etc.

This will work, sort of. But I didn’t quite like the redundant text with the long program names repeating. With LLMs, text is tokens and tokens cost money. You see, my programming skills were forged in the fiery depths of the Dot Com Bust. And let me tell you, I never pass up an opportunity for some cost-saving optimizations.

So, I changed the prompt slightly in the ‘Desired Format’ section. I asked the model to output just the ordinal number of the survey responses (E.g. 1 for Drama above) and the ordinal number of the program (E.g. 1 for Literature). Then Ryan suggested I should request a JSON output instead of CSV to make it simpler to parse. He also recommended I add an ‘example output’ section, an excellent suggestion.

The final prompt is as follows (simplified for clarity):

1. Arts and Humanities: Literature and arts.
2. Social Sciences: psychology, sociology, and political science
3. Business and Management: marketing, finance, economics, MBA
4. Life Sciences: Biology, environmental,agriculture-related
5. Engineering and Technology: Electrical, mechanical, etc.
6. Medicine Health and Allied Sciences: Nursing, medical school
7. All Other Programs: Everything else.

The above information pertains to categories of undergraduate programs.
Categorize the below data into one of the program categories above.

Desired Format: JSON format of an array with response number, program number

Example Example output: [[1,7],[2,1],[3,5],[4,8],[5,7]]

Data
###
1. Drama
2. IMB
3. Health sciences
4. MBA
5. Mechanical Engineering
###

The full prompt we used can be viewed on the GitHub link here.

The output from the model:

[
[1, 1]
[2, 3]
[3, 6]
[4, 3]
[5, 5]
]

So, as discussed earlier, the output from the model is a mapping between the ordinal numbers of survey responses and the categories we defined. Take the first line for example: 1,1. That means 1 is the response number and 1 is the corresponding mapped program number. Survey response 1 is “Drama” and the mapped program 1 is “Arts and Humanities”. This seems right! Drama in its proper #1 place, all eyes on it.

While the output at first glance looks like the output of embeddings (used in clustering and dimensionality reduction), they are simply the same mapped information with just the ordinal positions. In addition to providing some cost benefits on token usage, the numbers are easier to parse.

We can now translate the original survey response in the file into meaningful majors, do aggregation, and gain valuable actionable insights.

But wait, I’m not going to sit in front of my computer, type each block of survey responses into the browser and calculate the mappings. Besides being mind-numbing, the error rate would simply not do.

What we need is some good old automation. Enter the API …

Photo by Laura Ockel on Unsplash

API to the Rescue

As you may be aware, the Application Programming Interface (API) allows our program to interact with third-party services efficiently. While many people are accomplishing impressive feats with ChatGPT, the real potential of Language Models lies in utilizing the API to seamlessly integrate natural language capabilities into an application, making it imperceptible to the users. Much like the incredible science and technology that goes into making the phone or computer you are using to read this article on.

If you don’t already have it, you can request access to the API here, https://openai.com/blog/openai-api [4]. Once you sign up and get your API key, the specification can be found here. Some really helpful examples with code samples can be found here. The playground is a nice feature to test the prompt with various settings before you put it in [5].

We will be using the chat completion API using REST. A sample payload of the call is as follows:

{ 
“model”: “gpt-3.5-turbo”,
“temperature”: 0,
"n": 1,
“messages”: [
{“role”: “user”, “content”: “Hello, I'm a nail.”}
]
}

Let’s take a quick look at the parameters and their effects

model

The only one open to the public at this point for chat completions is gpt-3.5-turbo. Esan had access to GPT 4 model, which I was very jealous about. While gpt-4 is more accurate and hallucinates less [2], it is roughly 20 times more expensive and for our needs, Mr. Turbo was quite adequate, thank you.

temperature

Next to the prompt, the temperature is one of the most important settings we can pass to the model. It can be set to a value between 0 and 2, as per the API docs. It has a significant impact [6] as it controls how much randomness is in the output, sort of like the amount of caffeine in your system before you start writing. A guide to values you can use for each application is given here [7]

For our use case, we simply want no variations. We want the engine to give us mappings as is and the same ones every single time. So, we used a value of 0.

n

How many chat completions choices to generate? If we were doing this for creative writing and wanted more than 1 choice to select from, we can use 2 or even 3. For our case n=1 (default) will work well.

message

The role can be system, user, or assistant. The system role provides instructions and sets the context. The user role represents the prompt from the end user. The assistant role is the responses based on the conversation history. These roles help structure conversations and enable effective interaction between users and the AI assistant.

MODEL MAX TOKENS

This isn’t necessarily a parameter we pass in the request, though another parameter called max_tokens limits the total length of the response from the chat.

Firstly, a token can be thought of as a piece of a word. One token is approximately 4 characters in English. For example, the quote “The best way to predict the future is to create it” attributed to Abraham Lincoln and others, contains 11 tokens.

Image from Open AI Tokenizer. Generated by the Author.

If you are thinking that a token is exactly a word, here is another example of 64 tokens, to show it isn’t all that straightforward.

Image from Open AI Tokenizer. Generated by the Author.

Brace yourself for a shocking revelation: every emoji you include in your message adds a hefty toll of up to 6. That’s right, your beloved smileys and winks are sneaky little token thieves! 😉💸

The model max token window is a technical limit. Your prompt (including any additional data you place into it) and the answer must all fit within the model max limit listed here. In the case of chat completions, the content, role, and all of the previous messages all consume tokens. If you remove a message from the input or the output (assistant messages), the model will lose all knowledge of it [8]. Like Dory as she helps find Chico, no Fabio, no Bingo, no Harpo, no Elmo?… Nemo!

For gpt-3.5-turbo, the model maximum limit is 4096 tokens or roughly 16K characters. For our use case, the prompt is roughly 2000 characters, each survey response is roughly 20 chars (average) and the mapping response is 7 characters. So, if we put N survey responses in each prompt, the max characters will be:

2000 + 20*N + 7*N should be less than 16,000.

Solving we get an N value less than 518 or roughly 500. Technically, we could put 500 survey responses in each request and go through our data 20 times. Instead, we chose to put 50 in each response and do it 200 times as we were receiving abnormal responses intermittently if we put more than 50 survey responses in a single request. Once in a while, the service threw a temper tantrum! We’re not sure if it’s a chronic case of systemic petulance or if we just happened to stumble upon the grumpy side of luck.

So, how do we use this API we have? Let’s get to the good part, the code.

Photo by Markus Spiske on Unsplash

The Way of the Code

Node.js is a JavaScript runtime environment [9]. We will write a Node.js/Javascript program that will perform the actions as described in this flow chart:

Flowchart of the program. Image by the Author.

My Javascript skills aren’t that great. I can write better Java, PHP, Julia, Go, C#, or even Python. But Esan was insisting on Node, so Javascript it is.

The entire code, the prompt, and the sample input are available at this GitHub link. However, let’s take a gander at the juiciest bits:

First, let’s see how we read the CSV file in using the “csv-parser’ Node Library.

fs.createReadStream(fileName)
.pipe(csv())
.on('data', (data) => rows.push(data))
.on('end', async () => {
// Reading is done. Call classifier here
. . .
}

Next, we call the classifier to generate the mappings.

for(index = 0; index < totalDataPoints; ++index) {
dataText += (index+1).toString() + ". "
+ uniqueData[index] + "\n";
requestCount++;
if (requestCount>batchSize||index==totalDataPoints-1){
dataText += "###\n";
// We have loaded batchSize(50) response.
// Now construct the prompt
...
}
}

The prompt is then constructed from the categories, main prompt text, and the data from the CSV. We then send the prompt to the service using their OpenAI Node Library.

let prompt = categoriesText + mainPrompt + dataText;
let payload = {
model: "gpt-3.5-turbo",
temperature: 0,
messages: [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user","content": prompt }
]
};

try {
const response = await openai.createChatCompletion(payload);
let mapping = JSON.parse(response.data.choices[0].message.content);
// Here save the mappings
...

} catch (error) {
console.error('API Error:',error);
}

Finally, when all the iterations are done, we can translate the srcCol text (the survey response) to the targetCol (the normalized program name), and write out the CSV.

for (let row of rows) {
srcVal = row[srcCol].trim();
if (!row[targetCol] && dataMapping[srcVal])
row[targetCol] = dataMapping[srcVal];
}

stringify(rows, {
header: true
}, function (err, output) {
fs.writeFile(__dirname+'/'+destFileName, output,
function(err, result) {
if(err) console.log('error', err);
});
});

That bit of JavaScript wasn’t as hairy as I expected and it got done in 2 to 3 hours. I guess it always looks daunting until you get into it.

So, now that we have the code ready, it’s time for the final execution…

Photo by Alexander Grey on Unsplash

The Execution

Now, we needed a place to run the code. After debating if we should get a cloud instance to run the load, I did some quick math and realized that I could run it on my laptop in less than an hour. That wasn’t so bad.

We started with a test round and noticed that the service, 1 out of 10 times, would respond back with the data that was provided to it instead of the mappings. So, we would just get the list of the survey responses back. Since no mappings were found, those responses in the CSV file would be mapped to an empty string.

Instead of detecting this and retrying in code, I decided to rerun the script but have it only process records for which the target column was empty.

The script would start with the target column in all rows empty and fill in the normalized program name. Due to the error in the response, some of the rows would have the target column not be mapped and remain empty. When the script ran the second time, it would construct the prompt for only those responses which were not processed in the first run. We reran the program a couple of times and got everything mapped out.

The multiple runs took roughly about 30 minutes and did not need much supervision. Here is a selection of some of the more interesting mappings from the model:

Sample Mappings between input and program name. Image by the Author.

Most look right. Not sure if Organizational Behavior is Social Science or Business? I guess either would work.

Each request of about 50 records took a total of roughly 800 tokens. The cost of the entire exercise was 40 cents. We probably spent 10 cents, doing testing, reruns, etc. So for a total cost of about 50 cents, about 2 ½ hrs of coding/testing time, and ½ hr of runtime, we got the job done.

Total Cost: Approx. Less than $1

Total Time: Approx. 3 hours

Perhaps manual conversion using Excel, sorting, regex, and drag-and-copy, we could have accomplished it in the same amount of time and saved a little change. But, this was way more fun, we learned something, we have a repeatable script/process, and got an article out of it. Besides, I have a feeling StudyFetch can afford the 50 cents.

This was a good use that we achieved efficiently and cost-effectively, but what else can LLMs be used for?

Photo by Marcel Strauß on Unsplash

Looking for More Nails

Adding language capability to your applications can have further use cases than the one I illustrated above. Here are more use cases just pertaining to the review data we were looking at:

Data Parsing and Standardization: LLMs can help in parsing and standardizing data by identifying and extracting relevant information from unstructured or semi-structured data sources like the one we just looked at.

Data Deduplication: LLMs can help identify duplicate records by comparing various data points. For example, we can compare names, majors, and universities in the review data and flag potential duplicates.

Data Summarization: LLMs can summarize distinct records to get an idea of the response. For E.g. for the question “What is the biggest challenge you face while studying?”, a large language model can summarize several responses from the same major and university to see if there are any patterns. We can then put all the summarizations into a single request and get an overall list. But I suspect summarization from each customer segment will be more useful.

Sentiment Analysis: LLMs can analyze the reviews to determine sentiment and extract valuable insights. For the question “Would you pay for a service to help you study?”, LLMs can categorize the sentiment as 0 (very negative) to 5 (very positive). We can then use this to analyze student interest in a paid service by segment.

While student reviews are a great example of a smaller microcosm, the wider world has several uses for this technology as well. At AAXIS, where I work, we implement business-to-business as well as business-to-consumer digital commerce solutions. Doing so includes migrating large amounts of data from an existing older system to a newer system with different data structures. We use a variety of data tools to analyze source data to ensure consistency. The techniques outlined in this article could be of good use for that purpose.

Some of the other digital commerce use cases are checking the product catalog for errors, writing product copy, scanning review responses, and product review summarization, to name a few. A lot simpler to code for than the murky waters of undergrad creativity when asked for their majors.

Still, it is important to note that while LLMs can be powerful tools in cleansing data, they should be used in conjunction with other techniques and human oversight. Data cleansing processes often require domain expertise, context understanding, and manual review to make informed decisions and maintain data integrity. LLMs are also not inference engines[10]. They are next-word predictors. And they tend to provide incorrect information very confidently and convincingly (hallucinations) [2][11]. Fortunately, during our testing, we didn’t encounter any hallucinations since our use case primarily involved classification.

LLMs can be a great tool in your arsenal if you proceed with caution and are aware of the pitfalls.

Photo by Paul Szewczyk on Unsplash

The Final Nail

We started out, in this article, by looking at a specific use case of data cleansing: normalizing survey responses to a specific set of values. This would allow us to group the responses and gain valuable insights. We used a Large Language Model (LLM), Open AI’s GPT 3.5 Turbo, to help classify these responses. We reviewed the prompt that was used, how to make use of the API calls using the prompt, and the code needed to automate it all. Finally, we got it all put together and got the job done for a total OpenAI utility cost of less than one dollar.

Did we have a proverbial LLM hammer and found the perfectly shiny nail in free-form survey responses? Maybe. More likely, we had a Swiss army knife and used it to skin and eat some fish. Not quite purpose-built, but still very adequate. And Esan really loves Sushi.

What is your use case? We’d love to hear from you!

The Co-conspirators

The main work in this article was performed by me, Esan Durrani, and Ryan Trattner, Co-Founders of StudyFetch, an AI-powered platform that uses course material to create personalized all-in-one study sets for students.

I would like to thank Prashant Mishra, Rajeev Hans, Israel Moura, and Andy Wagner, my colleagues at AAXIS Digital for their review of this article and suggestions.

I would also like to thank my friend of thirty years, Kiran Bondalapati, VP of Engineering at TRM Labs for his initial formative guidance through the world of Generative AI and for reviewing this article.

Also, thanks to my editor, Megan Polstra, for making the article look and feel professional as always.

Reference

1. Temu Raitaluoto, “The importance of personalized marketing in the digital age”, MaketTailor Blog, May 2023, https://www.markettailor.io/blog/importance-of-personalized-marketing-in-digital-age

2. Ankur A. Patel, Bryant Linton and Dina Sostarec, GPT-4, GPT-3, and GPT-3.5 Turbo: A Review Of OpenAI’s Large Language Models, Apr 2023, Ankur’s Newsletter, https://www.ankursnewsletter.com/p/gpt-4-gpt-3-and-gpt-35-turbo-a-review

3. Alexandra Mendes, Ultimate ChatGPT prompt engineering guide for general users and developers, Jun 2023, Imaginary Cloud Blog, https://www.imaginarycloud.com/blog/chatgpt-prompt-engineering/

4. Sebastian, How to Use OpenAI’s ChatGPT API in Node.js, Mar 2023, Medium — Coding the Smart Way, https://medium.com/codingthesmartway-com-blog/how-to-use-openais-chatgpt-api-in-node-js-3f01c1f8d473

5. Tristan Wolff, Liberate Your Prompts From ChatGPT Restrictions With The OpenAI API Playground, Feb 2023, Medium — Tales of Tomorrow, https://medium.com/tales-of-tomorrow/liberate-your-prompts-from-chatgpt-restrictions-with-the-openai-api-playground-a0ac92644c6f

6. AlgoWriting, A simple guide to setting the GPT-3 temperature, Nov 2020, Medium, https://algowriting.medium.com/gpt-3-temperature-setting-101-41200ff0d0be

7. Kane Hooper, Mastering the GPT-3 Temperature Parameter with Ruby, Jan 2023, Plain English, https://plainenglish.io/blog/mastering-the-gpt-3-temperature-parameter-with-ruby

8. OpenAI Authors, GPT Guide — Managing tokens, 2023, OpenAI Documentation, https://platform.openai.com/docs/guides/gpt/managing-tokens

9. Priyesh Patel, What exactly is Node.js?, Apr 2018, Medium — Free Code Camp, https://medium.com/free-code-camp/what-exactly-is-node-js-ae36e97449f5

10. Ben Dickson, Large language models have a reasoning problem, June 2022, Tech Talks Blog, https://bdtechtalks.com/2022/06/27/large-language-models-logical-reasoning/

11. Frank Neugebauer, Understanding LLM Hallucinations, May 2023, Towards Data Science, https://towardsdatascience.com/llm-hallucinations-ec831dcd7786

--

--

Chief Science Officer @ AAXIS.IO . Has fun solving complex business and technical problems. Link up at linkedin.com/in/nareshram