Opinion

When ChatGPT first came out in November 2022, the LLM (Large Language Model) craze was immense. Straight out of Tony Stark’s lab, we finally had an Artificial Intelligence that communicated like a human. Even for the tech-initiated, its capabilities were shocking at first, almost frightening. Granted, LLMs had been around for some time by then, but GPT-3 took things to a new level.
But then, the issues started to show themselves. ChatGPT hallucinates, said machine learning researchers – it would often make things up and cite "sources" that did not exist. ChatGPT is a disaster for academic integrity, cautioned ethicists – students could cheat in easier ways than ever. And, arguably most importantly, ChatGPT is not ethically sound, warned AI ethics researchers – much of its training data was full of bias, and this reflects in its responses.
This leads to a dilemma. ChatGPT is powerful, yes – it certainly can do things. But at the same time, it is far from perfect. So should we use it? And if so, how?
I acknowledge the arguments against ChatGPT above. In fact, in many cases, you’ll find me actively making them. My own lab at the University of Washington is ripe with research concerning the ethics of LLMs.
That said, I maintain it would be foolish to ignore them altogether. Technology is advancing, and we must advance with it. We can only combat the issues with LLMs by actively using them in effective ways to learn what must be changed, not by ignoring them altogether.
Every field has its own unique drawbacks and benefits in this new technological age. In this article, I’ll discuss three ways in which you, the aspiring data scientist, can harness the power of ChatGPT. We’ll talk about what you can do, and, perhaps more importantly, what you can’t.
I want to consider this dilemma from two different perspectives. First, I’ll give a technical example, and then I’ll provide a broader, subtler perspective.
Let’s get to it.
First, ChatGPT can’t process all your data, but it can help you find it, format it, and guide you in the early stages of processing (code generation).
My point here is best illustrated by an example. Most quarters, I teach an undergraduate data visualization course. This, as you can imagine, involves data. And where there is data, there is a headache involved in getting it in the right format.
One student (let’s call him Dan) came to me with a particularly annoying issue. Dan had collected some user data about water quality in the department building. One of the questions asked respondents for adjectives to describe the water’s taste, and he wanted to visualize these results in a bar chart.
Unfortunately, the way the data was collected in the back-end inadvertently resulted in all the free-response "Other" adjectives being grouped together. More concretely, Dan’s data looked like this:

As you can see, the three "Other" responses were grouped together into a single item. This made for a subpar bar chart, since in an ideal world, each of those adjectives would receive their own bar, rather than being grouped into a joint category.
And so, Dan came to me for assistance. Try as I might, I couldn’t determine the correct method in pandas
to solve this issue and separate the data out into individual rows. I tried various things, the most unwieldy of which was trying to define some custom function and use it in combination with Series.apply
. I am sure the more astute among you have already identified the right function, but the important point is this – neither Dan nor myself knew it, and our online searches for information did not prove fruitful.
Granted, we could have done it manually, but there was much more data than the example subset I’ve shown here, and it would have been a pain, not to mention rather inelegant.
After nearly 30 minutes of failed attempts, I turned to ChatGPT as a last-resort effort to help Dan. This was before I had much experience using LLMs, so I did not think to try this earlier. We described the data we had, the problem we were facing, and the desired output.
Lo and behold, ChatGPT solved our problem by introducing us to the explode
function, quite literally designed to take list data in a column and expand it out. Running df.explode('Description')
gives us the following output:

But my point here isn’t about this specific function. If you’d known what to look for, you likely could have found a solution to this problem using standard Google searches. Heck, just writing this article, I found a Stack Overflow post that mentions this function.
And yet I maintain the newfound utility of ChatGPT (and other LLMs) for such use cases. Why? Well, in order to find solutions using standard Internet search, you often need to ask the question in the right way, using technical terms you may or may not be aware of. This can be a a barrier, especially if you’re a beginner, or if English isn’t your first language.
Since ChatGPT is a large language model designed for conversation with humans, it can be easier to explain your particular issue and be understood, even if you’re a bit uncertain about how to describe it. Dan and I experienced this ourselves above.
This brings me to my main point. In a Data Science context, ChatGPT is great at assisting with specific and targeted problems due to its vast amount of training data. In addition to a use case like the one above, you can ask it where you might find examples of specific types of data (the advanced GPT-4 model can even search the internet live), or ask it to conduct basic data transformations (such as organizing raw text data into a JSON).
You can’t ask ChatGPT to complete an end-to-end data science exploration for you (I mean, you can try, but it’s just going to give you some vague guidelines), but you absolutely can (and should) use it to troubleshoot small problems along the way.
Second, ChatGPT can’t make you a data scientist, but it can make you a competitive data scientist.
Now, let’s zoom out a bit. Rather than discuss a particular use case, I want to talk about the broader impact that ChatGPT can have on your career.
What am I arguing in the title of this subsection? This is an extremely important point. Many people think because LLMs can reason with human-like abilities, generate code, and hold so much information, they’ll replace software engineers, data scientists, and the like.
This is far from the reality. LLMs may be powerful, but they still struggle with basic tasks like arithmetic and can be prone to errors and hallucinations (read: literally making stuff up).
Practically speaking, this means that – despite what overenthusiastic influencers on social media want you to believe – someone with no training who just has a subscription to GPT-4 won’t take your job from you. Nor will your employer completely automate your work with an app running an LLM. All that effort you put into learning statistics and programming and people skills will pay off. ChatGPT alone can’t make you a data scientist.
But, ChatGPT and other LLMs absolutely can make you more competitive among data scientists. If you learn to adopt these new tools, you can better your workflow and take your skills to the next level.
There are levels to working in data science. The base level is just a standard data scientist who doesn’t take advantage of the up-and-coming utility of LLMs. Someone at this level will still have a marketable skill set and much to contribute, albeit with an unwise reluctance to embrace new technological advances.
One level up, you have someone with an identical base skill set, who has also put in the effort to learn some basic prompt engineering with LLMs. This data scientist is capable of the same work output, but with slightly more efficient solutions and faster workflows because they know how to use ChatGPT as an AI coding assistant.
And finally, you’ve got a data scientist who has put significant effort into understanding the various workflows and implications surrounding LLMs. Their abilities go beyond simple prompt questions, encompassing detailed development with APIs as well as a thorough understanding of the ethical concerns surrounding LLMs.
Among these three, the latter is the most attractive hiring candidate. They’re equipped to help companies augment their products with this advancing technology while remaining cognizant of common pitfalls and issues. As more and more companies realize they’ll get left behind if they don’t harness the power of generative AI, this kind of data scientist becomes more and more attractive.
A data scientist who is trained in the field’s foundations, who is engaging meaningfully with technological advancement, and who approaches modern problems with a forward-thinking, but ethical and just, mindset.
So I conclude by asking you a question.
Which kind of data scientist are you going to be?
My name is Murtaza Ali, and I am a PhD Student at the University of Washington studying computer science education. I enjoy writing about education, programming, life, and the occasional random musing.