How Human Labor Enables Machine Learning

Much of the division between technology and human activity is artificial — how do people make our work possible?

Stephanie Kirmer
Towards Data Science

--

Photo by Dominik Scythe on Unsplash

We don’t talk enough about how much manual, human work we rely upon to make the exciting advances in ML possible. The truth is, the division between technology and human activity is artificial. All the inputs that make models are the result of human effort, and all the outputs in one way or another exist to have an impact on people. I’m using today’s column to talk about some of the specific areas in which we overlook how important people are to what we do — and not just the data scientists who write the code.

The division between technology and human activity is artificial, because all the inputs that make models are the result of human effort, and all the outputs in one way or another exist to have an impact on people.

Generating Data

You almost certainly already know this one — LLMs require extraordinary quantities of text data to train. We often think about this in hundreds or thousands of gigabytes of data on a hard drive, but this is a bit abstract. Some reports indicate that GPT-4 had on the order of 1 trillion words in its training data. Every one of those words was written by a person, out of their own creative capability. For context, book 1 in the Game of Thrones series was about 292,727 words. So, the training data for GPT-4 was about 3,416,152 copies of that book long. And this is only an example from text modeling — other kinds of models, like those that generate or classify multimedia, use similarly massive volumes of those sorts of data too.

There are a few things to consider when this data is concerned. First, all that data is generated by people, it doesn’t just appear on our hard drives by magic. Respecting and acknowledging the people who create our data is important just as a matter of ethics, because they have put in work and created value that we are benefiting from. But there are also more selfish reasons why we ought to know where our data comes from. We have a responsibility as data scientists to know what material we are giving to our models as exemplars, and to understand it in great depth. If we ignore the provenance of our data, we open ourselves up to being unpleasantly surprised by how our models behave when faced with the real world. For example, training LLMs on internet forums or social media data leads these models to a risk of replicating the worst of these spaces, including racism, hate speech, and more. In somewhat less extreme examples, we know that models are flavored by the training data they get.

If we ignore the provenance of our data, we open ourselves up to being unpleasantly surprised by how our models behave when faced with the real world.

Labeling Data

Human help is required to label data. But, what are labels exactly? At its core, labeling data means using human discernment to assign values or judgments to what we uncover in the data. No matter how data is collected or created, a great many of the machine learning use cases for such data require labeling of some kind.

This may mean simply deciding whether a datapoint is good or bad, determining whether words are positive or negative, creating derived values, dividing records into categories, determining what tags apply to an image or video, or endless others. One common example is identifying what text is in an image or other multimedia to improve character recognition models. If you have used captcha, I bet this sounds familiar — you’ve done data labeling work.

LLMs themselves, in theory, don’t require labeling, because we infer the quality of human-ness of the texts from the fact that these texts were already generated by real people and thus must be as ‘similar to human output’ as they can possibly be, in essence. Basically, because a human wrote it, then it’s by definition an acceptable example for the model to try and learn and emulate. This is where we use things like semantic embedding — the model learns how the language patterns work in human-generated text and quantifies these into mathematical representations. But we are still choosing which text goes in to the model’s processes, as I described earlier, and we have a responsibility to understand and assess that text.

Teaching Models

Reinforcement learning uses human intervention for tasks around tuning — meaning, we’re adjusting how the model responds to prompts slightly, once it’s basically got the hang of returning a coherent answer, whether that’s text, or images, or video, or other things. After some mainly automated element of pre-training or base training, many models are fine tuned by human beings, making sometimes subtle determinations of whether model is doing what was desired. This is a very hard task, because the nuances of what we actually want from the model can be really complicated. It’s basically copy-editing an LLM in a pass-fail fashion, at a massive scale.

As I have discussed before, many modern models are seeking to produce the content that will be most pleasing to a human user- something that will seem right and appealing to a human being. What better way to train this then, than asking human beings to look at the results of an intermediate stage of training and decide whether the results are fitting this description, and tell the model so it can make more appropriate choices? Not only is that the most effective way, but it may be the only way it can work.

It’s basically copy-editing an LLM in a pass-fail fashion.

Why this matters

Ok, so what? Is it enough to be conscientious about the fact that real people do a lot of hard work to make our models possible? Pat them on the back and say thanks? No, not quite, because we need to interrogate what the human influence means for the results we generate. As data scientists, we need to be curious about the interaction between what we build and the rest of the world in which it lives.

Because of all these areas of influence, human choices shape the model capabilities and judgments. We embed human bias into the models, because humans create, control, and judge all the material involved. We decide that this bit of text will be provided to the model for training, or that this specific response from the model is worse than another, and the model solidifies these choices of ours into mathematical representations that it can reuse and replicate.

This element of bias is inevitable, but it’s not necessarily bad. Looking to create something free of all human influence suggests that human influence and human beings themselves are problems to be avoided, which isn’t a fair assessment in my opinion. At the same time, we should be realistic about the fact that human bias is part of our models, and resist the temptation to view models as beyond our human foibles. Things like how we assign labels, for example, lead us to imbue meanings into the data consciously or subconsciously. We leave traces of our thought processes and our histories in the data we create, whether it’s original creative content, data labels, or judgments of model output.

Looking to create something free of all human influence suggests that human influence and human beings themselves are problems to be avoided, which isn’t a fair assessment in my opinion.

In addition, often in the machine learning space human effort is perceived as in service of “real” work instead of meaningful on its own. People who produce original work stop being seen as uniquely creative individuals, but just get folded into “content generators” in service of the model. We lose track of the humanity and real reason that this content exists, which is to serve and empower humanity. As with the previous point, we end up devaluing people in favor of idolizing technology, which I think is foolish. The models are the product of people and exist to serve people, they aren’t an independent end unto themselves. If you build a model that is never used and never gets run, what is the point?

Is data a renewable resource?

There is another interesting issue: the risk of running out of pristine human generated content as a limiter on model capability. That is, as our society begins to use LLMs to generate our data, and Dall-E to generate our images, and we stop incentivizing real people to be creative without these technologies, then the trillions of words and mountains of images we need to train new versions of these models will become contaminated with artificially generated content. That content, of course, is derived from human content, but it’s not the same. We don’t yet have very good ways to differentiate content that’s generated by people without models, so we’re going to struggle to know whether our training data for future models has this contamination in it, and how much.

Some people argue this is not actually a big deal, and that training models on at least some proportion artificial content won’t be a problem, but others theorize that when we start to cannibalize artificially generated content in this way, the underlying processes of training will be existentially altered, in the form of something called Model Collapse. This is in some ways an example of the essential problem that your model affects the world that your model relies on, so the model is definitionally changed by its own behavior. This isn’t just true of LLMs, as data scientists know too well. Any model can work itself out of a job by affecting how people behave, resulting in performance drift due to underlying data relationships shifting.

Your model affects the world that your model relies on, so the model is definitionally changed by its own behavior.

Even if we aren’t training on actually artificial data, also, there are many scholars considering whether our human composition and creative processes will change due to our exposure to artificially created content. If you read a whole lot of LLM generated text, either while writing and getting a model’s advice or just around the internet in general, is that going to subtly change how you write? It’s too early to know on a community level, but it’s a serious concern.

Human influence is a fact of machine learning—it’s a philosophical issue. We think of machine learning as a pure scientific enterprise, something that acts upon us, and this is one of the reasons it seems terrifying to some. But in reality, the systems that are being created are the product of human intervention, and human creativity. Creating and curating data makes all the rest of machine learning possible. In some ways, this should be comforting to us, because we have control over what we do with machine learning and how we do it. The process of machine learning is taking relationships between pieces of data and calculating them into mathematical representations, but the data is produced by people and is under our control. Machine learning and AI aren’t some alien, abstract force — they’re just us.

See more of my work at www.stephaniekirmer.com.

--

--

I'm a data scientist who used to be a sociologist, and I write about the intersection of ML/DS and society. Subscribe https://medium.com/subscribe/@s.kirmer