Whereas many still cling to the former, some, including Artificial Intelligence (AI) luminary Andrew Ng, fervently argue that data, not models, must be at the core of the advancement of AI. It turns out he’s got a point; in fact, more than one. But let’s start at the beginning.
While all ML models essentially try to make predictions, the labels’ accuracy determines whether these predictions hold true in real life. In other words, the labeled aspects of data need to be invariably consistent with the "outside world," i.e. the actual conditions for which the model was designed. For this very reason, as far as the model is concerned, data labels come before everything else. And unless the data is labeled correctly, no model can ever achieve a reasonable degree of practical accuracy, even if it claims to be 99% accurate on paper.

A well-known example of this in the form of a meme comes from the field of Computer Vision: whereas the average person won’t have a problem telling apart different chihuahuas from muffins, this seemingly elementary task is very challenging for the machine, with the model’s success undeniably relying on clean, flawlessly labeled data.
Accordingly, data accuracy is by far the most defining component of data quality. And it must be present throughout the entire dataset for a ML model to be of any real use to anyone. This is why we can argue that focusing on the accuracy of data labels today can be the main gateway to increasing the accuracy of all AI models.
In June, Professor Ng – who at one point was the head of Google Brain and at another Baidu’s Chief Scientist – launched a campaign encouraging AI developers to do just that: shift their attention from models and algorithms to improving the quality of data they use to train those models. According to Ng, the established model-centric approach keeps the data fixed as developers improve the model until they achieve suitable results. This requires an extremely agile model that can somehow resolve all quality-related issues with the data that will likely arise at some future point. Alas, most models simply fail to do so in their entirety. To put it bluntly, your final results will be only as good as the data you used but rarely, if ever, better. The data-centric approach flips the script and holds the model or code as fixed, while the AI practitioners iteratively strive to improve data quality. Ng believes this is the way to proceed. We think so, too.
Model-centric vs. data-centric
Ng’s zealous quest is yet another testament to the essential role of data in the advancement of modern AI. Of course, data quality is not exactly a new topic for the AI industry – research consistently shows that 80% of AI project time is spent on data preparation. So, as AI development becomes increasingly more data-centric, it’s fundamental that ML scientists develop a keen understanding of how to increase the accuracy of their models through the Data Labeling part of the process.
The truth be told, the model-centric approach does have some perks to offer; after all, it’s cheaper, less time-consuming, and less troublesome at first glance. As a computer scientist, you have full control over what is going on – the model is in your hands to tweak and twist as much as you like. So, this essentially comes down to going for what you know from the perspective of those working with the data.
Generally, scientists try to avoid factors that they can’t directly influence. And data scientists aren’t labelers, so more often than not, they don’t have a say in how exactly the data is labeled or by whom. For this reason, for many ML researchers, it makes sense to downplay the importance of labeling and focus on the model instead; in other words, on their jobs.

To make a rough analogy, as a confectioner, you do your best to make top-quality chocolate with the ingredients you get your hands on, but you rarely if ever grow cocoa beans yourself.
That’s up to the tree farmers located halfway across the world. The cocoa beans in this case are the labeled data, and the farmers are the labelers. And just like confectioners, data scientists simply work with what’s been submitted to them to make their product, be it chocolate or ML models.
But although this is both plausible and appealing, taking the model-centric track is sure to run into serious problems because:
💡 Your output, by definition, will be only as acceptable as the data you used. If the data is noisy, even the best model can only go so far. Say, you’re working on an app that helps its users maintain a healthy diet. If your data set contains mislabeled items, such as similar looking parsnips and carrots, then the app is destined to perform poorly. So, without the right data, the whole enterprise simply fails to work even if you enhance your model. And thinking otherwise is basically on par with claiming that a new paint job will somehow boost your car’s performance, never mind the rotten engine.
💡 If the data is fixed, it can’t be updated or changed. Flexibility is something that a good scientific model (in any field) should always account for. Again, this is not possible if your first step is to say that you won’t change your data, ever. To put it into perspective, if you were working on a traffic control solution and another basic color suddenly had to be added to the arbitrary red-yellow-green of the traffic light, your model would become obsolete. To make your solution work, you’d have to relabel the data to accommodate for this change.
💡 Now, imagine that you’re working on a project that uses continuously updated data, such as facial recognition software (new faces) or a voice-assistant (new accents/dialects). The model-centric approach will probably either come to a grinding halt or essentially reinvent itself as data-centric in order to retain a competitive edge.
💡 If changes do need to take place within the model-centric framework, any manipulations will take longer compared to simply updating your set, which is the result of having fixed data that can never be tweaked. Ergo, those working on the model often have to find ways to circumvent problems. Going back to the example with the traffic light, to accommodate for this new change, rather than updating your data set, you’d have to come up with an ingenious way to boost your model. This would imply teaching the machine to recognize the new color without actually having it as a labeled color in the set. Which may or may not work, and at any rate will be challenging.
💡 Effectively, what this means is that the model-centric approach is the shorter road to take initially – provided everything stays the same – but this road swiftly turns into a zigzag when changes are required that aren’t aligned with the model’s initial parameters.
💡 As a consequence of that, scalability becomes a real problem, too – you can’t make things stable at the macro level if they’re unstable at the micro level. This is why scalability becomes nearly impossible when any significant change in the original parameters can potentially mean going back to the drawing board to think up a novel way to solve a problem without touching your data. Combine it with the most common method of data labeling today – the in-house route (when teams learn to label data from scratch using their own staff as opposed to outsourcing) – and you’re in for a pretty long wait before you can deliver anything solid. And when you do, it might not hold up for too long.
Concurrently, the data-centric approach has a number of big advantages to offer:
💡 Your data can be continuously updated and rapidly validated. This is only possible because the approach is purely data-driven.
💡 This approach lends a fair degree of flexibility and scalability (and generally, bouncebackability) to your whole pipeline. This is so because you have data at your disposal that can in theory be manipulated ad infinitum. Naturally, this is an invaluable asset when dealing with human-labeled data on a large scale. You may, for example, be dealing with audio annotations in order to train a voice-activated AI. If you already have 50,000 recordings (which is also possible with the model-centered approach), this could be sufficient to release a basic product. But with the data-driven approach, you can always add another accent or dialect to the same dataset, or even include a different language. Qualitatively speaking, this changes everything.
💡 On the other hand, in the model-centric scenario, you either have to leave things as they are (i.e. come to terms with your product’s limitations), or think of elaborate ways to tweak your model without touching the data (which rarely works with Natural Language Processing), or – perhaps without acknowledging it – switch to the data-centric approach, even if temporarily.
💡 Universality is another great perk. Labelers, particularly in the context of crowdsourcing, can tackle virtually every assignment and switch between tasks delivering valuable and even rare data for the training model. Say, you’re dealing with maps/navigation to acquire up-to-date info on businesses. With the model-based, fixed data approach, you’re basically stuck with whatever data you had at the time. But what if things have since changed? For instance, there are now new street signs or some businesses have relocated. This can be a real problem, but not if your approach is data-driven. Especially if you’re using crowd workers: simply ask them to revisit the points in question, and you’ve got yourself an updated data set.
💡 Improved quality of the incoming data can boost the training model’s overall results tremendously provided that adequate quality control mechanisms are in place, and the projects are appropriately managed (for example, using a CSA).
What’s the delay then?
To be fair, there are some downsides to the data-centric approach, too. Namely, if your team is on a tight budget and you’re using a ready dataset in an attempt to bypass any human-handled labeling altogether, then the model-centric approach is the only option. Simultaneously, when it comes to your labeling options, the aforementioned in-house route can be very expensive and time-consuming, while outsourcing can sometimes be a gamble that normally also costs a lot. Another method known as synthetic labeling entails generating faux data for ML but requires a great deal of computing power that many smaller companies don’t have access to. As a result, many teams think that the data-centric approach isn’t worth the trouble, mainly because, as it turns out, they’re ill-informed.
The data-centric approach will get you far, but only if you’re prepared to work with the data by investing time and/or money into it. The good news is that with some methods like crowdsourcing, data labeling no longer has to be costly or take months to complete. The trouble is that many people don’t know that such methods exist, or that they’ve become effective. Research indicates that almost 80% of all ML practitioners opt for the in-house track despite knowing its shortcomings. And these practitioners do it not because they particularly like this method, but because they just don’t know any better, as a recent survey reveals.
We can only assume that the remaining 20% are familiar with some of the newer labeling methods, but they probably feel unsure and wary about venturing outside the comfort zone of the familiar model-centered approach. The reason being that they get the heebie-jeebies at the thought of having to deal with that pesky data labeling in any shape of form. So, this is not just a wholly ideological issue of switching to the data-oriented approach, but actually learning to recognize that this switch doesn’t have to suggest a trip to the gallows – it’s actually bearable, and even painless.
Standardization is key
Standardization is what we’re after. And when it comes to standardization, Ng takes an industry-wide view and argues that there should be a single system across all AI projects for how to label data. But until the day industry standards are established, there are certain things that AI developers can do already today to ensure consistency of the labelled data in their projects.
Firstly, we need to measure label accuracy using one or more standard quality assurance methods like gold standard, consensus, or sample review. A good place to start is to provide solid instructions with plentiful examples for the labelers. Importantly, these examples must include not only what to do, but also what not to do. And they must be detailed enough to specify what is considered acceptable or appropriate in order to avoid the common problem of confusion of observations. For example, when the labelers have to mark human faces, are they also expected to mark realistic cartoon characters meant to represent people or not? This has to be spelled out for the labelers before they proceed with the task, because as a lot of research indicates, labelers often avoid asking clarifying questions and will simply interpret the task their own way.
Secondly, the labeling process should be conducted within a set framework of domain knowledge and context. This is vital because if the labelers understand the setting for the labels, they’re better equipped to correctly label when using words with multiple meanings, such as "arm," "date," or "crane." Vocabulary, format, and text style can vary greatly depending on the industry and should invariably be included within the context established for specific ML projects.
These steps to solidify consistency help prevent accuracy issues that may arise, particularly when labeling _subjective data. This type of data doesn’t have a gold standard of truth, which often exposes the process to inconsistent labels, stemming from each labeler’s experience and understanding of the assignment. This is known as labeling bias_, which is different from committing individual errors or even confusion of observations, because labeling inconsistencies in this case may sometimes follow a specific group trend, often related to one’s cultural background. For example, "identify an Asian man in the picture" is going to mean different things to someone from the US vs. the UK, i.e. a man from the Far East like China or Japan for the American, and a man from South Asia like India or Nepal for the Brit.
Lastly, while standards may lead to unwieldiness or unresolved irregularities, it’s important to remember that ML is inherently iterative. As various models are tested and their outcomes evaluated, new datasets should be continuously brought in for model improvement. Therefore, both the data labeling process and the teams doing the work should remain agile and flexible to incorporate any changes necessary to fine-tune the ML model.
Those changes may involve modifying certain facets of the data, such as its volume or complexity. They may also involve changing the process itself based on any relevant insights from model testing and validation. Such flexibility should always enable performers whenever necessary to contribute beyond the actual labeling by providing insights during ongoing assignments, which in the context of crowdsourcing can mean majority voting.
Retaining control
To sum up, the data-driven approach has many advantages to offer. And data-labeling accuracy lies at its heart. To ensure this accuracy, AI practitioners must plan and configure a quality control system. This multi-stage process involves decomposing a data-labeling task, writing instructions, designing a clear interface, and establishing a two-way communication channel with the labeling team. Moreover, in order to derive value from the labelers’ insights, it is crucial that the data labeling process involve a closed feedback loop, which basically means going back and forth in an open dialogue indefinitely, i.e. until a desired goal is achieved.
Let’s imagine you’re working on an interactive bot that talks to customers who buy electronic products, asks them questions, clarifies their preferences or shipment dates, and closes sales. This type of AI isn’t static. Not only does it train itself as every AI should, but it also needs customer feedback to operate and improve. Based on that, some tweaking may be required, either on the part of the model, or on the part of the labelers. In this case, it could mean recognizing new delivery addresses and planning the fastest routes taking different drop-off points into account, or cross-referencing any other products purchased by the same client to solve compatibility issues. Only the data-driven approach can fully allow for this. In Toloka’s case, where the data-driven track takes the form of crowdsourcing with countless labelers across the globe, every step of the way is monitored using automated quality control tools and modern Crowd Science methodology taught at Toloka Academy.
Main takeaway
Availability of sufficient amounts of quality labeled data remains an obstacle in AI development. As the movement with Ng as its bellwether gains momentum, the demand for accurately labeled data can be expected to rise substantially. As a result, forward-looking AI practitioners are now beginning to re-evaluate how they label their data. They may outgrow in-house labeling due to its very high cost and how little it offers scalability-wise, or find themselves priced out of external sources, such as pre-packaged data, scraping, or building relationships with data-rich entities.
This sets the scene for the continued rise of crowdsourced data labeling as a cost-/time-effective and scalable alternative with integrated quality control and fine-tuned, largely automated delivery pipelines. In 2019, the analyst firm Cognilytica reported that the market for third-party data labeling would expand to more than $1 billion by 2023 from $150 million in 2018. Other research predicts a similar trend, including some sources like Grand View Research that estimate the data-labeling market to be worth over $8 billion by 2028.
The bottom line is that if AI projects are to succeed in the real world, they must be fed high quality input. And improving the quality of data and consequently the models it fuels requires precision, that is accurate labels. Fortunately, suitable data-labeling technology and ready solutions are available: Appen, Toloka, and Scale AI to name but a few. Now, it is up to the AI practitioners to determine how to effectively integrate these solutions into their existing and future ML projects, which comes down to making a decisive step in favor of the data-driven approach. This goal-oriented prerequisite should be met to ensure the presence of error-free labels that enhance the quality of ML data, resulting in more robust training models and, ultimately, successful AI innovations that can improve our lives in the years to come.