Building Better ML Systems — Chapter 2: Taming Data Chaos

About data-centric AI, training data, data labeling and cleaning, synthetic data, and a bit of Data Engineering and ETLs.

Olga Chernytska

Published in

Towards Data Science

12 min readMay 24, 2023

Building Machine Learning systems means much more than iterating through cool state-of-the-art algorithms.

The research or study project ends with a demo. In the commercial project, the model is released to thousands, if not millions, of users who use your model in all imaginable and unimaginable ways and expect it to always work quickly, accurately, and fairly. A single incorrect prediction could cost someone their life, lead to losses in the millions of dollars, or seriously damage the company’s reputation.

Throughout this series, we’re discussing important topics that need to be addressed to build a good ML system: business value and requirements, data collection and labeling, model development, experiment tracking, online and offline evaluation, deployment, monitoring, retraining, and much much more.

In the previous chapter, we learned that every project must start with a plan because ML systems are too complex to implement in an ad-hoc manner. We reviewed the ML project lifecycle, discussed why and how to estimate project business value, how to collect the requirements, and then reevaluate with a cold mind whether ML is truly needed. We learned how to start small and fail fast using concepts like “PoC” and “MVP”. And finally, we talked about the importance of design documents during the planning stage.

And this chapter is all about data. We’ll be diving into various aspects of data in ML systems — data-centric AI, training data, data labeling and cleaning, synthetic data, and a bit of Data Engineering and ETLs. This post is the longest in the series, but for a good reason: most of a Data Scientist’s working time is devoted to data.

So let the story begin.

Data-centric AI

There are two ways to improve the model accuracy:

Collect more data or clean your existing data, while keeping the model constant.
Use a more advanced algorithm or fine-tune the hyperparameters of your current model, while keeping the dataset constant.

The first approach is known as data-centric, and the second one is model-centric. Now ML community gravitates towards data-centric AI; many researchers and practitioners have concluded that improving data leads to a larger increase in model accuracy than improving the algorithm. “Garbage in, garbage out”, a phrase you’ve heard a million times, is becoming great again.

Here is what Andrew Ng, the founder of DeepLearning.AI and Landing AI, says:

“Instead of focusing on the code, companies should focus on developing systematic engineering practices for improving data in ways that are reliable, efficient, and systematic. In other words, companies need to move from a model-centric approach to a data-centric approach.”

Companies that build great AI products also use a data-centric approach. Andrey Karpathy, the former director of AI at Tesla, shared that most of his time at Tesla was devoted to data.

Image. Great AI companies focus more on data than algorithms.
Source: “*Building the Software 2.0 Stack” by Andrej Karpathy*

Data-centric AI has become so popular that it has recently evolved into a separate discipline that studies techniques to improve datasets. To be on the same page with the ML community, I highly recommend that you take this excellent free course by MIT: Introduction to Data-Centric AI.

Data Pipelines

Everything is data. System-generated logs, bank transactions, website data, user input data, and customer data are just a few examples that your business may work with.

Data that arrives is often chaotic, unstructured, and dirty. It comes from multiple data sources, which can be tricky to merge; sometimes it’s encrypted or may have missing snippets. Data can take the form of byte streams, text files, tables, images, voice and video recordings; it can be binary or human-readable.

Before Data Scientists and ML Engineers can make any use of it, the data needs to be processed, transformed, cleaned, aggregated, and stored.

A data pipeline is a way to organize the flow of the data.

ETL (Extract-Transform-Load) is an example of a data pipeline widely used for data analytics and ML. Within the ETL, data is organized in the following way:

First, you determine what data you want to collect and from which sources.
Next, you merge these data sources, transform the data to the required format, resolve inconsistencies, and fix errors.
Afterward, you design data storage and store the processed and cleaned data there.
Finally, you automate the entire process to run without human intervention. The data pipeline should be automatically triggered periodically or once a specific event occurs.

To dive deeper into ETLs, check out the post What is ETL process: Overview, Tools, and Best Practices by NIX United.

*Image. ETL pipeline.* *Image by NIX United*.

That was a high-level overview of data pipelines. This topic is much broader and nuanced, so more and more companies are hiring Data Engineers to work with data storages and pipelines, and allowing Data Scientists and Machine Learning Engineers to focus on data analysis and modeling.

If you are curious about what’s in a Data Engineer skillset, read Modern Data Engineer Roadmap by datastack.tv. I love seeing the rise of specialized roles within the field, and I am really happy that Data Scientists are not expected to know everything anymore. What a relief!

And one more important thing before we jump into training data and labeling:

If the data pipelines are set up well, your company will benefit from the data even without advanced Machine Learning. So before adopting ML, companies usually start with reports, metrics, and basic analytics.

Training Data

To train the “Cat vs Dog” classifier, you show the model a lot of cat images while saying “This is a cat,” and a lot of dog images while saying “This is a dog.” Without providing any rules or explanations, you let the model decide what to look at to make a prediction. Mathematically, it means that the model adjusts its parameters until inputs match their expected outputs for the training data.

The model builds its understanding of the world based on training data, with the assumption that training data represents the real world and represents it correctly. That’s why the quality of the training data matters a lot.

The “Cat vs Dog” model won’t be able to predict breeds or classify other animals because this information was not present in the training set.
If there are mistakes in the labels, and some cats are labeled as dogs and vice versa, the model will be confused and unable to achieve high accuracy. Non-random mistakes can be extremely detrimental to the model. For example, if all chihuahuas are labeled as cats, the model will learn to predict chihuahuas as cats.
Real-world data contains biases. For instance, women are paid less. So, if you train a model to predict the salaries of your company employees, the model may end up predicting lower salaries for women because that’s exactly what it sees within the data and assumes it should be like this.
If some classes or segments are underrepresented or absent in the training data, the model won’t be able to learn them well and will produce incorrect predictions.

Training data should be Relevant, Uniform, Representative, and Comprehensive. The meaning of these terms is explained well in the post What Is Training Data? How It’s Used in Machine Learning by Amal Joby.

After we all agreed that it’s extremely important to train models on high-quality data, let me share some practical tips.

Before collecting training data, understand the business task and then frame it as a machine learning problem: what should be predicted and from what input. Almost any business task may be represented in several ways depending on requirements and restrictions. While working on a Computer Vision project, I usually choose between object detection vs segmentation vs classification and decide on the number of classes.

Training data must be very similar to the data your model will ‘see’ in production. Theoretically, models can generalize to unseen data, but in practice, this generalization ability is quite limited. For example, if you train a computer vision model for an indoor environment, it won’t work well outdoors. Similarly, a sentiment model trained on tweets won’t be effective for analyzing classic literature text snippets. I’ve personally experienced cases where a computer vision model struggled to generalize even with narrower gaps, such as slight changes in lighting, skin tones, weather conditions, and compression methods. To overcome differences between the training and production data, a popular approach is to use the most recent data from production as the training dataset.

*Image. Example of mismatch between train and production (test) data. Source:* *Google Research Blog*.

A small, clean dataset is better than a large but dirty one. For most projects, data annotation is a bottleneck. Data labeling is an extremely complicated, slow, and expensive process (the next section will be devoted to that). Having a huge, clean dataset is a luxury that only gigantic tech companies can afford. All others have to choose between size and quality, and you should always opt for quality, especially for the datasets on which you evaluate your models.

No one can really tell how much data is needed. It depends on the complexity of the predicted real-world phenomenon, variability in the training data, and the required model accuracy. The only way to find this out is by trial and error. And because of that…

Acquire data in chunks. Start with a small dataset, label it, train a model, check the accuracy, analyze errors, and plan the next data collection and labeling iteration.

Training data is not static. As you recall from the previous chapter, you are going to train and retrain the model a lot of times during the research phase and when the model is already in production. With each new iteration and model update, a new training dataset is needed. No rest for the wicked, remember? :)

Data Labeling

Most ML models in production today are supervised. This means that labeled data is required for training and evaluating the model. Even in the case of unsupervised learning, where the model learns patterns and structures from unlabeled data, labeled data is still needed to evaluate the model’s accuracy; otherwise, how else would you know that it’s good enough for production?

There are two types of labels: human labels and natural labels.

Some machine learning tasks are about predicting the future. Examples are predictions of stock prices, customer churn, time of arrival, fraudulent transactions, and recommendations. Once the future has come, we know the true label. These labels are referred to as natural labels, and we only need to collect them when they arrive.

In Computer Vision and NLP, we don’t predict the future, instead, we classify, analyze, and retrieve information from images and texts. That’s why we cannot obtain natural labels and must heavily rely on human labels.

Human data labeling is an extremely complicated, slow, and expensive process. Don’t think of it as a task within a Machine Learning project; it is better to approach it as a separate Data Annotation project, with its own scope, budget, timeline, team, tools, and KPIs.

*Image. Stages of Data Annotation Projects. Source:* *Best Practices for Managing Data Annotation Projects*

If you work closely with data annotation, I recommend you to check the 30-page report on Best Practices for Managing Data Annotation Projects by Tina Tseng et al. For a shorter version accompanied by my own insights, continue reading this post.

The first thing to decide on is: Who is going to label the data? There are three options to consider: crowd, vendor, and in-house labeling team. I remember the excitement around crowdsourcing tools like Amazon Mechanical Turk about five years ago. However, it quickly turned out that crowdsourcing labeling is only suitable for very simple tasks that require minimal to no workforce training. As a result, most companies choose between vendors and in-house labeling teams. Startups usually lean towards vendors as it offers a simpler starting point, while large AI companies build their own labeling teams to have control over the process and achieve higher annotation quality. As an example, Tesla has 1,000 full-time employees within its manual data labeling team. Just saying.

Create guidelines and train annotators based on them. Guidelines are documents that provide explanations and visuals of what should be labeled and how. Guidelines are then transformed into training materials that annotators must complete before undertaking actual labeling tasks. If you work with vendors, make sure their workforce training process is set well.

Real-world data is ambiguous and confusing, so allow annotators to say, “I do not know how to label this sample.” Then, collect these confusing samples and use them to improve the guidelines.

The annotation tool matters. Annotators are typically paid hourly rates, so helping them label faster and more accurately will save you a lot of money. On a large scale, it’s especially noticeable whether the annotator labels 100 samples per hour versus 300. So choose wisely, and pay attention to the following:

How much time it takes to label a single sample. Some tools were specifically developed for NLP tasks; completely different tools are used for 2D or 3D Computer Vision.
Whether AI-powered labeling is supported. It is something you want to use. The tool may predict segmentation masks with a single user click on an object or allow you to deploy your own models to assist the labeling process.
How well it fits your infrastructure. The annotation tool will be integrated into the data pipeline. Once data arrives, it is automatically sampled and sent to the annotators. They label the data, and labels are automatically stored in the database. Some tools may fit your infrastructure better than others, consider that.

The list of open-source annotation tools is here, and here is a nice comparison of some free and paid tools.

Estimate the costs and timelines. You’ll be surprised how slow and expensive data labeling can be (I was). Therefore, it’s better to be prepared (and prepare your manager in advance).

Here are the formulas to roughly estimate costs and time:

Labeling Time (in man-hours) = Time to label a sample (in hours) * Dataset size (in samples) + Time reserved for training and error correction (in man-hours)
Labeling Time (in working days) = Labeling Time (in man-hours) / Number of employees / 8 hours
Costs ($) = Annotator’s hourly rate ($) * Labeling Time (in man-hours)

No matter how hard you try, data labels will inevitably be dirty. Humans make mistakes, they may become distracted, or misunderstand the task. Therefore, checking the quality of the labels is a must. And, of course, the algorithm or tool you select for this task must also be integrated into the data pipeline. I won’t stop repeating that: everything must be automatic.

One such tool is Cleanlab. It was developed by MIT graduates and has recently gained great popularity. Cleanlab improves labels for images, text, and tabular data using statistical methods and machine learning algorithms (for examples of what it can do, check out the Cleanlab blog).

As a final note on data annotation, I recommend this insightful post by Synced — Data Annotation: The Billion Dollar Business Behind AI Breakthroughs. The title is self-explanatory, and the article is certainly worth a read.

Synthetic Data

Take all the aforementioned challenges with data labeling, add data privacy issues and severe class imbalance in real-world data, and you’ll have the reason why synthetic data is becoming increasingly popular.

Synthetic data is typically generated using some combination of gaming engines, generative adversarial networks, and perhaps a touch of magic. In the self-driving car industry, synthetic data has already become essential. Check out what NVIDIA and Tesla are already doing.

Once synthetic data generation is set, one can obtain large diverse datasets with extremely accurate labels relatively quickly and inexpensively. Even if the synthetic data doesn’t look perfect, it can still be useful for model pre-training.

If you’re interested in expanding your knowledge on this topic, here is a great resource: What Is Synthetic Data? by NVIDIA.

Conclusion

In this chapter, we discussed a new trend in the industry — data-centric AI, an approach to building ML systems that considers clean data to be much more important than advanced ML algorithms. We touched on data pipelines, which are designed to organize the flow of chaotic and unstructured data so the data can be used for analytics. We learned that training data should be relevant, uniform, representative, and comprehensive, as models build their understanding of the world based on this data. We reviewed two types of labels — human and natural — and navigated through the complex, slow, and expensive process of obtaining human labels, and discussed best practices to make this process less painful. Lastly, we talked about an alternative to real data and human labeling: synthetic data.

In the next posts, you will learn about model development, experiment tracking, online and offline evaluation, deployment, monitoring, retraining, and much much more — all this will help you build better Machine Learning systems.

The next chapter is already available:

Building Better ML Systems — Chapter 3: Modeling. Let the fun begin

About baselines, experiment tracking, proper test sets, and metrics. About making the algorithm work.

towardsdatascience.com