Stop treating data as a commodity

Why 87% of applied ML projects never make it to production—learnings from training over 50k models

Tobias Schaffrath Rosario
Towards Data Science

--

As you’re reading this on “Towards Data Science”, I probably don’t need to tell you about all the amazing things happening in ML research. I also don’t need to tell you about the potential of AI when applied to real-world use cases. However, 87% of applied deep learning projects still fail in the proof-of-concept stage and never make it to production [1].

In this post, I’ll dive deeper into what we learned from working on over 10,000 projects and training over 50,000 vision models for production together with our users here at hasty.ai. I summarize the reasons which we observed why so many applied ML projects fail and propose an approach to solve the issues.

TL;DR: We need a paradigm shift from model-centric to data-centric ML development.

Disclaimer: Here, at hasty.ai, we’re building an end-to-end platform for data-centric visionAI. Thus, all the examples I use come from the vision-space. However, I believe that the presented concepts are still relevant for other fields of AI as well.

Many applied ML teams mimic the research mindset

To understand why 87% of applied ML projects fail, we need to take a step back and grasp how most teams approach building ML applications today. This is dictated by the way how ML-research works.

Research teams mostly try to redefine the state of the art (SOTA) for a specific task. SOTA is measured by reaching the best performance score on a given dataset. Research teams keep the dataset fixed and try to improve fractions of already existing approaches to achieve a .x% increase in performance because this will get them published.

As every hype in research is generated by this chasing SOTA mindset, many teams working on applied ML adopt it, believing that it’ll lead them to success as well. They invest a lot of resources in developing models and treating the data as given or something easy to outsource.

Chasing SOTA in applied ML is a recipe for failure

However, this thinking is flawed.

First of all, the goals in research and applied ML differ significantly. As a researcher, you want to be sure that your work is cutting-edge. On the other hand, if you work in applied ML, your primary goal should be to make it work in production—if you use a five-year-old architecture, so what?

Second, the conditions you encounter in the real world diverge drastically from a research environment, making things much more complicated.

In research, teams work under perfect conditions on clean and structured data. However, in reality, data looks very different, and teams face a whole new set of challenges. Image by Author, created with imgflip.com.

Let me tell you one of my favorite urban myths to illustrate this from the world of autonomous driving:

As with most ML research, prototypes were developed in the Bay area. Teams spent years tweaking the models before doing their first test drives in the Midwest. When finally driving out in the Midwest, the models broke because all training data was collected in sunny California. The models couldn’t handle the harsh Midwest climate as suddenly there was snow on the road and raindrops on the cameras what confused the models.

Side note: I heard this story over and over but never found a citable source. If you know one, I’d appreciate it a lot if you’d share it with me.

This anecdote is perfect symbolism for a larger issue: you encounter quite pleasant conditions in a research environment, just like the nice California weather. However, in practice, conditions are rougher, and things get more challenging; you cannot apply everything that’s theoretically possible.

This states a whole new set of challenges if you want to do applied ML. Unfortunately, you won’t overcome these problems by simply improving your underlying architecture and re-defining SOTA.

Applied ML ≠ research

To be more precise, here are 5 ways how the environment in applied ML varies from the research one:

1—Data quality becomes relevant

In applied ML, there’s the saying:

“Garbage in, garbage out.”

This means that your models will be only as good as the data you train them on. Most researchers disregard that the 10 most used benchmark datasets across domains have a miss-classification rate of 3.4% for the training and validation data [2].

The most used benchmark datasets have a label-error rate of 3.4% [2]. Whereas this can be neglected when comparing model architectures to each other, it can have severe consequences in applied ML.

Many researchers argue that it’s reasonable not to pay too much attention to this, as you still get meaningful insights comparing model architectures with each other. However, if you want to build some business logic on top of your model, it is pretty critical to be certain if you’re looking at a lion or a monkey, to give you one example of many.

2—More often than not, you need custom data

Most benchmark datasets used in research are huge general-purpose ones because the goal is to generalize as much as possible. In applied ML, on the other hand, you’ll most likely run into specialized use cases which are not represented in the publicly available general-purpose datasets.

As a result, you’ll need to collect and prepare your own data that represents your specific problem appropriately. Often, choosing the right data has a bigger impact on your model’s performance than which architecture you choose or how you set your hyperparameters.

Decisions made early on in the ML development process have a huge impact on the end result. Especially, data collection and labeling strategy at the beginning are critical as the cascades cumulate with every step. The image is taken from [3].

In the paper “Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI” [3], a team from the Google AI lab impressively shows how picking a wrong data strategy at the beginning of a project can adversely affect the model’s performance later on. I really recommend reading the whole paper.

3—Sometimes, you need to work with small data samples

In research, a common answer to a low model’s performance is: “Just collect more data.” Typically, when you try to submit a paper to a conference trained on a few hundred examples only, they would most likely reject you. Even if you’d achieve a high performance, the argument would be that your model is overfitted.

In applied ML, though, it’s often not possible or too expensive to collect enormous datasets, so you have to work with what’s available to you.

The overfitting argument is not as critical here, however. If you try to solve a clearly defined problem and are acting in a rather stable environment, it might be viable to train an overfitted model as long as it’s producing the right results.

4—Don’t forget about data drift

To make things even more complex in applied ML, you’ll most likely run into some sort of data drift. It happens when the underlying distribution of your features or target variables changes in the real world once the model is in production.

An obvious example would be that you collected your initial training data and deployed your model in summer, but the world looks completely different in winter, and the model breaks. Another, harder to anticipate example would be when your users start to change their behavior after interacting with your model.

A slide of the great Josh Tobin summarizing the most common occurances of data drift [4].

A small body of research focuses on this problem (see the slide above from the great Josh Tobin), but most researchers overlook this in their work chasing SOTA.

5—You’ll run into computational limitations

Did you ever try to deploy a deep model to a Texas Instruments ARM device? We did. Ultimately, we got it to work, but it was a very frantic process.

Depending on your case, you might have to make inferences on edge, serve millions of users in parallel and in real-time, or you simply don’t have the budget for limitless GPU consumption. All of these reasons put constraints on what’s theoretically possible in a research environment. Researchers are used to spend $20,000+ for a project on GPUs.

A solution: from model- to data-centric ML

As mentioned above, the drastically different environment in applied ML states a whole new set of challenges. Focusing on improving the model only as it’s common in research doesn’t meet these challenges—the relationship between your model and your data is much more important than the model itself!

Most teams in applied ML, however, adopted the research mindset and do STOA chasing. They treat data as a commodity that can be outsourced and invest all their resources in the model work. From our experience, this is the single most important reason why so many applied ML projects fail.

More and more people in applied ML recognize, though, that focusing on the model only is producing disappointing results, and a movement advocating a new approach to applied ML is arising: they’re pushing for a shift from model- to data-centric ML development.

To learn more about it, I really recommend you to watch the talk from Andrew Ng below, where he makes a case for data-centric ML much more eloquently than I ever could.

In this talk, Andrew Ng gives many great examples of why a data-centric approach outperfrorms a model-centric one in applied ML.

But to summarize, the main thesis is: when you do applied ML, you shouldn't worry about SOTA too much. Even a few years old models are powerful enough for most use-cases. Tweaking the data and ensuring that your data fits your use case and is of high quality has a much bigger impact.

To make this a bit more tangible, let me share one of the talk’s anecdotes with you:

Andrew and his team were stuck at an accuracy of 76.2% when working on a defect detection project in manufacturing. He then split up the team. One group kept the model constant and added new data/improved the data quality, and the other used the same data but tried to improve the model. The team working on the data was able to boost the accuracy to 93.1%, whereas the other team couldn’t improve the performance at all.

Most teams we see succeed follow a similar approach, and we try to put this to the heart of each of our users we talk to. Concretely, data-centric ML development comes down to the following points from our experience:

1—The data flywheel: develop model and data in tandem

In the talk from above, Andrew mentions that his teams generally develop a model early on in the project and don’t spend time on tuning it initially. After running it on the first batch of annotated data, they usually notice some issues with the data — a class is under-represented, the data is noisy (e.g., blurred images), the data is labeled poorly, …

Then, they spend time fixing their data, and only once the model's performance doesn’t improve anymore by fixing the data, they go back to the model-work and compare different architectures and do the fine-tuning.

We also advocate for this approach with every opportunity we get. We call it the data flywheel. It’s the idea of having an ML pipeline that allows you to iterate quickly on model and data in tandem. You should be able to build models quickly, expose them to new data, take the samples with poor predictions, annotate those samples, add them to your dataset, retrain your model and then test it again. This is the fastest and most reliable way to build applied ML.

The Data Flywheel is an ML pipeline that allows you to iteratively develop your data and model in tandem reaching ever-improving performance under real-world conditions. Graphic by author.

This post is not meant to be promotional, but to show you how serious we are about this idea here at hasty.ai: we actually built our whole business around this idea, and from what we know of, we’re the only one doing so—in the vision space at least. When you use our annotation tool, you get a data flywheel out of the box within our interface.

We constantly (re-)train a model for you in the background without you writing a line of code while you annotate your images. Then, we use the model to give you predictions for the labels on the next image, which you can correct what improves the model’s performance significantly. You can also use our API or export the models to build the flywheel with your own (potentially customer-facing) interface.

Once you reached a plateau in performance by improving your data in Hasty, you can use our Model Playground to fine-tune our model and hit that 99.9% accuracy.

2—Annotate the data yourself, at least at the beginning

Doing data-centric ML also means that you should annotate your data yourself. Most companies stuck in the SOTA chasing mindset view data as a commodity and outsource the labeling.

However, they oversee that, actually, the data asset they’re building is their future competitive advantage and not the models they train. How to build models is public knowledge and (for most use-cases) easily replicable. Everyone can learn the foundations of ML for free nowadays. However, building a large ground-truth data set can be very time-consuming and challenging and is not reproducable without investing the time.

The more complex your use case, the harder it’ll be to build the data asset. Look at medical imaging to give you only one example of many where you need a subject matter expert to create high-quality labels. But even outsourcing the annotation work for apparently simple problems can lead to problems, as the example below shows—again, it’s stolen from Andrew Ng’s great talk.

It’s not always clear how to label objects. Identifying edge cases like this while annotating already, might save you hours of searching for them when debugging your model.

This is an image from ImageNet. The labeling instruction was to “use bounding boxes to indicate the position of iguanas”, which sounds like a not miss-interpretable instruction at first sight. However, one annotator drew the labels so that the bounding boxes were not overlapping and disregarded the tail. In contrast, the second one considered the tail of the left iguana as well.

Both approaches are perfectly fine on their own, but it’s problematic when half of the annotations are the one way and the other half the other one. When you outsource the annotation work, you might spend hours finding a mistake like this, whereas you can align much better when you do the annotations internally.

Furthermore, running into issues like this might give you great insights into edge cases which might cause your model to break once it's in production. For example, an iguana looks somewhat like a frog without the tail, and your model might confuse the two with each other.

Of course, this example is made up, but when you annotate yourself, you very often will run into objects where you’re not sure how to label them. Your model will struggle with exactly the same images as well once it’s in production. Being aware of this early on, allows you to take action and mitigate the potential consequences.

Often, companies outsource the annotation work because they perceive it as too much effort and painful to do themselves. However, more and more tools offer great degrees of automation, speeding up the labeling work immensely.

Working for hasty.ai, of course, I think that our annotation tool is the best one out there. I’m not trying to convince you that you should use our tool, though. I just want to stress that it’s definitively worth checking out the tools available and surveing which one fits the best to your needs. Using the right annotation tool can make it economically viable to annotate your data in-house, bringing you all the benefits I mentioned above.

3—Use tools to reduce the MLOps hassle as much as possible

Following a data-centric approach brings much more challenges with it than only labeling the data. Building the data flywheel, how described above, is quite tricky infrastructure-wise.

Getting this right is the art of MLOps. It’s a new term in the applied ML world depicting pipeline management and making sure that your model runs as it should in production. It’s a bit like DevOps for traditional software engineering.

If you ever dived into the world of MLOps before, you probably saw the graphic below from Google’s paper “Hidden technical debt in machine learning systems”. [5]

Applied ML is so much more than just the model code. All of the other tasks make up MLOps. There are more and more tools out there making MLOps easy. Be smart, and use some of those to take some of the hassles out of applied ML. The image is taken from [5].

It shows all the different elements of MLOps. In the past, companies who were able to implement this built large teams and maintained a Frankensuite of tools writing countless lines of glue-code to make it work. Mostly, the FAANG companies were the only ones able to afford it.

But right now, more and more startups arise that offer tools to simplify the process and allow you to build production-ready applied ML without the MLOps hustle. We, here at hasty.ai, are one of those startups. We offer an end-to-end solution to build complex visionAI applications and handle all the infra of the flywheel for you.

But regardless if you end up using Hasty or not, be smart end check what’s out there instead of trying to handle all of the MLOps yourself. This frees up time which you can use to focus on the relationship between your data and your models, increasing the chances of success of any applied project.

Conclusion

The potential of applying all the exciting things happening in ML research to real-world use cases is endless. However, 87% of applied ML projects still fail in the proof-of-concept phase.

From our experience of working on over 10,000 projects and training over 50,000 models for production, we think that a paradigm shift from model- to data-centric development is the solution to make more of those projects successful. With that thinking, we’re not alone but part of a rapidly growing movement in the industry.

Many applied ML teams didn’t adopt this mindset yet because they were chasing SOTA and mimicking how ML is done in the research community, neglecting the different conditions you encounter in applied ML.

However, by no means, I want to trash the research community or make the impression that their approach is valueless. Even though there is justified criticism of the academic world in ML (which I didn’t even touch in this article) [6], it’s fascinating how much great work has been produced over the last years. I expect many more groundlaying things coming out of the academic world.

But this is exactly the point. The goal of academics is to do the groundwork to lay the path for applied ML. However, the goals and environment in applied ML are completely different, so we need to adopt a different, data-centric, mindset to make applied ML work.

From our experience, data-centric ML comes down to the following three points:

  1. The data flywheel: develop model and data in tandem
  2. Annotate the data yourself, at least at the beginning
  3. Use tools to reduce the MLOps hassle as much as possible

Thanks for sticking with me until here and reading the article. I’d love to hear your feedback and learn how you approach the challenges of applied ML. You can reach out to me on Twitter or LinkedIn at any time.

If you liked the article, please share it to spread the data-flywheel and data-centric ML ideas.

If you want to read more hands-on articles about how to do data-centric visionAI, make sure to follow me on Medium.

About Hasty.ai 🦔

Hasty was founded in Berlin in 2018 by engineers who wanted to make it easier, faster, and cheaper to get vision AI into production environments. Having worked on a broad range of AI projects for the German manufacturer space, we found ourselves spending countless hours doing manual annotation and MLOps for what seemed simple use-cases.

Today, we take care of all the distractions and roadblocks for you so that you can focus on what matters — putting visionAI into production.

Sources

[1] “Why do 87% of data science projects never make it into production?” (2019), VentureBeat magazine article
[2] C. Northcutt, A. Athalye, J. Mueller, Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks (2021), ICLR 2021 RobustML and Weakly Supervised Learning Workshops, NeurIPS 2020 Workshop on Dataset Curation and Security
[3] N. Sambasivan, S. Kapania, H. Highfill, D. Akrong, P. Paritosh, L. Aroyo, “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI (2021), proceedings of the 2021 CHI Conference on Human Factors in Computing Systems.
[4] S. Karayev, J. Tobin, P. Abbeel, Full Stack Deep Learning (2021), UC Berkely course
[5] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J. Crespo, D. Dennison, Hidden Technical Debt in Machine Learning Systems (2015), Advances in Neural Information Processing Systems 28 (NIPS 2015)
[6] J. Buckmann, Please Commit More Blatant Academic Fraud (2021), personal blog

--

--

Working at hasty.ai on teaching machines how to see world. Constantly learning about visionAI