Oftentimes in notebooks and learning materials, you’ll find more than a handful of ways to improve the accuracy of machine learning models by parameter optimization. That can only get you so far.
Data is everything in modern-day machine learning, but is often neglected and not handled properly in AI projects. As a result, hundreds of hours are wasted on tuning a model built on low-quality data. That’s the main reason why the accuracy of your model is significantly lower than expected – it has nothing to do with model tuning.
Don’t let this happen to you.
Every AI solution consists of two parts: code (model) and data. Libraries make you write less and less code for identical results, but no one can tell you how to prepare the data adequately. That’s the general gist of a data-centric approach. More details in a bit.
Today’s article answers the following questions:
- What is Data-Centric AI?
- Data Quantity vs. Data Quality – Which one should you go for?
- Where to find good datasets?
What is Data-Centric AI?
The words of Andrew Ng are wise yet again. In his hour-long session on YouTube, Andrew makes a particular statement that really hits home: your model architecture is good enough.
And it makes sense – teams of geniuses worked on the model architecture (think ResNet, VGG, EfficientNet…) – so it’s safe to assume they did their homework right. Stop trying to improve their work – it’s a windmill you don’t want to fight.
Having that said, your approach to machine learning can be either model-centric or data-centric:
- Model-centric approach: Asks how you can change the model to improve performance.
- Data-centric approach: Asks how you can change or improve your data to improve performance.
The model-centric approach tends to be more fun for practitioners. That’s easy to understand, as practitioners can directly apply their knowledge to solve a specific task. On the other hand, nobody wants to label data the entire day, as it’s seen as a tedious low-skill job.
Don’t be that guy. Data is as important – heck, even more important – than the model.
In the same YouTube session, Andrew makes one more claim that makes you think: In recent publications, 99% of the papers were model-centric with only 1% being data-centric.
And as it turns out, most performance gains were made with a data-centric approach. For example, take a look at the following image, taken from the mentioned session:

I don’t know much about steel defect detection, solar panels, nor surface inspection, but I know improvement in accuracy when I see one. The model-centric approach provided either zero or close to zero improvements to the baseline, but probably took hundreds of hours of practitioners’ time.
To summarize – don’t try to outsmart a room full of PhDs. Instead, make sure the quality of your data is top-notch before trying to improve the model.
Data Quantity vs. Data Quality
Data quantity represents the amount of data available. The usual approach is to collect as much data as humanly possible and leave it for a neural network to learn the mappings.
I’ll share my top websites for finding good and large datasets later, but let’s explore the average dataset size on Kaggle. It’s presented in the following figure:

As you can see, most datasets aren’t that big. Dataset size doesn’t matter too much in a data-centric approach. Sure, you can’t train a neural network on 3 images, but the focus is shifted on quality, not quantity.
Data quality is all about, well, quality. It doesn’t matter if you don’t have hundreds of thousands of images – it’s vital that they’re high quality and labeled correctly.
Take a look at the following example – it shows how the data labeler might approach labeling images for an object detection task:

Want to confuse a neural network? That’s easy – just label things inconsistently. It is essential to have strict labeling rules if you care about data quality. That’s especially the case when you have multiple labelers.
Still, one question remains – how much data is enough?
That’s harder to answer than you think. Most algorithms will have a minimum recommended number of data points in the documentation. For example, YOLOv5 recommends at least 1500 images per class. I’ve managed to get good results with less, but the accuracy would definitely improve with more training samples.
To summarize – having an abundance of data is a perk, not a necessity. You can achieve more with less provided the higher data quality in a smaller dataset.
Where to Find Good Datasets?
We’ll now explore two resources on which you can download good quality datasets – free of charge. The first one is well-known in the Data Science community, while the other is a newcomer specializing in particular areas.
Kaggle
Yes, you can find almost anything on Kaggle. It’s an established resource for finding datasets of any source – tabular, images, and others.

Kaggle is the place to start if you’re new to data science and Machine Learning. It’s also a good fit for anyone who wants to try their luck at competitions. These could also be financially beneficial if you win.
Graviti Open Datasets
This is a new kid in the block. Graviti provides a collection of premium-quality datasets, mainly for computer vision.

There are over 1100 datasets available, but not all of them are finalized yet. However, their team is adding new datasets daily, so the future is looking bright. For example, here’s how their nuScenes dataset looks like when visualized through their visualization features:

As you can see, the website is simple to use and provides you with the options to either explore the datasets online or download them locally.
Wrapping Up
And there you have it – a basic introduction to a data-centric approach to AI. It boils down to caring more about data quality than data quantity. Still, good quality datasets are tough to find. That’s where the previous section of the article comes in.
You need a premium-quality dataset if you want to build a premium-quality machine learning model. For everyday use, Kaggle is a good place to start. If you’re into computer vision and AI, give Graviti a try – it’s completely free.
Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.
Stay Connected
- Follow me on Medium for more stories like this
- Sign up for my newsletter
- Connect on LinkedIn