The world’s leading publication for data science, AI, and ML professionals.

Rethinking How We Approach AI Problems

Let's focus on data

Image via Canva under the One Design Use License to author.
Image via Canva under the One Design Use License to author.

When we talk about machine learning and Artificial Intelligence in most cases we talk about models that we apply to tackle our classification or regression tasks: Some experts pledge their loyalty to ensemble models such as gradient boosted decision trees. Others try their luck with fine-tuning highly complex neural networks architectures. Some people go even a step further and try to make the best out of both worlds and stack all sorts of models together into a gigantic black box to get to the best accuracy score possible.

But why does nobody ever talk about the data?

Image via Canva under the One Design Use License to author.
Image via Canva under the One Design Use License to author.

For a long time now we went through an era where research, literature and online AI tutorials focus on algorithms and models and suggest that in order to become a better Machine Learning expert with better results we should use more complex models, tune hyperparameters, stack them and if we witness overfitting, then simply perform some regularization. However, why do most textbooks and tutorials almost completely neglect the importance of the Data that actually goes into the model? I believe that it is time for a paradigm shift!

The moving parts of AI

Image via Canva under the One Design Use License to author.
Image via Canva under the One Design Use License to author.

On a very abstract level, one can say that AI has two components: The trained model and its code (e.g. Neural Network) that outputs some prediction and the data that was used to train the model. As discussed above, we currently live in a time where the most focus lies on improving the model for example by fine-tuning parameters. Andrew Ng from DeepLearning.AI calls this a Model-centric view and distinguishes this from a new alternative approach that he calls Data-centric.

Model-centric

In the already discussed model-centric view we first download and collect all data we can possibly get and subsequently develop a model that performs well. Based on this baseline, the data is fixed and the model and code are iteratively improved (e.g. via changing model architectures or tuning parameters) until a satisfactory level of accuracy is reached.

Data-centric

In a data-centric approach, the focus is laid on the data and not on the model: After finding a well-suited model for the task, the code of the model is fixed and the data quality is iteratively improved. The main paradigm shift here is that working with data is not a preprocessing step anymore! Working on the data is nothing we do only once but rather the core part that we repeatedly improve. With this approach the consistency of the data is paramount.

"No other activity in the machine learning life cycle has a higher return on investment than improving the data a model has access to." – Gojek

Becoming more data-centric

Image via Canva under the One Design Use License to author.
Image via Canva under the One Design Use License to author.

The aim of this article is not to make you completely neglected your model, but about motivating you to dedicate some time to work with the data instead of investing everything into the model. Because after all, AI has two moving parts, the model and the code, so why only focus on one? If you want to learn more about how to become more data-centric, here are some guidelines from Andrew Ng that will get you started – most of them however have one thing in common: consistency.

Consistent y labels

Data comes in many forms but what every dataset for supervised learning tasks has in common are y labels. These y labels were either collected automatically or added in a (semi-)manual annotation process. However, very often they are not consistent or free of errors and the model ends up learning a contradicting signal which hampers performance.

The image below depicts inconsistent labels for an image recognition task of an elephant detector. To train this detector, images of elephants were manually annotated with bounding boxes. But which alternative is more correct – left or right? If the task is not to count the elephants but simply tell if the image contains the animal, then both versions are completely fine. It is however important that the way how the bounding boxes are drawn across the training images are consistent, otherwise, the classifier will not be able to learn properly.

Inconsistent bounding box annotation of elephants - Image was taken by author
Inconsistent bounding box annotation of elephants – Image was taken by author

How to find inconsistencies?

Finding these inconsistencies especially when your dataset is large, can be very time-consuming. As a rule of thumb, Andrew Ng states that finding inconsistencies and focussing on the data works best if the dataset doesn’t exceed 10.000 observations. Up to this dataset size, and with the right workflow, inconsistencies can be spotted and fixed manually within a reasonable time and will likely yield a greater performance boost in accuracy than focussing on the model.

"With a data centric view, there is significant room for improvements in problems with < 10,000 examples" – Andrew Ng


Example: Quantifying improvement

If you have 500 observations and 12% of the observations are inconsistently or incorrectly labelled, the following approaches turn out to be about equally effective:

  • Fixing the inconsistencies (relabelling of data)
  • Collection of another 500 new observations (doubling the training set)

from https://www.deeplearning.ai


One way of finding inconsistencies in labels is to use multiple labellers and see how they interpret and annotate the data. If you for example work on a speech recognition task and you hire five people to transcribe audio sequences to be able to train your classifier, don’t simply split your data into five parts and watch the labellers do their job! It is better to start with a small percentage of the data and show the same observations to multiple labellers. Very likely you will spot differences in how the labellers transcribed the audio sequences, not because some were lazy or did a poor job but because no standard was initially established! Imagine these two transcriptions of the same sentence:

Image via Canva under the One Design Use License to author.
Image via Canva under the One Design Use License to author.
  • Transcription 1: "Uhm… it will rain tomorrow"
  • *Transcription 2: "*It will rain tomorrow"

Both transcriptions are fine, however, because there was no agreed standard on how to transcribe filler words such as "uhm" inconsistencies in the labelling occurred. These inconsistencies will very likely lower the performance of your speech recognition model. It is therefore important to make a decision and agree on a standard on how to handle these ambiguities. As you can imagine, you won’t be able to define all necessary standards before the annotation process, and new inconsistencies will likely show up during the annotation. It is therefore important to see this as an iterative approach!

How to fix inconsistencies?

As already stated above, fixing inconsistencies is an iterative approach. Once you know of inconsistency in your data labels you should consider the following steps as a blueprint:

  • Find examples with these ambiguities and inconsistencies
  • Agree on a standard by making a decision on how they should be labelled/handled (e.g. in the example above, should "uhm" be transcribed or left out)
  • Document the new standard in a dedicated labelling instruction manual

Only if the standards are reproducible and well documented with meaningful examples, you can make sure that the hired labellers or in case you hire somebody else to do the job will know what to do in case of uncertainty. It is therefore important that labelling instructions include borderline cases, near-misses and confusing examples and not only unambiguous dummy examples.

Image via Canva under the One Design Use License to author.
Image via Canva under the One Design Use License to author.

What if relabelling and fixing inconsistencies is not an option?

Sometimes, for example for budgetary or time reasons it is not possible to fix inconsistencies in the whole dataset. If you however want to make the best out of your data and stay with a data-centric view, you can consider these options:

  • Collect more data (also not always feasible)
  • Use data augmentation (add some noise and variation to existing observations)
  • Toss out inconsistent and examples that contain too much noise
  • Focus on a subset of the data that shows the most potential for improvement – use error analysis

Subsequently, the last two options are discussed in more detail.

Tossing out observations – from big data to good data

Image via Canva under the One Design Use License to author.
Image via Canva under the One Design Use License to author.

This option might be quite counterintuitive. But more data is not always better! Having a large dataset only helps only if it’s "good" data, but what defines a "good" dataset? Generally speaking, you can make out four characteristics of a dataset with high quality:

  1. Labels that are consistent and don’t show ambiguities
  2. Observations cover the most important cases
  3. The training data has timely feedback from the production data. Especially in regard to data-drift)
  4. Dataset is size appropriate (not too small)

If your data doesn’t meet the criteria above, you might want to consider spending some to improve it, or as mentioned above, remove poor observations! The relationship of "good" data and model performance is exemplarily (simplified) shown in the graphic below:

Image inspired by Andrew Ng, redrawn and adapted by author
Image inspired by Andrew Ng, redrawn and adapted by author

If you have a small dataset with few observations that are noisy, then fitting a robust function (red lines) is difficult (left). If you have a big dataset with a lot of noise, then fitting a robust function is possible and might yield good results (middle). If you have few observations with very high data quality, in this example the best function can be fitted.

Performing error analysis to focus on a subset of data to improve

As already stated above, sometimes improving the whole dataset would be too time-consuming, especially when working with big data. Furthermore, often it is important to prioritize correctly to not waste time on issues that might not really boost your model’s performance significantly once they are fixed. Focussing on a subset of data therefore often seems to be the right way to do it. Here are a couple of practical guidelines and options you can follow when performing error analysis:

  • When working with classification, use the confusion matrix as a tool to get a first overview on which class is performing the worst. Often this is a good starting point for a deeper investigation.
  • If you have any knowledge on the human level of performance (how a human expert would perform on the same or on a similar task) try to include this information in the error analysis: Look for the class with the largest gap in accuracy between the estimated human level of performance and your models’ performance. This class has the biggest potential for improvement -therefore starting with fixing the data of this class is the best option! Often it is not possible to get estimates on the human level of performance, in this case, try to look for studies in related fields or investigate the performance of machine learning models that were made in the past!
  • Take a close look at the misclassified observations and try to see if you can find a pattern. For example, if you work in image analysis you can investigate if the misclassified images exhibit some kind of noise such as changing conditions of lighting, unsharpness or if they might be even mislabeled. If there are multiple patterns of noise, focus on the one that occurs with the highest frequency.

How you handle the noise is highly case-dependent. If the noise is not to be expected in production (e.g. images that were acquired by an old sensor that was replaced), it might be worth removing or relabelling the misclassified images. However, if the noise is to be expected in production (e.g. changing lighting conditions) you can try adding more images to the dataset that contain the same noise signal. A second option would be to artificially augment your dataset by for example making some existing images brighter or darker. In both cases, you will provide the model with more examples to learn from which will ultimately improve performance.

Image via Canva under the One Design Use License to author.
Image via Canva under the One Design Use License to author.

Lastly, it is important to say that error analysis is not a one-step approach but has to be iterative: Train your model, perform error analysis, improve your data (subset) and retrain your model. Repeat this to increase your model’s accuracy continuously.

Final thoughts and limitations

The article showed you why focusing on improving the data instead of on the model might pay off. Furthermore, guidelines on how to become more data-centric were discussed. However, there are some limitations to everything and its important to outline some of them here:

  • Having a data-centric view is often easier with unstructured data such as images, videos or audio sequences. Finding patterns in data and fixing labels is usually much harder when working with structured tabular data.
  • The bigger the dataset the harder it gets to ensure that your dataset is consistent. Spending time with the improvement of the dataset has the biggest with datasets with a size < 10,000 observations

Even though the focus of this article laid on establishing consistency in your dataset, the importance of Data Science and feature engineering has to be stressed here too. Using complex models with thousands of parameters should never be used as an excuse for not performing proper data science and gaining a real understanding of your data. Often a better performance can be gained by spending more time on analysing and fixing your dataset in combination with a simpler model than by neglecting the data and the usage of a very complex model (e.g. neural networks). A trend that can be seen in recent times is that machine learning experts don’t deem proper feature engineering important because neural networks are supposed to be powerful learners that don’t require these time-consuming steps. This trend has shown to be problematic in recent times since predictions come out of black-box models with little to no interpretability and machine learning experts are then often not able to explain why a model is predicting something.


Further material

[1] Andrew Ng – MLOps: From Model-centric to Data-centric AI [PDF]: https://www.deeplearning.ai/wp-content/uploads/2021/06/MLOps-From-Model-centric-to-Data-centric-AI.pdf

[2] Andrew Ng -A Chat with Andrew on MLOps: From Model-centric to Data-centric AI on Youtube: https://www.youtube.com/watch?v=06-AZXmwHjo&t

[3] DeepLearning.AI -Data-centric AI: Real World Approaches on Youtube: https://www.youtube.com/watch?v=Yqj7Kyjznh4


Related Articles