The world’s leading publication for data science, AI, and ML professionals.

7 Common Gotcha’s of Data Projects

Challenges That Occur When Working With Data

Data is central to our work as Data professionals, however, data very rarely comes prepared for us to begin waving our so-called "magic wand"; There will be problems with our data and the best way to ensure we are getting the best out of our data projects is to know what they are so that we could come up with ways to work around them.

Let’s explore some of these problems…

1 Data Collection and Labeling

Data collection can be extremely expensive with respect to time and money in some cases – this typically occurs when we have a custom problem that doesn’t have data readily available that we could leverage hence we’d have to collect the data ourselves.

Note: Check out Always remember data comes before the Science to learn techniques to acquire data.

Always Remember Data Comes Before The Science

Where the big bucks get spent is labeling the data for supervised learning tasks. It’s even worst when it has to be done manually. For example, if the goal of our project is to identify all of the supermarkets in a city – the ideal solution would be to acquire this data from somewhere, but for this scenario, let’s imagine we cannot.

To get up-to-date data, the team decides to send out a car that has cameras on it to take pictures of the environment within that city. This form of data collection in and of itself is a very expensive process, but now we need to label the data. The labeling of the data would have to be done manually which involves paying humans to do so, and that is not a cheap task.

Within this category, there is also a sub-problem that could jeopardize the outcome of our data project. It’s simply called bad quality, and it encompasses 2 components:

  • The quality of the raw data being poor
  • The quality of the labeling being poor

As we proceed you’ll see more examples of bad quality.

2 Noise

When we speak of noisy data, we are making reference to additional information that is meaningless within our dataset. The noise itself could be defined as corruption or distortion of examples. For instance, Images may be blurred, a voice note may have police sirens in the background, or some text may concatenate some words that shouldn’t be concatenated.

"Noise is often a random process that corrupts each example independently of other examples in a collection" – Burkov, A. Machine Learning Engineering. Page 44

Noise typically becomes a problem when the dataset is small relative to the problem being solved – it usually leads to overfitting in small datasets. In other words, the model will learn of the noise that’s contaminated in the data which would lead to bad generalizations on new unseen data.

Note: When the dataset is large, noise could serve as a form of regularization.

3 Low Predictive Power

If you’ve never experienced testing multiple algorithms on a dataset and having them all perform ridiculously terrible, I envy you. We generally cannot know if we have a low predictive power problem until we have spent some time trying our best to get a good model.

If we exhaustively try many different solutions to try to find some acceptable result to no prevail, regardless of how complex the problem is becoming, then it may be a good idea to consider the possibility we have low predictive power.

Low predictive power could be a result of 2 factors:

  • The model may not be expressive enough
  • The data may not contain enough information for the model to learn a good enough function to map inputs to outputs

4 Bias

Wikipedia describes bias as a disproportionate weight in favor of or against an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, or a belief. In science and engineering, a bias is a systematic error [Source: Wikipedia].

There are a number of reasons why bias may occur, hence I will create a separate article to explore this phenomenon.

5 Outdated Examples

MLOps has become increasingly popular over the past few years and this is one of the reasons. Once a model has been built and deployed into a production environment, it would generally perform well for some time before declining – How long that takes depends on the task at hand.

A model generally starts to make errors because of a phenomenon called concept drift. Concept drift means that the statistical properties of the target variable, which the model is trying to predict change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes [Source: Wikipedia].

6 Outliers

In statistics, an outlier is an example that looks significantly dissimilar from the majority of examples within the dataset. Although "dissimilarity" is determined by the practitioner, there are metrics that could be used to measure how dissimilar one example is from another, such as the Euclidean Distance.

Simple models like Linear Regression or Logistic regression, as well as some ensemble methods such as Adaboost, are particularly sensitive to outliers. On the other hand, models that could explicitly or implicitly perform feature-spaced transformations tend to be okay at dealing with outliers.

7 Data Leakage

Data leakage could occur at many different stages within the lifecycle of a data project. To define it simply, Data leakage is when information that is outside of the training data is used to create the model, which could lead you to create models that are extremely optimistic (if not completely invalid).

"Data leakage is when the data you are using to train a Machine Learning model has some information about what you are trying to predict."

Therefore, the easiest way to identify whether you’re faced with a Data leakage problem is to consider whether the results of your models seem a little bit too good to be true.

Wrap Up

In this article, I covered 7 common problems that we tend to face when we work on data projects. While I didn’t provide much of a solution to combatting these problems in this article, I believe that acknowledging their existence and how to identify whether it’s a problem that you are having is the first step to coming up with a solution, which generally is quite intuitive. In later articles, I will touch up on some of these points, i.e. how to combat the different types of bias in data.

Thank You for reading!

I recently started my own mailing list. If you enjoyed this article, Subscribe to my Mailing list to connect with me so that you never miss a post I make about Artificial Intelligence, Data Science, and Freelancing.

Sign Up

Related Articles

Always Remember Data Comes Before The Science

Don’t Make Breaking Into Data Science Harder Than It Needs To Be

My Biggest Challenges Being A Self-Taught Data Scientist

4 Data Related Books I’ll Be Reading In April


Related Articles