The world’s leading publication for data science, AI, and ML professionals.

Improving Models Using Missing Values

Get valuable information from missing values.

DATA SCIENCE

Image by the author.
Image by the author.

When I load a new dataset, one of the first things I check is how many missing values we have. Seeing all those NaN values are disappointing and discouraging. It is not unusual to have columns with more than 50% missing values. We are all so excited about the new datasets, mainly because new datasets mean more data and more improvement. But like any other previous datasets, the new datasets are full of USELESS MISSING VALUES!!!!

But wait! Are missing values really invaluable? Can I get some extra information or insight from the missing values?

In this short article, I will change your view about missing data. I’ll show you to look at missing values differently and improve your model performance with missing data. Even let me be clear, one of the main differences between a professional data scientist and a regular one is how they approach missing values and get valuable information from them.


Let me start with a question. What do you do when you see missing values? Many of us remove them initially (or permanently). Some of us torture ourselves for weeks to impute them.

Removing or imputing missing values before investigating the reason behind them has three negative consequences.

  1. Removing samples (i.e., rows) with missing values forces us to ignore other non-missing values that could help us build better predictive models.
  2. Removing features (i.e., columns) with a high amount of missing values again removes other non-missing values for that feature, and it is a waste of data.
  3. Sometimes missing values by themselves have good information that is necessary for improving the analysis or the performance of our predictive model.
  4. Removing or imputing missing values can introduce bias to our model.

Let me be clear. I have no problem with removing or imputing missing values, but you ask yourself a few questions. Here are three important questions that you must ask yourself before missing data deletion or imputation.

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

What did generate these missing values?

Imagine we have a dataset of people with and without a certain type of cancer gathered by a medical research group. You have all information from both groups. One of the missing columns could be biopsy test results. It makes sense to have missing values for people without cancer, but why do we have many missing values for people who finally diagnosed with cancer? One simple explanation is the medical research team who gathered data could not find biopsy test results for some patients and skipped them. But a more important reason could be the fact that the doctor did not order a biopsy because of other factors. For example, maybe the other test results made the doctor confident that cancer should be diagnosed (even without biopsy). Or the other possibility is that doctor could not diagnose cancer and refused to order a biopsy test. If you remove or ignore missing values, you are missing a big piece of information. In this example, missing values are a class of patients with very clear and very unclear symptoms of this specific type of cancer. And patients (samples) without missing values in their biopsy results are patients who had enough symptoms that doctors ordered biopsy for them.

As you see, missing values are giving us crucial information. The only way to get this valuable information is to think about why we see a missing value and understand its meaning.

Remember, a missing value is not only caused by data collectors’ mistakes. Maybe, other factors (doctors in this example) decided to leave it as a missing value for some reasons.

Photo by Mike Alonzo on Unsplash
Photo by Mike Alonzo on Unsplash

Any correlation between missing data and other data?

Sometimes there is a good correlation between one or several features and missing values. In other words, the missing value is not randomly missed.

A very simple example is data collected from patient information sheets. Many of these data (especially the raw datasets) have missing values regarding pregnancy questions. Men leave these questions unchecked (of course), and a raw dataset of these forms will be full of missing values in those columns. Note that some women might leave those questions unchecked, but the meaning of missing values for males is different from missing values for female patients.

Sometimes you might see a correlation between missing values with some strange features like IDs or code numbers. For example, in a hospital, patient ID numbers or order numbers could have a specific meaning. Sometimes, the correlation between missing values and ID numbers can help us understand both missing values as well as the structure behind those codes and numbers.

Photo by David Travis on Unsplash
Photo by David Travis on Unsplash

Is there any bias in samples with missing values?

The other danger of removing samples with missing values (before investigating the reasons behind them) is introducing bias to our models or studies. Let’s imagine we have a database from users’ profiles for an online dating app. This database could be made based on a questionnaire that age and race were optional questions. Should we expect women or minority groups to leave some of these age and race-related questions unanswered? I think it is a possibility. If we remove rows based on missing values, we are introducing bias in our model. My hypothesis (which should be tested) is that the database should be biased more toward white men after removing samples with mentioned missing values. With this hypothetical example, I want to show you how removing samples with missing values without considering the inherent bias could result in a biased model.

Summary

Missing values could have important information that improves not only the performance of our models but also the business questions that we are addressing.

Ask yourself three questions before removing or imputing missing values.

1) What did generate these missing values?

2) Any correlation between missing data and other data?

3) Is there any bias in samples with missing values?


Related Articles