The world’s leading publication for data science, AI, and ML professionals.

The Three Types of Missing Data Every Data Professional Should Know

And Why It's Important To Know Them

DATA SCIENCE. ANALYTICS. DATA ENGINEERING.

Photo by Matt Walsh on Unsplash
Photo by Matt Walsh on Unsplash

INTRODUCTION

If you ask Data scientists what is the one problem in data they wish they can avoid but cannot, chances are they will all respond with missing data.

You know how they say that the only thing certain in life are death and taxes? Well for Data Scientists, missing data is probably the third on that list.

Missing data, in general, restricts the effectiveness of our machine learning (ML)models, especially when applied to real-world use cases.

If we think about it, one of the most popular suggestions to making ML models better is getting more data and more observations. Therefore, it should come as no surprise if missing data does the opposite: weakens the models and obscure the phenomena we are trying to explain or solve.

In the book, "Missing Data: A Gentle Introduction", three authors collected data and found that not only were missing data prevalent in three years worth of published journals (that’s right, published journals) but it was estimated that the average amount of missing data exceeded 30%.

It, therefore, is time for us to devote some resources to discussing and coming up with an appropriate response to missing data.

While the common approach to missing data is simply deletion or simple imputation, not all the time are these solutions appropriate. Take the case of data deletion where you may actually delete as high as 30%, lowering the statistical power of your dataset, especially for small sizes.

This article serves as an introduction and the first of a series designed to discuss the handling of missing data.

THE THREE TYPES OF MISSING DATA

MOCK DATASET: WEIGHT MANAGEMENT DATASET

To assist our efforts in understanding missing data, it will be helpful to generate some mock data which we can use.

#PRELIMINARIES
import random
import numpy as np
import pandas as pd
from numpy import random

Assume our dataset will have something to do with a weight-management program where initial weights are taken (in lbs). Let us not discuss how realistic a person can actually lose safely but for illustration purposes, let us assume that these are all possible.

For the next two months, weights are then subsequently measured.

#INITIAL WEIGHTS
random.seed(19)
initial_weights = random.normal(150,30,10)

After the first month of the program, let the participants’ weights be:

random.seed(20)
first_weigh_in = random.normal(145,20,10)

So what we have right now is a dataset with initial weights and first month-weights. Since we have randomized the number generation, we have observations where the first-month weights are lower than the initial month (signifying weight loss) and observations where the first-month weights are higher than the initial month (signifying weight gain).

For the second month-weigh in, let us assume that those who lost weight are determined to keep losing weight and have therefore a higher probability of losing more weight (around 3 to 5 kgs), or if they did gain some weight, it will be small only (around 1 to 2 kg).

Those who gained, however, were either demotivated and gained more or got inspired by those who lost weight and therefore started lost some weight.

We can code this relationship as follows:

first_month_diff = first_weigh_in - initial_weights
second_weigh_in = [None] * 10
random.seed(21)
for i in range(len(first_month_diff)):
    if first_month_diff[i] > 0:
        second_weigh_in[i] = first_weigh_in[i] + random.randint(-3,7)
    else:
        second_weigh_in[i] = first_weigh_in[i] + random.randint(-5,3)

Let’s check out our full dataset:

df = pd.DataFrame({"Initial":initial_weights,
                  "First Month": first_weigh_in,
                  "Second Month": second_weigh_in})
df

Take a look at our dataset and try to remember which participants (1–10) have continuously lost weight and which ones gained throughout the months.

SIMULATING THE THREE MISSING DATA MECHANISMS

To make sense of this simulation, we will apply the three different mechanisms on the "Second Month Weigh-In" observations.

Thus, for each mechanism, the initial-weigh-in and first-month weigh-in observations are available.

TYPE 1: MISSING COMPLETELY AT RANDOM (MCAR)

Missing Completely at Random is a mechanism where data is missing due to completely random reasons; there is no specific structure as to why data might be missing.

For example, it is quite possible that during the weigh-in for the second month, a participant happens to be sick and just missed it. It may also be possible that something completely unrelated to the phenomena you are studying or measuring such as a car breakdown on the way to the gym/measuring center. Other reasons would include data management, for example, an accidental deletion.

This is the easiest mechanism to code:

random.seed(25)
df["MCAR"] = random.binomial(1,0.5, size=10)*df['Second Month']
df["MCAR"] = df["MCAR"].replace(0, np.nan)
df

TYPE 2: MISSING AT RANDOM (MAR)

Suppose for example that people who gained weight, instead of losing them in the first month, got demotivated and purposedly did not show on the second-month weigh-in.

That is, and this is an important piece for MAR: the observations in the initial and first months determine whether the observation in the second month would be missing.

This systematic relationship can be coded such as:

Note that data missingness does not depend on the value of second-month weigh in themselves. For example, if you look at person number 10 – he lost some weight on the second-weigh in, but because we are only looking at the initial and first-weigh in information, he never had the chance to find out and chose to not have this information measured.

TYPE 3: MISSING-NOT-AT-RANDOM (MNAR)

Now, this is where it becomes a little tricky. Suppose that people who gained during the second month purposedly did not show up for the second-month weigh-in.

In this scenario, the probability of the data being missing is directly related to the value of the missing data itself. We call this data, "Missing not at random" or MNAR data.

Unlike MAR, in which the probability of missingness is related to the other observed data, MNAR has a structure that is directly related to the missing observations themselves.

The following structure can be coded as follows:

random.seed(34)
df["MNAR"] = [df["Second Month"][i]*random.binomial(1,(1/(df["Second Month"][i]*4/df["First Month"][i]))) if (df["Second Month"][i]- df["First Month"][i] > 0) else df["Second Month"][i]
for i in range(10)]
df["MNAR"] = df["MNAR"].replace(0, np.nan)
df

Out of the three mechanisms that we have considered, MNAR creates the most difficult situation to overcome.

If you look at closely the relationship we have modeled, we see that the greater the weight gained in the second month, the higher the probability of it missing in the second-month weigh-in.

But the tricky part is actually this: the knowledge of this relationship is not known to the data scientist because these have not been observed.

So this is the challenge in classifying an observation between MAR and MNAR: to classify as MNAR, one must ascertain a relationship between the missing variable and the probability of missing it but for MAR, one can establish the relationship by looking at the observed, available data alone.

FINAL DATASET

Codes to Enhance Table Visualization Found on my Github Page
Codes to Enhance Table Visualization Found on my Github Page

FINAL REMARKS

For researchers, missing data is similar to taxes and death; they are inevitable. This article introduced to us the three types of missing data identified primarily by the mechanism (or structure) that governs their missingness.

It is important to identify the mechanism that underlies the missingness of data because not all types can be ignored. For example, with MCAR and MAR, missingness can be ignored and may have little effect on the phenomenon that we are studying.

However, with MNAR, not properly recognizing these would lead to a biased study and models that are less effective in real-world applications.

While there is no one-size-fits-all solution, depending on the underlying mechanism of missingness, one can employ a range of techniques to handle missing data. For example, the most popular method in handling missing data: deletion, may be appropriate only for datasets that are MCAR. Mean and median imputation may work for MCAR and MAR.

Next on this article series, we will be discussing how to handle missing data like a pro.

Handling "Missing Data" Like a Pro – Part 1 – Deletion Methods

Handling "Missing Data" Like a Pro – Part 2: Imputation Methods

Handling "Missing Data" Like a Pro – Part 3: Model-Based & Multiple Imputation Methods

Let me know guys what you think!

Full code for simulation found on my Github page.

REFERENCES

McKnight, P. E. (2007). Missing data: a gentle introduction. Guilford Press.


Related Articles