The world’s leading publication for data science, AI, and ML professionals.

Handling “Missing Data” Like a Pro – Part 1 – Deletion Methods

Basic and Advanced Techniques for the 21st Century Data Scientist

Data Science. Analytics. Statistics. Python.

Photo by Emily Morter on Unsplash
Photo by Emily Morter on Unsplash

As we mentioned in the first article in a series dedicated to missing Data, the knowledge of the mechanism or structure of "missingness" is crucial because our responses would depend on them.

While the list of techniques is growing for handling missing data, we discuss some of the most basic to the most celebrated techniques below. These techniques include data deletion, constant single, and model-based imputations, and so many more.

Before we begin discussing them, please note that the application of these techniques requires discernment from the data scientist. Even if we can identify the mechanism of missingness, other information such as the data collection and methodology are needed to choose the most appropriate technique.

Because the range of techniques that we will be discussing is quite numerous and comprehensive, let us cut them up to make them more digestible. For this part of the article, we will be focusing on: deletion methods.

ADULT INCOME DATASET

To reinforce our understanding, let’s use a dataset, particularly the Adult Income Dataset from UCI.

Before we begin, it is important for you to perform an EDA for this dataset. The knowledge of the variables’ density will come in handy with the choice of the appropriate technique to apply. See my article regarding this.

PRELIMINARIES

import random
import numpy as np
import pandas as pd
#For data exploration
import missingno as msno #A simple library to view completeness of data
import matplotlib.pyplot as plt
from numpy import random
%matplotlib inline

LOAD THE DATASET

df = pd.read_csv('data/adult.csv')
df.head()
Completeness Visualization by the Missingno Package. Our data is missing some values in the categorical variables.
Completeness Visualization by the Missingno Package. Our data is missing some values in the categorical variables.

SIMULATING MISSINGNESS

Let us simulate some missingness for some of our continuous variables: age and fnlwgt. Note that income here is the target variable and is a categorical variable.

#Random Seed
random.seed(25)
#Percentage Missing
percentage_missing_1 = 0.15
percentage_missing_2 = 0.10
#Number of Observations
obs = df.shape[0]
#Simulation
df["fnlwgt"] = random.binomial(1,(1-percentage_missing_1), size=obs)*df['fnlwgt']
df["age"] = random.binomial(1,(1-percentage_missing_2), size=obs)*df['age']
df["fnlwgt"] = df["fnlwgt"].replace(0, np.nan)
df["age"]= df["age"].replace(0, np.nan)
msno.matrix(df)
Completeness Visualization after we simulated some missing data.
Completeness Visualization after we simulated some missing data.

We have the highest number of missing data for the "final weight" variable. This is consistent with our missing data simulation. Note that the mechanism we employed is simply MCAR.

We have the highest number of missing data for the "final weight" variable. This is consistent with our missing data simulation.

Now that we have our dataset with missing data, we can now proceed to examine how our different techniques affect the dataset, and consequentially the outcome of using such datasets.

DATA DELETION METHODS

The simplest data handling method across all Data Science blogs (and even some published articles) is data deletion. But as we have mentioned in our introduction, data deletion diminishes the effectiveness of our models especially if the amount of missing data is significant.

Summary for Data Deletion Methods for Missing Data
Summary for Data Deletion Methods for Missing Data

LISTWISE METHOD COMPLETE CASE METHOD

From the name itself, listwise or complete method drops an observation as long as one value is missing. If this is applied carelessly, review how this reduces our observations:

df1 = df.copy()
df1.dropna(inplace=True)
df1.shape[0]
Our observations lost 30% of their original value.
Our observations lost 30% of their original value.

We lost 30% of our observations which is a lot! This is obvious as dropna() employed in an entire dataset drops all the observations as long as one column is missing.

Now, this method has a few advantages, such as its simplicity and efficiency but note that this method is only appropriate for datasets that are MCAR.

BEFORE APPLYING LISTWISE DELETION

Before deciding what to do with missing data, especially if you plan to apply listwise deletion, you need to identify relevant variables first for your study.

If the variable is not gonna be needed, it does not matter whether the particular item is missing or not and should be excluded in the subset of dataframe before applying listwise deletion.

For example: if we think that the final weight is irrelevant to our study (e.g. predicting income class) we can exclude it from our features dataframe.

df2 = df.copy()
df2 = df2.loc[:, df2.columns != 'fnlwgt']
Excluding one variable leads to simply dropping only 17% of the total observations.
Excluding one variable leads to simply dropping only 17% of the total observations.

With this additional step, we are able to save an additional 6,182 observations from wasteful deletion. For some studies, this little step may be the difference maker from that target accuracy you may be targetting.

Listwise deletion is primarily useful for MCAR data. As data is missing in a completely random way, assuming that we do not delete a substantial amount of observations, then we can assume that little to no information is lost by deletion.

Listwise deletion is employed in most regression and supervised learning methods, including Principal Component Analysis. (PCA)

PAIRWISE DELETION AVAILABLE CASE METHOD

In contrast with listwise deletion, the available case method uses all available observations. That is, if a feature/variable for an observation is missing, a method or technique that uses this discards only the variable with missing information and not the entire observation.

For example, if an observation above in our dataframe does not contain a value for "final weight", then measures/metrics or parameters that require the final weight value would not be calculated for that observation. Everything else will still continue to make use of that observation.

This method is so unappreciated, that most do not recognize that this is the method employed in correlation analysis. To see this in action:

Notice that we used the original dataframe with missing values and a correlation can still be calculated.

In addition to correlation analysis, the pairwise method is used for factor analysis. For those who are calculating

AVAILABLE ITEM

A method used for the creation of composite variables is that of the available item method. This method, like the available case method, uses all available information available.

The available item method aggregates across correlated items by:

  1. First applying a standardization method, for example, z-score.
  2. After that, the transformed variables, instead of being added, are averaged for each observation.

Thus, a composite score can now be created.

Now, this is called a deletion method because it makes no attempt to replace missing values.

If you are planning to create composite scores, one can simply apply this algorithm.

CONCLUDING REMARKS

This article introduced the first category of techniques used in handling missing data – deletion.

The primary advantage of deletion is its simplicity while the primary disadvantage is the loss of statistical power. But as we will see in the next article, the other category of techniques, namely imputation, have disadvantages as well in certain situations, that missing data experts would rather use deletion methods.

Lastly, applications of any of the techniques we mentioned above, require judgment guided by the researcher’s objectives, data collection methodology, and the underlying mechanism of missingness.

In the next article, we discuss imputation methods.

Handling "Missing Data" Like a Pro – Part 2: Imputation Methods

Handling "Missing Data" Like a Pro – Part 3: Model-Based & Multiple Imputation Methods

REFERENCES

McKnight, P. E. (2007). Missing data: a gentle introduction. Guilford Press.


Related Articles