Data Science. Analytics. Statistics. Python.

As we mentioned in the first article in a series dedicated to missing Data, the knowledge of the mechanism or structure of "missingness" is crucial because our responses would depend on them.
While the list of techniques is growing for handling missing data, we discuss some of the most basic to the most celebrated techniques below. These techniques include data deletion, constant single, and model-based imputations, and so many more.
Before we begin discussing them, please note that the application of these techniques requires discernment from the data scientist. Even if we can identify the mechanism of missingness, other information such as the data collection and methodology are needed to choose the most appropriate technique.
Because the range of techniques that we will be discussing is quite numerous and comprehensive, let us cut them up to make them more digestible. For this part of the article, we will be focusing on: deletion methods.
ADULT INCOME DATASET
To reinforce our understanding, let’s use a dataset, particularly the Adult Income Dataset from UCI.
Before we begin, it is important for you to perform an EDA for this dataset. The knowledge of the variables’ density will come in handy with the choice of the appropriate technique to apply. See my article regarding this.
PRELIMINARIES
import random
import numpy as np
import pandas as pd
#For data exploration
import missingno as msno #A simple library to view completeness of data
import matplotlib.pyplot as plt
from numpy import random
%matplotlib inline
LOAD THE DATASET
df = pd.read_csv('data/adult.csv')
df.head()

SIMULATING MISSINGNESS
Let us simulate some missingness for some of our continuous variables: age and fnlwgt. Note that income here is the target variable and is a categorical variable.
#Random Seed
random.seed(25)
#Percentage Missing
percentage_missing_1 = 0.15
percentage_missing_2 = 0.10
#Number of Observations
obs = df.shape[0]
#Simulation
df["fnlwgt"] = random.binomial(1,(1-percentage_missing_1), size=obs)*df['fnlwgt']
df["age"] = random.binomial(1,(1-percentage_missing_2), size=obs)*df['age']
df["fnlwgt"] = df["fnlwgt"].replace(0, np.nan)
df["age"]= df["age"].replace(0, np.nan)
msno.matrix(df)

We have the highest number of missing data for the "final weight" variable. This is consistent with our missing data simulation. Note that the mechanism we employed is simply MCAR.
We have the highest number of missing data for the "final weight" variable. This is consistent with our missing data simulation.
Now that we have our dataset with missing data, we can now proceed to examine how our different techniques affect the dataset, and consequentially the outcome of using such datasets.
DATA DELETION METHODS
The simplest data handling method across all Data Science blogs (and even some published articles) is data deletion. But as we have mentioned in our introduction, data deletion diminishes the effectiveness of our models especially if the amount of missing data is significant.

LISTWISE METHOD COMPLETE CASE METHOD
From the name itself, listwise or complete method drops an observation as long as one value is missing. If this is applied carelessly, review how this reduces our observations:
df1 = df.copy()
df1.dropna(inplace=True)
df1.shape[0]

We lost 30% of our observations which is a lot! This is obvious as dropna() employed in an entire dataset drops all the observations as long as one column is missing.
Now, this method has a few advantages, such as its simplicity and efficiency but note that this method is only appropriate for datasets that are MCAR.
BEFORE APPLYING LISTWISE DELETION
Before deciding what to do with missing data, especially if you plan to apply listwise deletion, you need to identify relevant variables first for your study.
If the variable is not gonna be needed, it does not matter whether the particular item is missing or not and should be excluded in the subset of dataframe before applying listwise deletion.
For example: if we think that the final weight is irrelevant to our study (e.g. predicting income class) we can exclude it from our features dataframe.
df2 = df.copy()
df2 = df2.loc[:, df2.columns != 'fnlwgt']

With this additional step, we are able to save an additional 6,182 observations from wasteful deletion. For some studies, this little step may be the difference maker from that target accuracy you may be targetting.
Listwise deletion is primarily useful for MCAR data. As data is missing in a completely random way, assuming that we do not delete a substantial amount of observations, then we can assume that little to no information is lost by deletion.
Listwise deletion is employed in most regression and supervised learning methods, including Principal Component Analysis. (PCA)
PAIRWISE DELETION AVAILABLE CASE METHOD
In contrast with listwise deletion, the available case method uses all available observations. That is, if a feature/variable for an observation is missing, a method or technique that uses this discards only the variable with missing information and not the entire observation.
For example, if an observation above in our dataframe does not contain a value for "final weight", then measures/metrics or parameters that require the final weight value would not be calculated for that observation. Everything else will still continue to make use of that observation.
This method is so unappreciated, that most do not recognize that this is the method employed in correlation analysis. To see this in action:

Notice that we used the original dataframe with missing values and a correlation can still be calculated.
In addition to correlation analysis, the pairwise method is used for factor analysis. For those who are calculating
AVAILABLE ITEM
A method used for the creation of composite variables is that of the available item method. This method, like the available case method, uses all available information available.
The available item method aggregates across correlated items by:
- First applying a standardization method, for example, z-score.
- After that, the transformed variables, instead of being added, are averaged for each observation.
Thus, a composite score can now be created.
Now, this is called a deletion method because it makes no attempt to replace missing values.
If you are planning to create composite scores, one can simply apply this algorithm.
CONCLUDING REMARKS
This article introduced the first category of techniques used in handling missing data – deletion.
The primary advantage of deletion is its simplicity while the primary disadvantage is the loss of statistical power. But as we will see in the next article, the other category of techniques, namely imputation, have disadvantages as well in certain situations, that missing data experts would rather use deletion methods.
Lastly, applications of any of the techniques we mentioned above, require judgment guided by the researcher’s objectives, data collection methodology, and the underlying mechanism of missingness.
In the next article, we discuss imputation methods.
Handling "Missing Data" Like a Pro – Part 2: Imputation Methods
Handling "Missing Data" Like a Pro – Part 3: Model-Based & Multiple Imputation Methods
REFERENCES
McKnight, P. E. (2007). Missing data: a gentle introduction. Guilford Press.