The world’s leading publication for data science, AI, and ML professionals.

3 Approaches to Find Missing Values

Learn a good way to deal with different types of missing values

Photo by Ehimetalor Akhere Unuabona on Unsplash
Photo by Ehimetalor Akhere Unuabona on Unsplash

Introdution

I will begin this post with a "revelation": The datasets in the real world will not always come with Missing Values as NaN.

Yes. They can be anything. And before you lose a lot of time trying to figure out what to do, just go ahead and read this short post to acquire 3 good approaches to deal with those types of data.

First of all, I should introduce you to a missing data or a null value.

A null value in dataset is used when the value in a column is unknown or missing. (adapted from docs.microsoft.com)

The definition above means that whenever you see an NA or NaN, ?, – , etc, everything that does not represent one observation of the data from that variable or column, that is a missing or null value.

How You Will Find Them

Many times, yes, the null value will appear as the famous NaN in Python or a simple NA in RStudio. In those cases, that value will not change the data type for the variable.

Other times, for many reasons, the data can appear under a specific textual notation determined by the person who created the dataset. One can use, for example, "No Value" as a proxy for the missing value. Or instead, it can be symbols like # or – , ! or anything else. In those cases, the type of the variable will be changed to string.

Thus, if you’re dealing with another data type, like numbers, it’s easier to notice that there’s a wrong value in that column. If you’re working with text, it’s a slightly more complex, but we will see what to do.

Let’s see how to deal with each case.

The Common Missing Value NaN

First, I will create a toy dataset and import some libraries.

# imports
import Pandas as pd
import numpy as np
# Create a dataframe
df = pd.DataFrame({'ID': range(1,31),
'value':[n if n % 5 != 0 else np.nan for n in range(1,31)],
'numbers': [n if n % 4 != 0 else '?' for n in range(1,31)],
'text': ['txt' if n % 3 != 0 else 'No Value!' for n in range(1,31)],
})
df.head()
Dataframe created. Image by the author.
Dataframe created. Image by the author.

Checking the data types, notice that the column value, where we have the NaN, it keeps its type as float. But look at the column numbers, it has been coerced into string.

df.dtypes
Data types can change if the missing value is a string. Image by the author.
Data types can change if the missing value is a string. Image by the author.

You can use the methods isna() or isnull() to find missing values, but none of them will find the missing values for the columns numbers or texts, as those are textual missing values within columns identified (or coerced) by Pandas as text.

# Finding the total of null values
df.isna().sum()
# Or you can also use
df.isnull().sum()
See that none will point to '?' or 'No Value!' as a missing value. Image by the author.
See that none will point to ‘?’ or ‘No Value!’ as a missing value. Image by the author.

For the known missing data, find them by simply slicing the dataset.

# Pandas has good methods for that
df[ df.value.isna() ]
The 6 missing values from the 'value' column. Image by the author.
The 6 missing values from the ‘value’ column. Image by the author.

From here, you can just use the many techniques available to deal with the missing data: deletion, imputation, interpolation etc. I go over a few option in the article below.

Exploratory Data Analysis with Python – Part 1

Numeric Data with Textual Missing Data

Now for the other columns, one of the best ideas, at first, is to check the data types, as we did before df.dtypes.

If we know which type we should have in a certain column – like we know that the column numbers should have integers – but we got something different than expected, then we know we probably have a missing data point or something we should double check in that column.

So, since I know that my column numbers must have integers, I will go ahead and look for non-numeric data, like something different than a digit (0–9).

# Find the non-numeric with isdigit to assess if the data is number.
# You can also use .isnumeric()
df[ df.numbers.astype(str).str.isnumeric() == False ]
We have found the missing numbers : '?' - Image by the author.
We have found the missing numbers : ‘?’ – Image by the author.

Of course there are cases where numbers came as text just because Python could not identify the correct type or it came with an extra space (for example, ‘ 3’), but in that case, that would either not appear as a missing value or appear in your list, but you know it’s correct and you can go ahead and clean/ format/ transform to reflect the correct data type.

Number 3 with an extra space was identified as "not digit". Image by the author.
Number 3 with an extra space was identified as "not digit". Image by the author.

Textual Data with Missing Values

What to do with Textual data where I don’t know where the NAs are?

Many times, the NaN values can come as a textual entry "NA", "?", "—", "!", "NULL" etc , but they can be many other things. So the idea will be to use Regular Expressions to try to find them. Once you find the pattern, it becomes easier to identify the others.

Let’s imagine we don’t know that our NA values for the column text is "No Value!". We will check for many patterns using RegEx.

In the code below, I will apply a lambda function that looks for the following textual values in each row of the column text: "NA" or "*" or "?" or "!" or "#" or "-" and checks for the length of the list it finds. If the list is different than zero, it means I found something, so there is at least one of those patterns in that row. And you could include any other value you want in that RegEx.

# Import Regular Expressions lib
import re
# Using regex to find the NA values
df[ df.text.apply(lambda x: len(re.findall('NA|[*|?|!|#|-]', x)) !=0 )]
Results found. All of the null values. Image by the author.
Results found. All of the null values. Image by the author.

Before You Go

Knowing your data is always a must before you start an exploratory analysis. Look at the dataset documentation if available, like data dictionary or research summary. Those documents will bring you the missing data cases and how they are identified.

For business data from your company, try to chat with the owner of that database to help you gain time and find the NAs quicker.

If you lack both of those options, then go for those three approaches shown above.

Missing data is not a complicated matter to deal with, but it can for sure hurt your work if not properly cleaned.

For more content, don’t forget to follow me.

Gustavo Santos – Medium

Subscribe to Medium, if you’d like. You can use my referral code here.

Pandas Documentation.


Related Articles