
Introdution
I will begin this post with a "revelation": The datasets in the real world will not always come with Missing Values as NaN.
Yes. They can be anything. And before you lose a lot of time trying to figure out what to do, just go ahead and read this short post to acquire 3 good approaches to deal with those types of data.
First of all, I should introduce you to a missing data or a null value.
A null value in dataset is used when the value in a column is unknown or missing. (adapted from docs.microsoft.com)
The definition above means that whenever you see an NA or NaN, ?, – , etc, everything that does not represent one observation of the data from that variable or column, that is a missing or null value.
How You Will Find Them
Many times, yes, the null value will appear as the famous NaN in Python or a simple NA in RStudio. In those cases, that value will not change the data type for the variable.
Other times, for many reasons, the data can appear under a specific textual notation determined by the person who created the dataset. One can use, for example, "No Value" as a proxy for the missing value. Or instead, it can be symbols like # or – , ! or anything else. In those cases, the type of the variable will be changed to string.
Thus, if you’re dealing with another data type, like numbers, it’s easier to notice that there’s a wrong value in that column. If you’re working with text, it’s a slightly more complex, but we will see what to do.
Let’s see how to deal with each case.
The Common Missing Value NaN
First, I will create a toy dataset and import some libraries.
# imports
import Pandas as pd
import numpy as np
# Create a dataframe
df = pd.DataFrame({'ID': range(1,31),
'value':[n if n % 5 != 0 else np.nan for n in range(1,31)],
'numbers': [n if n % 4 != 0 else '?' for n in range(1,31)],
'text': ['txt' if n % 3 != 0 else 'No Value!' for n in range(1,31)],
})
df.head()

Checking the data types, notice that the column value, where we have the NaN, it keeps its type as float. But look at the column numbers, it has been coerced into string.
df.dtypes

You can use the methods isna() or isnull()
to find missing values, but none of them will find the missing values for the columns numbers or texts, as those are textual missing values within columns identified (or coerced) by Pandas as text.
# Finding the total of null values
df.isna().sum()
# Or you can also use
df.isnull().sum()

For the known missing data, find them by simply slicing the dataset.
# Pandas has good methods for that
df[ df.value.isna() ]

From here, you can just use the many techniques available to deal with the missing data: deletion, imputation, interpolation etc. I go over a few option in the article below.
Numeric Data with Textual Missing Data
Now for the other columns, one of the best ideas, at first, is to check the data types, as we did before df.dtypes
.
If we know which type we should have in a certain column – like we know that the column numbers should have integers – but we got something different than expected, then we know we probably have a missing data point or something we should double check in that column.
So, since I know that my column numbers must have integers, I will go ahead and look for non-numeric data, like something different than a digit (0–9).
# Find the non-numeric with isdigit to assess if the data is number.
# You can also use .isnumeric()
df[ df.numbers.astype(str).str.isnumeric() == False ]

Of course there are cases where numbers came as text just because Python could not identify the correct type or it came with an extra space (for example, ‘ 3’), but in that case, that would either not appear as a missing value or appear in your list, but you know it’s correct and you can go ahead and clean/ format/ transform to reflect the correct data type.

Textual Data with Missing Values
What to do with Textual data where I don’t know where the NAs are?
Many times, the NaN values can come as a textual entry "NA", "?", "—", "!", "NULL" etc , but they can be many other things. So the idea will be to use Regular Expressions to try to find them. Once you find the pattern, it becomes easier to identify the others.
Let’s imagine we don’t know that our NA values for the column text is "No Value!". We will check for many patterns using RegEx.
In the code below, I will apply a lambda function that looks for the following textual values in each row of the column text: "NA" or "*" or "?" or "!" or "#" or "-" and checks for the length of the list it finds. If the list is different than zero, it means I found something, so there is at least one of those patterns in that row. And you could include any other value you want in that RegEx.
# Import Regular Expressions lib
import re
# Using regex to find the NA values
df[ df.text.apply(lambda x: len(re.findall('NA|[*|?|!|#|-]', x)) !=0 )]

Before You Go
Knowing your data is always a must before you start an exploratory analysis. Look at the dataset documentation if available, like data dictionary or research summary. Those documents will bring you the missing data cases and how they are identified.
For business data from your company, try to chat with the owner of that database to help you gain time and find the NAs quicker.
If you lack both of those options, then go for those three approaches shown above.
Missing data is not a complicated matter to deal with, but it can for sure hurt your work if not properly cleaned.
For more content, don’t forget to follow me.
Subscribe to Medium, if you’d like. You can use my referral code here.