values.

Missing values indicate we do not have the information about a feature (column) of a particular observation (row). Why not just remove that observation from the dataset and go ahead? We can but should not. The reasons are:
- We typically have many features of an observation so we don’t want to lose the observation just because of one missing feature. Data is valuable.
- We typically have more than one observation with missing values. In some cases, we cannot afford to remove many observations from the dataset. Again, data is valuable.
In this post, we will go through how to detect and handle missing values as well as some key points to keep in mind.
The outline of the post:
- Missing value markers
- Detecting missing values
- Calculations with missing values
- Handling missing values
As always, we start with importing numpy and pandas.
import numpy as np
import pandas as pd
Missing value markers
The default missing value representation in Pandas is NaN but Python’s None is also detected as missing value.
s = pd.Series([1, 3, 4, np.nan, None, 8])
s

Although we created a series with integers, the values are upcasted to float because np.nan is float. A new representation for missing values is introduced with Pandas 1.0 which is . It can be used with integers without causing upcasting. We need to explicitly request the dtype to be pd.Int64Dtype().
s = pd.Series([1, 3, 4, np.nan, None, 8], dtype=pd.Int64Dtype())
s

The integer values are not upcasted to float.
Another missing value representation is NaT which is used to represent datetime64[ns] datatypes.
Note: np.nan’s do not compare equal whereas None’s are considered as equal.

Note: Not all missing values come in nice and clean np.nan or None format. For example, the dataset we work on may include "?" and "- -" values in some cells. We can convert them to np.nan representation when reading the dataset into a pandas dataframe. We just need to pass these values to
na_values
parameter.

Detecting missing values
Let’s first create a sample dataframe and add some missing values to it.
df = pd.DataFrame({
'col_a':np.random.randint(10, size=8),
'col_b':np.random.random(8),
'col_c':[True, False, True, False, False, True, True, False],
'col_d':pd.date_range('2020-01-01', periods=8),
'col_e':['A','A','A','B','B','B','C','C']
})
df.iloc[2:4, 1:2] = np.nan
df.iloc[3:5, 3] = np.nan
df.iloc[[1,4,6], 0] = np.nan
df

As we mentioned earlier, NaT is used to represent datetime missing values.
isna()
returns the dataframe indicating missing values with booleans.

isna().sum()
returns the number of missing values in each column.

notna
is the opposite of isna
so notna().sum()
returns the number of non-missing values.

isna().any()
returns a boolean value for each column. If there is at least one missing value in that column, the result is True.

Calculations with missing values
Arithmetical operations between np.nan and numbers return np.nan.

df['sum_a_b'] = df['col_a'] + df['col_b']
df

Cumulative methods like cumsum
and cumprod
ignore missing values by default but they preserve the positions of missing values.
df[['col_a','col_b']].cumsum()

We can change this behavior by setting skipna
parameter as False.
df[['col_a','col_b']].cumsum(skipna=False)

The missing values are included in the summation now. Thus, all the values after the first nan are also nan.
Groupby
function excludes missing values by default.
df[['col_e','col_a']].groupby('col_e').sum()


Handling missing values
There are mainly two ways to handle missing values. We can either drop the missing values or replace them with an appropriate value. The better option is to replace missing values but in some cases, we may need to drop them.
Dropping missing values
We can drop a row or column with missing values using dropna()
function. how
parameter is used to set condition to drop.
- how=’any’ : drop if there is any missing value
- how=’all’ : drop if all values are missing
Let’s first modify our dataframe a little:
df.iloc[7,:] = np.nan
df

how='any'
will drop all rows except for the first and sixth:

how='all'
will only drop the last row:

Note: In order to save these changes in the original dataframe, we need to set
inplace
parameter as True.
Using thresh
parameter, we can set a threshold for missing values in order for a row/column to be dropped. Dropna also does column-wise operation if axis
parameter is set to 1.
Replacing missing values
fillna()
function of Pandas conveniently handles missing values. Using fillna()
, missing values can be replaced by a special value or an aggreate value such as mean, median. Furthermore, missing values can be replaced with the value before or after it which is pretty useful for time-series datasets.
We can select one value to replace all missing values in a dataframe but it does not make any sense. Instead, we can create a dictionary indicating a separate value to be used in different columns.

replacements = {'col_a':0, 'col_b':0.5, 'col_e':'Other'}
df.fillna(replacements)

We can use an aggregate function as the value to replace missing values:
df['col_b'].fillna(df['col_b'].mean())

We can also fill missing values with the values before or after them using the method
parameter. ffill stands for "forward fill" and replaces missing values with the values in the previous row. As the name suggests, bfill (backward fill) does the opposite. If there are many consecutive missing values in a column or row, we can use limit parameter to limit the number of missing values to be forward or backward filled.


All the missing values are filled with the values in the previous cell. Let’s limit the number of missing values to be filled as 1:

Let’s try to fill missing values using bfill method:

The values at the end remain as missing because there are no values after them.
Bonus: Interpolation
interpolate
fills missing values by interpolation which is especially useful for sequential or time series data. The default method is linear but it can be changed using method parameter. Some available options are polynomial, quadratic, cubic. Let’s do an example with linear interpolation.
s = pd.Series(np.random.random(50))
s[4, 5, 9, 11, 18, 19, 33, 34, 46, 47, 48] = np.nan
s.plot()

ts.interpolate().plot()

Thank you for reading. Please let me know if you have any feedback.