The world’s leading publication for data science, AI, and ML professionals.

A Practical Guide on Missing Values with Pandas

A comprehensive guide to handle missing

values.

Photo by Zach Lucero on Unsplash
Photo by Zach Lucero on Unsplash

Missing values indicate we do not have the information about a feature (column) of a particular observation (row). Why not just remove that observation from the dataset and go ahead? We can but should not. The reasons are:

  • We typically have many features of an observation so we don’t want to lose the observation just because of one missing feature. Data is valuable.
  • We typically have more than one observation with missing values. In some cases, we cannot afford to remove many observations from the dataset. Again, data is valuable.

In this post, we will go through how to detect and handle missing values as well as some key points to keep in mind.

The outline of the post:

  • Missing value markers
  • Detecting missing values
  • Calculations with missing values
  • Handling missing values

As always, we start with importing numpy and pandas.

import numpy as np
import pandas as pd

Missing value markers

The default missing value representation in Pandas is NaN but Python’s None is also detected as missing value.

s = pd.Series([1, 3, 4, np.nan, None, 8])
s

Although we created a series with integers, the values are upcasted to float because np.nan is float. A new representation for missing values is introduced with Pandas 1.0 which is . It can be used with integers without causing upcasting. We need to explicitly request the dtype to be pd.Int64Dtype().

s = pd.Series([1, 3, 4, np.nan, None, 8], dtype=pd.Int64Dtype())
s

The integer values are not upcasted to float.

Another missing value representation is NaT which is used to represent datetime64[ns] datatypes.

Note: np.nan’s do not compare equal whereas None’s are considered as equal.

Note: Not all missing values come in nice and clean np.nan or None format. For example, the dataset we work on may include "?" and "- -" values in some cells. We can convert them to np.nan representation when reading the dataset into a pandas dataframe. We just need to pass these values to na_values parameter.


Detecting missing values

Let’s first create a sample dataframe and add some missing values to it.

df = pd.DataFrame({
'col_a':np.random.randint(10, size=8),
'col_b':np.random.random(8),
'col_c':[True, False, True, False, False, True, True, False],
'col_d':pd.date_range('2020-01-01', periods=8),
'col_e':['A','A','A','B','B','B','C','C']
})
df.iloc[2:4, 1:2] = np.nan
df.iloc[3:5, 3] = np.nan
df.iloc[[1,4,6], 0] = np.nan
df

As we mentioned earlier, NaT is used to represent datetime missing values.

isna() returns the dataframe indicating missing values with booleans.

isna().sum() returns the number of missing values in each column.

notna is the opposite of isna so notna().sum() returns the number of non-missing values.

isna().any() returns a boolean value for each column. If there is at least one missing value in that column, the result is True.


Calculations with missing values

Arithmetical operations between np.nan and numbers return np.nan.

df['sum_a_b'] = df['col_a'] + df['col_b']
df

Cumulative methods like cumsum and cumprod ignore missing values by default but they preserve the positions of missing values.

df[['col_a','col_b']].cumsum()

We can change this behavior by setting skipna parameter as False.

df[['col_a','col_b']].cumsum(skipna=False)

The missing values are included in the summation now. Thus, all the values after the first nan are also nan.

Groupby function excludes missing values by default.

df[['col_e','col_a']].groupby('col_e').sum()

Handling missing values

There are mainly two ways to handle missing values. We can either drop the missing values or replace them with an appropriate value. The better option is to replace missing values but in some cases, we may need to drop them.

Dropping missing values

We can drop a row or column with missing values using dropna() function. how parameter is used to set condition to drop.

  • how=’any’ : drop if there is any missing value
  • how=’all’ : drop if all values are missing

Let’s first modify our dataframe a little:

df.iloc[7,:] = np.nan
df

how='any' will drop all rows except for the first and sixth:

how='all' will only drop the last row:

Note: In order to save these changes in the original dataframe, we need to set inplace parameter as True.

Using thresh parameter, we can set a threshold for missing values in order for a row/column to be dropped. Dropna also does column-wise operation if axis parameter is set to 1.

Replacing missing values

fillna() function of Pandas conveniently handles missing values. Using fillna(), missing values can be replaced by a special value or an aggreate value such as mean, median. Furthermore, missing values can be replaced with the value before or after it which is pretty useful for time-series datasets.

We can select one value to replace all missing values in a dataframe but it does not make any sense. Instead, we can create a dictionary indicating a separate value to be used in different columns.

replacements = {'col_a':0, 'col_b':0.5, 'col_e':'Other'}
df.fillna(replacements)

We can use an aggregate function as the value to replace missing values:

df['col_b'].fillna(df['col_b'].mean())

We can also fill missing values with the values before or after them using the method parameter. ffill stands for "forward fill" and replaces missing values with the values in the previous row. As the name suggests, bfill (backward fill) does the opposite. If there are many consecutive missing values in a column or row, we can use limit parameter to limit the number of missing values to be forward or backward filled.

All the missing values are filled with the values in the previous cell. Let’s limit the number of missing values to be filled as 1:

Let’s try to fill missing values using bfill method:

The values at the end remain as missing because there are no values after them.


Bonus: Interpolation

interpolate fills missing values by interpolation which is especially useful for sequential or time series data. The default method is linear but it can be changed using method parameter. Some available options are polynomial, quadratic, cubic. Let’s do an example with linear interpolation.

s = pd.Series(np.random.random(50))
s[4, 5, 9, 11, 18, 19, 33, 34, 46, 47, 48] = np.nan
s.plot()
ts.interpolate().plot()

Thank you for reading. Please let me know if you have any feedback.


Related Articles