
Data is the new fuel. However, the raw data is cheap. We need to process it well to take the most value out of it. Complex, well-structured models are as good as the data we feed to them. Thus, data needs to be cleaned and processed thoroughly in order to build robust and accurate models.
One of the issues that we are likely to encounter in raw data is missing values. Consider a case where we have features (columns in a dataframe) on some observations (rows in a dataframe). If we do not have the value in a particular row-column pair, then we have a missing value. We may have only a few missing values or half of an entire column may be missing. In some cases, we can just ignore or drop the rows or columns with missing values. On the other, there might be some cases in which we cannot afford to drop even a single missing value. In any case, handling missing values process starts with exploring them in the dataset.
Pandas provides functions to check the number of missing values in the dataset. Missingno library takes it one step further and provides the distribution of missing values in the dataset by informative visualizations. Using the plots of missingno, we are able to see where the missing values are located in each column and if there is a correlation between missing values of different columns. Before handling missing values, it is very important to explore them in the dataset. Thus, I consider missingno as a highly valuable asset in data cleaning and preprocessing steps.
In this post, we will explore the functionalities of missingno plot by going through some examples.
Let’s first try to explore a dataset about the movies on streaming platforms. The dataset is available here on kaggle.
import numpy as np
import pandas as pd
df = pd.read_csv("/content/MoviesOnStreamingPlatforms.csv")
print(df.shape)
df.head()

The dataset contains 16744 movies and 17 features that describe each movie. Pandas isna
returns the missing values and we apply sum
function to see the number of missing values in each column.
df.isna().sum()

"Age" and "Rotten Tomatoes" columns have lots of missing values. There are some 6 other columns with number of missing values more than 200. Let’s now use missingno to see if we can get a better intuition about the missing values.
import missingno as msno
%matplotlib inline
We imported missingno library. %matplotlib inline
command allows to render visualizations within the jupyter notebook. The first tool we use is missing value matrix.
msno.matrix(df)

White lines indicate missing values. "Age" and "Rotten Tomatoes" columns are dominated by while lines as we expect. But, there is an interesting trend in the other columns that have missing values. They mostly have missing values in common rows. If a row has a missing value in "Directors" columns, it is likely to have missing values in "Genres", "Country", "Language", and "Runtime" columns. This is highly valuable information when handling missing values.
Heatmaps are used to visualize correlation matrices which show the correlation of values between different columns. Missingno library also provides heatmaps that show if there is any correlation between missing values in different columns.
msno.heatmap(df)

Positive correlation is proportional to the level of darkness in blue as indicated by the bar on the right side. There are positive correlations at different levels between "Directors", "Genres", "Country", "Language", and "Runtime" columns. The highest correlation is between "Language" and "Country" which is 0.8. This confirms our intuition from the missing values matrix as these columns have missing values in the same rows.
Another tool of missingno is the bar plot on missing values.
msno.bar(df)

It shows bars that are proportional to the number of non-missing values as well as providing the actual number of non-missing values. We get an idea of how much of each column is missing.
As we mentioned earlier, in order to handle missing values well, we need to understand the structure of the dataset in terms of missing values. It is not enough to just know the number of missing values. The plot of missingno library helps a lot in understanding missing values well. After this step, we can start thinking about how to handle missing values.
The following post provides a detailed guide on how to handle missing values with Pandas. Missingno and pandas can be used together in order to build a robust and efficient strategy to deal with missing values.
Thank you for reading. Please let me know if you have any missing values.