
One of the tasks that you need to perform prior to training your machine learning model is data preprocessing. Data cleansing is one key part of the data preprocessing task, and usually involves removing rows with empty values, or replacing them with some imputed values.
The word "impute" means a value assigned to something by inference from the value of the products or processes to which it contributes. In statistics, imputation is the process of replacing missing data with substituted values.
In this article, I will show you how to use the SimpleImputer class in sklearn to quickly and easily replace missing values in your Pandas dataframes.
Loading the Sample Data
For this article, I have a simple CSV file (NaNDataset.csv) that looks like this:
A,B,C,D
1,2,3,'Good'
4,,6,'Good'
7,,9,'Excellent'
10,11,12,
13,14,15,'Excellent'
16,17,,'Fair'
19,12,12,'Excellent'
20,11,23,'Fair'
The following code snippet loads the CSV file into a Pandas DataFrame:
import numpy as np
import pandas as pd
df = pd.read_csv('NaNDataset.csv')
df
The DataFrame looks like this:

Replacing Missing Values
All the missing values in the dataframe are represented using NaN. Usually, you can either drop them, or replace them with some inferred values. For example, to fill the NaN in the B column with the mean, you can do something like this:
df['B'] = df['B'].Fillna(df['B'].mean())
df
The empty values in column B are now filled with the mean of that column:

This is straight-forward, but sometimes your fill strategy might be a little different. Instead of filling missing values with the mean of the column, you might want to fill it with a value that most frequently occurs. A good example is column D, where the most occurring value is "Excellent".
To fill the missing value in column D with the most frequently occurring value, you can use the following statement:
df['D'] = df['D'].fillna(df['D'].value_counts().index[0])
df

Using sklearn’s SimpleImputer Class
An alternative to using the fillna() method is to use the SimpleImputer class from sklearn. You can find the SimpleImputer class from the sklearn.impute package. The easiest way to understand how to use it is through an example:
from sklearn.impute import SimpleImputer
df = pd.read_csv('NaNDataset.csv')
imputer = SimpleImputer(strategy='mean', missing_values=np.nan)
imputer = imputer.fit(df[['B']])
df['B'] = imputer.transform(df[['B']])
df
You first initialize an instance of the SimpleImputer class by indicating the strategy (mean) as well as specifying the missing values that you want to locate (np.nan):
imputer = SimpleImputer(strategy='mean', missing_values=np.nan)
Once the instance is created, you use the fit() function to fit the imputer on the column(s) that you want to work on:
imputer = imputer.fit(df[['B']])
You can now use the transform() function to fill the missing values based on the strategy you specified in the initializer of the SimpleImputer class:
df['B'] = imputer.transform(df[['B']])
Take note that both the fit() and transform() functions expect a 2D array, so be sure to pass in a 2D array or dataframe. If you pass in a 1D array or a Pandas Series, you will get an error.
The transform() function return the result as a 2D array. In my example, I assign the value back to the column B:
df['B'] = imputer.transform(df[['B']])
The updated dataframe looks like this:

Replacing Multiple Columns
To Replace the missing values for multiple columns in your dataframe, you just need to pass in a dataframe containing the relevant columns:
df = pd.read_csv('NaNDataset.csv')
imputer = Simpleimputer(strategy='mean', missing_values=np.nan)
imputer = imputer.fit(df[['B','C']])
df[['B','C']] = imputer.transform(df[['B','C']])
df
The above example replaces the missing values in columns B and C using the "mean" strategy:

Replacing using the median
Instead of using the mean of each column to update the missing values, you can also use median:
df = pd.read_csv('NaNDataset.csv')
imputer = SimpleImputer(strategy='median', missing_values=np.nan)
imputer = imputer.fit(df[['B','C']])
df[['B','C']] = imputer.transform(df[['B','C']])
df
Here is the result:

Replacing with the most frequent value
If you want to replace missing values with the most frequently-occurring value, use the "_mostfrequent" strategy:
df = pd.read_csv('NaNDataset.csv')
imputer = SimpleImputer(strategy='most_frequent',
missing_values=np.nan)
imputer = imputer.fit(df[['D']])
df[['D']] = imputer.transform(df[['D']])
df
This strategy is useful for categorical column (although it also works for numerical columns). The above code snippet returns the following result:

Replacing with a fixed value
Another strategy you can use is replacing missing values with a fixed (constant) value. To do this, specify "constant" for strategy and specify the fill value using the fill_value parameter:
df = pd.read_csv('NaNDataset.csv')
imputer = SimpleImputer(strategy='constant',
missing_values=np.nan, fill_value=0)
imputer = imputer.fit(df[['B','C']])
df[['B','C']] = imputer.transform(df[['B','C']])
df
The above code snippet replaces all the missing values in columns B and C with 0’s:

Applying the SimpleImputer to the entire dataframe
If you want to apply the same strategy to the entire dataframe, you can call the fit() and transform() functions with the dataframe. When the result is returned, you can use the iloc[] indexer method to update the dataframe:
df = pd.read_csv('NaNDataset.csv')
imputer = SimpleImputer(strategy='most_frequent',
missing_values=np.nan)
imputer = imputer.fit(df)
df.iloc[:,:] = imputer.transform(df)
df
Another technique is to create a new dataframe using the result returned by the transform() function:
df = pd.DataFrame(imputer.transform(df.loc[:,:]),
columns = df.columns)
df
In either case, the result will look like this:

In the above example, the "most_frequent" strategy is applied to the entire dataframe. If you use the median or mean strategies, you will get an error as column D is not a numerical column.
Conclusion
In this article, I discuss how to replace missing values in your dataframe using sklearn’s SimpleImputer class. While you can also replace missing values manually using the fillna() method, the SimpleImputer class makes it relatively easy to handle missing values. If you are working with sklearn, it would be easier to use SimpleImputer together with Pipeline objects (more on this in a future article).