The world’s leading publication for data science, AI, and ML professionals.

Data Science Superpowers: Missing Data

Data isn't always pretty…

Imputing Missing Values with ML Predictions

Photo by Markus Spiske on Unsplash
Photo by Markus Spiske on Unsplash

Data isn’t always pretty…

…okay, let’s face it: data is almost never pretty.

Missing data is a very common occurance and there are numerous ways to address it. Sometimes dropping the entries with missing values is fine, other times imputing with the mean, median, or mode is a convenient strategy.

But there’s another way!

DS Superpower: Filling missing values with ML predictions!

So you’re working on a project where you’re going to predict something (let’s say house_price) and you encounter some missing data in a predictive column.

Before we predict for y , let’s predict for x !

Photo by Steven Kamenar on Unsplash
Photo by Steven Kamenar on Unsplash

Example: Setup

Importing libraries.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor

Create toy dataset.

# Set random_state variable.
R = 2020
# Set up phony names for features.
x_vars = [
    'year_built',
    'bedrooms',
    'bathrooms',
    'sqft',
    'lot_size'
]
y_var = ['sale_price']
# Make a toy dataset.
X, y = make_regression(
    n_samples=1_000,
    n_features=5,
    n_targets=1,
    random_state=R)
# Convert to pandas.
X = pd.DataFrame(X, columns=x_vars)
y = pd.DataFrame(y, columns=y_var)
# Set 30 random missing values in `year_built` column.
X.loc[X.sample(30, random_state=R).index, 'year_built'] = np.nan

We now have two Pandas DataFrames: X and y . The columns have been given names – albeit nonsense names with regards to the data values, but we’ll roll with it for demonstration.

Importantly, we’ve sprinkled some missing values throughout the year_built column. We want to fill in these values using ML, rather than either dropping them or using a mean.

>>> X.shape, y.shape
((1000, 5), (1000, 1))
>>> X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
year_built    970 non-null float64
bedrooms      1000 non-null float64
bathrooms     1000 non-null float64
sqft          1000 non-null float64
lot_size      1000 non-null float64
dtypes: float64(5)
memory usage: 39.2 KB

Things look to be in order with our data. So let’s get to the fun part!

Example: Process walk-through.

It’s as simple as splitting out the predictive columns from the column with missing values, instantiating a model, fitting it, and setting its predictions into the DataFrame! Let’s take it step by step and then functionalize it.

  1. Determine which X-columns are going to be used to predict and which X-column is the new temporary-target. In our case, we will use X_cols = ['bedrooms', 'bathrooms', 'sqft', 'lot_size'] and y_col = 'year_built' .
  2. Instantiate and fit model. I like using a Random Forest model because they are less prone to overfitting and can take data that isn’t scaled.
rf = RandomForestRegressor(random_state=R)
# Filter data that does not include rows where `y_col` is missing.
rf.fit(
    df[~df[y_col].isna()][X_cols],
    df[~df[y_col].isna()][y_col]
)
  1. Get predictions and set into DataFrame.
y_pred = rf.predict(df[df[y_col].isna()][X_cols])
df.loc[df[y_col].isna(), y_col] = y_pred

That’s how it works! Simple!

Of course, this process is screaming to be functionalized, so let’s do it! We should also give some verbosity and check how good the Forest’s fit is.

Fill Missing Values with RF: Function

def fill_missing_values_with_rf(df: pd.DataFrame, 
                                X_cols: list,
                                y_col: str,
                                astype: type,
                                random_state: int,
                                verbose=True):
    """
    Replace missing values from `y_col` with predictions from a 
    RandomForestRegressor.
    """

    if not df[y_col].isna().sum():
        if verbose:
            print(f'No missing values found in `{y_col}`.')
        return df
    df = df.copy()

    # Instantiate and fit model.
    if verbose:
        print('Instantiating and fitting model...')
    rf = RandomForestRegressor(random_state=random_state)
    rf.fit(
        df[~df[y_col].isna()][X_cols],
        df[~df[y_col].isna()][y_col]
    )
    if verbose:
        print('tModel fit.')
        r2score = rf.score(
            df[~df[y_col].isna()][X_cols],
            df[~df[y_col].isna()][y_col]
        )
        print(f'tModel `r^2`: {round(r2score, 3)}')

    # Get predictions.
    if verbose:
        print('Predicting values...')
    y_pred = rf.predict(df[df[y_col].isna()][X_cols])

    # Set values in df.
    if verbose:
        print('Setting values...')
    df.loc[df[y_col].isna(), y_col] = y_pred

    # Set dtype.
    df[y_col] = df[y_col].astype(astype)

    if verbose:
        print('Complete!')
    return df

We can now see the function in action:

>>> new_df = fill_missing_values_with_rf(
        X, 
        ['bedrooms', 'bathrooms', 'sqft', 'lot_size'],
        'year_built',
        float,
        R)
Instantiating and fitting model...
    Model fit.
    Model `r^2`: 0.844
Predicting values...
Setting values...
Complete!
>>> new_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
year_built    1000 non-null float64
bedrooms      1000 non-null float64
bathrooms     1000 non-null float64
sqft          1000 non-null float64
lot_size      1000 non-null float64
dtypes: float64(5)
memory usage: 39.2 KB

Looks great!

We looked at the well-established process of taking data with missing values and filling them intelligently with what we do best: modeling!

This kind of Data Processing is certainly a DS Superpower. We can predict the future, the past, and the unknown – we should take advantage of that!

Filling missing data by any method should be performed mindfully.


Related Articles