
In Python, function wrappers (also called decorators) are used to modify or extend the behavior of an existing function. They have a variety of applications including debugging, runtime monitoring, user login access in web development, plugins and much more. While typically applied in the context of software engineering, function wrappers can also be used for Data Science and machine learning tasks. For example both runtime monitoring and debugging with function wrappers can be used when developing data processing pipeline and machine learning models.
An interesting application of function wrappers in the space of data science and machine learning is for data imputation. Data imputation is the task of inferring and replacing missing values in data. Data imputation can help decrease bias, increase efficiency in data analysis and even improve performance of machine learning models.
There are several well known techniques for imputing missing values in a data set. The simplest is replacing all missing values with zero. This is limited as this imputation value may not accurately reflect reality and won’t necessarily reduce bias and increase efficiency in data analysis. In some cases it may actually introduce a significant amount of bias especially when a large percentage of the values in a column are missing. Another method is replacing missing numerical values with the mean. While this is better than imputing with zero, it still lends itself to bias especially when a large percentage of data is missing in a column.
Another approach is to build a Machine Learning model that predicts a missing value based on the values of the other columns in the data. This approach is ideal since, even in the case of a large percentage of data missing in a specific column, inference based on other columns should help to reduce bias. This approach can be further improved by applying a machine learning model at category level. This in theory can be used to impute an entire column of missing values relatively well. Further, the approach should work better the more granular the category and consequently the model.
For the first two approaches we can simply use the pandas fillna() method to fill missing values with zero, the mean and the mode. For imputing missing values with a prediction we can use the IterativeImputer module in Scikit-learn package. Here we will look at how to use function wrappers to design data imputation methods for each of these methods.
Here we will be working with the Wine Magazine Dataset, which can be found here. The data is publicly free to use, modify and share under the creative commons license (CC0: Public Domain).
For my analysis, I will be writing code in Deepnote, which is a collaborative data science notebook that makes running reproducible experiments very easy.
Imputing Missing Values with Zero
To start, let’s navigate to Deepnote and create a new project (you can sign-up for free if you don’t already have an account).
Let’s create a project called ‘data_impute’ and a notebook within this project called ‘imputer’:

Next let’s import the packages we will be working with:
Now let’s define our function that we will use to impute missing values with zero. We will call it simple_imputation. It will take a parameter called input_function as an argument. We will also pass the input function to the wraps method in the functools wrappers, which we will place before our actual imputation function, called simple_imputation_wrapper:
def simple_imputation(input_function):
@functools.wraps(input_function)
def simple_imputation_wrapper(*args, **kwargs):
Next, within the scope of simple_imputation wrapper, we specify the logic for imputing missing values in the data frame that is returned by our input function.
def simple_imputation_wrapper(*args, **kwargs):
return_value = input_function(*args, **kwargs)
print(" - - - - - - - Before Imputation - - - - - - - ")
print(return_value.isnull().sum(axis = 0)). return_value.fillna(0, inplace = True)
print(" - - - - - - - After Imputation - - - - - - - ")
print(return_value.isnull().sum(axis = 0))
return return_value
Our imputation function (simple_imputation_wrapper) is defined within the scope of our simple_imputation function. The full function is as follows:
Next let’s define a function that reads in our Wines data set and returns a dataframe containing our data:
def read_data():
df = pd.read_csv("wines_data.csv", sep = ";")
return df
Now if we call our read_data function it will have the added behavior from the simple imputation method:
Imputing Missing Values with Mean & Mode
Next we will define a data imputation method that replaces missing numerical values with the mean and missing categorical values with the mode.
We will call our new function meanmode_imputation. It will also take input_function as an argument. We will also pass the input function to the wraps method in the functools wrappers, which we will place before our actual mean/mode imputation function, called meanmode_imputation_wrapper:
def meanmode_imputation(input_function):
@functools.wraps(input_function)
def meanmode_imputation_wrapper(*args, **kwargs):
Next, within the scope of meanmode_imputation wrapper, we specify the logic for imputing missing values in the data frame that is returned by our input function. Here we will iterate over the column types and impute the mean if the column type is ‘float’ and impute the mode if the column type is ‘category’:
def meanmode_imputation_wrapper(*args, **kwargs):
return_value = input_function(*args, **kwargs)
print("— - - - - - - Before Mean/Mode Imputation - - - - - - - ")
print(return_value.isnull().sum(axis = 0))
for col in list(return_value.columns):
if return_value[col].dtype == float:
return_value[col].fillna(return_value[col].mean(), inplace = True).
elif return_value[col].dtype.name == 'category':
return_value[col].fillna(return_value[col].mode()[0], inplace = True)
print(" - - - - - - - After Mean/Mode Imputation - - - - - - - ")
print(return_value.isnull().sum(axis = 0))
return return_value
The full function is as follows:
We also need to modify our read_data function such that it takes a dictionary of column names and specify categorical columns and numerical columns types as category and float respectively. We do this by iterating over column names and casting column types for each using our dictionary of columns and data types:
for col in list(df.columns):
df[col] = df[col].astype(data_type_dict[col])
The full function is the following:
Next we need to define our dictionary of data type mappings:
data_type_dict = {'country':'category', 'designation':'category','points':'float', 'price':'float', 'province':'category', 'region_1':'category','region_2':'category', 'variety':'category', 'winery':'category', 'last_year_points':'float'}
And we can pass our dictionary to our read data method:
df = read_data(data_type_dict)
And we get the following output:
Imputing Missing Values with IterativeImputer
For our final function wrapper we will use the IterativeImputer available in imputation module in Scikit-learn. The IterativeImputer uses an estimator to iteratively impute missing values in one column using the values in all other columns. The default estimator is a Bayesian Ridge Regression estimator but this is a parameter value that can be modified. Let’s start by importing IterativeImputer:
from sklearn.impute import IterativeImputer
Next, similar to the previous function wrappers, we define a function called iterative_imputation that takes an input function, call the wraps method before our imputation wrapper and define our imputation wrapper as iterative_imputation_wrapper. We also store the return value of our input function and print the number of missing values present before imputation:
def iterative_imputation(input_function):
@functools.wraps(input_function)
def iterative_imputation_wrapper(*args, **kwargs):
return_value = input_function(*args, **kwargs)
print("--------------Before Bayesian Ridge Regression Imputation--------------")
print(return_value.isnull().sum(axis = 0))
Next within the scope of iterative_imputation_wrapper we define dataframes containing categorical columns and numerical columns:
return_num = return_value[['price', 'points', 'last_year_points']]
return_cat = return_value.drop(columns=['price', 'points', 'last_year_points'])
We can then define our imputation model. We define our model object with 10 iterations and a random state set for reproducibility. We will also use the default estimator which is a Bayesian regression estimator:
imp_bayesian = IterativeImputer(max_iter=10, random_state=0)
We can then fit our model and impute the missing numerical values:
imp_bayesian.fit(np.array(return_num))return_num = pd.DataFrame(np.round(imp_bayesian.transform(np.array(return_num))), columns = ['price', 'points', 'last_year_points'])
We will also continue imputing the categorical variables with the mode. It is worth noting that categorical values can also be imputed with a classification model (I will save this task for a future post):
for col in list(return_cat.columns):
return_cat[col].fillna(return_cat[col].mode()[0], inplace = True)
return_value = pd.concat([return_cat, return_num], axis=1)
The full function is as follows:
We can now place our iterative imputation decorator before our read_data method:
And call our method as before:
Another way to improve this imputation method is to build an imputation model at a category level. For example, for each country build an estimator that imputes missing numerical values. I encourage you to play around with the code and see if you can make this modification. In a future post I will walk through how to build these category-level imputation models as well as explore additional imputation methods.
The code in this post is available on GitHub
Conclusions
There are a wide variety of techniques that can be used for data imputation. The simplest method we covered is replacing missing values with zero. This method is not ideal as it can lead to a great deal of bias especially when there is a large number of missing values. A better approach is to impute missing numerical values with the mean and missing categorical values with the mode. While this is an improvement over imputing missing values with zero, it can further be improved using a machine learning model. Additionally, building an imputation model at the category level can provide further improvements.