Notes from Industry

As a consultant, I don’t always have control over the data I receive. Going back and forth with a client can only get you so far. At a certain point, you need to work with the data you have. That means I work with a lot of messy data, and have become intimately familiar with all different kinds of data leakage.
Data leakage can be extremely sneaky, especially if you don’t have a lot of insight into the data collection process. Often, data leakage occurs when data points are:
- only collected sometimes
- not collected in a standard way
- overwritten
- collected at different times
I realized early on that I would need a principled way of dealing with these situations. Whatever I came up with, it had to work on the training data, AND new future data, since my clients want daily predictions on new data.
The first idea that came to my mind was to throw the offending variables out. For some clients, this meant throwing out a lot of signal. I knew I could do better. I needed a way to use the information density that existed in the dataset, but get rid of the leaks.
What I settled on is an often overlooked statistical methodology that many data scientists have never heard of – Multiple Imputation by Chained Equations.
The Data
Let’s look at a classic example of data leakage using the breast cancer Wisconsin dataset. In our scenario, we work for a medical imaging lab. We have been tasked with coming up with a model that can predict whether a tumor is malignant. However, we receive data from multiple doctors, who run different tests, which provide different data. Sometimes, if a tumor is found to be malignant, a second series of tests is run, and more data is collected. Sometimes, they aren’t. This causes some of our fields to have missing values in much higher concentrations for non-malignant tumors. It’s a mess.
import numpy as np
random_state = np.random.RandomState(seed=0)
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Load data and split into training and new, "unseen" data.
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X.drop([c for c in X.columns if "error" in c], axis=1, inplace=True)
X_train, X_new, y_train, y_new = train_test_split(
X, y, test_size=100, stratify=y, random_state=0
)
# train data has missing values correlated with
# the target (tumor malignancy)
X_train = X_train.copy()
for c in X_train.columns:
ampute_ind = random_state.choice(
X_train.index,
size=100,
p=(2-y_train) / (2-y_train).sum()
)
X_train.loc[ampute_ind, c] = np.NaN
# New data does not have missing values correlated
# with the target
X_new = X_new.copy()
for c in X_train.columns:
ampute_ind = random_state.choice(
X_new.index,
size=10
)
X_new.loc[ampute_ind, c] = np.NaN
# Check out the distribution of missing values
# grouped by the target
X_train.isnull().groupby(y_train).mean().transpose()
Our missing fields are highly correlated with whether or not the tumor is malignant. On the surface, this seems like a good thing. It should help us classify the tumors. However, this data is suffering from pretty severe data leakage. More tests were only run once we actually determined if the tumor was malignant or not. Therefore, our training data has a fundamentally different behavior from the new data we will be predicting.
If we were to run our model on this data right now, the model would unfairly classify samples with missing values as non-malignant far more often than it should. We have a conundrum – we have a high information density in our dataset, with a lot of valid data. We need a way to get rid of this leakage, while still keeping all the information we can.
Problems with Mean/Median Imputation
A common way to deal with Missing Data is to simply impute each point with the mean or some other value. This is usually not a good way to fix data leakage. Any predictive model capable of modeling non-linear relationships will pick up on this pattern, and the unwanted information will still leak through. If we imputed X_train
with each variable’s mean, and then ran our model, we would find that our model unfairly classifies variables with mean-imputed values as non-malignant. We wouldn’t be any better off.

Multiple Imputation
Multiple imputation is an iterative method for modeling each variable as a function of the other variables in a dataset.

The details of this algorithm are outside the scope of this article, but if you would like to know more, Stef Van Buuren has written a great free online book, which you can find here. Here, we impute our missing values with a package [miceforest](https://github.com/AnotherSamWilson/miceforest)
, which uses lightgbm
as a chaining function:
import miceforest as mf
kernel = mf.ImputationKernel(
data=X_train,
datasets=5,
save_all_iterations=True,
random_state=random_state
)
kernel.mice(2, verbose=True)
What we just did is train 200 random forests ([# variables] x [# datasets] x [# iterations]), and use the predictions from those random forests in an iterative fashion to fill in the missing values of our dataset. We create multiple datasets because we can usually not be 100% sure of our imputation values, so we run the process multiple times to get uncertainty measures.
In general, MICE is an expensive process. However, there is a trick we can use to considerably decrease imputation time on new data, and make this method feasible to use in production. miceforest
keeps track of the models trained for each variable – we can simply use those to impute any new data we come across. This cuts down on imputation time considerably.
Building the Classifier
Now that we have our completed datasets and have eliminated the data leakage, we need to train our model. Since we created 5 imputed datasets, the question instantly becomes "which dataset should we use to train our model?"
The answer is "it depends". If you have enough computing resources, and want to be as robust as possible, ideally you would train one model on each imputed training dataset. In general, the more imputed datasets you can build, the better your understanding will be of how the imputations affect the variance of your predictions.
However, for simplicity, here we are just going to train 1 model on the first dataset, and use that model to get predictions from 5 different imputed datasets in production.
import lightgbm as lgb
dtrain = lgb.Dataset(
data=kernel.complete_data(0),
label=y_train
)
lgb_model = lgb.train(
params={"objective": 'binary',"seed": 1, "verbosity": -1},
train_set=dtrain,
verbose_eval=False
)
Alright, we are getting pretty close. We have eliminated our data leakage, kept the information density in the dataset, and ran a model which can predict whether a tumor is malignant. The only thing left is to build the inference part of our pipeline.
Scoring the New Data
Every day, doctors will submit their test results, and we need to return a prediction. Before we can score our new data, we need to impute the missing values. Our training dataset had no missing values – neither should our new data. Thankfully, miceforest
saves the models it created while running mice
, and can impute our new data without having to re-train the models:
# Get missing data imputations
new_data_imputed = kernel.impute_new_data(X_new)
# Get predictions for each dataset
predictions = np.array([
lgb_model.predict(
new_data_imputed.complete_data(i),
raw_score=True
)
for i in range(kernel.dataset_count())
]).transpose()
Our predictions array contains our 5 predictions for each sample in the new data. If you actually view this array, you will see that these predictions are usually very close together. Our dataset has a very high information density, which allowed us to both impute our variables, as well as model our target with a high degree of accuracy. However, we can still see the negative affect that the missing data had:
import pandas as pd
sample_var = predictions.var(1)
sample_missing_value_count = X_new.isnull().sum(1).values
missing_metrics = pd.DataFrame({
'sample_stdev': (sample_var ** 0.5).round(3),
"missing_value_count": np.minimum(sample_missing_value_count,4)
})
missing_metrics.groupby("missing_value_count").agg({
"sample_stdev": ["mean", "count"]
})
Samples that had more missing variables tend to have higher variance in their predictions. This happens for two reasons:
- We had less real data to perform inference with
- There are more chances for data to be different in each dataset

Finally, we need to decide how we will actually return a prediction. This entirely depends on the situation. Having uncertainty measures for each sample is pretty useful. We may wish to return the mean, as well as the 25th and 75th percentile for each sample. We may wish to just return the median.
Conclusion
We have shown how Multiple Imputation by Chained Equations (MICE) can be used to eliminate data leakage. We have also shown how this method can be extended to production environments, and be used to obtain uncertainty measures around our resulting predictions.