The Dreaded Antagonist: Data Leakage in Machine Learning

Probably one of the most underappreciated concepts in Machine Learning

Andreas Lukita
Towards Data Science

--

I have attended more than 5 Business Analytics and Machine Learning courses, both in-person and online. Surprisingly, only one has scratched the surface of data leakage, briefly.

Photo by Luis Tosta on Unsplash

When talking about data leakage without the context of machine learning, oftentimes we refer to it as the scenario when confidential information is transferred to a third party without proper security measures or permission, leading to a breach of privacy and security¹.

While the concept is somewhat similar, this is not quite the explanation in the context of machine learning. Here’s what it means in the world of machine learning:

Data leakage occurs when information from the test dataset is mistakenly included in the training dataset.²

The result? Unrealistically good performance metrics during training, but poor performance when the model is actually put to use.

In simpler terms, the model memorized information that it should not have access to, leading to artificially inflated performance metrics during training.

Still can’t wrap your head around it? Well, picture this. You are studying for your upcoming math exam. You do lots of practice questions to get better each day. Then, you find out that the exam questions have been accidentally leaked online. You have access to this critical information and decide to practice on this paper (You train yourself on this dataset that is not supposed to be known before the exam, and thus, you “memorize” the pattern of the question). The result? You become overly familiar with the test question and you get unrealistically good performance metrics for that piece of paper, but when you are actually put to use in the real world… (let’s not touch on that).

Table of Contents

Target Leakage

As for Target Leakage, it might not be so easy and straightforward to recognize. Picture this: you are building a model to predict whether your customers will cancel their monthly subscriptions to your service (i.e., churn out). At first glance, including the “number of customer service calls” made by the customer as a feature in our model does not seem problematic at all, as you may reason out that more customer service calls are linked to a higher probability of churning out.

However, upon closer inspection, it is found that the “number of customer service calls“ is the result of customers churning out, instead of a contributing feature. Customers who have already decided to churn out merely call to settle any outstanding issues before eventually canceling their subscriptions. Hence, this information would not be available to us at the time of predicting whether a customer would churn out (In other words, we only know this information for customers that have already decided to churn out).

Including Target Variable as part of Feature Variables, or any proxy that is directly or indirectly derived from the Target Variable, could lead to data leakage.

Train-Test Contamination and Leakage during Data Preprocessing

These situations refer to the case when the same preprocessing steps are applied to both train and test datasets. For instance, when we do data preprocessing steps such as feature scaling, estimating missing values, and removing outliers, we should ensure that we do not “learn” from the test dataset as shown below³.

scaler = StandardScaler()
scaler.fit(X_train)
scaler.transform(X_train)
scaler.transform(X_test)

Here, we separate our dataset into training and testing early before the preprocessing steps such that we are able to fit only the training dataset. Note that we should not fit on the entire dataset (both train and test) as this would result in data leakage (our model learns what it is not supposed to learn, in other words, this information about the testing dataset is not known at the time of prediction).

Fit on training data, transform on both training and testing data.

Consequences of Data Leakage in Machine Learning

The consequences of not detecting the presence of data leakage in a machine-learning project are immense — they give false hope. Ever experience when your training performance is ridiculously high whereas your testing dataset performs very poorly? Data leakage might be the culprit. The keywords here are overfitting and the inability to generalize. This is because the model has learned to memorize noise and irrelevant information, resulting in poor performance when faced with a real test dataset².

The eventual outcome?

You make an inaccurate model evaluation and unreliable predictions. What a waste of resources!

Preventing Data Leakage: Manual Review

Yes, we all get it. Manual review is inefficient and can be very time-consuming. However, putting in the time to study the relationship between features and target variable is perhaps the most consistent way to detect data leakage, and hence, be very well worth it. When a feature has a very high correlation with the target, for example, we should be skeptical to investigate the relationship further. Sometimes, doing Exploratory Data Analysis (EDA) might help with uncovering the correlation between features and targets. Moreover, having a well-rounded domain knowledge and expertise can help to determine whether or not a feature should or should not be included in the model. Remember, when in doubt, always ask yourself this guiding question

“Does this feature contain information that would not be available at the time of prediction?”

If the answer to the above question is a “yes”, then including that feature could result in data leakage.

Preventing Data Leakage: Pipeline is King

None of the Business Analytics and Machine Learning courses that I attended before has any mention of building a Machine Learning Preprocessing Pipeline. The most common practice is to write spaghetti code all around the place without any standardization of workflow. While this may be familiar to many people, it is simply not the best practice — one reason being the possibility of introducing data leakage into the model. The first time I was exposed to the idea of leveraging a pipeline came from the book Data Cleaning and Exploration with Machine Learning. The way of writing the code by embedding each preprocessing step as variable arguments into the make_pipeline method and separating the steps for numerical, categorical, and binary variables respectively are some key lessons I picked up from the book.

Simply put, a pipeline is a set of linear sequences of data preprocessing steps, executed one after another. Pipelines offer a clear and orderly chaining process for automating a machine learning project’s workflow. We can leverage the scikit-learn Pipeline class, which takes a list of tuples as input, where each tuple represents a single step in the pipeline. The first element of each tuple is a string that represents the name of the step, and the second element is an instance of a scikit-learn transformer or estimator object. Of course, there is an alternative shorthand to this, which is make_pipeline which does not require us to name the estimators (We are all lazy creatures). Remember, the estimators need to have both fit and transform method.

Why bother with Pipeline?

Pipeline handles all the data processing processes automatically. It also makes sure that each step is only fitted on the training data, which prevents data leakage and guarantees that the stages are carried out in the proper order.

Here’s an example: we want to do data preprocessing for different data types in our dataset, namely numerical, categorical, and binary features, each with different steps. We can leverage on make_pipeline to lay out the process in an orderly manner, and let the Pipeline takes care of all the jobs behind the scene. This would return a Pipeline object which has several attributes and methods we can call. For example, we could call fit(X_train, y_train) and score(X_test, y_test) to fit and evaluate the model respectively.

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from feature_engine.encoding import OneHotEncoder
from preprocfunc import OutlierTrans #self-created Python class

standardtrans = make_pipeline(OutlierTrans(2),
StandardScaler()
)

categoricaltrans = make_pipeline(SimpleImputer(strategy="most_frequent"),
OneHotEncoder(drop_last=True)
)

binarytrans = make_pipeline(SimpleImputer(strategy="most_frequent")
)

columntrans = ColumnTransformer(transformers=[
("standard", standardtrans, numerical_cols),
("categorical", categoricaltrans, ['gender']),
("binary", binarytrans, ['completedba'])
])

lr = LinearRegression()
pipe = make_pipeline(columntrans, KNNImputer(n_neighbors=5), lr)

Preventing Data Leakage: Cross-Validation

This section is inspired by the book Data Cleaning and Exploration with Machine Learning. Another thing I picked up from the book is marrying the concept of Pipeline and Cross-Validation. Yes, they are not exclusive! The selection of train and test datasets is very critical and could lead to data leakage if not done correctly. When we do not perform cross-validation to evaluate our model, we run the risk of overfitting the training data and getting poor performance on new, unseen data. It could be by chance that the once-off train test split that we performed resulted in our model learning a specific feature exclusive to that split, which might not be generalizable.

Why bother with Cross-Validation?

Cross-validation allows us to obtain a more precise prediction of how well our model will perform on brand-new, untested data. By using cross-validation, we can test the effectiveness of our model on more than one subset of the data.

We can leverage scikit-learn K-fold CV to achieve this.

How does K-fold CV work in short? The data is first divided into k equal-sized folds, after which the model is trained on k-1 folds before being tested on the last fold. Each fold serves as the testing set once during the course of this operation, which is repeated k times. At the end of the iteration, an estimate of the model’s performance is generated by averaging the outcomes of the k iterations. When k is set to 1, this means we are falling back to the usual train test split. We train our model on the entire dataset and test it on a separate dataset.

The good news is, we can pick up from where we left off in the Pipeline.

from sklearn.model_selection import cross_validate, KFold

ttr = TransformedTargetRegressor(regressor=pipe, transformer=StandardScaler())

kf = KFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_validate(ttr,
X=X_train,
y=y_train,
cv=kf,
scoring=('r2', 'neg_mean_absolute_error'),
n_jobs=1)

Real-World Dataset Example: The Titanic Dataset

Titanic. Classic. The Titanic dataset is a classic machine learning problem, where we’re given a set of features for each passenger, such as their age, gender, ticket class, place of embarkation, and whether they had family members on board. Using these features, the goal is to train a machine-learning model to predict whether a passenger survived or not. Here is a short and quick version of the prediction without delving into hyperparameter tuning and feature selection.

Using the following code to clean up the raw dataset.

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

df = pd.read_csv("../dataset/titanic/csv_result-phpMYEkMl.csv")

#Change column names, replace "?" to "NaN", change data types
def tweak_df(df):
features = ["PassengerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"]
return (df
.rename(columns={"id": "PassengerId", "'pclass'": "Pclass", "'survived'": "Survived", "'name'": "Name", "'sex'": "Sex", "'age'": "Age", "'sibsp'": "SibSp", "'parch'": "Parch", "'ticket'": "Ticket", "'fare'": "Fare", "'cabin'": "Cabin", "'embarked'": "Embarked"})
[features]
.replace('?', np.nan)
.astype({'Age': 'float', 'Fare': 'float16'})
)

#Splitting dataset into train, validation, test, and unseen
X_train_val_test, X_unseen, y_train_val_test, y_unseen = train_test_split(tweak_df(df).drop(columns=['Survived']), tweak_df(df).Survived, test_size=0.33, random_state=42)
X_train, X_val_test, y_train, y_val_test = train_test_split(X_train_val_test, y_train_val_test, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=42)

#Extensive cleanup
def tweak_titanic_cleaned(train_df):

impute_table = (train_df
.assign(SibSp=lambda df_: np.where(df_.SibSp==0, 0, 1),
Parch=lambda df_: np.where(df_.Parch==0, 0, 1))
.groupby(['SibSp', 'Parch'])
['Age']
.agg('mean')
)

train_df_intermediary = (train_df
.assign(SibSp=lambda df_: np.where(df_.SibSp==0, 0, 1),
Parch=lambda df_: np.where(df_.Parch==0, 0, 1),)
)

condlist = [((train_df_intermediary.Age.isna()) & (train_df_intermediary.SibSp == 0) & (train_df_intermediary.Parch == 0)),
((train_df_intermediary.Age.isna()) & (train_df_intermediary.SibSp == 0) & (train_df_intermediary.Parch == 1)),
((train_df_intermediary.Age.isna()) & (train_df_intermediary.SibSp == 1) & (train_df_intermediary.Parch == 0)),
((train_df_intermediary.Age.isna()) & (train_df_intermediary.SibSp == 1) & (train_df_intermediary.Parch == 1)),]

choicelist = [impute_table.iloc[0],
impute_table.iloc[1],
impute_table.iloc[2],
impute_table.iloc[3],]

bins = [0, 12, 18, 30, 50, 100]
labels = ['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior']
features = ["Survived", "Pclass","Sex","Fare","Embarked","AgeGroup","SibSp","Parch","IsAlone","Title"]

return (train_df
.assign(Embarked=lambda df_: SimpleImputer(strategy="most_frequent").fit_transform(df_.Embarked.values.reshape(-1,1)),
Age=lambda df_: np.select(condlist, choicelist, df_.Age),
IsAlone=lambda df_: np.where(df_.SibSp + df_.Parch > 0, 0, 1),
Title=lambda df_: df_.Name.str.extract(',(.*?)\.'))
.assign(AgeGroup=lambda df_: pd.cut(df_.Age, bins=bins, labels=labels),
Title=lambda df_: df_.Title.replace(['Dr', 'Rev', 'Major', 'Col', 'Capt', 'Sir', 'Lady', 'Don', 'Jonkheer', 'Countess', 'Mme', 'Ms', 'Mlle','the Countess'],
'Other'))
.set_index("PassengerId")
[features]
)

Let me explain the idea behind the preprocessing steps I have implemented:

  1. Impute the missing data from Embarked column with the most frequent entries using SimpleImputer class from scikit-learn.
  2. Impute the missing data from Age column with a list of mean values based on the conditions of whether the person has family members onboard. (i.e. if the person has no family member onboard, then we impute the mean that is calculated based on the other passengers who also have no family members onboard)
  3. Engineered a feature IsAlone to denote whether the passenger has any family members onboard
  4. Engineered a feature Title to denote the title of the passenger
  5. Engineered a feature AgeGroup to categorize 5 different age group generations.

Data Leakage 1st illustration: Including target survived as feature

#Intentionally add target variable to list of features
X_train = tweak_titanic_cleaned(pd
.concat([X_train, pd.DataFrame(y_train)], axis=1))

X_val = tweak_titanic_cleaned(pd
.concat([X_val, pd.DataFrame(y_val)], axis=1))

X_test = tweak_titanic_cleaned(pd
.concat([X_test, pd.DataFrame(y_test)], axis=1))

# Prepare the training data
X_train = pd.get_dummies(X_train, columns=["Survived", "Pclass", "Sex", "Embarked", "AgeGroup", "IsAlone", "Title"], drop_first=True)
X_val = pd.get_dummies(X_val, columns=["Survived", "Pclass", "Sex", "Embarked", "AgeGroup", "IsAlone", "Title"], drop_first=True)

# Scale numerical columns
scaler = MinMaxScaler()
num_cols = ["Fare","SibSp","Parch"]
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_val[num_cols] = scaler.transform(X_val[num_cols])

# Fit and evaluate Logistic Regression model
lr_model = LogisticRegression(random_state=0)
lr_model.fit(X_train, y_train)

# Make predictions on validation data
y_pred_val = lr_model.predict(X_val)

# Evaluate model on validation data
acc_val = round(accuracy_score(y_val, y_pred_val) * 100, 2)
print("Logistic Regression Model accuracy on validation data:", acc_val)

As you might have expected, including the target variable survived as a feature effectively make our model useless as it now has an accuracy of 100.0% on the validation data. There is no point in doing any prediction. This mistake is easy to spot and less common.

Data Leakage 2nd illustration: Mixing up records of training and testing data

If testing data is accidentally included in the training set, the model may be trained on this leaked information and thus perform unrealistically well on the test set. The following is an intentionally made-up example of including part of the test set into the training set.

X_train = (pd
.concat([tweak_titanic_cleaned(X_train),
tweak_titanic_cleaned(X_val).iloc[:150, :]])
)
y_train = (pd
.concat([y_train,
y_val.iloc[:150]])
)

X_val = tweak_titanic_cleaned(X_val)

# Prepare the training data
X_train = pd.get_dummies(X_train, columns=["Pclass", "Sex", "Embarked", "AgeGroup", "IsAlone", "Title"], drop_first=True)
X_val = pd.get_dummies(X_val, columns=["Pclass", "Sex", "Embarked", "AgeGroup", "IsAlone", "Title"], drop_first=True)

# Scale numerical columns
scaler = MinMaxScaler()
num_cols = ["Fare","SibSp","Parch"]
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_val[num_cols] = scaler.transform(X_val[num_cols])

# Fit and evaluate Logistic Regression model
lr_model = LogisticRegression(random_state=0)
lr_model.fit(X_train, y_train)

# Make predictions on validation data
y_pred_val = lr_model.predict(X_val)

# Evaluate model on validation data
acc_val = round(accuracy_score(y_val, y_pred_val) * 100, 2)
print("Logistic Regression Model accuracy on validation data:", acc_val)

Data Leakage 3rd illustration: Wrong data preprocessing steps

Here, we are supposed to separate the dataset into training and testing dataset before the preprocessing steps. If we do the preprocessing steps before splitting the dataset, we might accidentally learn from the testing dataset, leading to overly inflated model performance.

# Prepare the training data
df_leaked = pd.get_dummies(tweak_titanic_cleaned(X_train_val_test), columns=["Pclass", "Sex", "Embarked", "AgeGroup", "IsAlone", "Title"], drop_first=True)

# Scale numerical columns
scaler = MinMaxScaler()
num_cols = ["Fare","SibSp","Parch"]
df_leaked[num_cols] = scaler.fit_transform(df_leaked[num_cols])

# Split the data into train, validation, and test sets
X_train, X_val_test, y_train, y_val_test = train_test_split(df_leaked, y_train_val_test, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=42)

The correct step is to fit_transform on the training dataset, and transform on the testing dataset.

Data Leakage 4th illustration: Including the feature cabin as part of the feature in the model

This data leakage problem is not easy to spot and perhaps needs some domain knowledge to understand. The main question to ask is “Does this feature contain information that would not be available at the time of prediction?” If the answer to this question is a yes, then high chance there could be data leakage in play.

In the context of this, the “prediction” is being done after the passengers have already boarded the ship and the event has occurred. The goal is to predict whether a passenger would have survived or not based on the available data (i.e. passenger class, age, etc.) after the event has already taken place. At the time of prediction, cabin number information is not available since it was only assigned to passengers after they boarded the ship.

Not every passenger had their cabin number recorded in the dataset, in fact, we have a large number of missing data. The cabin number may not be accurate or complete even for those that do have the record. Thus, we cannot utilize cabin number as a predictor when developing a model to estimate survival because it could not be correct or available for all passengers.

Imagine that we decide to use the cabin number as one of the predictors in the model. While training the model, it uses the cabin number to make predictions and gets really good at it. But when we try to use the model in the real world, we may not have the cabin number for all passengers, or the cabin number we have might be wrong. This means that even if the model was very accurate during training, it may not do well in a real-world deployment case.

One possible solution is to drop this feature and exclude it from the model building.

Afterword

Preventing data leakage is indeed a challenging task. Studying the relationship between features and target variable is key to uncovering this problem. Next time when you see an insanely high performance from your model, maybe it’s better to learn to sit back and observe, cause not everything needs a reaction.

Thank you for reading, and happy modeling!

If you pick up something useful from this article, do consider giving me a Follow on Medium. Easy, 1 article a week to keep yourself updated and stay ahead of the curve!

You can connect with me on LinkedIn: https://www.linkedin.com/in/andreaslukita7/

References:

  1. Forcepoint. What is Data Leakage? Data Leakage Defined, Explained, and Explored. https://www.forcepoint.com/cyber-edu/data-leakage
  2. Analytics Vidhya. Data Leakage And Its Effect On The Performance of An ML Model. https://www.analyticsvidhya.com/blog/2021/07/data-leakage-and-its-effect-on-the-performance-of-an-ml-model/
  3. JFrog. Be Careful from Data Leakage — Potential Pitfalls in your Machine Learning Model. https://jfrog.com/community/data-science/be-careful-from-data-leakage-2/
  4. Data Cleaning and Exploration with Machine Learning by Michael Walker: https://www.packtpub.com/product/data-cleaning-and-exploration-with-machine-learning/9781803241678
  5. Scikit-learn Pipeline. https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline
  6. Titanic Dataset. https://www.openml.org/search?type=data&status=active&id=40945&sort=runs
  7. https://github.com/datasciencedojo/datasets/blob/master/titanic.csv

--

--