
Data leakage is a sneaky issue that often plagues machine learning models. The term leakage refers to test data leaking into the training set. It happens when the model is trained on data that it shouldn’t have access to during training, leading to overfitting and poor performance on unseen data. It’s like training a student for a test using the test answers – they’ll do great on that specific test, but not so well on others. The goal of machine learning is to create models that can generalize and make accurate predictions on new, unseen data. Data leakage undermines this goal, and it’s important to be aware of and prepare against it. In this article, we’ll take a closer look at what data leakage is, its potential causes, and ways to prevent it with practical examples using Python and scikit-learn, and cases from research.
Consequences of Data Leakage
- Overfitting. One of the most significant consequences of data leakage is overfitting. Overfitting occurs when a model is trained to fit the training data so well that it is not able to generalize to new data. When data leakage occurs, the model will have a high accuracy on the the train and test set that you used while developing it. However, when the model is deployed, it will not perform as well because it cannot generalize its classification rules to unseen data.
- Misleading Performance Metrics. Data leakage can also result in misleading performance metrics. The model may appear to have high accuracy because it has seen some of the test data during training. It’s thus very difficult to evaluate the model and understand its performance.
Data leakage before splitting
The first case we are presenting is the simplest one, but probably the most common: when preprocessing is performed before the train/test split.
You want to use a StandardScaler to standardize your data, so you load your dataset, standardize it, create a train and test set, and run the model. Right? Wrong.
0.745
The mean and standard deviation are computed on the whole column, and thus they include pieces of information from the test set. Using these values in the standardization process means the test data is leaking into the train data.
The solution: Pipelines
0.73
In this version, a pipeline is used to encapsulate the preprocessing step, which is then fit and evaluated on the training set only. In this case, StandardScaler is used as a preprocessing step, which standardizes the feature by subtracting the mean and scaling to unit variance. When you call the fit method, sklearn is standardizing each set separately. This ensures that the test set is not used to inform the preprocessing step, avoiding data leakage.
Data leakage when using cross-validation
The second example is a very common mistake that often goes unnoticed. Your dataset is unbalanced, and you’ve read how you should use oversampling to "fix" it. After some googling, you find SMOTE, an algorithm that uses the nearest neighbors to generate new samples in order to balance the minority class. Let’s apply this technique to a dataset called credit_g, from the library PMLB.
The dataset is unbalanced, with a 70/30 ratio between the classes.
ROC AUC score (baseline): 0.75 +/- 0.01
As a baseline result, we show the AUC score without applying any transformation. Running a Logistic Regression model gives a mean ROC AUC score of 0.75.
Let’s now apply SMOTE.
1 700
0 700
Name: target, dtype: int64
ROC AUC score (with data leakage): 0.84 +/- 0.07
After applying SMOTE you’re happy to see that the AUC score increased from 0.75 to 0.84! However, all that glitters is not gold: you just caused data leakage. In the code above, the transformation was applied before running cross-validation, which splits train and test sets on different folds. This is a very common scenario that can trick beginners into thinking that SMOTE increased their model performance.
Let’s now take a look at a corrected version of the code, where SMOTE is applied after the cross-validation split.
ROC AUC score: 0.67 +/- 0.00
Applying SMOTE correctly actually made the model worse.
As Samuele Mazzanti highlighted in his article Your Dataset Is Imbalanced? Do Nothing!, oversampling is not necessary to handle unbalanced datasets.
Data leakage in time series
Time series data has unique characteristics that make it different from other types of data, which can lead to specific challenges when splitting the data, preparing features, and evaluating models. Here, we’ll elaborate on these challenges and suggest best practices to minimize data leakage in time series analysis.
Incorrect train-test split: In time series data, it’s essential to maintain the temporal order of observations when splitting the dataset into training and test sets. A random split can introduce leakage, as it may include future information in the training set. To avoid this, you should use a time-based split, ensuring that all data points in the training set come before those in the test set. You can also use techniques such as time-series cross-validation or walk-forward validation to assess your model’s performance more accurately.
Feature engineering: You should avoid using future information that wouldn’t be available at the time of prediction. For instance, calculating technical indicators, lagged variables, or rolling statistics should be done only using past data, not future data. To prevent data leakage during feature engineering, you can use techniques like applying time-based window functions, ensuring that the calculation window only includes data available up to the prediction time. This also applies to external data. Sometimes, time series models incorporate external data sources that may contain future information. Make sure that the indicators are lagged appropriately, so they don’t provide information from the future and always verify that external data sources maintain the same temporal order as your primary time series dataset.
Data leakage in image data
When you’re working with medical data there are often multiple images taken from the same patient. In this case, you can’t just split the dataset randomly to train a model, because you might accidentally end up with images from the same person in both the training and test sets. Instead, you need to use a per-subject split.
So what’s a per-subject split? It just means that you keep all the images from the same person together, either in the training or test set. This way, your model can’t cheat by learning from images of the same person in both sets.
There is a study that looked at the difference between random splits and per-subject splits. They test it on three different datasets and find that random splits lead to inflated test accuracy because of data leakage. On the other hand, per-subject splits give more accurate results. The datasets used are the following:
AIIMS dataset: Contains 18,480 2D OCT images of healthy and cancerous breast tissue from 45 subjects (22 cancer patients and 23 healthy subjects).
Srinivasan’s dataset: An ophthalmology dataset with 3,231 2D OCT images of age-related macular degeneration (AMD), diabetic macular edema (DME), and normal subjects, including 15 subjects per class.
Kermany’s dataset: A large open-access ophthalmology dataset featuring images from 5,319 patients with choroidal neovascularization (CNV), diabetic macular edema (DME), drusen, and normal retina images. The dataset is available in different versions, with variations in the number of images, organization, and data overlap between training and testing sets.
The results speak for themselves.

The models are evaluated on the Matthews Correlation Coefficient, defined as follows:

As you can imagine, when we randomly split the data, we got a score that was too good to be true. And that’s because images from the same person look very similar, so the model had an easier time recognizing images from people it had already seen in the training data. In the real world, we need models that can reliably identify diseases in new patients.
Data leakage is a common problem that can affect even the most skilled data scientists when building machine learning models. Next, we’ll take a look at another case from research. In 2017, Andrew Ng and his team published a groundbreaking paper titled "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning." The paper introduced an algorithm that used deep learning to detect pneumonia in chest X-rays, achieving performance on par with expert radiologists. The following image is taken from the paper.

Do you notice something wrong?
In the first version of the study they trained the model splitting the data randomly. Since multiple scans from the same patient were included, this potential data leakage raised concerns about the reliability and generalizability of CheXNet’s results. The authors acknowledged the issue and later released a new version, correcting the issue. The following image is taken from the corrected version.

Conclusion
Data leakage is a sneaky issue that can affect machine learning models at various stages of development. As we’ve explored in this article, it can lead to overfitting, misleading performance metrics, and ultimately, a model that doesn’t generalize well to unseen data. Whether you’re working with tabular data, time series, or images, it’s important to be aware of it to build successful models. Here are some key takeaways from the article:
- If your model suddenly starts performing too well after making some changes, it’s always a good idea to check for any data leakage.
- Avoid preprocessing the entire dataset before splitting it into training and test sets. Instead, use pipelines to encapsulate preprocessing steps.
- When using cross-validation, be cautious with techniques like oversampling or any other transformation. Apply them only to the training set in each fold to prevent leakage.
- For time series data, maintain the temporal order of observations and use techniques like time-based splits and time-series cross-validation.
- In image data or datasets with multiple records from the same subject, use per-subject splits to avoid leakage.
By keeping these points in mind, you’ll be better equipped to build more robust and accurate machine learning models.
Enjoyed this article? Get weekly Data Science interview questions delivered to your inbox by subscribing to my newsletter, The Data Interview.
Also, you can find me on LinkedIn.