The world’s leading publication for data science, AI, and ML professionals.

How Data Leakage affects model performance claims

This year has seen several important scientific advancements enabled by machine learning driven research. Along with the enthusiasm came…

This year has seen several important scientific advancements enabled by machine learning driven research. Along with the enthusiasm came also some worry related to the reproducibility issues encountered in ML-based science. Several methodological problems have been identified, out of which data leakage seems to be the most widespread. Generally, data leakage can skew results and lead to overly optimistic conclusions.

There are several different ways in which data leakage can occur. The objective of this post is to present some of the most commonly encountered types, along with a few tips about how to identify and mitigate them.

Image generated by the author using dreamstudio.ai
Image generated by the author using dreamstudio.ai

Data leakage can be defined as an artificial relationship between the target variable and its predictors which is unwillingly introduced through the data collection method or the pre-processing strategy.

The main sources of data leakage I will try to exemplify are:

  1. The improper separation between training and test datasets
  2. The usage of features that are not legitimate (proxy variables)
  3. The test set is not drawn from the distribution of interest

1. The improper separation between training and test datasets

Data scientists know that they need to divide their input data into train and test sets, only train their model using the training set and compute evaluation metrics only on the test set. This is a textbook error that most people know to avoid. However, the initial exploratory analysis is often performed on the complete data set. If this initial analysis also involves pre-processing and data cleaning steps, it can be a source of data leakage.

Pre-processing steps that can introduce data leakage:

  • performing missing values imputation or scaling before splitting the two sets. By using the complete data set to compute imputation parameters (mean, standard deviation, etc.), some information that shouldn’t be available to the model during its training is introduced in the training set
  • performing under/oversampling before splitting the two sets also leads to an improper separation between the training and test sets (oversampled data from the training set would be present in the test set leading to optimistic conclusions)
  • not removing duplicates from the data set before splitting. In this case, the same values could be part of the training and test sets after splitting, leading to optimistic evaluation metrics.

2. The usage of features that are not legitimate

It is considered data leakage also when the data set contains features that should not legitimately be used in modeling. An intuitive example would be if one of the features is a proxy for the outcome variable.

The Seattle Building Energy Benchmarking data set contains an example of such a variable. Seattle’s objective was to predict a building’s energy performance based on characteristics that are already publicly available, such as building surface, building type, property usage, date when it was built, etc. Their dataset also contains the Electricity and Natural Gas consumption values, along with the target variables Site Energy Use and GHG Emissions. Electricity and Natural Gas consumption values are highly correlated with the target variable, including them in the features when building a prediction model would yield very accurate results.

Correlation between some features and the target variables (Image by the author)
Correlation between some features and the target variables (Image by the author)

However, these features are just proxies for the output variable. They do not actually explain anything that common sense does not already tell us: buildings that use a lot of electricity will have a high energy usage overall.

If the Electricity usage values are available at the prediction time, then the prediction of Site Energy Use becomes a trivial task and there is no actual need to build a model.

The example given here is simple but, in general, the judgment of whether to use a particular feature or not requires domain knowledge and can be problem specific.

3. The test set is not drawn from the distribution of interest

This particular source of data leakage can be a bit harder to exemplify but can be intuitively explained. We can divide it into several sub-categories:

  • Temporal leakage: if a model is used to make predictions about the future, then the test set should not contain any data that pre-dates the training set (the model would be built based on data from the future)
  • Non-independence between train and test samples: this problem arises more in the medical domain, where several samples are collected from the same patients over some period of time This issue can be handled by using specific methods such as block cross-validation, but it is a difficult problem in the generic case since all the underlying dependencies in the data might be known
  • Sampling bias: choosing a non-representative subset of the dataset for evaluation. An example of such bias would be choosing only cases with extreme depression to evaluate the effectiveness of an anti-depressive drug and make claims about the drug’s effectiveness for treating depression in general

Conclusion

Data leakage can be introduced at various stages of the modeling pipeline and detecting it might not be obvious. The pre-processing steps and the test/train split method will depend on the characteristics of the dataset and might require specific domain knowledge. As a general rule, if the obtained results are too good to be true then there is a high chance of data leakage.


Related Articles