Pre-Process Data with Pipeline to Prevent Data Leakage during Cross-Validation.

Kai Zhao
Towards Data Science
4 min readSep 4, 2019

--

image source: http://www.securitymea.com/2018/04/25/data-leakage-has-gone-up-by-four-times/

In machine learning, K-fold Cross-validation is a frequently used validation technique for assessing how the results of a statistical model will generalize to an unseen data set. It is often used to estimate the predictive power of a model. Today I would like to discuss a common mistake during cross-validation when applying data preprocessor such as StandardScaler.

The Mistake

Let’s identify the mistake by reviewing the following code. I have highlighted the sections for discussion with two red boxes.

In the first red box, the train set is fit and transformed and the test set is transformed (not fit) with StandardScaler. This seems correct because both the train set and test set are normalized by StandardScaler using the information only from the train set (test set is unseen).

The second red box highlights the 5-fold cross-validation performed by GridSearchCV to select the best model based on the normalized train set when fitting the data. Here comes the problem, but let me first take a step back by reviewing briefly how GridSearchCV performs the cross-validation. GridSearchCV splits the train set into an inner train set and a validation set. For 5-fold cross-validation, the data ratio of the inner train set to validation set is 4: 1. The process repeats four times so the entire train set is covered (see figure below).

https://scikit-learn.org/stable/modules/cross_validation.html

Now let me discuss the problem in 4 steps:

  1. The purpose of the cross-validation is to estimate the predictive power, so the validation set should be treated as a temporary unseen test set.
  2. The StandardScale should not fit the temporary test set. However, because we fit the GridSearchCV with a preprocessed train set, the temporary test set is, unfortunately, fit with StandardScale and therefore leaked.
  3. The variance of the temporary test set is artificially removed so the resulting cross-validation score is biased.
  4. GridSearchCV that decides the model parameters based on the cross-validation score may fail to find the best model that mitigates overfit.

I noticed the importance of this issue from my capstone project for the Data Science Immersive Program I studied at General Assembly. The project was to build classification models to classify breast cancer cell using the classic University of Wisconsin breast cancer cell (diagnosis) data set from Kaggle. Part one of the project was to reproduce the model for breast cancer diagnosis discussed in Breast Cancer Diagnosis and Prognosis via Linear Programming published by University of Wisconsin professors Olvi L. Mangasarian, W. Nick Street and William H. Wolberg back in 1994. Because the data set is small (i.e., 569 samples), the predictive power of the model was estimated solely based on the cross-validation scores (rather than doing a three-part split into the train, dev, and test sets).

The Solution

Fortunately, there is a simple solution. By simply putting preprocessors and estimators into Pipeline, we can avoid such undesired outcome from GridSearchCV. See the code below:

To demonstrate how Pipeline properly solves the problem (fit_transform only the inner train set with StandardScaler), I would like to quote the amazing response from Shihab Shahriar (with his permission) to the question I posted on Stackoverflow.com. The response further simplified by only looking at the cross-validation part of the GridSearchCV by using cross_val_score. The demonstration uses the breast cancer data set (569 samples).

1. Subclass StandardScaler to print the size of the dataset sent to the fit_transform method.

2. Place the class in the pipeline and run through cross_val_score for a 5-fold cross-validaiton.

3. The output shows only 80% of the data set (569 * 80%) are both fit and transformed with the cross_validation_score.

I would like to give my special thanks to Shihab for his solution that benefits the readers. The original question I posted on Stackoverflows.com focuses on whether Pipeline works properly with preprocessors during cross-validation. Shihab not only gave the solution but also discussed the sensitivity of different estimators to the output of the preprocessor, which may also answer some of your other questions. If you are interested, click here for the entire discussion.

Conclusion

The takeaway from this blog post is that whenever data preprocessors such as StandardScaler and PCA (i.e., Principal Component Analysis) are needed for building the machine learning models, make sure to use Pipeline for the cross-validations for model tuning. Again, our ultimate goal is to build a machine learning model that will generalize to an unseen data set. Therefore, I believe preventing data leakage during the cross-validation is indeed important. I hope this post is helpful~.

--

--

I am a data scientist with engineering background and practical business experience in environmental & sustainability and financial investment.