Data Leakage in Machine Learning: How it can be detected and minimize the risk

Prerna Singh
Towards Data Science
8 min readJan 9, 2022

--

Image by author

Introduction

We can define data leakage as:

“When data set contains relevant data, but similar data is not obtainable when the models are used for predictions, data leakage (or leaking) occurs. This results in great success on the training dataset (and possibly even the validation accuracy), but lack of performance in production.”

Data leakage, or merely leaking, is a term used during machine learning to describe the situation in which the data used to teach a machine-learning algorithm contains unexpected extra information about the subject you’re estimating. Leakage occurs when information about the target label or number is introduced during learning that would not be lawfully accessible during actual use. The most fundamental example of data leakage would be if the true label of a dataset was included as a characteristic in the model. If this object is classified as an apple, the algorithm would learn to predict that it is an apple.

Why does data leakage happen?

Data leakage can occur for a variety of reasons, often in subtle and difficult-to-detect ways. When data leakage occurs, it usually leads to overly optimistic outcomes during the model building phase, followed by the unpleasant surprise of poor results after the prediction model is implemented and tested on new data. In other words, the leakage might lead your system to train a suboptimal model that performs significantly worse in practice than a model generated in a leak-free environment. Leakage can have a variety of real effects, ranging from the financial expense of making a terrible financial and technological expenditure something that doesn’t work to system failures that hurt consumers’ perceptions of your system’s reliability or affect the employer products. As a result, data leaking is among the most significant and pervasive challenges in machine learning and statistical, and one that deep learning practitioners must be mindful of.

“So now we’ll go through what data leakage is, why it’s essential, how to recognize it, and how to prevent it in your real-world applications?”

In the world of data protection, the term data leaking also relates to the unauthorized movement of information outside of a protected facility such as a data center. However, given the importance of keeping details concerning the forecast completely separated from the retraining and model development stages, this protection approach is actually relatively appropriate for our machine educational environment.

Data Leakage in a More Sophisticated Version

Let’s take a look at some more subtle data leakage issues.

1st example:

One typical example is when future information is incorporated in the training data that would not be legally available in real use. Assume you’re working on a retail website and need to create a classifier to forecast whether a user would stay and read another page or leave. If the classifier predicts that they are going to leave, the website may display something that encourages them to stay and shop. The user’s overall session length or the total number of pages seen during their visit to the site are examples of features that contain leaked information. During the post-processing phase of the visit log data, for example, this total is frequently added as a new column. This feature contains information on the user’s future visits, such as how many more trips the user will make. In a real-world deployment, it’s impossible to determine. The total session length feature might be replaced with a page visit in-session feature, which only knows how many pages have been visited so far in the session, not how many are left.

2nd Example:

The second example of leaking could be attempting to forecast whether a customer visiting a bank’s website will open an account. If the user’s record includes an account number field, it may be blank for users who are still browsing the site, but it will be filled in after the user creates an account. Clearly, the user account field is not a viable feature to use in this situation, as it may not be available while the user is still browsing the site.

3rd Example:

If you are building a diagnostic test to forecast a specific medical condition, another example of future information leaking in the past is if you are producing a diagnostic test to predict a specific medical condition. A binary variable in the existing patient data set might indicate whether or not the patient had surgery for that disease. Clearly, such a characteristic would be an excellent predictor of the medical condition. There is a slew of different ways predictive data may find its way into this feature set. It’s possible that a certain combination of missing diagnosis codes was highly symptomatic of the medical condition. However, since that information isn’t available while a patient’s condition is being examined, these would be illegal to utilize. Finally, another example with the same patient may be the patient ID form. The ID may be assigned based on the diagnosis path taken. In other words, if the ID is the outcome of a visit to a specialist, where the initial doctor judged that the medical issue was likely, the ID could be different.

This last example exemplifies the fact that data leaking can manifest itself in a training set in a variety of ways, and that many leakage issues are frequently present at the same time. Fixing one leaky feature might sometimes reveal the existence of another, for example.

Other Data Leaking Examples

Here are some more examples of data leakage to serve as a guide. Leakage can be classified into two categories. Leakage in the training data, which occurs when a test or future data is mixed in with the training data, as well as leakage in features, which occurs when something extremely informative about the true label is included as a feature.

  • Data leaking in the training data:

Performing some kind of pre-processing on the full dataset whose results influence what is seen during training is one of the most common causes of data leakage. This can include scenarios like computing parameters for normalization and rescaling, finding minimum and maximum feature values to detect and remove outliers, and estimating missing values in the training set, or performing feature selection using the distribution of a variable across the entire dataset. When working with time-series data, another crucial requirement for caution arises when records for future events are mistakenly used to compute features for a specific forecast. The session length example we saw was one example, but more subtle consequences can arise if data collection mistakes or missing value indicators are present. If a feature requires at least one record to be collected in a certain time period, the presence of an error may provide information about the future. To put it another way, no more observations are predicted.

  • Leakage in features:

Leakage in features refers to when we remove one variable, such as a diagnosis ID or a patient record, but we forget to remove other variables, known as proxy variables, that contain the same or comparable information. A great example of this is the patient ID, which contained hints about the severity of the patient’s condition due to the admission procedure. In certain cases, predictor variables records are deliberately randomized, or particular sections sharing personal data of users, including their name, location, and so on, are abstracted. Unraveling this privacy protection can expose users or other sensitive data that is not legally accessible in real use, dependent on the forecasting assignment.

Finally, any of the above scenarios could be represented in a 3rd party datasets that are also added to the training set as a source of new features. As a consequence, always be aware of the characteristics of such external data, as well as their interpretation and source.

How to Detect Data Leakage?

In this section, I define three steps to detect and avoid data leakage in the application.

  1. Before beginning to construct the model, make the following preparations:

Before beginning to construct the model exploratory data analysis might disclose surprises in the data before developing the model. Look for qualities that are highly connected with the desired label or value, for example. In the medical diagnostic example, a binary feature that indicated a patient underwent a specific surgical treatment for the ailment could be an example of this. That could be extremely closely linked to a specific illness.

2. After you’ve built your model:

After you’ve built your model, search for unusual feature behavior in the fitted model, such as exceptionally high feature weights or extremely large information games associated with a variable. Next, search for a model’s overall performance that is unexpected. Look attentively at the occurrences or characteristics that have the most bearing on the model if your model results of the evaluation are substantially higher than the same or comparable situations and datasets.

3. Limited application of the trained model in the real world:

A limited real-world installation of the training sample to determine if there’s a significant discrepancy between the action potential is generated given by the model’s learning & growth outcomes and the real outcomes is another reliable but possibly expensive check for leakage.

This check that the model generalizes effectively to new data is useful, but it may not provide any quick insight into whether or not leakage is occurring, or if any performance reduction is due to other factors such as classical overfitting.

How to Minimize Data Leakage?

There are some best practices you can use to assist limit the risk of data leakage in any application.

One essential guideline is to make sure that any data preprocessing is done separately for each cross-validation sweep. In plenty of other words, any statistics or variables you calculate for normalizing or leveling features should be based on the data provided in the cross-validation split, not the entire data set. Make sure you apply the same parameters on the matching held-out test fold as well.

Keep note of the time stamp connected with analyzing a specific data occurrence, such as a user’s click on a webpage, if you’re working with time-series data, and ensure sure any data used to calculate characteristics for this instance doesn’t include records with a timestamp later than the cutoff value. This will ensure that future data isn’t included in existing feature computations or training data.

If you already have sufficient data, consider creating a separate test set before working with a new dataset, and only assessing your final concept and this testing dataset as the last step. The purpose is like a real-world deployment in that you want to make sure your train model generalizes effectively to new data. If there isn’t a big decline in performance, that’s fantastic. If there is, leakage, along with the typical suspects such as classical overfitting, could be a contributory reason.

In the end, we can conclude that the below points are helpful to combat data leakage.

· By removing all data just before the event of interest and focusing on that time in which you are learning about an observation or fact

· In input data add the random noise for the purpose to smooth out the effects of possibly leaking variables.

· By removing leaky variables and evaluating simple rule-based models like to check if these variables are leaky, and if so, remove them. If you have any doubt about a variable that is leaky, we can remove it.

· By using pipelines architectures, it is easy to do a different sequence of steps for data preparation which is used to be performed in cross-validation folds. For this purpose, different packages or libraries are available in different programming languages such as caret package in R and scikit-learn in Python.

· By using a holdout dataset, we can hold back a validation dataset as a final stability check of your model before using it.

Summary

We discovered the following in this article:

· What is Data Leakage in Machine Learning?

· Different Real-world Examples of Data Leakage

· How to Detect Data Leakage?

· How to Minimize Data Leakage?

--

--

Ph.D. in Computer Science | Data Scientist | Machine Learning Researcher | Currently working in Unity Technologies -Weta Digital