The world’s leading publication for data science, AI, and ML professionals.

Detecting Data Leakage before it’s too late

Sometimes it's too good to be true.

[UPDATE: I have started a tech company. You can find out more here]

After reading Susan Li‘s Expedia Case study article, I wanted to see if I could reproduce the results using AuDaS, Mind Foundry’s automated Machine Learning platform. The data is available on Kaggle and contains customer web analytics information for hotel bookings (true and false). The goal of this competition is to predict whether a customer will make a reservation or not. However, after cleaning the data and building my model after 3 minutes of training I had reached a classification accuracy of 100% which immediately triggered an alarm. I was a victim of data leakage.

In this post we will see how AuDaS raised the alarm and how I could have avoided it in the first place!

Detecting the Leakage

The goal is to classify whether a customer will book or not so I launched a simple classification task in AuDaS and immediately triggered a model health warning after reaching 100% accuracy.

Model Health Warning raised by AuDaS
Model Health Warning raised by AuDaS

Upon further inspection of the feature relevance I realised that the leakage was caused by the gross bookings column which contains $ values associated to the bookings. I then went back to the data preparation step in AuDaS to exclude it and most importantly see if there were any other columns I should remove.

Feature Relevance in AuDaS
Feature Relevance in AuDaS

Proactive Detection of Data Leakage with AuDaS

I decided to use AuDaS’ auto-histogram page to identify perfect predictors of bookings to then exclude from the training data. Not surprisingly the gross bookings and click bools (whether or not the hotel was clicked) were strong predictors of a booking because in order to make a booking we need to click then link and then pay!

After excluding the gross bookings and relaunching a classification task I was still reaching a classification accuracy of 99% which made me have a closer look at was happening.

Effectively, because the bookings were rare and AuDaS balances the training and 10% hold out for validation purposes, the click bool column was a near perfect predictor of bookings.

Robust Model building with AuDaS

Finally, I decided to exclude click to identify the key features that predict a booking (and therefore clicks) as this is Expedia’s main objective.

The classification accuracy was reduced to 72% as a result, but AuDaS was able to identify a more nuanced ranking of relevant predictors of bookings.

The main features for the chosen model were:

  • The Hotel’s position on the search results page
  • The A/B sorting method of the search results
  • The property’s location score and price

Key Take aways

Data Leakage in Data Science can often go unnoticed which is why it is important to have mechanisms to raise warnings and detect the sources of the leaks. Once the leaks have been identified, an understanding of the data is important to explain why and how they are being caused in order to decide the best way to repair them. AuDaS helped us avert a disaster.

AuDaS

AuDaS is an Automated Data Science platform developed by Mind Foundry that provides a robust framework for building end-to-end Machine Learning solutions. This framework helps identify and act on data leakage before it’s too late. You can try AuDaS here and view more demos bellow:

Team and Resources

Mind Foundry is an Oxford University spin-out founded by Professors Stephen Roberts and Michael Osborne who have 35 person years in data analytics. The Mind Foundry team is composed of over 30 world class Machine Learning researchers and elite software engineers, many former post-docs from the University of Oxford. Moreover, Mind Foundry has a privileged access to over 30 Oxford University Machine Learning PhDs through its spin-out status. Mind Foundry is a portfolio company of the University of Oxford and its investors include Oxford Sciences Innovation, the Oxford Technology and Innovations Fund, the University of Oxford Innovation Fund and Parkwalk Advisors.


Related Articles