
The last time you had to find and gather data – for a school project, for a business project, for a personal project – where did you look? Were you handed the data directly from someone within your company? Did you go out and track down an interesting data set on your own?
Most importantly – did you validate the data’s legitimacy after you had it in hand?
I needed a classification data set for a project. The data could be about anything, so naturally I wandered over to Kaggle to look for some cool data sets. I found this data set on Airline Passenger Satisfaction, which is a very rich set for classification.

This is a great looking data set. It has various customer service metric ratings, a binary target variable, and over 130k rows of responses. It’s a silver medal data set on Kaggle, and for good reason. It checks all the boxes for being a fun and interesting data set.
However, I noticed immediately that there was no attribution to this data, beyond an acknowledgement that it was modified from a different Kaggle dataset located here.

I went to this "original" data set, uploaded by John D. Notably there is still no attribution for the data set, so now I went out to Google.
In my searches, I found two potential sources for the data. The first is the American Customer Satisfaction Index, which has a similar list of independent variables in its Airline Customer Experience Benchmark survey. However, their statement is " Each year, the ACSI interviews hundreds of passengers about their recent flight experiences". Hundreds is great, but it is NOT 130k. Is this the source of the data? If so, the data is gathered from potentially DECADES of responses, and is therefore not necessarily useful.
Another potential source of the data came up at IATA with the Airsat Passenger Satisfaction Benchmark. They list a similar grouping of variables, but don’t specify how many surveys they take.
Beyond being unable to source the data – the most disturbing thing is how many people have used this unsourced data set as the basis of important projects. I found one instance of a published article in a peer-reviewed publication which openly used the original "John D" data set as the basis of their work, by referring to it as a "Kaggle repository". While the purpose of the article was more about methodology than business answers, the underlying data should still be validated and sourced. "John D on Kaggle" is not a source. Another article written for the Journal of Retailing and Consumer Services, for which I could only read the abstract, alluded to a "dataset comprising feedback from more than 133,000 customers". Without being certain, that specific response number is suspiciously close to the cleaned up silver-medal dataset that started this whole curious rabbit hole.
Any search on airline satisfaction yields several projects, articles and references based on the now-infamous (if only in my own mind) Airline Passenger Satisfaction data set – A data set with NO KNOWN SOURCE.
For all we know, John D wrote this data set on a whim one afternoon. We’ll never know, because despite requests in the comments on his data set, he never attributed his source. And now this data set lives on forever on the internet, referenced in legitimate publications.
This is exactly how misinformation spreads. Information with no source is passed out into the world as being legitimate. When taken on faith often enough, it is accepted as legitimate with no confirmation. And perhaps this data set IS legitimate – but what if it isn’t? What if this data set is providing faulty, or at best suspect, information to businesses and to the world?
Don’t add to this problem. Know your data’s source. Validate your data. Use only data that comes from known sources, and always get a "second opinion" when possible. Provide attribution for your data, and increase confidence.
It is our job as data scientists to tell stories with real, actionable data. Stories based on faulty data are only fairy tales.
references:
- Airline Passenger Satisfaction dataset on Kaggle cleaned by TJ Klein
- Passenger Satisfaction dataset on Kaggle uploaded by John D
- ACSI Airline Experience Benchmark Program
- Airsat Passenger Satisfaction Benchmark Program