Making Data Useful

How to work with someone else’s data

A guide to wrestling with inherited data

Cassie Kozyrkov

Published in

Towards Data Science

8 min readApr 4, 2020

Definitions

You’re using primary data if you (or the team you’re part of) collected observations directly from the real world. In other words, you had control over how those measurements were recorded and stored.

What’s the opposite? Inherited (secondary) data are those you obtain from someone else. (For example, you can get over 20 million datasets here.)

Want definitions of other related jargon? Find them in my main article on data provenance. (You’re reading Part 2.)

Buyer beware

Inherited datasets are like inherited toothbrushes: using them is an act of desperation. You’d always prefer to use your own if possible. Unfortunately, you might not have that option. Collecting primary data can be prohibitively expensive.

Collecting your own data is a luxury not everyone can afford.

While primary data carries a whiff of superiority reminiscent of artisanal cheese, anyone who insists that you’re worthless for chowing down on inherited data should check their privilege. Individuals (as well as firms without strong data traditions or newcomers to a domain) might not have the resources to collect data on their own, especially when the project requires very large datasets or specialist skills/equipment. For example, not everyone can afford to take pathogen measurements in a lab environment with a high biosafety rating.

Sometimes your only option is to try to make the best of someone else’s data.

But if you’re forced to work with someone else’s data, don’t be surprised later on if things don’t pan out the way you’d hoped. There’s no guarantee that inherited data will serve your needs… (Also, the tooth fairy isn’t real.)

Buyer beware: There’s no guarantee that inherited data will serve your needs.

Just because someone sold you a package labeled “dinner things” doesn’t mean you can make a good dinner with it. What if it only contains a bunch of toilet rolls? Be careful with inherited data; it might not suit your needs.

Here’s a quote I like from R.A. Fisher:

“To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”

That goes for your inherited dataset too — its collection was finished before you arrived on the scene, so it wasn’t designed to fit your needs. As far as your intended purpose goes, it might be so kaput that the most you can squeeze out of it is an autopsy. I hope you’ve taken enough journeys around the sun not to let this ruffle your feathers.

Inherited data are easier to get but harder to trust.

If you’re on the verge of protesting that it’s better to have some data rather than no data at all, replace the word “data” with “noise” / “lies” / “distractions” and try your sentence again. Quality is everything, and if you didn’t collect the dataset yourself, you have no control over what was measured and (perhaps more importantly) what was left out. Some data is only better than none sometimes.

When you’re forced to work with inherited data, you’ve got five main problems to worry about:

Purpose — was the dataset collected for a similar purpose to yours?
Competence — do you trust the team who collected the data to take measurements in a competent fashion?
Agenda — do you trust the dataset not to be tainted by the biases and agendas of its authors? (Don’t be too quick to trust!)
Clarity — is there clear documentation to prevent you from misinterpreting the contents of the dataset?
Processing — are you sure the dataset hasn’t been transformed, skewed, or otherwise tampered with?

Advice for working with someone else’s data

How should you approach working with inherited data? As with any dataset you get your paws on, your first action, habit — instinct! — should be SYDD. (“Split your damned data.”) Put any data not earmarked for exploration in a safe place.

Documentation

Next, it’s crucial that you seek documentation about how the inherited dataset was born. If possible, try to identify and contact whichever culprits, er, project members were responsible for the data collection legwork to find out what precisely they did and ask them clarifying questions. If consultation with the originating team is impossible (or if they forgot what they did), you’ll be forced to rely on written documentation. (See? Yet another argument for why good documentation is so important.)

If you’re an analyst (not doing statistical inference), you can take a lightweight approach: find out how much documentation exists and how detailed it is, make a gut check about whether the data quality justifies your time investment, and then use the documentation only as needed for your usual exploratory data analysis (EDA). However, if you think you found something inspiring enough to bring to a decision-maker, thoroughly check the documentation to ensure that you haven’t fallen for an obvious red herring.

If you’re doing ML/AI, you can take a similar approach and use inherited data for training after only a cursory foray into documentation, but — for goodness’ sake! — validate carefully and test your model on your own primary data. If you failed to thoroughly check performance in primary data, it doesn’t count. You’re a danger to yourself and others. Pretty please don’t launch.

Statisticians won’t get away with anything so easy. If you’re engaged in a data-driven decision-making project, you’re going to have to get to know the documentation thoroughly before proceeding with inherited data. This is soulcrushingly boring, but it’s the price you have to pay for working safely with data you didn’t create yourself. Specifically, make sure that you compile your own document detailing your understanding of the sampling procedure, potential biases, and real-world actions that generated the data you inherited.

Data quality

As you’re poring over documentation, look up the exact meaning of the variable names and ask an analyst friend to help you do some EDA to ensure that the data aren’t completely bonkers. For example, no negative values for things that should be positive, no columns without variance, no duplicates, no contradictions, etc. If you’re not sure what to keep an eye out for, you might like to look up some of these keywords from standard health checkups for data quality:

Does it cover the right topic?

purpose
usefulness
relevance
comparability
comprehensiveness

Are there any barriers to access?

availability
accessibility
clarity
storage

Can you trust the source?

competence
credibility
reliability
agenda
bias

Can you trust what you’re seeing?

accuracy
consistency
validity
uniqueness
processing

Can you trust what you’re not seeing?

completeness
coherence
timeliness
latency

Assumptions and caveats

It’s very important that you don’t make premature assumptions — just because something is called “weight” doesn’t mean it’s what you think it is. It could be an SI unit or a weighting of importance or anything else — people put the darnedest labels in their datasets, especially if they’re not creating them with other data scientists in mind.

Don’t make premature assumptions.

Additionally, trust no one! Even if the documentation says that “weight” refers to weight in kilograms, remember that transcription booboos and other measurement errors are a thing. For example, if the column claims to contain weights, you might ask yourself: Was a scale used? (Or just an eyeball? How do I know?) How precise was the scale? Was there more than one scale? Were they calibrated carefully? How well-trained were the personnel taking the measurements?

I’ll measure it myself, thanks. Image: SOURCE.

As you attack these questions in the documentation, you’ll be shocked to see how few “what the #$%@ is this?” and “how did this actually work in real life?” questions you can get good, precise answers for.

Does that mean you must give up? You may proceed… with appropriate caution and humility.

Does that mean you must give up? No. You may proceed… with appropriate caution and humility. Every time you find an issue, you’ll be honor-bound to wax lyrical about all the guesses you’re forced to make about what really happened during the data collection.

You’ll spend a lot of boring — yes, boring — time on the following:

State the assumptions you’re being forced to make.
Write up caveat notes to be included in the appendix of your final report.
Write cautionary notes that warn the decision-maker (and your other readers) that conclusions from the study will need to be downgraded due to potential data issues.

The level of apology you owe your readers and decision-makers depends on your experience and expertise with similar data.

In plain English: the more you know about the real-world context in which data were collected, the less rubbishy your assumptions are likely to be, and the less you need to grovel in humility before your audience.

If you’re an expert in the field — for example, if you’re an experienced biostatistician working with epidemiologists — you’re better qualified to make reasonable assumptions about what happened during the collection of a COVID-19 dataset than I am. If I were cheeky enough to dive into such data without the support of experienced subject-matter experts, I would owe it to myself and my audience to take stock of my ignorance (in written form!) before proceeding.

“Since our project team did not participate in planning the study or data collection, it is possible that we are missing crucial context which renders our conclusions invalid.”

In addition to writing conclusion-softening notes, every self-respecting author should remember to include a blanket inherited data warning like this one: “Since our project team did not participate in planning the study or data collection, it is possible that we are missing crucial context which renders our conclusions invalid.”

This isn’t about perfectionism. It’s perfectly reasonable to try to slightly increase your (imperfect) understanding with data. Just don’t forget to acknowledge how much you don’t know.

This isn’t about perfectionism. It’s perfectly reasonable to try to slightly increase your (imperfect) understanding with data. Just remember that learning a little bit more doesn’t equal knowing everything… and don’t forget to acknowledge how much you don’t know. If you fail to do that, you’re leading anyone who trusts you straight off a cliff.

Humility may be a wordy strain on the media-weaned attention span, but the overconfident alternative is worse.

Thanks for reading! How about an AI course?

If you had fun here and you’re looking for an applied AI course designed to be fun for beginners and experts alike, here’s one I made for your amusement:

Enjoy the entire course playlist here: bit.ly/machinefriend

Liked the author? Connect with Cassie Kozyrkov

Let’s be friends! You can find me on Twitter, YouTube, Substack, and LinkedIn. Interested in having me speak at your event? Use this form to get in touch.