Making Data Useful

All about data provenance

Unstructured data, inherited data, exhaust data, obfuscated data, and other goblins

Cassie Kozyrkov
Towards Data Science
8 min readApr 3, 2020

--

If you’re about to jump on the citizen data scientist bandwagon (diving into COVID-19 data, perhaps?) there are a few things you should know about data provenance…

Data provenance: “Who collected it and why?”

Society is plagued by distorted expectations regarding data, littered with nonsense like “numbers can’t lie” and “it’s just your opinion until you show me the data” (no, it’s still your opinion) and “I looked at data, so now I’m informed.”

There comes a time in every child’s life when they must learn that:
1) The tooth fairy isn’t real.
2) Things don’t just magically work out because you have some numbers. It really matters where those numbers came from. (Some children are a few decades overdue for this developmental milestone.)

Anyone can put some electronic scribbles in a table and call it data. That doesn’t make it good/true/useful/worthy in the sense you associate with Science.

Even if the dataset was collected carefully, are you sure you know what happened to it on its way to you? The only reason that a villain won’t remove inconvenient rows from your dream dataset (“Hide data from you? I would never! Those were outliers.”) or aggregate things in a way that skews the message before sharing data with you is that it’s too easy. (Self-respecting villains prefer a challenge.)

Ability to reason about provenance is one of the basic requirements of data literacy, so let’s make a few distinctions:

  • Primary data versus inherited (secondary) data
  • Captured data versus exhaust data
  • Structured data versus unstructured data
  • Raw data versus processed data

and then round out the article with a summary of their relative virtues.

Primary versus secondary

You’re using primary data if you (or the team you’re part of) collected observations directly from the real world. In other words, you had control over how those measurements were recorded and stored.

What’s the opposite? Inherited (secondary) data are those you obtain from someone else. (For example, you can get over 20 million datasets here.)

Inherited datasets are like inherited toothbrushes: using them is an act of desperation. You’d always prefer to use your own if possible, since secondhand datasets are rife with gotchas. Unfortunately, you might not have that option.

If you’re interested in my deep-dive on inherited datasets — including advice on how to work with them — take a small detour here.

Captured versus exhaust

Captured data are intentionally created for a specific analytical purpose, while exhaust data are byproducts of digital/online activity. Exhaust data usually come about when websites store activity logs for purposes — such as debugging or data hoarding — other than specific analyses.

Captured data are created to be used by data science professionals for a specific purpose, while exhaust data are not.

Exhaust data can be a treasure trove for analytics, but a nightmare for statistical inference. (My article on the difference between statistics and analytics is here.) If you’re doing careful data-driven decision-making, you’d prefer captured data if given the choice. On the other hand, if you’re looking for surprises or hoping to spark inspiration for an original idea, you might want to direct your gaze outside captured datasets.

If you’re used to working with data created for analysis, you’re likely to feel like that something is not quite right in the topsy turvy exhaust data wonderland… but you can’t quite put your finger on it. If all the conventions you’re used to were ignored and the logging choices feel a bit batty, there’s a simple reason for it: those datasets were not designed for your eyeballs. As a result, you shouldn’t be surprised if exhaust data feel like messy junk that requires extra cleaning and processing. Take a deep breath and allocate additional time to your wrangling efforts.

Structured versus unstructured

On a related note, I had a hard time wrapping my head around the newfangled moniker “unstructured data” until I realized that it was just the business world’s way of rediscovering a very old concept: a mess.

That thing you call unstructured data is just data that needs *you* to put structure on it.

If we’re pedantic about it, there’s no such thing as unstructured data (since by being stored, they’re necessarily forced to have some kind of structure), but let me be generous. Here’s what the definition intends to convey:

  • Structured data are neatly formatted, ready for analysis.
  • Unstructured data are not in the format you want and they force you to put your own structure onto them. (Think of images, emails, music, videos, text comments left by trolls.)
That thing you call unstructured data is just data that needs *you* to put structure on it. Meme: SOURCE.

For example, you might want to work with image data in a nice tabular format but instead you find a folder full of pictures with all kinds of different extensions and formats. Adding insult to injury, you discover a text file with links to more pictures on miscellaneous websites. Yuckity yuck.

It takes a special sort of talent to store things in such disarray, which is why you’ll usually find that the culprits behind it are human. At least machine-generated exhaust data like log files have some kind of standardized format (even if it’s a daft one) — that’s why we call those things semi-structured data.

One analyst’s junk is another analyst’s treasure.

It wouldn’t hurt you to think of structured data as tidy libraries and unstructured data as chaotic attics. As with data, there are far fewer libraries in the world than collections of mess. (I can see one right from my desk.) Of course, chaos is in the eye of the beholder. You might find image data messy by definition, while I might only find them messy if they arrive as described above. One analyst’s junk is another analyst’s treasure.

The sentiment behind the unstructured data “revolution” is that your patience with nonstandard data formats might pay off handsomely.

The only thing you can do with unstructured data is analytics. In order to use it for statistical inference or ML/AI, you’d have to impose structure on it. When your data make the move from unstructured to structured, you usually erode some of its information, just like you toss away the flaky parts of an onion when you prepare it for your salad.

Since unstructured data become structured by the time you’re done with them, the real “advantage” to them is that you have them in the first place. If businesses had to make all their data presentable to your tastes, they’d probably turn their noses up at the effort involved and not bother to store as much data. As a result, you’d have access to far fewer sources for your data-mining adventures.

Raw versus processed

Speaking of cleaning, raw data are those which were not altered after collection, while processed data are those which are cleaned up and/or transformed. In other words, raw data arrive in their original form, untampered with by strangers. Raw data almost always need a bit of love (read: janitorial work) to become useful.

You’d think that means processed data are superior to raw data… but that’s a trap! Because data science professionals write code to process their data (cf. “nondestructive editing”), the truly superior option is raw data accompanied by processing code. If you’re sharing data, that’s the best way to do it (unless there’s a legal/privacy reason to hide some of it with processing or if the raw dataset is painfully massive).

The truly superior option is raw data accompanied by processing code.

Weaning yourself off spreadsheets and writing code to do your data work for you ensures that no information is destroyed during cleaning and allows you to change your mind about, say, outlier removal or category aggregation as easily as snipping out some offending code.

If we’re going granular, then let’s name some breeds of processed data:

  • Transformed data — information is altered in a way that prevents its reconstruction
  • Aggregated data — original’s information content is collapsed over observations
  • Obfuscated data — information is intentionally hidden (e.g. by deletion or noise injection)
  • Polite data — huh?

That last one is a tongue-in-cheek distinction I like to make. I have no objections to receiving a polite version of a raw dataset which is nicely formatted (with good documentation and variable names that make sense) as long as no information has been lost to transformation / aggregation. For example, if you’re sending me data and you know I like to use kilograms instead of pounds, replacing your confusingly-named wlb column with a weight_kg column that I’d find easier to use is polite; it’s a slightly better choice than shipping me the raw data. But if you replace wlb with the binary column is_overweight, you’d be sending me impolitely processed data. That’s because you’ve unnecessarily destroyed some information that was available in the raw data… and who’s to say that I’d agree with your definition of overweight?

Selecting a dataset

Primary data reigns supreme… if you can afford it. If you’re going to be using someone else’s data, it’s better to work with raw/polite data unless their domain expertise vastly outclasses yours. If you’re asking the same questions the original collectors were interested in, captured data are superior. If you’re looking to get inspired in directions that haven’t occurred to anyone yet, exhaust data might be a better bet (since captured data are more likely to leave you with similar conclusions to those made by the original collectors). Quick summary:

Primary data advantage: control over quality.
Secondary data advantage: saving time and money.

Captured data advantage: statistical inference and decision-making.
Exhaust data advantage: inspiring original ideas.

Structured data advantage: less cleaning required.
Unstructured data advantage: some things wouldn’t be stored otherwise.

Raw data advantage: almost everything.
Processed data advantage: coping with massive raw datasets.

Advice for working with someone else’s data

Read on to Part 2 here

Thanks for reading! How about an AI course?

If you had fun here and you’re looking for an applied AI course designed to be fun for beginners and experts alike, here’s one I made for your amusement:

Enjoy the entire course playlist here: bit.ly/machinefriend

Connect with Cassie Kozyrkov

Let’s be friends! You can find me on Twitter, YouTube, and LinkedIn. Interested in having me speak at your event? Use this form to get in touch.

--

--

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita