Data Does Not Have Intrinsic Value

Peter Binggeser
Towards Data Science
5 min readOct 3, 2017

--

Combined with mystique, confusion, and general misunderstanding — the phenomenon of data being produced everywhere, all the time, has led to two problematic assumptions:

1. Data has value.

2. More data means more value.

While these statements aren’t wrong, but they aren’t right, either. Every day people face the concept of data and its presumed importance, but they are rarely face to face with the real thing. Few are ever hands on with messy, tangled, often redundant, or self-contradicting datasets. Instead, we have abstracted notions that working with data is like panning for gold. We envision teams of data scientists sifting though rows to uncover magic answers or those all important insights.

Few have experienced how difficult of a material data can be to work with and just how easy it is to get it wrong.

The inability to understand data’s meaning in its rawest form causes its content to be clouded by assumptions of eventual usefulness. We experience and rely on dozens of data-enabled systems every day, yet few have experienced how difficult of a material data can be to work with. Few know just how easy it is to get it wrong.

Look at all the value! Many numbers. Wow.

It’s too easy to picture data as binary bits of ones and zeros, colorful charts, or detailed spreadsheets. Assuming that it exists in a form that someone-somewhere can easily understand. We find an expert that can do just that, and all is well in the world. Meanwhile, just the time it takes to read this sentence — many business leaders wondered if that artificial intelligence with a human name they’ve seen in commercials can just make sense of all their data instead.

We don’t think of other materials this way. You don’t have to be an expert carpenter to tell if a rocking chair is comfortable or guess if will hold your weight before sitting on it. We understand that the branches of a dead tree are fragile and dangerous, while simultaneously trusting our lives to the quality lumber that makes up the structure of our homes.

I’m not advocating for everyone to attend a data science bootcamp, be able to spin up a Jupyter Notebook, or attempt to keep up with the latest techniques from academia and silicon valley. I’m saying that businesses benefit when team members across disciplines all share some level of data literacy. Have experts on your teams — but don’t throw problems over the wall to them. Work in collaboration with them to make sense of the underlying issues, larger strategy, and issues you might not even realize exist yet.

Why do all IoT devices have temperature sensors?

You can now buy a refrigerator that records every time you open its doors. Is that data inherently useful? No. But if the data comes with context: a timestamp for the event, information about the make, model, and serial number, well then, yes, probably. With even basic statistical or machine learning methods (and some examples as training data) a decent data scientist could assign probabilities: someone in the home is having company over, cooking a meal, on vacation, or has a drinking problem. Then your typical weekday routine could be modeled with ease — turning that data into information that could be very valuable to an appliance manufacturer, grocery store chain, food producer, recipe aggregator, or at home meal services.

What if the fridge had no way of differentiating between hasn’t been opened in a while verses can’t access internet? This discrepancy calls into question every gap between the data points. Did the refrigerator go unused all day, or is the data just missing?

This is why many IoT devices include a temperature sensor — not because temperature data is always useful, but because it’s a cheap sensor to ping at a constant rate and act as a slightly useful heartbeat for the device. Temperature brings validity to the other elements of the data.

Samsung’s newest fridge (Family Hub) takes 3 photos every time the door closes.

Can I guess your SSN?

Here’s a more sensitive and simple example: it’s very easy to generate social security numbers. I can type random numbers in the pattern and likely match an issued one pretty quickly: 652–12–7623, 236–76–1230, 824–01–3628. Out of 1 billion possible SSNs, about 450 million have been issued — so it’s nearly a 50/50 chance of getting a hit. Prior to 2011 this was even easier, the first few digits represented area-code information and some ranges of numbers have been completely used up. One can very easily generate a dataset containing every single valid, issued, and yet-to-be-issued SSN in this format. However, without context like the name, birthday, and address of the individual owner, they are obviously just random numbers with some hyphens.

At what point does a list of possible social security numbers really become data? If it has a billion rows, can be placed in a database, downloaded and shared — but isn’t valuable, is it still data? What if we somehow could accurately assign each a status: issued_alive, issued_deceased, or not_yet_issued? Even then, as people die and new numbers are issued — our dataset slowly becomes invalid and out of date.

Maybe it isn’t valuable until we can assign another piece of personally identifiable information: last_name, first_name, date_of_birth? Each piece of information might be fairly useless on its own, but incredibly valuable when added to the others.

You don’t need to be a great writer to know how to read.

In order for a business to advance into a truly data-driven enterprise, it needs to move beyond an isolated team of data scientists to a broader, multidisciplinary approach to data literacy. Decisions about what data to gather, how to store it, and how to make sense of it will all be improved when the business as a whole understands what makes its data significant.

Be critical about data’s substance and schema — interrogate the data to understand what it will be able to tell you instead of just assuming its eventual utility. Treat it like you would any other material — with healthy dose of pragmatism and common sense. Without this, data won’t be gathered for eventual purpose; it will be hoarded and eventually thrown out.

Remember: just because you wrote it down, doesn’t mean it’s valuable.

--

--