Storytelling with Data

The subtle art of good analytics and explainable insights

Scott Haines
Towards Data Science
6 min readOct 9, 2019

--

An Open Book with a story about tropical pirates, a treasure chest and pirate ships coming off the page.
Storytelling with Data isn’t as whimsical. It does, however, provide the correct mental model. (Source Pixabay)

What is your story?

Across the world, we are producing and consuming data at exponential rates. This is just a fact. With the advent of 5g networks we should be capable of handling up to 10/Gbs up from 300 Mbs from 4g LTE. This means we have the bandwidth to send and consume more event data and metrics from across all kinds of connected devices and embedded systems than ever before in human history. However, you have to ask yourself “what is your data telling you”?

When you consider companies that are data-first/data-driven and succeeding within their respective industries (think Google, Facebook, Microsoft) versus companies that are collecting data with the end goal of hopefully using it to solve business goals at a point further down the road — but who are only collecting for future consumption; the goals and mindsets couldn’t be more different.

If you are actively producing, consuming and putting to use the data generated within your organization then there is typically a rich accompanying data culture within the business as well as a direct company-wide mandate to emit and store usable data from across the verticals or business units. Alternatively, companies who are not using the data that they produce have no idea if the structure of the data will be a direct benefit to the company down the line, and this will hamper the future ability of that company to make informed decisions from their data. (See my other post on Preventing the Data Lake Abyss for more details on data hygiene).

Preparing your data for success

Library Books on a Shelf. Showing organized storage of tombs of written word or data.
The Library is a good analogy for a Data Store or Data Warehouse. It is organized by category, broken down by author and you can lookup and find exactly what you are looking for no matter when the book was written. (Source: Pixabay)

Maintain a Data Vocabulary

If you consider how intelligent life communicates then you can distill that down to a common vocabulary which transforms words into meaning. Given that the meaning of words are informed by the context of their setting then it makes sense to represent the data encapsulated within an event or metric with a common set of vocabulary and defined within a structure that can be easily understood and maintained for the foreseeable future to come.

Take initiatives like the CloudEvents open-specification which is trying to build an open-schema for event-based infrastructure, or the Elastic Common Schema which is trying to unify the way that organizations write data into ElasticSearch, or even the older HTML microdata specification which enables a common set of entities and relationships to exist in the markup of HTML. All of these initiatives have overlapping and common goals to create specifications that enable unified vocabulary and structure for data.

It is invaluable to document an agreed upon format that can be reused within your company or organization

Whether you use any of these specifications or simply reference them as a cheatsheet for how to model your data it is invaluable to document an agreed upon format that can be reused within your company or project.

Shared libraries and common formats help to reduce the burden of figuring out how to model new data. This can be achieved in many ways but I am a huge proponent for compiled structured data (like Google’s Protocol Buffers). Protocol buffers allow the reuse of data structures across many composable types with the use of imports. They also compile down to many languages (java /obj-c/javascript/scala/etc) so the same data structures can be used across an organization by importing a library. This post is not about protobuf but about how to think about modeling data so I’ll get back to it.

When thinking about Data

Imagine for a second the seemingly simple use case of modeling User data at a company with no data warehouse, no data lake, and who even lacks a traditional Database. Startups face this kind of problem all the time, and if you have started to think about solving this problem as well then you have probably started to think of a User in terms of say your Netflix or Facebook account. It helps though to ask yourself what a general User is and if the notion of a User has the same meaning across more than one context, as well as the additional step that asks what is it exactly that encapsulates “User data”.

What are the use cases you would like this data to solve for

I like to start this process off by working backwards. What are the use cases you would like this data to solve for. This process is no different than when a product manager (PM) goes through a similar exercise that most Engineers are now intimately familiar with. You know “As a user of this feature I would like to…”. One can draw many corollaries to the PM user stories and use that process as a mental model for working with data — given it helps to break down complex problems into snackable contextual ideas.

A pile of legos. Just a bunch of building blocks that create imaginative creations.
Data primitives can be combined to create new structures from common components. (Source: Pixabay)

Say we all work at a new startup that is focusing on delivering the next great local-farm-fresh-to-front-porch-in-a-box concept. Now that we have context, and given context is king, we can now start to consider the User and User Data modeling journey on the same page. So what kinds of Data Stories should we be able to tell?

“From this data, I would like to be able to understand customer churn based on geographical region”

“From this data, I would like to be able to understand how the provider of the farm fresh food factors into customer satisfaction”

“From this data, I would like to understand what times of year certain produce is in higher demand, what that produce is, and how we can source it from local providers (vendors)”

Breaking things down. Cause it is fun!

If we start with concrete examples of how our data will be used then the process of breaking things down in reverse order becomes an easier exercise and gives the data modeler a clear idea of what is necessary.

Given the three use cases from above we can distill the need for a User (Customer), a Vendor (provider / farm in this case), a Product (Box of Farm Fresh Food) which contains a collection of one or more Item’s (Produce/Goods) which are controlled by a Schedule (Interval of Deliveries) which dictates when a box of farm fresh food will show up on someone’s doorstep as a Delivery.

Now imagine these relationships as an over arching data story: “A User of our Service receives our Product on a given Schedule and the Item’s in each Delivery contain seasonal selections from local Vendor’s”.

This reminds me that we need to attribute GeoRegion as well as the notion of some Temporal data to the User/Vendor and the User and Vendor Activity/ This data can be encapsulated directly in the main records (User/Vendor) as well as in an Event. Events could be increase or decrease frequency of a Delivery schedule, or a Vendor declaring they are out of an Item. Events can be created for almost anything and enable you to cross-correlated User’s and Vendor’s behavior in the system to start to add more Insights into the layers of event data.

While building out the use cases, we have to also account for a User’s Satisfaction and that could be solved with the notion of a Rating that would be associated with each Product Delivery. This could be sourced through direct requests via the services App or Website eg. the typical “Rate your experience” you get when you step out of a Lyft or get some Doordash delivered. The positive benefit here is that you can begin to learn from each User based on their prior satisfaction with each of their Delivery’s. This opens the door to building new learning models on top of all of this data. Say you want to keep track of the best and worst vendors by region based on direct customer feedback, then this metric could be used to discount partnership fees with the best performing Vendor and is also a metric to use when ditching the poor performing Vendor’s.

The Storyteller

Breaking down complex ideas and delivering meaning from words is the responsibility of any great storyteller. It is even more important from a data storyteller given that their success literally makes or breaks a company, and their ideas and meaning are composition of easy to follow metrics, clearly defined events, and insightful behavior driven accounting of piles of otherwise boring data! Now go make this happen for yourself.

--

--

Distinguished Software Engineer @ Nike. I write about all things data, my views are my own.