Good Data Citizenship Doesn’t Work

We can and should ask everyone to be good data citizens. But we have to be good leaders as well.

Benn Stancil

Published in

Towards Data Science

10 min readJan 20, 2022

Jointly authored by Benn Stancil at Mode and Mark Grover at Stemma.

Beginning in grade school, we’re all taught to be good citizens. We learn from civics teachers that democracies don’t work if we don’t vote and participate in local politics; we hear from social leaders that we have to speak up and make our voices heard; we’re told by elected officials that we should ask not what our country can do for us, but ask what we can do for our country.

It’s a powerful message. But it’s incomplete. Citizens alone can’t create a prosperous society; elected leaders have to do their part as well. They have to write bills, pass laws, negotiate compromises, and build fair and functioning institutions that carry out the will of the people. Just as people can’t pass the responsibilities of democracy off to governments, governments can’t pass the responsibility of leading off to the people.

Data organizations — which increasingly frame their goals around terms like “data democratization” and “citizen analysts” — need to remember the same thing. As data has grown in importance and scope, data teams try to outsource responsibilities to their non-data colleagues, calling on more of them to manage, analyze, and curate an organization’s analytical assets. On the whole, the effort makes sense: Data teams are too small to do it on their own, and, just as is true in our political process, more involvement from the citizenry makes for a more vibrant society.

But we should be careful about taking this approach too far. Organizations “elect” us to be the primary stewards of our companies’ data, and to build the institutions that both make their participation effective and provide for their needs. We can, and should, ask everyone to be good data citizens. But we have to be good leaders as well — especially in matters related to managing and organizing data assets.

From full service to self-service

Decades ago, when data was primarily used to report a few business KPIs to executives, data teams (or, more often, IT teams) could do most of this work on their own. There was no need for a data citizenry, any more than there is a need for a citizenry at a restaurant: A few people place a few orders, and the kitchen handles them just fine.

In recent years, this has all changed. The number of patrons has grown enormously, from executives and a few operational leaders to everyone at a company. The variety of orders has also exploded. Data isn’t used for just KPIs, but for answering complex questions across a wide range of topics, from product and engineering to operations, customer support and executive decision making. And the complexity of each order has increased too. Data that used to live in a few central systems is now spread across dozens (if not hundreds) of third-party SaaS applications. According to some sources, organizations now use more than ten times more SaaS products today than they did just five years ago [1].

As a data industry, we’ve responded to this enormous increase in demand by building a buffet. Using self-serve tools in which people can create their own analysis, patrons assemble their own orders, given some basic ingredients provided by the data team. This new restaurant can serve far more people than before, dramatically expanding access to data to everyone in an organization.

But it’s created a new problem: How can people know they can trust the data they’re seeing? To overextend the buffet analogy, how can people know the food they find in the corner station, the one that feels like it’s been unattended all night, is still safe to eat?

All too often, it isn’t. People will show up in a meeting to talk through a business problem, and different people will bring mismatched numbers. All too often, data slows decisions down rather than speeds them up, because people spend more time debating if the data is right than they do talking about what to do about it. And all too often, data teams will update core datasets or KPIs, and old versions will linger on for months.

Incidents like these erode trust in data and data teams. It teaches people not to participate in the data democracy, and to make decisions entirely on their own, often without using data at all.

Building a better menu

To solve this problem, many data teams turn to data documentation. The idea is simple enough: Create a menu that explains what data is available and what it means. Though this documentation can take lots of different forms — internal wikis, third-party products, chatbots, a Google sheet — for self-serve experiences to work, we need something to help people make sense of the raw data they’re served.

Unfortunately, as anyone who’s tried to create — and maintain — data documentation knows, this is much easier said than done. Organizations have so much data that it’s often impossible to reliably document even the most important datasets. And because data changes so frequently, docs also often go stale, leading people to distrust the docs as much as they do the underlying data.

To solve these problems, organizations sometimes hire full-time data stewards, who are accountable for defining data documentation and ensuring it stays up-to-date. Not only is this expensive, it rarely works. Data stewards are assigned to document data in departments in which they aren’t experts; as a result, they end up asking domain experts for help. Organizations could address this by hiring more stewards, but a lot of organizations can’t hire one steward, much less several.

Seeing this challenge, many data teams lean on good citizenship. A member of the data team creates a Google doc for other people to document all the canonical sources of truth and the gotchas related to them. The doc gets announced on Slack, and for a few weeks, people diligently update and expand it. But the good habits slip. Data citizens forget to update it, or, understandably busy with their day jobs, don’t have time to do it. New people join the company and are never trained to update the docs.

Good citizenship, it turns out, doesn’t work either. Data documentation is a Herculean task, too big to be maintained by an overburdened data steward, and too ever-present to be shared among a distributed citizenry.

Thinking outside the doc

These efforts fail because, fundamentally, documentation is the wrong solution. Data, it turns out, doesn’t so much drift towards entropy, but sprints at it.

Businesses change quickly; data sources are constantly evolving; responses to questions go stale almost as quickly as they’re answered. Trying to document a system like this would be akin to documenting the news — that’s simply how people keep up with current events. (There have been attempts at this, like Vox, which was launched as a website that would be full of news “explainers” that functioned as a kind of documentation. This model never really stuck, and Vox quickly became a news site that was structurally no different from any other.)

Rather than looking at systems of documentation for solutions, data teams should instead look at systems that organize vast amounts of constantly changing information. Fortunately, there are lots of examples.

News sites — Vox tried to replace — and eventually, became — a standard news site. But if the goal is to keep people informed about current events, traditional news sites work pretty well. They succeed because they provide a constant feed of information. Even if you don’t read every story, the firehose creates a general awareness of current events, just as Instagram and Facebook feeds keep you passively updated about your friends. What can we learn from it? Airtime matters. Getting information in front of people, even if no single conversation explains everything, can keep people well informed.
Wikipedia — Wikipedia successfully “documents” an extraordinary amount of information, and responds remarkably quickly to current events. However, it works because of its scale: It’s maintained by hundreds of thousands of people, who are incentivized to do so because of Wikipedia’s billions of readers [2]. What can we learn from it? “Citizen documentation” only works on an enormous scale. About 0.05% of Wikipedia’s readers edit an article in a given month. Even if companies could get people to update data documentation at 50 times that rate, a company of 1,000 people would only have 25 people contribute to their documentation. That’s not nearly enough to keep up with the changing pace of data.
Yelp — Rather than sourcing reviews from professional critics, Yelp (and dozens of other sites like it) relies on restaurant patrons to write them. As a result, reviews are updated much more frequently than they would be otherwise. What can we learn from it? When we talk about documentation, we tend to focus on owners. Customer input can also be useful, even if it’s just flagging something as out of date. This is especially true for “savvy” customers who know the domain well.
Quora and StackOverflow — These sites focus on answering questions instead of documenting information. This lets people address specific needs, provides mechanisms for important problems to bubble to the top of the pile, and focuses less on live documents and more point-in-time answers. This also encourages people to respond to questions — to “document” things — because the value of the answer is more clear. What can we learn from it? Focused mechanisms for answering questions can make it easier to incentivize people to participate, and ensure that their participation is more targeted. However, this works best in settings where answers are stable. In domains that are changing quickly, point-in-time answers go stale quickly.
Google — Google used crawlers to index the web, not unlike other search engines that existed before. However, its true innovation was PageRank, which used the number of backlinks to a site as a proxy for how trustworthy it is. Over time, Google built tools to give website owners more control over what gets indexed from their sites and how they can optimize their ranking. For businesses, Google even began providing ways for people to upload information directly. What can we learn from it? Discovery systems can start with automation, and layer on various levels of curation for people to review, correct and augment.
The internet — Nothing is changing faster than the internet, and yet, we still do a pretty good job of navigating it. This happens through a couple means: bookmarks and search. First, we have primary sites that we actively go to — sites like Reddit, Twitter, the New York Times, and so on. For everything else, we find it through search, which is powered by popularity. We’d never (ok, AOL tried, back in the day) try to document the internet, where we navigate through it using a table of contents. What can we learn from it? It’s sometimes better to throw out “organization” completely and rely entirely on search, powered by usage.

Lessons learned

These lessons — as well as those from our past mistakes — will soon find their way into our approach for organizing and curating data. We think they’ll get applied in a handful of ways.

Review more, document less.

Just as Google organized the world’s information by automatically mapping what’s popular, we will need to do the same for data. We need to auto-document as much as we can, based on how people are using data already. The aim shouldn’t be to create a perfect source of truth, but a sketch of what appears to be true.

Once documented, we should then hand it over to data owners and domain experts to review, correct, and augment. This way, people are only required to document the things that are tribal knowledge — the places where something looks one way, but actually means something different — while machines automate the rest by scanning the logs and APIs of data warehouses, BI tools, Slack, JIRA, and more.

Let there be mess

With the amount of data that organizations have today, any effort to review and document it has to be focused. Quora and StackOverflow show us that questions, rather than documentation, help uncover the most important issues.

In data, we should follow the same principle: Use the questions people are asking to find data hotspots and focus our energy on those. That means some corners of your data will be messy, and some concepts will go undocumented. That’s ok, so long as there’s a method for identifying when those areas “heat up.”

It’s not just the owner

With Yelp, we saw that a savvy consumer is often more helpful to other customers than the restaurant owner. Patrons know what others customers care about, and are often more honest about it too.

Similarly, data consumers know what data they trust and what data they don’t. They’re the people who’ve been burned, who’ve run into the data’s gotchas, and learned about them the hard way. These are your analysts, data scientists, and business users. Get them a voice in the process, and don’t just rely on the data owners.

Get it while it’s hot

Lastly, the biggest thing we learned from data documentation in the past was that getting started, for all its difficulties, is the easy part. Maintaining up-to-date documentation is much harder.

The best way to address this is by documenting data when it’s generated. Make it part of the process of adding a new event, table, or a replication job, when the change is already top of mind. If possible, embed it in the development process, and pester people when they don’t include the necessary updates. This shifts the burden of documentation upstream, making it part of the development cycle. Otherwise, “fix docs” gets added to an ever-growing backlog of follow-on projects that we never actually get to.

—

We like to tout how important data is to the modern organization, and the more of it that we have, the more valuable it is. We’d argue the opposite: Organizations have too much data. Without better ways of organizing it, large volumes of data are more overwhelming than useful.

Fortunately, we’re making progress. When it comes to managing data, we’ve figured out a bunch of things that don’t work, a few things that do, and have good parallels to draw from in adjacent spaces. By following these lessons — and, most importantly, by recognizing that we need a mix of good citizenship, good product, and good processes — we can finally build a data society that works for, and is trusted by, everyone.

References

[1] Statista, Average number of software as a service (SaaS) applications used by organizations worldwide from 2015 to 2021, accessed January 9, 2022.

[2] In the second half of 2021, 800 million unique devices visited the English Wikipedia each month. It was edited by about 400,000 people per month. Wikimedia Statistics, Unique devices, accessed January 9, 2022; Wikimedia Statistics, Editors, accessed January 9, 2022.