The Future of Data Catalogs

Let’s visit a website just to “browse the metadata,” said no one ever

Prukalpa

Published in

Towards Data Science

6 min readMay 10, 2022

Last Friday, Data Twitter was buzzing with Josh Wills’ tweet about metadata and business intelligence.

At Atlan, we started as a data team, and we failed three times at implementing a data catalog. As a data leader who saw these projects fail, I found that the biggest reason data catalogs fail is the user experience. This isn’t just about a beautiful user interface though. It’s about truly understanding how people work and giving them the best possible experience.

People like Josh want context where they are, when they need it.

For example, when you’re in a BI tool like Looker, you inevitably think, “Do I trust this dashboard?” or “What does this metric mean?” And the last thing anyone wants to do is open up another tool (aka the traditional data catalog), search for the dashboard, and browse through metadata to answer that question.

Imagine a world where data catalogs don’t live in their own “third website”. Instead, a user can get all the context where they need it — either in the BI tool of their choice or whatever tool they’re already in, whether that’s Slack, Jira, the query editor, or the data warehouse.

*Active metadata in Looker. (Image from author.)*

I believe this is the future of data catalogs — activating metadata and bringing metadata back into the daily workflows of data teams.

In Josh’s words, “It’s like reverse ETL but for metadata”.

Why don’t data catalogs work like this today?

Traditionally, data catalogs were built to be passive. They brought metadata from a bunch of different tools into another tool called the “data catalog” or the “data governance tool”.

The problem with this approach — it tries to solve a “too many silos” problem by adding one more siloed tool. That doesn’t solve the problem that users like Josh face every day. Eventually, user adoption suffers!

A senior data leader at a large company called these data catalogs “expensive shelfware”, or software that sits on the shelf and never gets used.

*The problem with traditional data catalogs. (Image by author.)*

How can we save data catalogs from becoming shelfware?

Think about the modern tools we use and love today — GitHub, Figma, Slack, Notion, Superhuman, etc.

One common thing across all these tools is the concept of flow. In the words of Rahul Vora (Founder of Superhuman):

“Flow is a magical feeling.
Time melts away. Your fingers dance across the keyboard. You’re driven by boundless energy and a wellspring of creativity — you are completely absorbed by your task.
Flow turns work into play.”

The secret to magical data experiences lies in flow. These great user experiences aren’t about the macro-flows. They’re about micro-flows, like not having to switch to a separate data catalog to get context for the dashboards in your BI tool. There are dozens of micro-flows like this that can power magical experiences and completely change the way that data users feel about their work.

Therein lies the promise of active metadata.

What is active metadata?

Instead of just collecting metadata from the rest of the stack and bringing it back into a passive data catalog, active metadata makes a two-way movement of metadata possible, sending enriched metadata back into every tool in the data stack.

My favorite explanation of “active metadata” and how it is different from traditional, passive approaches actually goes back to… the dictionary.

“If you describe someone as passive, you mean that they do not take action but instead let things happen to them.”
— Collins Dictionary

Being “active” is about always being engaged and moving forward, rather than sitting back and letting things happen around you.

Take a moment to think about this means in the context of metadata, and it paints a picture of what active metadata can be — when metadata transforms into “action” to make our data experiences better.

Achieving flow through active metadata

The only reality in data teams is diversity — a diversity of people, tools, and technology. Diversity that leads to chaos and sub-optimal experiences for everyone involved.

The key to wrangling this diversity and achieving flow lies in metadata. It’s the common thread across all of our tools that gives the context we’re desperately lacking every time we bounce between tools to figure out what’s going on with a data project.

When you’re browsing through the lineage of a data asset and find an issue, you can create a Jira ticket right then and there.
When you ask a question about a data asset in Slack, a bot brings context about that asset directly to you in Slack.
When you are pushing to production in GitHub, a bot runs through the lineage and dependencies and gives you a “green” status that you’re not going to break anything — right in GitHub.

*Activating metadata. (Image by author.)*

Going beyond the data catalog

The “data catalog” is just a single use case of metadata — helping users understand their data assets. But that barely scratches the surface of what metadata can do.

Activating metadata holds the key to dozens of use cases like observability, cost management, remediation, quality, security, programmatic governance, auto-tuned pipelines, and more.

The more I think about this, the more I have begun to believe that active metadata can make intelligent data dream a reality.

Here’s an example of how it could work:

With active metadata, you could use past usage metadata from BI tools to understand which dashboards are used the most and when people use them.
End-to-end lineage connects these dashboards to the tables that power them in the data warehouse.
Operational metadata shows connected compute workloads, associated data pipelines, and run times.

Couldn’t we use all of this information to auto-tune our pipelines and compute, optimizing for a great user experience (updated data in the dashboard when people need it, and best performance at the time of max usage) while minimizing costs?

*Active metadata use cases. (Image by author.)*

Beyond that, it feels like the use cases of active metadata are limitless. It has the potential to bring intelligence and flow to every part of the data stack and truly act as the gateway to the data stack of our dreams — a truly intelligent data system.

Automatically deduce the owners and experts for data tables or dashboards based on SQL query logs
Automatically stop downstream pipelines when a data quality issue is detected, and use past records to predict what went wrong and fix it without human intervention
Automatically purge low-quality or outdated data products
and much more

In the past few years, it has been heartening to see active metadata become the de facto standard for next generation metadata, with even Gartner releasing its inaugural Market Guide for Active Metadata a few months ago. This may sound a little crazy, but in a world with self-driving cars, smart houses, and rovers that navigate themselves across Mars, why can’t we imagine a smarter data experience powered by our wealth of metadata?

Want to learn more about third-generation data catalogs and the rise of active metadata? Check out our ebook.

Found this content helpful? I write weekly on active metadata, DataOps, data culture, and our learnings building Atlan at my newsletter, Metadata Weekly. Subscribe here.