Your Data Catalog Shouldn’t Be Just One More UI

An in-depth look into how an API-focused data catalog can help you ensure the success of your data platform via combining different types of metadata

Mahdi Karabiben
Towards Data Science

--

Photo by Claudio Schwarz on Unsplash

Over the past few years, the data field went through a major shift in how data platforms are designed, built, and used. This change, which can be referred to as the third wave of data technologies, ushered in the era of the Modern Data Stack. An era in which scalability in terms of storage and compute is a solved problem. Instead, we finally get the chance to focus on extracting as much value as possible from the data, democratizing access to it, and even quality-of-life upgrades — without worrying too much about technical constraints and complexities.

One of the key pillars of this era was the data catalog, a concept that’s not new or revolutionary by any means — we can think of Informatica or the Hadoop ecosystem’s Apache Atlas — but that promised to unlock the untapped potential of combining the different types of metadata thanks to today’s mighty Modern Data Stack.

Like other parts of the stack, the problem was first tackled at data-driven tech companies. Airbnb talked about Dataportal back in 2017, then Netflix countered with a blog post about Metacat in 2018. Lyft open-sourced the pull-based Amundsen in 2019, then LinkedIn countered by open-sourcing the push-based DataHub in early 2020.

The expectations were high: unlocking the untapped potential of metadata would optimize how we interact with our data assets, streamline processes, solve many of today’s data management problems, and help us navigate the maze which is today’s data lineage. And yet here we are in 2022, with a dozen different data catalogs offering merely one addition to our stack: a UI.

A UI that has a lot of capabilities, sure. Today’s data catalogs offer seamless data discovery, centralized data glossaries, a wide range of integrations, and many other useful features. But, is this really it? I personally don’t think so.

In this article, we’ll go through some of the capabilities that data catalogs should (and can) actually offer. But first:

What went wrong?

Up to 2019, tackling metadata management at scale was still a complex challenge with many unanswered questions. Should we pull the metadata from different tools or should the tools push the metadata into the catalog? Should we only catalog tables, or should the graph also include all sorts of data assets (dashboards, notebooks, pipelines, etc.) and even people? Should we think of column-level lineage or would it only make things even more complex?

By 2019, the landscape became clearer and many major tech companies disclosed how they tackled the problem and what worked/didn’t work with their approach. SaaS tools were able to offer similar functionalities in the following months, and we quickly reached a state where any company can add to its stack a data catalog offering core features like data discovery and lineage — and then, the progress stalled.

We no longer hear about innovative new capabilities or major improvements. Instead, today’s data catalogs seem content with being just one more brick within the data platform. A brick that tells us what’s in the warehouse and how assets are connected, but not much more.

The current state of the Modern Data Stack relies on a set of different tools that each store some of the metadata. For example, our data orchestration tool knows about the execution state of our data assets and the pipelines that build them, our data observability tool knows about the health of our assets, and our data catalog ensures we can navigate the different types of data assets.

The current state of metadata within the Modern Data Stack: different tools (in green) that connect to different assets (in yellow) and store a portion of the metadata. (Image by author)

In this current state, each new tool within the stack needs to go back to an empty drawing board and solve the problems that every other tool has already solved — like retrieving metadata from different sources directly and defining schemas for the different entities (for example, every tool needs to answer questions like “What metadata do we want to store for a given table?” and “What metadata can we retrieve from BI assets?”).

I personally believe that this current state is what’s blocking the progress of data catalogs. How can we unlock the potential of metadata if it’s scattered across ten different tools that have defined their own schemas and standards? Many data catalogs tried to address the problem by pulling metadata from as many tools as possible, but adding yet more connectors and integrations is never the correct answer to fix a messy architecture.

The path forward: fewer UIs, more APIs

“Metadata management is a solved problem”

Although it may not feel like it, there are currently no technical complexities or unsolved engineering problems that prevent a given company from centralizing all of its metadata in one place. (But industry-specific edge cases may still exist.)

The next logical step is then for the data catalog to become the central metadata repository within the company, with other tools connecting to it via APIs to retrieve any required metadata and enrich it with new metadata. This shift demands standardization regarding the definition of different metadata entities, which is something that open-source projects like OpenMetadata have already implemented.

Once all of the metadata is available in one place, the data catalog moves into the role of a central horizontal metadata repository within the Modern Data Stack, instead of being just one more brick that stores a portion of the metadata. The data catalog UI, in turn, becomes a separate component built on top of this metadata repository.

Proposed management of metadata within the Modern Data Stack: a central metadata repository connected to the different tools (in green) and assets (in yellow). (Image by author)

This design offers two core benefits that unblock the current state of data catalogs:

  • We (slowly) move towards metadata standards: This stops the endless integrations/connectors race. Every tool in the data stack doesn’t need to be able to integrate with every other tool and asset to exchange metadata. Instead, we should aim to define standards and push the existing tools within the stack to align on them. Tools within the stack can still leverage metadata and enrich it, but this would happen via standardized entities and schemas.
  • The data catalog becomes the access point to all of the metadata: This opens the door to use cases in which we need to combine different types of metadata and also ensures that we can always build the full metadata picture of a given asset by simply interacting with the catalog.

Activating the metadata (for real)

Centralizing all of the company’s metadata in one location has massive potential benefits that are critical to the success of the data platform.

When we talk about activating the metadata, we shouldn’t merely aim for Slack notifications following a schema change or an alert when a test fails. Instead, there’s a much wider range of potential use cases:

  • Automatically detect upstream bottlenecks for business-critical data assets and alert on them: Given that the data catalog has access to all of the metadata, it would be straightforward to monitor the full upstream lineage of business-critical data assets (like financial dashboards for example) and automatically detect issues in terms of data quality, performance, or even possible optimizations.
  • Enrich the whole data stack with different types of metadata: Pulling different types of metadata out of the catalog via its APIs means that we can inject it into other tools within the stack when it makes sense. For example, when there’s a data quality incident for a given table, wouldn’t it be convenient to see an alert directly within the downstream dashboards that rely on that table? The current workflow expects data users to navigate to the data catalog UI to ensure that it’s safe to trust a given chart or data asset, which adds unnecessary friction to such scenarios.
  • Immediately detect and fix unoptimized data assets: For example, the data catalog would be able to detect tables that are frequently queried with a filter using a column other than the one the table is partitioned by, and so it can update the table and partition it using the column that’s used the most for filtering.

These are just a few examples of what can be achieved by centralizing the metadata and building functionalities on top of it — the realm of untapped possibilities is even more exciting once we consider API-based access to all the metadata types that our data platform has to offer.

A proactive data catalog

Considering that the data catalog has the full data lineage, it should be the foundation for all metadata-based workflows.

As an example, let’s take data contracts — a concept that got a lot of attention recently without a clear approach to implementing it. In an ideal world, data contracts should be supported by the data catalog’s API. By doing so, we can for example implement the following scenario:

  1. Someone from a product dev team raises a pull request that updates a column in a given table or a schema definition.
  2. A CI step performs an API call to the data catalog.
  3. The data catalog returns the list of downstream tables and columns that rely on this column, their importance (based on business criticality), and their corresponding owners.
  4. All the owners of the impacted downstream tables with an importance score higher than a predefined value (4/5 for example) get added to the PR as reviewers.
  5. Other owners whose impacted downstream tables/columns have lower importance get notified (via Slack, email, or any other tool) about the proposed change.

This simple workflow can already simplify how we manage data contracts: Instead of managing the contract separately and adding unnecessary friction, it can be an automated part of managing the data asset itself. Data consumers are notified of the change before it happens (instead of receiving an alert after the data catalog detects a change) and can act on it before it’s applied.

The above example is merely one scenario in which a proactive data catalog simplifies how we manage data assets and opens the door to new enhancements and automation.

The data catalogs that got it right

There’s already a large number of data catalogs on the market. And so it’s possible that many options are capable of handling the scenarios discussed in this article — but I’m personally aware of two open-source projects that got it right:

OpenMetadata

Most of the ideas and principles discussed in this article align perfectly with the vision behind OpenMetadata, an open-source project built by the team that revamped Uber’s data culture. OpenMetadata was designed to solve the generic metadata management problem, instead of simply open-sourcing the work that was done at Uber or merely looking at the problem from the Uber lens.

DataHub

Built initially at LinkedIn and open-sourced back in 2020, DataHub was designed with scalability in mind from day one. Its heavy focus on APIs makes it an ideal catalog to enrich with a wide range of components on top — and the team at LinkedIn is doing exactly that.

Conclusion

Although they’re arguably the most exciting component of the Modern Data Stack, today’s data catalogs are for some reason content with the status quo as yet one more UI within the stack — but this needs to change.

The path forward is blurry, messy, and by no means easy, but going through it will be worth the effort. By storing all of the company’s metadata in one place and offering optimized API-based access to it, we open the door to a very wide range of possibilities to solve some of today’s most pressing data problems.

--

--