Addressing the Data Catalog Identity Crisis

Is the data catalog dead? Not by a long shot. But it’s time to change how we think about data catalogs.

Jon Loyens
Towards Data Science

--

Image courtesy of Hansjörg Keller on Unsplash

Data catalogs, and in fact the entire ecosystem that has formed around metadata management over the past decade, are in need of a wake up call. As the chief product officer of one such company, that can be tough to write, but doesn’t make it any less true.

Is the data catalog dead? Not by a long shot. Do you need to change how you think about data catalogs? Most likely, yes.

It’s ironic that one of the most important solutions for driving shared understanding among your data teams is often the most misunderstood. I’m writing of course about data catalogs.

These tools, like most data management products developed a decade ago, were meant to handle slowly changing, relational data. Catalogs were deployed in service of data governance and compliance, not discovery and democratization. Needless bureaucracy and process often stymied data analysts who were looking to find and understand data required for analytics.

Data catalogs have evolved quite a bit in recent years, but data work today demands so much more, and many of the legacy offerings in this space haven’t met the moment. Modern data catalogs must be a powerful tool for enabling DataOps, they must support distributed architectures, and they must be able to scale as demand for data and knowledge grows within the enterprise.

Recently, Barr Moses, of Monte Carlo, a leading provider of data quality and observability tools, argued that data catalogs are having an identity crisis. I liked the article, and not just because it references one of my favorite pre-pandemic activities of drinking in a dive bar. She compares deploying data catalogs to a bartender asking you to mix your own drink. It’s a great analogy when applied to traditional data catalogs, and I definitely prefer my cocktails professionally made.

The essence of this article, and in fact many of the questions I get from customers, is about improving the day-to-day work experience of data engineers. The people who bear the unenviable task of meeting exceedingly tighter SLAs while working with more complex and constantly changing tools. Ultimately it boils down to a simple question: how should a modern data catalog make data work easier, faster, and more valuable for your organization? Below I address a few areas that Barr references in her article:

  • Automation — Actionable data requires automation that goes beyond simply scanning schemas and counting how often a table is referenced in query logs. At data.world, we rely on knowledge graphs to draw inferences from all the available metadata and automatically alert data engineers and stewards when issues arise that require immediate attention
  • Scalability — Modern data catalogs must extend beyond structured data models to encompass everything from APIs to streams to processes and even interrelate measures and business domains. If you impose limits on what gets cataloged, you risk losing potentially critical context for your data. Knowledge graphs and open standards are again the answer — giving you a flexible and extensible metadata model that can truly catalog all your organizations data assets.
  • Distributed architecture — Data gravity and application silos have rendered the notion of unified data a pipedream. A modern data catalog must embrace the data mesh paradigm and support both virtualized and federated access to data. Data analysts should be able to explore and understand data based on its meaning, not its resident datastore. Meanwhile, data engineers and stewards should have full lineage, auditability, and metrics on how the data is being used to ensure safety and compliance. We call this last-mile governance, and it clears the final hurdle to true data supply chain efficiency.

Your front office for data and analytics

If data warehouses, pipelining, and transform tools are the data and analytics factory line, then data catalogs are the front office. DataOps applications generating metadata such as data quality, observability, classification and lineage create information that needs to be consumed by data scientists and engineers, the front office workers. But how do they get this information without digging into ten different applications? Like a great CRM tool does for sales, a data catalog must put this information at their fingertips to enable great data discovery and reproducible analytics.

For data catalogs to overcome their “identity crisis” and truly become that front office — a knowledge operating system if you will — they must have the following qualities:

  • Information Radiator — A one-stop solution that can automatically aggregate mission critical DataOps metadata and present it to your data and analytics community in quickly consumable fashion. A knowledge-graph-based metadata model quickly and easily incorporates new information about your data, making this task much easier.
  • Collaboration Hub — A central clearing house for actions to be taken and jobs to be done relating to your data assets. By bringing together data producers and consumers and enabling them to take action together, in real time, you can iteratively overcome the knowledge gaps that stand in the way of reproducible and trustworthy analytics without the chore of manual documentation efforts. Supporting a data mesh architecture with both virtualized and federated query architecture as well as last mile data governance is critical to truly encourage collaborative exploration of data free of the shackles of physical data stores.
  • Open Ecosystem — Are the popular tools in your data architecture right now the same ones your team used a year ago? Three years ago? The data and analytics ecosystem is changing quickly, so it’s critical that a modern data catalog that acts as the front office for DataOps is built on open and extensible standards. This ensures you can quickly incorporate new tools into your workflow so new silos don’t develop.

Finally, and most importantly, none of this matters if your data people don’t use the catalog. In fact, I believe data catalogs can and should “spark joy” for the entire data and analytics team as well as the stakeholders who rely on their work to make critical business decisions. This is part of our ethos at data.world, where data catalog adoption isn’t just a measure of client satisfaction but a barometer of our own success.

We’ve built our business on teaching clients how to adopt agile data governance and DataOps, much like early purveyors of agile software tools had to, and by helping them achieve early success focused on specific use cases rather than “boiling the ocean.”

As Barr writes in her article, “a data catalog is only useful when it’s designed with a purpose in mind.” At data.world, our purpose is to empower people, teams, and companies to transform complex, inscrutable data and metadata into usable knowledge that propels business and society. It’s a cocktail for success — one you don’t have to mix yourself.

--

--

Co-founder and Chief Product Officer at data.world, ex-HomeAway-BV-Trilogy, Python and JS nut, Austinite, Canadian, Midgetman, Tennis Player, Geek