Selecting the right data catalog for a data mesh

No one tool can be used to enable a data mesh, but a knowledge-graph-based data catalog empowers you to work with data across your organization

Jon Loyens
Towards Data Science

--

Image courtesy of mefodiy from Pexels.com

In 2019 Thoughtworks consultant Zhamak Dehghani coined the phrase “data mesh,” and the industry quickly viewed it as the next big thing in data architecture. Zhamak’s socio-technical approach marries product thinking and domain-driven data management, organizations to collect, manage, and share data. Furthermore, it empowers domain experts to own the data they create and make it available to consumers across business lines

When the idea of data mesh was introduced and its benefits became apparent, many organizations working with large amounts of data and aiming to become more data-driven worked to implement it, some with greater success than others.

No one tool by itself can be used to enable a data mesh; a tight coordination of tools throughout the stack is required. A data catalog is a major component of that and can help with that coordination. But traditional data catalogs oriented around relational schemas and machine learning will struggle in this role.

I’m here to explain why a simple machine learning catalog — while useful — isn’t a silver bullet for creating a data-mesh-driven culture and architecture, and why a better bet is to build out a data mesh environment underpinned by a knowledge-graph-based data catalog.

What constitutes a data mesh?

A data mesh, according to the commonly agreed upon definition, is a socio-technical architecture that establishes how you organize the people, teams, and working groups within your business; how you organize and share the data itself; and how you build the underlying architecture to support independence and agility while still enabling cross-domain leverage for all your data assets.

Making it very clear, a data mesh is not a tool you can buy — it’s a way of building your data governance architecture that relies at least as much on organizational and operational structure as it does on the tools in your data stack.

While there are the 4 key tenets of the data mesh philosophy, if you’re thinking about it simply, from a process and tools perspective, your choices break down into two categories:

How you organize your data:

True domain independence and ownership means each domain can define their data as they see fit. Within this framework, data stewards of each domain are responsible for building and maintaining data products in line with the organization’s federated governance standards. The data stewards are also responsible for making their products available to consumers in other domains using a self-service data architecture.

What tools you use to make your data commonly accessible:

Though each domain owner can define their products as they wish, there are aspects of data governance that need to be standardized across the organization to ensure interoperability between domains. This standardization is called federated computation governance.

And in terms of tools and technology, you need an org-wide self-service data platform that provides domain-independent capabilities for data collection, management, distribution, and analysis to the owners of each domain.

Why Machine Learning Alone Isn’t a Silver Bullet

Now that I’ve established the definition of a data mesh, described how it works, and identified the components needed in order for it to succeed, I’ll explain why machine learning data catalogs aren’t all you need for implementation of your data mesh toolchain.

The benefit of machine learning data catalogs is that they aid in discovery, governance, and curation — identifying the data type in a column, retitling, flagging errors or inconsistencies, etc. — and help you understand what data you have faster and more efficiently.

Helpful functionality, certainly. But despite their usefulness, even the most-advanced machine learning data catalogs don’t possess the capabilities to enable data mesh.

Most machine-learning catalog functionality is limited to a single domain, meaning they can’t operate across the enterprise as a whole, siloing data and data teams completely. This is the antithesis of a data mesh: it doesn’t help you cross organizational boundaries; it doesn’t help you treat data as a product; and it doesn’t help you truly operationalize those data products.

If you’re operating within a single domain with a single, simple data warehouse and a common schema for your entire business, this works just fine. But it isn’t realistic for enterprises at scale, and it can’t help you build an organization-wide self-service architecture. It also limits your business agility and creates either new silos or bottlenecks. Even equipping each domain in your organization with its own machine learning data catalog wouldn’t help; instead, it would further silo your data and reinforce any silos that already exist.

Unfortunately, machine learning data catalogs leave you with a painful choice: do you force every domain to operate in a strict common model and lose agility, or do you equip each domain with its own machine learning data catalog and lose interoperability and federated governance?

How a Knowledge Graph Helps Solve the Problem

There is a third option, and it’s to choose a data catalog built on a knowledge graph.

Knowledge-graph-based data catalogs go far beyond simply relabeling data against a straightforward schema, breaking down the barrier between domains and truly building a semantic layer across your entire enterprise. They allow each domain to have its own metadata model, and enable agility across the organization without necessitating massive software redeployments. And, because they’re infinitely expandable and extensible, they allow you to add new domains and semantics over time.

A knowledge-graph-based data catalog is the perfect tool for enabling a data mesh architecture, as it allows for true federated interoperability. It allows you to query across domains despite differences in underlying architecture, and it lets you curate and treat your data as a product regardless of differences between a domain’s data stack. All of this means you can manage your data across your entire enterprise without having to govern each silo separately.

Additionally, a knowledge-graph-based catalog’s graph analytics can immediately update you to upstream data failures, allowing you to immediately work on a patch while also finding a data source that will continue to allow your business to function.

They allow business users to identify and interact with true data product owners much more quickly and easily, can provide real-time feedback on what your data assets are being used for, and can give and receive real-time recommendations on what data to use to solve specific problems. All of this means they might even lead to business insights or solutions you hadn’t considered before.

Altogether, a knowledge-graph-based data catalog enables your business — your entire business — to be much more data driven in real time.

Knowledge Graph or Bust(ed Interoperability)

For large organizations and/or organizations working with mass amounts of data, enabling a data mesh architecture across the enterprise allows greater autonomy and flexibility for domain experts and data owners, and provides a self-serve infrastructure for data experts and business users to find, collect, and share data.

A knowledge-graph-based data catalog is the perfect tool for enabling a data mesh, and because it enables every stakeholder to manage and work with data across your organization, it’s a great investment for any team trying to build a truly data-driven business.

--

--

Co-founder and Chief Product Officer at data.world, ex-HomeAway-BV-Trilogy, Python and JS nut, Austinite, Canadian, Midgetman, Tennis Player, Geek