Data Contracts: The Mesh Glue
A practical definition and implementation guidelines
Context and motivation
Far from a unique and consolidated data platform — a datalith—the data mesh’s distributed nature advocates instead for a set of loosely coupled data products that can interact with each other.
In this article we will explore data contracts, artifacts that maintain coherence once we break the big data rock into pieces.
Not only the data breaks itself, but also some of the platform-enabling components like ingestion frameworks, metadata repositories, or scheduling engines. As Zhamak Dehghani explains in her foundational “Data Mesh” book, the Data Mesh “engineering state of mind” should learn from the knowledge accumulated in the software engineering discipline over the years. For instance, drawing inspiration from the well know UNIX design principles:
- Write <programs> that do one thing and do it well.
- Write <programs> to work together.
Change “programs” for “data products” and you will get the Data Mesh data disaggregation philosophy, change it again for “data engineering components” and you will get the engineering way of thinking.
Furthermore, as we saw in “Dissecting the Data Mesh Technical Platform: Exposing Polyglot Data,” not only do we have to design very modular components to handle granular tasks, but also implement different variations of the same modules to adapt to different user needs and expectations. For example different data products might use different ingestion frameworks , or data transformations technologies, and THAT’S PERFECTLY FINE and to some extent, even desirable.
The tradeoff of this approach is evident: rising maintenance costs.
I believe that one of the key unanswered questions on the Data Mesh paradigm is to clearly define when to promote a central platform component and when to give freedom to the business domains to implement their own pieces. It is clear that some pieces are global by nature, think for instance in a data product search component, but there are some gray areas like metadata repositories or data quality engines where the decision is not that straightforward.
Back to data, with fragmenting and embedding the implementation details of the different data products inside the data domains — thus translating the responsibility to the business domains — , we are satisfying the first UNIX principle (.. do one thing and — hopefully — do it well ..).
But, how do we ensure that a conglomerate of data products pieces can work together seamlessly? No need to reinvent the wheel here: a clearly defined APIs and expectations in the form of contracts is the solution.
Now let's try to understand the concept of data contracts and will deep dive into its multi modal technical implementation required by different user needs. We will use a few open source components that I think fit perfectly with the Data Mesh philosophy.
So, what is a data contract?
With the ultimate goal of building trust on “someones else” data products, data contracts are artifacts that sits at the intersection of a (a) business glossary providing rich semantics, (b) a metadata catalog providing information about the structure on (c) a data quality repository setting expectations about the content across different dimensions.
Incapable of coming with a canonical definition I’d rather attempt to “duck type” a data contract (e.g. describe its properties).
So a data contract …
Is meant to ease and promote data sharing — The data contract is somehow the external and observable view of the data product, it should be designed to “seduce 🥰” potential data consumers clearly communicating the underlying business semantics of the data product.
A data contract is not a bunch of technical metadata of disjoint tables.
Taking object oriented programming as a proxy, I envision the data contract like the classes interface (e.g. list of public class methods) rather than to a list of private classes properties. As an example, instead of exposing tables like `full_customer_list` and `historical_orders`, we should expose an interface like `top_january_2022_customers_by_clv_emea. I believe that this property aligns well with the “valuable on its own” principle of data products.
Naturally, not only the contract needs to be business meaningful but also technically, providing rich metadata about its base structure: table, event, graph .., data schema or supported consumption formats.
Guarantees consumption stability — Data products are far from static, so one of the key use cases of data contracts is to provide retrocompatibility via interface versioning. As we do with programming APIs, data contracts are versioned — it is the responsibility for the data product owners to maintain and support older versions of data products.
Set expectations — The data contract communicates the global and local policies execution result over the data product, displaying SLOs values for KPIs like data downtime or NULL fields percentage just to name a few.
Is consumable and enrichable — The contract should be consumable for downstream processes, the contract can act as an input for software processes like data transformation pipelines.
Finally, it is the responsibility for the data products owners to bundle and maintain the contracts inside the data product.
Now, from a technical perspective, data contracts are, at the end of the day, table metadata that needs to be managed. The implementation can be as simple as a MS Excel file in a shared repository all the way to a noSQL database (document store), with one big trend of expressing data contracts with YAML/JSON files versioned under the data product source repository.
The need for different contract validation execution strategies
One key aspect in data contracts lifecycle is the actual contract validation process, so far, we have described data contracts as declarative objects. Objects that offer trustworthy information about the data products they describe.
But at some point, we need to “fill the contract”, and ensure that our data assets validate the contract expectations. And that means evaluating the contract against the data it describes, and ensure results are not breaking the expectations. For instance, if the contract states that at most 10% percent of certain column values can be NULL, we need to go and actually execute and count the number of NULL rows in the table every time our data product gets new data in or it is modified. The results should be stored in either local or global contract information repositories.
So how can the platform validate a contract?
As we saw in the introduction, the beauty of Data Mesh is to acknowledge that different personas/journeys will have different requirements when it comes to — in this case — contract evaluation. Then we should let the users choose DIFFERENT implementations attending to their specific needs. To illustrate this and hopefully inspire the reader, we will focus on the implementation of two opposite sets of requirements. In real life there will be very different shades of gray between these somehow “extreme” scenarios.
- Scenario #1 — Automated transformation pipeline: this is perhaps the most classical one where we load a large table, let’s say on a daily basis and need to ensure that the new table state conforms to the data contract. In such scenario the requirements might be things like the ability to process large data sets with high throughput in an automated fashion. With that in mind and with the goal of enabling developers in the business domains to automate the contract validation, we can design a software component like the one below.
The idea is to agree on a contract YAML format that can be automatically feed into the Great Expectations + SPARK combo to perform the validations at scale. Great expectations is an amazing tool for executing data expectations, it is based on defining assertions about your data. Those assertions are expressed in a declarative language in the form of simple, human-readable Python methods, so it is straightforward to generate the expectations after parsing a simple YAML file with the contract.
The following code snippet is a SPARK job that performs column validation using this approach (using the `expect_column_to_exists` assertion):
- Scenario #2 — Interactive development: in this second scenario, more aligned with the Data Scientists persona kind of work, the data product is generated in an interactive way via IDEs like Jupyter Notebooks. As the development tends to be iterative, the requirement is to quickly evaluate the contract over and over without spinning large clusters submitting batch jobs. One specific consideration of this scenario is that data tends to fit in memory. With all of that in mind, a component like the following will come useful:
The idea is to embed every component locally. Thanks to technology like Apache Arrow and duckdb, we can efficiently query analytical data in memory with an in-process OLAP database. Special shout-out to duckDQ, a fantastic python library that provides a fluent API for the definition and execution of data checks following the estimator/transformer paradigm of scikit-learn in structures like pandas dataframe or Arrow tables.
The following code snippet illustrates this process:
NOTE: At the time of writing this article, the python interface to Iceberg tables (pyIceberg) is in a very early stage development phase, thus in the code snippet we are directly loading the underlying parquet files.
Conclusion
In this article we have explored the “Data contract” concept, a key artifact to ensure that information spread across different data products can be shared and reused. Trust is the glue that ties together different data assets under the Data Mesh paradigms, and data contracts are quintessential to overcome the fear of consuming data assets that do not come from a central entity.
We have also analyzed a couple of technical implementations using open source components for one fundamental process in the data contracts lifecycle: its evaluation.