The world’s leading publication for data science, AI, and ML professionals.

Scaling Data Products Delivery Using Domain-Oriented Data Pipelines

A tested approach for the rapid delivery of data products at scale.

Notes from Industry

Introduction

The recent reimagining of "data as a product" i.e. a slight deviation from the well-known mantra of "data as a strategic asset", has necessitated the revamp of modern Data Pipeline architectures to support the rapid delivery of data solutions at scale. This proposition is correct irrespective of the underlying data architecture upon which the target enterprise is defined.

The concept of "data as a product" is increasingly gaining momentum amongst organizations that have huge amounts of data at their disposal. This would typically include giant retailers, social media platforms and financial services firms, just to mention a few. While most of these organizations desire a transition to this new paradigm or ways of working, unfortunately, most of them can’t, simply because of the limitation posed by their existing data pipeline architectures i.e. they are monolithic in nature.

Scaling the delivery of data products in modern data ecosystems therefore necessitates a complete overhaul of existing monolithic data pipelines. This should be preceded by a thorough understanding of the various interactions (aka interface contracts) that exist between the domains that contribute to the evolution of the data product. Empowering these domains with dedicated pipelines is key for a successful data product delivery. Further reading about the construct of "data domain ownership" as well as related principles governing data as a product can be found in Dehghani’s recent publication here.

What is a domain-oriented data pipeline?

As stated above, the philosophy underpinning data as a product enforces domain ownership on the components of the data products. This also includes the data pipelines that orchestrate the build, test and deploy of the domain components. It is important to mention that while these pipelines may be independent in their own rights, they are however not completely isolated. As such, a data pipeline architecture that enables the respective domains in a data product team to independently build, test and deploy re-usable components of the data product using a minumum set of shared baseline artefacts is known as a domain-oriented data pipeline. It is important to mention that this definition assumes an architecture that implicitly incorporates best practices for modern data pipeline designs e.g. security, scalability, obeservability etc.

High-level architecture of a domain-oriented data pipeline
High-level architecture of a domain-oriented data pipeline

The high-level architecture depicted above captures the essential components of a domain-oriented data pipeline. The key principles outlined below are very important to help maximize the benefits delivered by a domain-oriented data pipeline:

  1. Create the right team topology: As seen in the diagram above, the pipeline design is predicated on the domain teams responsible for building out various components of the data product (e.g. Domain Team 1 … Domain Team N). Also, since re-use is a key feature of a domain-oriented data pipeline, the team topology should support a core product team responsible for embedding common standards that can be leveraged by the domain teams as they build out their respective data workloads. For example, the core product team is responsible for baking and publishing known versions of base images of the data product for consumption by the domains in the ecosystem.
  2. Minimize duplication and maximize re-use via shared baseline components: This comes on the back of the understanding of the relationships between the various domain components. A domain-oriented data pipeline architecture should encourage re-use and minimize duplication across the pipeline ecosystem as much as possible. While this can be a balancing act, the overall benefits far outweigh the initial efforts with the domain components relationship mapping exercise.
  3. Define clear domain regions (including asset nomenclatures): While the intention is to ensure that each domain pipeline exists independently (even though not completely isolated), proper attention has to be given to how separation of concerns is achieved for the domains wrt the shared components. This is done by creating regions of interactions for each domain. Furthermore, the regions must be identifiable by clear nomenclatures (i.e. naming standards) especially on shared components such as storage, pipeline build artefacts, code repositories etc. This clarity will come in handy for ease of data products evolution and consumption by end-users.
  4. Define clear domain boundaries (including ownerships and accountabilities): A lack of boundaries, ownerships and accountabilities is always a recipe for disaster in any pipeline architecture. This problem can be further exacerbated in a domain-oriented pipeline architecture. However, if clearly defined, a domain-oriented data pipeline architecture has the benefit to deliver a federated governance model on the data ecosystem. That is, the domains as well as the core product teams will have the flexibility to define structures that create common standards and help to eliminate potential friction in the overall team topology.

Benefits of domain-oriented data pipelines

No doubt scaling data products with traditional monolithic data pipelines can be very challenging. Domain-oriented data pipelines on the other hand, tend to simplify the delivery approach by adding the following benefits (there are many more):

  1. They are very easy to scale both from a technology and business standpoint. Imagine the convenience of adding a new pipeline to support a brand new business domain from a pre-packaged domain pipeline template.
  2. The independence of the domain pipelines largely helps to eliminate a single point of failure in the entire data product delivery process.
  3. Domain-oriented data pipelines make it easier to detect performance bottlenecks in the data product delivery lifecycle since metrics can be measured and collected at the level of each domain.
  4. They deliver a federated data governance model which gives control to the actors in the team topology to own and be accountable for their respective pipeline components.

Conclusion

A data product, if developed correctly can be elegant at first sight. However, a "true data product" is not just defined by its elegance but also on its ease of maintenance as well as its ability to scale in order to support the changing needs of the enterprise. This is where domain-oriented data pipelines come in.

The principles described herein stem from my recent experience leading a team of data pipeline engineers to architect and implement a domain-oriented data pipeline for an enterprise. So while some of them may not be directly applicable to your data ecosystem, the bulk of it however holds true for most modern data ecosystems though with a few tweaks. For example, even though only one core product team has been depicted in the reference architecture, in reality, this team can be scaled out with dedicated pipelines, depending on the complexity of the data product.

Finally, it is important to mention that data, at the right grain, must be made available to the actors in a domain-oriented pipeline architecture in order to support build and test of the various data components prior to having them published to the shared zones for consumption in the ecosystem.


Related Articles