Smashing silos with Domain Data Stores

Piethein Strengholt
Towards Data Science
9 min readMar 17, 2021

--

In my other posts you learned how ABN AMRO makes data available in a data mesh style architecture. In this blogpost you will learn about how to break big data monoliths apart.

Data-driven decision-making shift

In the years since data warehouses became a commodity, much has changed. Distributed systems have gained great popularity, data is larger and more diverse, new database designs have popped up, and the advent of cloud has separated compute and storage for increased scalability and elasticity. Combine these trends with the shift from centralized to domain-oriented data ownership, and you will immediately understand the importance of changing the way data-intensive applications must be designed.

In our data architecture we made a clear split between direct data consumption and creation of new data. In the data distribution architecture, as you can read here and here, we have positioned Read Data Stores (RDS) to capture and serve out larger volumes of immutable data repeatedly to consumers. In this pattern, data is read but no new data is created. Consuming applications or users use the RDSs directly as their data sources and might perform some lightweight integration based on mappings between similar data elements. The big benefit of this model is that it does not require data engineering teams creating and maintaining new data models. You don’t extract, transform, and load data into a new database. Transformations happen on the fly, but these results don’t need a permanent new home. This approach is particular useful for data exploration, lightweight reporting and simple analytical models that don’t require complex data transformation.

The problem, however, is that the consumers’ needs can exceed what RDSs offer. In some cases, there is a clear need for new data creation: for example, complex business logic followed by analytical models that generate new business insights. To preserve these insights for later analysis, you need to retain this information somewhere, for example, in a database. Another situation can be that the amount of data that needs to be processed exceeds what the RDS platform can handle. In such a case, the amount of data processing; for example, of historical data, is so intense that you would be justified in incrementally bringing data over to a new location, processing it, and pre-optimizing it for later consumption. One more situation would be when multiple RDSs need to be combined and harmonized. This typically requires orchestrating many tasks and bringing data together. Making users wait until all of these tasks are finished will negatively affect user experience. These implications bring us to the second pattern of data consumption: creating Domain Data Stores (DDSs).

Domain Data Stores

We want to manage newly created data more carefully, while at the same time increasing agility. This is what DDSs are positioned for. This type of application has the role to intensively process data, store the newly created data, and facilitate the consumer’s use case. To unlock the value at large, we have designed a new architecture, which includes a platform for data engineering teams. Let’s look side and evaluate the characteristics.

Figure 1 by Piethein Strengholt: DDS Reference Architecture

What we envision is an ecosystem that allows rapid delivery of new data-driven decision use cases. It facilitates data engineering and intensive processing at large, while staying in control and not seeing a proliferation of technologies. We foresee a shift from generic (enterprise) data integration towards business specific data creation; a shift from integration specialists to community building and seamless collaboration; and a shift from rigid data models toward more flexible or “schema-light” approaches.

At high-level the ecosystem looks like Figure 1 above; a fully managed platform that allows fast data ingestion, transformation and usage. At the bottom, you see managed infrastructure, which main goal is to hide away complexity from all data engineering teams. There are reusable functions to support data engineering teams in a self-service manner. These include reusable and managed database technologies, central monitoring and logging, lineage, identity and access management, orchestration, CI/CD, data and schema versioning, patterns for batch-, API- and event-based ingestion, integration with business intelligence and advanced analytics capabilities, and so on. The underpinning platform is managed using a Team Topologies approach: a central platform team manages the underlying platform, while supporting all other teams. The main purpose is to simplify all services, governing and securing the platform and with that reducing the overhead for data engineering teams.

On top, you see DDSs in which data is managed by the data engineering teams. These domain teams focus on either data products, customer journeys or business use cases. The boundaries around DDSs also determine data responsibilities. These include data quality, ownership, integration and distribution, metadata registration, modeling, and security. I’ll come back to the granularity and domain boundaries later.

For the functional requirements, we ensure business objectives and goals are well-defined, detailed, and complete. Understanding them is the foundation for your solution and requires you to clarify the criteria for what business problems need to be solved, what data sources are required, what solutions need to be operational, what data processing must be performed in real time or offline, what the integrity and requirements are, and what out‐ come is subject to reuse by other domains.

For the non-functional requirements, we made choices about how many and what type of data store technologies are offered. You can think of a common set of reusable database technologies or data stores and patterns to ensure leveraging the strength of each data store. For example, mission-critical and transitional applications might only be allowed to go with strong consistency models, or business intelligence and reporting might only be allowed with stores that provide fast SQL access.

Different data stores manage and organize their data internally. One common way of organizing is to separate (either logically or physically) the concerns of ingesting, cleansing, curating, harmonizing, serving, and so on. Within our domain data stores, we encourage using various zones with different storage techniques, such as folders, buckets, databases, and the like. Zones also allow us to combine purposes, so a store can be used to facilitate operations and analytics at the same time. For all stores and zones, the scope must be very clear.

Figure 2 by Piethein Strengholt: Layered DDS approach

For the data models, we encourage to shift away from rigid data models toward more “schema-light” approaches. However, any architectural style is allowed. If teams embrace schema-on-read or prefer directly building up simple dimensional models, we encourage them to do so. Kimball, or Data Vault modeling can be also applied. It all depends on the needs and size of the use case, which brings me to the next subject.

Domain Data Store Granularity

When we transition away from our enterprise data warehouses into more fine-grained DDS designs, we need to consider the granularity and logically segment our data. Determining the scope, size, and placement of logical DDS boundaries is difficult and causes challenges when distributing data between domains. Typically, the boundaries are subject-oriented and aligned with business capabilities. When defining the logical boundaries of a domain, there is value in decomposing it into subdomains for ease of data modelling activities and internal data distribution within the domain.

The important task is to think carefully about the logical role of your DDS. This covers as well the business granularity and technical granularity:

  • The business granularity starts with a top-down decomposition of the business concerns: the analysis of the highest-level functional context, scope (i.e., ‘boundary context’) and activities. These must be divided into smaller ‘areas’, use cases and business objectives. This exercise requires good business knowledge and expertise on how to divide efficiently business processes, domains, functions etc. The best practice is to use business capabilities as a reference model, study common terminology (ubiquitous language) and overlapping data requirements.
  • The technical granularity is performed towards specific goals such as: reusability, flexibility (easy adaptation to frequent functional changes), performance, security and scalability. The key point of balance is about making the right trade-offs. A business domain might use the same data, but if the technical requirements are conflicting with each other it might be better to separate the concerns. For example, if one specific business task needs to intensively aggregate data, and another one only quickly selects individual records, it can be better to separate the concerns. The same might apply for flexibility. One use case might require daily changes, the other one must remain stable for at least a quarter. Again, you should consider separating the concerns. Therefore, we decomposed DDSs in such a way that instances are allowed within a DDS boundary.

The story of organizing data internally can become more complex when a domain is larger and composed of several subdomains. The DDS in this view is more abstract: instances and zones can be shared between multiple subdomains, and zones can be exclusive. Let me try to make this concrete with an example. For a large domain, you could plot a boundary around all the various zones of one DDS. Within this DDS, for example, the first two zones can be shared between multiple subdomains. So cleaning, correcting, and building up historical data is commonly performed for all subdomains. For the transformation, the story becomes more complex because data is required to be specific for a subdomain or use case. So, there can be pipelines that are shared and pipelines that are solely specific to one use case. This entire chain of data, including all of the pipelines, belong together and thus can be seen as one giant DDS implementation. Inside this giant DDS implementation, as you just learned, you see different boundaries: boundaries that are generic for all subdomains and boundaries that are specific.

Decomposing a domain is especially important when a domain is larger, or when subdomains require generic — repeatable — integration logic. In such situations it could help to have a generic subdomain that provides integration logic in a way that allows other subdomains to standardize and benefit from it. A ground rule is to keep the shared model between subdomains small and always aligned on the ubiquitous language. For the overlap, we use different patterns from domain-driven design.

Imagine three illustrative use cases in which data requirements overlap. Different integration and distribution patterns can be applied within and across the different teams. Let’s explore which different approaches you can apply.

The separate ways pattern can be used if the associated cost of duplication is preferred over reusability. This pattern is typically a choice when high flexibility and agility are required. It can also be a choice when little or nothing is in common from a modeling perspective.

Figure 3 by Piethein Strengholt: DDS integration patterns

Teams can work use a partnership pattern to accommodate the shared development needs of all parties when overlap is large. All teams must be willing to cooperate with and regard each other’s needs. A big commitment is needed from everybody, because each cannot change the shared logic freely. Data engineering teams, in this approach, are both data consumers and providers: they capture, extract and load it into data stores, and republish or distribute it.

Figure 4 by Piethein Strengholt: DDS integration patterns

A customer-supplier pattern can be used if one team is strong and willing to take ownership of the data and needs of downstream consumers. The drawbacks of this pattern can be conflicting concerns, forcing downstream teams to negotiate deliverables and schedule priorities.

Figure 5 by Piethein Strengholt: DDS integration patterns

A conformist pattern can be used to conform all parties entirely to all requirements. This pattern can also be a choice when the integration work is extremely complex, no other parties are allowed to have control, or when vendor packages are used.

Figure 6 by Piethein Strengholt: DDS integration patterns

Conclusion

The architecture that we have been building throughout this chapter helps you to understand how we manage data-intensive applications at scale. You’ve seen that to achieve a faster time to value, it’s important to decomposition data using domain boundaries. By smashing silos and keeping dependencies between DDSs to a minimum, we let teams stay focused.

The architecture that we discussed throughout this blogpost helps us manage data at scale. If you are curious to learn more, I engage you to have a look at the book Data Management at Scale.

--

--