The world’s leading publication for data science, AI, and ML professionals.

Data Management Architectures – Monolithic Data Architectures and Distributed Data Mesh

Learn more about limitations of monolithic data architecture and how a distributed data mesh helps to address these challenges.

The post is a must-read for software engineers, data engineers, data scientists, MLOps engineers, software developers, and database architects interested in learning more about monolithic data architecture and distributed data mesh.

Introduction

Image by author
Image by author

A data management architecture governs how organizations collect, store, secure, arrange, integrate and use data. A good data management architecture provides clarity about every aspect of data and how enterprises can get the best out of their data for business growth and profitability. Conversely, a bad data management architecture leads to inconsistent datasets, incompatible data silos, and data quality problems that make data worthless or affect an organization’s ability to run analytics, data warehousing (DW), and business intelligence (BI) activities, particularly Big Data.

Conventionally, most organizations prefer to start with a central data team and a monolithic data management architecture where all data operations are carried out from and to a single, centralized data platform. While a monolithic data architecture is comparatively easy to set up and can handle small-scale data analytics and storage without performance degradation, things get overwhelming over time. Also, the central data management team tends to become a bottleneck as data volume and demand increase.

Introduced by Zhamak Dehghani, a distributed data mesh provides a way to reconcile and hopefully address the challenges associated with previous data architectures, which are usually hamstrung by data standardization challenges between data consumers and producers. Distributed data mesh nudges us towards more empowered, agile, leaner, multi-function teams and a domain-driven business structure. It combines the best data management approaches in a decentralized fashion while maintaining a data-as-a-product viewpoint, self-service user access, domain awareness, and governance.

This post will help readers understand a monolithic data architecture, the challenges associated with monolithic data architectures, and how a distributed data mesh can help organizations transform their analytical data as a product and build highly scalable, resilient, and data-driven applications. The target audience is software engineers, data engineers, data scientists, MLOps engineers, software developers, and database architects interested in learning more about monolithic data architecture and distributed data mesh.

Monolithic Data Architecture

Image by author
Image by author

A monolithic data architecture is a framework where application data is stored, transformed, manipulated, consumed, managed e.t.c from a single, centralized data store. Monolithic data architectures are managed by one massive platform team, suitable for smaller organizations whose business domains are relatively simple and whose data landscape is not constantly changing, but they pose several challenges for evolving engineering teams. Let’s take a look at some of these challenges.

The first concern is that monolithic data architecture can’t scale indefinitely, and most monolithic databases don’t auto-scale at all. As application workload and data volume increase exponentially, a monolithic database gradually becomes slow, expensive, and hard to maintain. The ability to consume and harmonize the ubiquitous data in one place, managed by one central team, also diminishes for organizations with rapidly and extensively changing use cases, several data sources, and data consumers.

Secondly, monolithic databases often have high latency and throughput that naturally arise from the concurrent database reads/writes from the different disconnected teams and methods within an application. With monolithic data architectures, it’s also challenging to respond to new needs without altering the entire data pipeline.

Thirdly, a monolithic data architecture lacks modularity and suffers from the homogenization of technology. When a monolithic database becomes faulty or unresponsive, it affects the entire application and halts all database-related activities. Also, if engineering teams are forced to adopt a single database for an application, there won’t be much room for innovation and experimentation.

Eventually, a monolithic data architecture would naturally result in inpatient data consumers, disconnected data producers, and a backlogged engineering teams burdened by towering technical debts and struggling to strive in an agile world where changes are dynamic and businesses need to innovate quickly.

Now you understand the meaning and limitations of monolithic data architecture. Let’s now look at a more optimized data architecture that primarily addresses the challenges associated with a monolithic architecture.

Distributed Data Mesh

Image by author
Image by author

A distributed Data Mesh is highly decentralized, views "data-as-a-product," and supports distributed, "domain-specific data owners" responsible for handling their own data products and pipelines in an easily consumable, user-friendly way while enhancing the communication between distributed data in different locations. In many ways, a distributed data mesh is the platform version of microservices.

Engineering teams can easily achieve scalability with a distributed data mesh by breaking the entire data architecture into smaller, domain-oriented, and more decentralized components and having different teams manage each domain. That way, it’s easier to scale out as the number of use cases, data sources, and diversity of access models increase. It also allows teams to build highly scalable, data-driven applications and effectively use data to better improve marketing campaigns, reduce costs, optimize business operations and make more informed decisions.

Distributed data mesh provides greater flexibility, improved productivity, and autonomy for data owners. By distributing and decentralizing responsibility to people closest to data, a distributed data mesh reduces the read/write rates on each database, facilitates data innovation, and eliminates the burden on engineering teams to meet the needs of every data consumer using a single data pipeline or data store. In a distributed data mesh, each team is free to decide how to collect, organize, store and use data.

Thirdly, distributed data meshes have a self-serve "infrastructure-as-a-platform" and "federated computational governance" design that enables domain autonomy and provides a domain-agnostic, interoperable and universal approach to data product monitoring and governance, data standardization, logging, alerting, product quality metrics, and more. The impacts of Database failure are drastically reduced in a distributed data mesh, and each team can make changes to their data platform, introduce new features, and deploy feature updates without altering other data stores.

When To Transition From A Monolithic To A Decentralized Data Mesh

Image by author
Image by author

A distributed data mesh isn’t a silver bullet that’s suitable for all organizations and teams. As you can imagine, the monolithic data architecture also isn’t going out of existence. Before transitioning from a monolithic to a distributed data architecture, organizations need to do their homework and ensure that moving would be a wise decision for the business.

Here are some questions organizations and their teams should ask when evaluating the readiness to transition to a decentralized data mesh:

  • Size of data team: how many data engineers, data analysts, data scientists, and product managers do we have on the data team?
  • Data size: how voluminous is our data? At what rate is it growing?
  • Data variety and sources: how many data use cases and sources do we have?
  • Data bottlenecks: how often is the data team occupied with resolving technical debts (hence slowing down the implementation of new data products and turning it into a bottleneck)?
  • Lead times: despite the growth of our team size, are members of each team disconnected, or do they lack domain knowledge?
  • The number of data domains: how many functional teams rely on our data store to make decisions, how many data-driven products and features do we have?
  • Data governance: how much of a preference is data governance for our organization? Is there political infighting over who controls

By asking these questions and evaluating the responses, organizations can determine whether it’s best to stick with traditional monolithic architecture or switch to a decentralized architecture. In general, companies with microservices-based applications, very demanding and complex data infrastructure requirements, high data volumes, numerous data sources and domains, and large team sizes that are often overwhelmed would benefit more from a data mesh. Otherwise, I believe that a decentralized solution would be overkill.

Conclusion

Organizations need to adopt a new methodology for managing data at scale to augment and improve every aspect of life and business with data. Although the technological advancements in the past have addressed the scale of data volumes and data processing compute, they’re unable to address scale in other dimensions such as the proliferation of data sources, changes in the data landscape, speed of response to change, and diversity of data users and use cases.

A data mesh architecture addresses these dimensions and drives a new logical view of organization structures and technical architecture. A properly implemented data mesh bridges the gap between IT and business leaders, giving them a platform to ensure that business strategy and technology align to power a business forward. By moving from monolithic data architecture to a distributed data mesh, organizations and engineering teams can radically decrease lead times, better position themselves to efficiently use data for several use cases, and build scalable, data-driven applications that align properly with the current shifts towards cloud-native architectures and ecosystems.


Related Articles