The world’s leading publication for data science, AI, and ML professionals.

Layers of Data Quality

Where and how to address problems with your data

With the recent surge of interest in generative AI and LLMs, data quality has received a resurgence of interest. Not that the space needed much help: companies like Monte Carlo, Soda, Bigeye, Sifflet, Great Expectations, and dbt Labs have been developing a range of solutions, from proprietary to open core. While some of these solutions are direct competitors, they don’t all address the same problems. For example, defining an explicit dbt test to ensure that a column contains unique values is very different from anomaly detection on metrics (e. g., your dim_orders process generated 500,000 records one day, when it’s usually more like 50,000). Data can fail in spectacular and myriad ways.

You’ve probably heard of data quality dimensions; I particularly like Richard Farnworth‘s take¹, but a quick Google search will yield dozens of different opinions. At the core, though, the idea is that data can be "right" in one aspect but wrong in others. If your data is correct but late, is it valuable? What if the numbers are objectively wrong, but they’re consistent²? This is an important aspect of data product management and identifying your stakeholders’ priorities.

There’s a lot of focus on how data is malformed, missing, late, incomplete, etc., and less attention is given to the root causes of data quality issues. We spend an inordinate amount of time testing and observing the data itself rather than seeking improvement in the systems that produce, transform, and use that data. I want to explore these "layers" of data quality issues, solutions, and the teams who should be involved in resolving problems related to them.

Layer 1: Data Production

All data comes from somewhere, and the source is frequently the root cause of data quality issues. Following the principle of garbage in, garbage out, you can’t make useful data products from poor source system data.

This layer has three fundamental sources of data quality issues: schema drift, semantic drift, and system availability and reliability. They’re all extremely important, but they lead to different data quality failures. What’s more is that they need different solutions, and frequently, different teams have to engaged to resolve them.

Schema changes are in many ways the simplest to identify and resolve. If product engineers, either in-house or through a vendor, change the schema of a table you’re using, your downstream processes may just break. This tends to be less of an issue with SaaS APIs since they follow a more mature change management protocol. This isn’t a slight at developers, however; frequently, these internal teams don’t even know that downstream teams are using their data assets. Schema changes that break ETL processes are frequently a symptom of poor organizational communication.

Semantic drift is more pernicious and has more exposure across the enterprise. What if your development team changes the unit of measure for a field? Or what if they change the values that will be populated into a field via a dropdown menu? Your product isn’t the only thing that’s exposed. Operations teams for enterprise systems like SFDC, Zuora, NetSuite, and Zendesk may likewise change how they use their operational systems, and this can have an impact. Data quality issues can also be entirely accidental; I think we’ve all seen the poor sales rep who entered a multi-billion dollar deal through a typo. What’s really fun is that you may have a combination of schema change and semantic drift; for example, _is_enterprisecustomer has replaced a valid value of ‘Enterprise’ in the _customertype field. As with schema changes, communication is the missing piece.

Finally, there’s the issue of source system availability and reliability. If the source system is down, data may not be generated for a period of time. Depending on the architecture, data may still be generated, but the endpoint for retrieval may be down. In these cases, the data quality failure is categorically different from schema changes and semantic drift.

Chad Sanderson has been a champion for the idea of data contracts³: formal, programatically-enforced definitions of data as it’s emitted from (usually) operational systems. Data contracts are a good starting point to addressing both schema changes and some types of semantic drift. Note that data contracts don’t address SaaS operations teams; they need a separate system for change management.

Product teams are familiar with observability and monitoring solutions like Datadog and Splunk. SaaS vendors typically have status portals, and some offer notifications when services are down. The challenge is that frequently, these systems outages never make it past their development and / or operations teams, even though they’re vital to downstream data teams. Creating good processes to communicate outages and issues across the organization is just as important as the observability systems that give us that visibility in the first place.

Interventions: data contracts, communication channels, change management, systems observability and monitoring and communication to downstream consumers

Go-to teams: product engineers, SaaS systems operators (e. g., AR specialists, account executives), platform engineers, platform operators (e. g., SaaS admins, SaaS business analysts)

Layer 2: Data Processing – Extract / Transform

Let’s say the source system data is pristine, or at least clean enough to be usable. There’s still a lot that can go wrong. This time, we have exposure on three fronts: logical errors in development, low-resiliency design, and platform stability.

When developing a data pipeline, sometimes we get it wrong. Maybe Airflow tasks don’t have the proper dependencies established, or perhaps an Analytics engineer made a mistake in a join. Depending on the specific error, the outcome could be the pipeline breaking entirely (usually in dramatic and fiery fashion); however, subtle logical bugs can exist in systems for years, either slightly affecting metrics all of the time, or rarely affecting metrics in a big way.

During the requirements gathering process, it’s essential to know what your stakeholders’ priorities are. In some cases, it’s critical to fail a pipeline if data is missing or has some kind of quality issue. In other cases, speed of delivery matters more than absolute accuracy. On the other end, is it OK to show partial or incomplete data, or should we only show all data once it’s been processed for a given period? Misalignment between data product teams and stakeholders can mean that we’re not meeting expectations.

Then, there’s the reality that as with products and SaaS solutions, our Data pipelines actually run on platforms. What if Snowflake is down? What if there’s a bug in BigQuery or EMR that affects resource allocation? These things happen, and while platform teams may be aware, there’s still that communications / visibility gap to downstream teams.

Solutions are plentiful but not necessarily easy to implement. Strategies like unit testing and integrations testing are vital to catching bugs before they hit production. Tooling in the data space still lags the broader software engineering ecosystem, and data team culture is still catching up, too.

On the other hand, many of the observability solutions that apply to source systems also apply to data processing systems. The sticking point, however, is making sure that this visibility is shared throughout the organization. Communications can be made through notifications channels or dashboards, but regardless of approach, we need to make sure impacted teams know what’s going on.

It’s also worth calling out that the same issues caused by data producers applies to core asset developers, too; their consumers like analysts, data scientists, and BI developers are just as vulnerable to schema changes, semantic drift, and systems (ETL) outages.

Interventions: unit testing, integrations testing, clear design documentation, communication channels, change management

Go-to teams: data engineers, analytics engineers, data platform engineers

Layer 3: Data Consumption – Analytics, AI and ML

In the last mile between data and information, there’s still plenty of room for mistakes. Here, the questions are less about how the data is being handled, produced, and transformed, and more about how it’s being understood and used. Specifically, I want to focus on errors in applying techniques and misunderstandings of the data itself.

Statistics, ML, and AI are complicated. Like really, really complicated. There are a tons of considerations when choosing models: whether your variables are continuous or discrete, the distribution of your data, heteroskedasticity, sample size, and dozens if not hundreds of others. Even if you make all of the right decisions, there may be obscure implementation bugs in your ML library.

A data consumer can also just not understand the context of upstream data. Maybe they didn’t check the data catalog, or maybe there’s not a data catalog at all. Familiarity with a data asset is vital but not always enough to avoid making these kinds of mistakes and ultimately drawing incorrect conclusions.

Also, there’s no guarantee that someone in the business is going to interpret a dashboard, analysis, or report correctly. This isn’t necessarily through malicious intent, either. Most people who are consuming data have jobs outside of the data. Data teams are often so deep in the trenches that they fail to understand where metrics may be unclear.

You’ll probably spot the theme by now, but the solutions here are people-centric. Folks hired into advanced analytics and ML roles need the right training and experience to succeed. As to creating products on top of established assets, data catalogs and even just conversations with data asset owners can be a huge help. Finally, making sure that artifacts like dashboards and reports are clear is essential, and warm handoffs / stakeholder education are even better.

Interventions: Data catalogs, dashboard labels, stakeholder education, practitioner education and training

Go-to teams: data analysts, data scientists, BI developers, ML engineers

Conclusion

While it’s important to realize all of the different ways that your data can be wrong (dimensions), it’s also really important to understand how and where it can be wrong, too (layers). We talked about errors at the source (Layer 1), during data processing (Layer 2), and during consumption (Layer 3). We also talked about interventions and which teams are closest to the problem and the solution.

Data Quality is vital to getting value from your data, whether that be through analytics, automation, or an external data product. Data quality dimensions help us identify the axes on which our data is failing us; is it stale? Is it wrong? But it’s equally important to understand why our data is failing us. The "why" is what gives us insight into solutions, either for immediate remediation or longer-term, systemic corrections.


¹Richard Farnworth. (June 28, 2020). The Six Dimensions of Data Quality – and how to deal with them. https://towardsdatascience.com/the-six-dimensions-of-data-quality-and-how-to-deal-with-them-bdcf9a3dba71

²Benn Stancil. (Jun 9, 2023). All I want to know is what’s different. https://benn.substack.com/p/all-i-want-is-to-know-whats-different

³Chad Sanderson. (January 25, 2023). Data Contracts for the Warehouse. https://dataproducts.substack.com/p/data-contracts-for-the-warehouse


Related Articles