Performing Data Validation at Scale with Soda Core

An in-depth look into the past and present of data validation, and how you can leverage today’s tools to ensure data quality at scale

Mahdi Karabiben
Towards Data Science
10 min readMay 26, 2022

Photo by Alex wong on Unsplash

With the rising interest in the Modern Data Stack (MDS) and its growing ecosystem of data technologies and SaaS offerings, data teams are heavily betting on it to open new doors of innovation and capabilities. But more importantly, the MDS ecosystem promises to offer efficient alternatives to costly decisions made in the not-so-distant past.

After all, when we compare today’s data stack to “first-wave” BI platforms or “second-wave” Hadoop environments, it’s evident that today we can have more (features, capabilities, and value generation) for less — whether in terms of engineering effort, time, or infrastructure costs. This is then a very suitable time to enhance our data platforms or, better yet, update and fine-tune their existing components.

In this article, we’ll go into the details of how we can leverage this new wave of data tools to do more with less in the area of data validation.

Wait, do we even have a clear definition of data validation?

The broad theoretical definition of data validation is static: It’s the set of processes that allow us to ensure that our data can be trusted and that it meets pre-defined data quality standards — usually using 6 key dimensions (accuracy, completeness, consistency, timeliness, validity, and uniqueness). On the other hand, like most other areas within data engineering, the technical ways in which we apply the theoretical definition are evolving at a rapid pace and in different directions — making the concept itself rather fluid. Nonetheless, the end goal remains the same: ensuring that we can trust the data we’re consuming.

Now that we’ve — sort of — cleared up the definition, let’s look at how we used to tackle (and sometimes avoid) data validation a few years ago.

The dark old days of data validation

For many years (think 2010 to 2016) data engineers were building pipelines without software engineering best practices — the goal was to deliver as much data as possible, as fast as possible, and then decide what we want to do with it. This didn’t cause any immediate issues at the time, because “big data” was still a secondary factor for decision-making at most companies, and so skipping data validation altogether was tolerated.

Data engineering teams that wanted to ensure data quality in those years had very limited options:

  • If the data was processed using a distributed compute engine (usually on top of a distributed file system like HDFS), then the team would need to write dedicated tasks/jobs that would cleanse the data and run data quality checks. (For example, if the team maintained Spark jobs that did the processing, the data validation would also happen using Spark either via dedicated jobs or as a step of the processing job.)
  • If, on the other hand, the data lived in a distributed data warehouse (like Apache Hive), then the team would need to write and maintain multiple SQL queries that run on the different input tables to perform the necessary checks.

In both scenarios, the difficult-to-scale brute-force approach was the only available option. Different companies worked on building internal frameworks and abstractions that would simplify the process, but there was no “aha” moment within the data community — data validation necessitated a lot of effort because everyone was starting from square one, and so most companies just pushed it the side.

But now, things have changed: data is a first-class citizen everywhere and metrics are consumed by a wide range of users/teams. Now, we frequently find ourselves trying to understand why two dashboards present different values for the same metric, or how a failure in one pipeline would impact our end-users. Now, we’re paying our overdue debts for building data pipelines without having data quality in mind — and so, how can we tackle data quality with open-source “third-wave” tools? How can we leverage these tools to automate existing processes and reduce the costs (whether in terms of engineering effort, time, or budget) of implementing data validation components at scale?

The state of data validation today

Let’s first start our answer by talking about the considerable progress that data validation saw in the past few years. Whether via open-source projects like Great Expectations and Soda Core (previously SodaSQL) or SaaS platforms focusing on the larger space of data observability like MonteCarlo and Sifflet, data validation has evolved a lot.

When we talk about data validation today, we’re talking about just one component within the broader and quickly maturing space that encompasses data observability and DataOps — a space that’s making it much easier to find the confidence to say “yes, the data is OK”. And so even though we’re focusing on open-source data validation tools, the design that we’re building can then be extended into an exhaustive data observability layer — but that’s out of the scope of this article.

What’s in scope, though, is the state of open-source data quality in 2022 — so let’s take a look at the lay of the land.

Great Expectations

Great Expectations is arguably the tool that defined the current standard of what should be expected from a data validation tool: You define your checks (or expectations) and how/when you want to run them, and then your data validation component takes care of the rest. Pretty neat.

In the past four years, the tool expanded in every way possible: an integrations list that’s getting longer and longer, data profiling capabilities, and built-in data documentation. And the cherry on top is that Great Expectation is a Python library, and your expectations are simply Python functions.

But what’s even simpler than maintaining Python functions? YAML and SQL.

Soda Core (previously SodaSQL)

Soda Core is another open-source tool that provides the necessary capabilities to ensure data validation. And even though the tool itself is also written in Python (similarly to Great Expectations), it tackles data validation in a different way: As a developer, you’re simply expected to deliver a set of YAML configuration files that tell Soda how to connect to your data warehouse and what checks you want to run on your different tables.

This approach scales very conveniently when managing hundreds of tables with different owners and maintainers. Initially, Soda required one YAML file per table, but now with the release of SodaCL you can leverage loops and custom Soda syntax within the YAML configuration to optimize how you define the metrics/checks.

Soda mainly prioritizes CLI interactions to run the checks (with a wide range of commands and options), but it also offers a rich Python library — which opens the door to custom usage and leveraging the output of its checks directly within a Python application.

Concrete example: ensuring data quality at scale using Soda Core

To showcase the simplicity of setting up a scalable data validation component today, let’s go through the different steps of adding Soda Core to an existing data platform.

You might encounter this scenario in various use cases, like needing to enhance data quality and improve trust in your data, or a data reconciliation project (ensuring that data that you consume from a new source matches the data that you were consuming from a legacy source you want to deprecate). In such scenarios, the addition of a data validation component has been simplified to merely providing answers to three questions.

Step 1: What are we validating?

First, we’ll need to start by defining which assets we actually want to test/validate, and which checks we want to run on these assets. Soda makes this process extremely efficient because instead of writing dozens of redundant SQL queries, we merely need to pick the metrics we want to leverage (out of a lengthy list of pre-defined metrics) and the checks we want to perform.

Then, we need to communicate our choices to Soda via YAML. We can use loops, lists, and even custom SQL-based metrics/checks — ensuring that we can implement all possible scenarios without any redundancies when defining the checks.

Sample Soda Core checks YAML file.

Soda is heavily optimized to minimize the cost and effort needed to add and maintain data checks, and in the sample above we only scratched the surface of the available functionalities (other notable features include built-in freshness checks and check configurations).

With this approach, we get to manage our data validation checks via source control (ensuring versioning and centralization). Additionally, the usage of YAML means that no matter the background of the user interacting with the data (whether they’re a data scientist, data engineer, ML engineer, or even a PM), they’d be able to not only understand the existing checks for a given table but also propose modifications. For better or worse, YAML managed to position itself as the universal configuration language of the tech world, and so we might as well leverage it to the fullest.

Now that we have defined the checks we want to run, let’s talk about where we should run them.

Step 2: Where are we validating?

Considering that we’re talking about a modern data platform, the assumption would be that we’re applying an ELT design and running our transformations on a cloud-based distributed data warehouse.

Soda offers connectors to all of the “mainstream” data warehouses and has built-in optimizations for its queries (like leveraging caching) to ensure optimal performance and minimal costs for our data validation queries that run on the data warehouse. With that in mind, we merely need to provide a configuration.yml file describing the connection and the data warehouse we want to use — and then Soda will abstract the rest.

Sample configuration for a Postgres warehouse.

It’s also worth noting that it’s possible to leverage Soda for Spark-based architectures, thanks to a feature-rich Soda Spark extension.

Now that Soda can connect to our data warehouse and run its queries on it, let’s decide on how and when we actually want to run these queries.

Step 3: How (and when) are we validating?

Since we have an existing data platform, we can assume that we also have an orchestrator that triggers the different tasks within our pipelines and ensures scheduling and — unsurprisingly — orchestration (examples include Airflow and Dagster). Ideally, we’d want to perform the data validation as soon as possible within our DAGs — so let’s see how we can achieve this thanks to our orchestrator.

Soda offers integrations with Airflow and Prefect out of the box, and the recommendation is to run Soda checks as dedicated tasks within our DAGs. If we take Airflow as an example, this can be done in different ways like leveraging the PythonVirtualenv operator or even running the soda scan CLI command directly via Airflow’s Bash operator.

But even if we’re using an orchestrator that doesn’t integrate with Soda (one that doesn’t allow us to run Python packages), the ideal pattern can be as follows:

  1. Receiving the raw data from an external source or via an EL component (like Airbyte for example). This data will be ingested into our raw layer (whether in a data warehouse or a lakehouse) and the completion of its ingestion will, in turn, trigger the validation task.
  2. Using dedicated data validation tasks/steps within our orchestrator that trigger an execution environment to run Soda Core. A serverless service like a Lambda function on AWS (or its alternatives on other cloud providers) would be ideal for this since it eliminates the need to run Soda directly within the orchestrator itself. The serverless function would be triggered via an HTTP request, and the result of the task would then rely on the response of the serverless function to determine whether the execution was a success or a failure (this can even just rely on the response code of the serverless function).
  3. Pushing the metrics generated by the data validation task into our data warehouse either for further monitoring or to build dedicated dashboards on top of them. To retrieve the metrics we can use the scan_result object which is part of the Soda scan output.
  4. Relying on the output of the Soda checks to determine how to proceed with the execution of our DAG. This entirely relies on the specific use case to determine which action should happen when a check fails. (like raising a warning or an error based on the type of the check and the severity of the issue, sending alerts, etc.) If there are no blocking failures, then we’d move to the data transformation part of our pipeline knowing that we can trust the data.
Sample data pipeline with a Soda Core task. (image by author)

With this design, we leverage the fact that Soda is a lightweight package and run it in a dedicated serverless environment, which minimizes the risk of potential issues.

Summing things up

Throughout the article, we saw how data validation evolved in the past few years and how today implementing a dedicated data quality component merely consists of answering three questions via YAML configuration.

This progress immensely lowers the cost of ensuring data quality at scale, whether to enhance existing data pipelines and improve trust in the data or to implement a specific use case like in a data reconciliation scenario.

Similar patterns can be seen in other parts of the data stack, where resource-heavy problems of the not-so-distant past become abstracted features that can be implemented with minimal resources. This brings us back to the first point of the article: before expecting the MDS to introduce new capabilities or use cases, why not leverage its ecosystem to improve existing components within our data stack?

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

No responses yet

What are your thoughts?

Recommended from Medium

Lists

See more recommendations