Making Sense of Big Data

Test Your Data Until It Hurts

A data testing tale

Micha Kunze

Published in

Towards Data Science

8 min readNov 3, 2020

If you work in analytics, it is safe to assume that you have complained about data quality issues more than once. We all have, and the pain is real!

As a Data Engineer, data and its quality are close to my heart. I am acutely aware of how paramount good data quality is — and recently this topic has created some buzz with e.g. #DataOps. There are a plethora of articles on that topic: simply go to the DataOps Medium page from DataKitchen, or if you like the term MLOps better, there is also plenty of content, such as this blog post about data testing in MLOps from the great expectations blog.

There is much more out there — but from my experience, the adoption of systematic data testing and quality monitoring is lacking in practice. In reality, if there even is data testing it often ends up as some assertions mangled into the data pipeline, polluting the data pipeline code, and not creating any visibility of your data quality issues. In this post, I want to give some insight into our (ongoing) journey of data testing and why it has to hurt!

Data Quality

The first problem with data quality that I see is that it is a very vague term. The actual implications of how one defines data quality or which data quality issues matter are dependent on your use case and your type of data.

One usual suspect is missing data/observations, either due to incomplete records or due to a copying error in your pipeline that went unnoticed. An upstream job in the enterprise data store changed, suddenly you have fewer rows/observations in your data. Then there are wrong entries, i.e. a person entered a number in a field somewhere that was plain wrong or in the wrong format. And the list goes on.

Ok, so if the data coming into your system has issues, what about the data that you put out? And even if you manage to fix those issues, are you making sure that the data you publish has top-notch quality? Yet another data quality issue.

All in all, there are a lot of things to consider mixed with a lot of noise and uncertainty on how to deal with data quality issues.

You should care

If you still think that you have no data quality issues because your code passed all the automated checks and you mocked all the data or even unit tested your sample data, you are wrong! Or rather, you might be correct right now, but eventually when things break in your data (and they will) you might very well be using and possibly putting out some bad data without even knowing.

If you do not test you do not know, simple as that. So start testing now! And once you start testing, you should test until it hurts. If you just collect data metrics that show that the pipeline is running with X rows so you can pat yourself on the back for how great your automation skills are, this will not generate any value for you. Or the other way around, paraphrasing Daniel Molnar: “vanity metrics are useless”, meaning don't monitor metrics that make you look good, instead monitor the bad things that make painfully obvious what is wrong and what you need to fix. If the tests do not hurt you, if they do not show you the things that are wrong, they are worthless.

Test your data until it hurts so that you have to fix the issues and constantly improve!

Let’s flip that over to something good-what can we do to start testing? How can we feel the good pain? 😅

The most important thing is to start. And to start with something simple. From my perspective the most value can be gained from two things when testing our data:

Break automated pipelines if data quality is bad
Observe data quality and unexpected data / failed pipelines

The first point is obvious, we do not want to use or publish bad data, so we fail if the data is bad. The second one is nearly as valuable: When you start testing data quality, you very likely will not know what good data even looks like, i.e. what are the bounds of good data or how does a good data distribution look like? There might be variations on data coming in on weekends and weekdays or seasonal trends that you did not pick up on the initial dataset you worked on. So in order to stay on top of breaking things for the sake of data quality, you need to learn and iterate on what good data actually looks like over time.

This works for us

Until now I have been vague on how we actually test — so let's get to some more tangible examples of how our team continuously tests data.

As I wrote in a story on Data Engineering practices: keep it simple! And if you, like me, are working a lot with python and pandas, then great-expectations is a fantastic place to start with your data testing journey.

great-expectations works by building expectation-suites (basically test suites) which are subsequently used to validate data batches. These suites are saved as .json files and live with your code. To build expectations you can use some basic profiler from the package (or write your own — it is rather simple) to start with some suggestions or start from scratch. Let us look at a simple example:

The above example assumes you have installed the package and initialized it (I created this repo to get you started quickly). The example shows how to quickly add a data source to great_expectations, add an expectation to the expectation suite and then test data against the expectation suite. Optionally you quickly check the data documentation rendering.

Key features

As stated earlier, the first thing we wanted to aim for is breaking a pipeline if the data is bad and we then wanted to observe data quality over time. Using the data validation feature lets you do the former one easily: just run the validation as part of your pipeline and break if needed. The latter one is also covered as the validation result will tell you what went wrong, i.e. which values you did not expect. So you immediately know what is wrong and you can act.

Other key features include that the expectation suites live with the code, which is perfect for version control. Furthermore, you can easily add comments to your suites and capture data quality issues (and they render nicely in the automated data documentation).

Since all of it lives in your codebase (and of course you use git) it straight-forward to collaborate in the team. The addition of editing expectations in a notebook (with customizable jinja templates) is a feature that we use constantly: just run great_expectations suite edit <suite_name> !

Testing your data and data distribution. great-expectations allows you to easily test data distributions. Simple mean, median, min, max, or quantiles or more advanced things such as Kulback-Leibler divergence. You can use themostly keyword for most expectations to tolerate a certain fraction of outliers. The simple build-in expectations get you very far!

Of course, there is much more: you can test data freshness against pipeline runtime and use evaluation parameters generated and runtime or build your own expectations. There is a ton that I am not covering, such as automated data documentation, data profiling reports all with automatic HTML renders that make it easy to share and publish.

And on top of all that you can easily get involved contributing to the code base — the guys are extremely helpful and appreciative!

How we use it

We are testing the input and output data of each pipeline run, independent if the pipeline code. Our pipelines run as kubernetes jobs and so we created a simple wrapper that reads the command line arguments, resolves the dataset names, matches them to an expectation suite, and validates the data.

Data validation (blue diamonds) decoupled from the actual pipeline code, controlling failing, alerting, and publishing of the pipeline.

The above image is a simple diagram of what happens when any of our pipelines run: the input gets validated, only then does the actual job run. After the job ran, the output data gets validated, only then will the data be published. If a validation on the output data fails, we publish into a failed destination so we may inspect the output if needed.

Pain

As soon we put that in production the pain began. We had spent significant time profiling the data, running all the pipelines with new data validation checks in our non-production environment. And we were confident. Still, there were so many things we did not know about our data and how it behaved over time.

For several weeks we had failing pipelines due to the data validations. Some we expected, but at some point, I felt really horrible for putting one of my colleagues who was on support rotation through this: Constantly failing runs, many times updating the expectation suites, rinse and repeat. In the beginning, it was hard to figure out when to change thresholds/expectations, and when we were looking at an actual problem.

There was simply pain.

Gain

After the first pain subsided we quickly saw strange data issues. In one instance this led us to actually find bugs in some of our data transformation pipelines that only showed up over time. We fixed it -> profit!

In another instance we actually prevented publishing bad predictions — it had turned out that one of the upstream jobs that was copying data from the enterprise data store had incomplete data 😱. Previously, this bug has caused us to give out some bad predictions in one known incident, which had some real 💰 consequences for the business. So, preventing us from doing that is some serious profit!

Conclusion

It is true: no pain no gain!

We all know we have to get our data quality in check. And once you open that can of worms you will quickly feel the pain of dealing with all the data quality issues that you find. But the good news is: it is well worth it. Simple data tests can get you very far and significantly improve the quality of your product.

In the team, I work closely with data scientists, and they now love the data testing. They appreciate the tool and the value it generates for the team and our products. We continuously improve our datasets with it, we collaborate on the testing and the testing is not getting mangled up with the pipeline code.

So, while this journey promises pain, it is well worth it. Start testing your data now! 💯