Evolving a Data Pipeline Testing Plan

The Perils of Exhaustive Multi-Source Multi-Destination Test-Driven Development

Published in

Towards Data Science

9 min readApr 30, 2023

The first contact with the ideas of Test-Driven Development (TDD) leaves many beginner data engineers in shock at what TDD promises. Faster development, cleaner code, career advancement, and world domination, to name a few. Yet, the reality is quite different. The initial attempts at applying TDD to data engineering leave many data engineers demoralized. Extracting the value of TDD takes so much efforts. It requires a deep knowledge of testing techniques that are not in the beginner DE toolbelt. The process of learning “what” to test is hard. Learning the tradeoffs inherent to applying TDD to data pipelines is even harder.

In this article we look at how to evolve a data pipeline testing plan to avoid feeling the full pain that comes from over-specified testing.

Problem

What are the perils of test-driven development? For all its benefits, TDD can be a dangerous thing for a new data engineer. The initial drive to test everything is strong, and it can lead to sub-optimal design choices. Too much of a good thing, as they say.

The drive to test every single part of the data pipeline is a tempting direction for engineering-minded folks. But in order to preserve one’s sanity, there must necessarily be some restraint. Otherwise, you’ll end up with a jungle of tests surrounded by a sea of red. And balls of mud are always not far behind.

For example, say we have the following data pipeline.:

We have three data sources, six transformations, and two data destinations.

What would an inexperienced data engineer produce as a test plan?
We’ve all be there.

Solution #1: Exhaustive Multi-Source Multi-Destination Paths with Edge Length > 0

Using the classic 3 part testing framework we can safely assume that our data engineer will start with this:

Unit tests ✅ : Sure take each transformation, generate some sample input data for each transformation, run the sample data through each step of the pipeline, capture the results, and use the output to validate the transformation logic.
E2E test ✅ : We are gonna need to run the pipeline on full prod data anyways so let’s run the whole pipeline on a sample from the production data, capture the results and use that output to validate the end to end pipeline.
Integration tests ❓❓❓: But what to do here? The first inclination is to build one test for each of the combinations of the transformation stage.

After some back of the envelope calculation our data engineer starts internalizing the fact that the combination of the 6 transformation steps grows rapidly. There must be a better way.

Testing every single combination with edges of variable length is not gonna meet the deadline we promised to the customer. We should have budgeted for more time.

Solution #2:

“OK OK, but it can’t be that bad.”

Yes considering that integration tests are not going to be touching real sources and sinks, then so be it, let’s plug these 6 transformations together. We get the graph combinations below. We get approximately 10 integration tests.

But remember this is data engineering, which means that input data is out of our control, and changes over time. So we need to add the data-centric tests in there. (You surely know better but lets follow this argument).

Solution #3:

“Right but can’t we summarize this somehow, there must be a core set of data scenarios we need to absolutely support? Like a priority list of data validation things?”

Sure, but even if we delay the data validation checks we still get this picture:

“Do we really need this much testing? Isn’t acceptance testing about testing what the users sees? Can’t we sacrifice the developer experience in order to deliver a good product, on time, that solves the customers problem?”

Sure, yes the next logical step is to only run the e2e tests and move one with our lives. However, there is a middle step that solves both problems of “too many integration tests” and “solid data validation tests”. Your probably used it before but didn’t have a name for it: “Inline assertions”. This is a quite useful trick from the defensive programming tradition.

Solution #4:

The core idea of these ”Inline Assertions” is that you build, when possible, the whole pipeline as a monolith that includes assertions about the code interfaces AND the data interfaces between the components of your monolith.

That’s it, you put that in the red, green, refactor development loop, and keep growing that list of assertions as random things happen.

We are running a little hot here, but notice that we are using production data sources, and production sinks. If you are in a bind, go for it. If you have some time, at least create dedicated testing sinks, and remember to put limits on the number of rows you get from the input data sources.

This might seem obvious to you, but we are all learning how to build data pipelines that solves the customers problem here :)

How would that look in the revered Testing Pyramid?

Let’s call that DEE2E++ Testing.
Data Engineering End to End ++ Testing.

There seem to be two flavors of DEE2E++ Testing:

Ubiquitous Anti-Corruption Layers (U-ACL)
Mostly-Warnings Anti-Corruption Layer (MW-ACL)

On the Ubiquitous side it looks like the following:

Each transformation gets one input anti-corruption layer that protects it from upstream changes, and one output anti-corruption layer that protects the downstream consumers from the current transformation’s internal changes. If something changes in the upstream data schema or contents, then the input ACL with stop the processing and report the error to the user. Then if we change something in the data schema or contents of the current transformation, then the output ACL will also catch the errors and stop the processing.

This is quite a lot of work for a starting data engineer. Adding mandatory validation rules on each and every transformation will push the data eng to “batch work”. Instead of breaking the pipeline into multiple steps, they will say to themselves: “If I have to add these ACLs for each transform, that’s gonna be 2 times the number of transformations. I might as well add just two. One at the top of the pipeline and one at the bottom. I’ll deal with the internals of the transformations on my own.” That is a valid initial approach where the focus is on 1) top level ingestion logic and 2) customer-visible data outputs. The issue with this strategy is that we lose the benefits of tests in regards to Localizing bugs. If there is a bug in transformation 4 of 6 then the ACL tests will only show that the final output is invalid, not that transformation 4 is the culprit.

In addition, as the data sources evolve, what we are talking about here is 90% warnings and 10% blocking errors. Just because a new column showed up on the input data does not mean that the whole pipeline should fail. And just because the distribution mean of some column has shifted a bit does not mean that all of the data is invalid. Customers might still be interested in the freshest available data for making their business decisions and do reconciliation later if needed.

For that, you need the “Mostly-Warnings Anti-Corruption Layer”.

It notifies the dev that there is something wrong, but does not stop the processing. It achieves the same role as the input and output ACLs but for each transformation. Also it is much more tolerant to change. This type of ACL emits warnings and metrics and the dev can prioritize the warnings later. If something is completely out of hand, the dev can backfill the data after fixing the data transformation.

Obviously, YMMV. If the current data pipeline’s output has few established human users, then communicating with the consumers will help the new dev learn about the domain. On the other hand if this pipeline has a multitude of established automated data pipelines that consume the output, then this DEE2E++ Testing might not be sufficient. However, new data devs that are starting out are probably not assigned on day one to business critical data pipelines that impact hundreds of data consumers. So, instead of forcing the new data devs to be crushed by both foreign testing techniques, and by mission-critical domains, the DEE2E++ method can be good starting point for new data devs.

Here is the DEE2E++ diagram again.

“Wait, Wait, Wait are you saying that each component is only gonna get a Warnings-Only Anti-Corruption Layer?”

Not “Warnings-only”, “Mostly-Warnings”. Some of the assertions will certainly stop the processing and fail the job. But yes, that’s the idea. If you do the “Ubiquitous Anti-Corruption Layer” strategy, then you will need more time. As the domain become clearer you can add stricter ACLs around critical pieces of the data pipeline. This domain understanding will help rank the transformations in terms of complexity. As you identify the complex ones that need extra care, you can move from the “MW-ACL”s to the “U-ACL”s to protect highly critical business logic for example.

“I mean yes, but then why bother with the unit tests then? Aren’t they covered in the inline tests?”

Sure, fine, let’s remove them.

OK? I guess we can all go back to work now.

Conclusion

In short, the common principles of test-driven development can be quite overpowering for a new data engineer. It is important to remember that TDD is a design tool, not a law. Use it wisely, and it will serve you well. But use it too much, and you’ll find yourself in a world of hurt.

In this article we examined what can happen with over-specified testing. First, we took a seemingly simple data pipeline and saw what happens when we fall in the trap of :“Exhaustive Multi-Source Multi-Destination Paths”. Then we observed how integration tests are only the tip of the iceberg when compared to the data-centric tests. Finally, we found out that a good place to start for beginner data engineers is to focus on E2E tests with “Mostly-Warnings Anti-Corruption Layers” as inline tests. This DEE2E++ Testing strategy has two benefits. First, the fresh data engineer will not give up on testing from day 1. Second, it gives the devs breathing room to learn about the domain and iterate their data pipeline design using their existing basic knowledge of data engineering. Instead of immediately getting lost in the micro-level of TDD, they can deliver working software to the stakeholders, and then build on the protection that DEE2E++ Testing provides to add more fine grained tests as the requirements evolve.

So there you have it. The perils of test-driven development. May you all avoid them, and may your data pipelines be ever green.

Want to learn more about modern data pipelines testing techniques?

Checkout my latest book on the subject. This book gives a visual guide to the most popular techniques for testing modern data pipelines.

2023 Book link:

Book link: Modern Data Pipelines Testing Techniques on leanpub.

See ya!

Disclaimer: The views expressed on this post are mine and do not necessarily reflect the views of my current or past employers.