Quality Assurance in Data Pipelines

Techniques to create trust in your data solution

Travis Cooper
Towards Data Science

--

Photo by Zan on Unsplash

A core responsibility of a data engineer is to build data pipelines. The flow of data from one location to another with mutations along the way is a primary focus for any data engineer. However, when you change data, you make people nervous. Even though you are actively changing the data from its raw form into something usable, it doesn’t change the fact that it is no longer the data that was collected. In order to mitigate these fears, it is vital that data engineers build as many data integrity measures into the solution as possible.

Techniques in quality assurance

1. Byte-by-Byte Comparison

A byte-by-byte comparison is performed to confirm the integrity of the data. A byte-by-byte comparison involves the following:

  1. Ingest the source data, transform it, and load it into the target destination
  2. Read the data in the target destination, reverse the transformations, and compare it byte-by-byte to the original source

While a byte-by-byte comparison can confirm the integrity of the data, it is a very expensive operation with respect to time and resources. It also adds an added complexity to the development work as you need to, not only develop the transformations, but also the logic to reverse those transformations. However, it can be executed along side the regular ingestion process to ensure data integrity.

Use Case: This can be used in most pipelines given it transforms the data and reverses the transformations to ensure validity.

2. Checksums

Checksums allow us to check data integrity without as much overhead as a byte-by-byte comparison. There are several ways of calculating a checksum, but it typically involves summing a hashed value(s). In other words, if you calculate the checksum of the source data and compare it to the checksum of the destination data, the two values should match so long as there isn’t a mutation in the data.

Use Case: This can be used where data sets are not expected to change from source to destination.

3. Row Counts

A non-expensive operation to confirm data integrity is a simple row count. If you compare the number of rows in the source to the number of rows in the destination, it should be the same. If not, there may be an unanticipated transformation. However, row counts should only be used as a first check of the data. There are many issues with relying on row counts to validate the data in its entirety. It does not validate the actual data nor does it account for NULL values that may have appeared during the migration.

Use Case: This can be used where data sets are not expected to change from source to destination.

4. Unit Tests

Unit tests are a standard with any code you develop; especially data pipelines that actively mutate the data. Pipelines can be tested at both the micro and macro levels. A micro unit test looks at the individual methods. These tests are easier to create as you do not need a full data set to confirm their function (e.g. if a function is meant to filter out null values, a simple matrix with various data types will suffice.). A macro unit test focuses on the pipeline as a whole. A macro unit test is slightly more difficult to test as it involves a full data set to ensure all transformation steps are performing as expected. Unit tests are essential to ensure data integrity remains in tact during any and all mutations in a data pipeline.

Use Case: This can be used in most pipelines given it transforms the data and reverses the transformations to ensure validity.

Final Thoughts

As a data engineer, data integrity should be considered a top priority when building a pipeline. Not only does it give you confidence that your code is working as expected, but it also gives the user confidence that their data is accurate. Build integrity into your solutions by ensuring any movement or mutation of data functions as expected. Hopefully, the techniques above give a good starting point for building reliable solutions.

--

--