
If you do not have the time to read the full article, consider reading the 30 seconds version.
Synopsis
If you have Machine Learning (ML) pipelines in production, you have to worry about backward compatibility of changes made to the pipeline. It may be tempting to increase test coverage, but a high test coverage cannot guarantee that your recent changes have not broken the pipeline or generated low quality results. To do that, you need to develop end-to-end tests that can be executed as part of the Continuous Integration pipelines. Developing such a test requires sampling the dataset that powers the pipeline from a run that produces acceptable result and on which you have an in-depth knowledge. Once you have the sampled data, you can run the stable version, e.g., master, etc., of your ML pipeline to produce the expected result. When you have a feature branch, run the branch on sampled data, compare the actual result with the expected result, and consider it green when the difference is acceptable.
If you are intrigued how this can be done a bit more detailed, please check out the rest.
Background

A lot of machine learning pipelines can be abstracted best as directed acyclic graph of task functions (see Figure 1). Underneath those task functions hundreds to thousands lines of codes. As your codebase matures, we adopt a branching strategy like, Gitflow. When you make changes to your pipeline, typically it is done through feature branches. Every commit to that branch triggers some static code analysis and unit testing rounds giving you a red or green signal on something that did not work or worked. After a few feedback rounds later, you are happy with the changes. So, you merge the branch and think happy thoughts for a job well done. A few days later, you are ready for the production runs with a new found confidence from the recent features that you rolled out. The morning of the first run, you are ready to bask in the glory of the successful run that finished overnight. When you look at the run status, you see what you did not want to see: the run failed! The new and improved imputation technique that you implemented failed to handle a zero division error. You are annoyed, since a unit test could have caught that easily. So you fix the code, add a test, start the run again. Some time later, you check the run status; it failed again. This time it is a mishandling of string formatting while logging. If you are lucky, this is the last issue that you need to fix. However, you may not be that lucky and there may be five other issues out there that will fail the run.
Honestly speaking, it is a losing battle. It is not easy to cover tests for an ever increasing/changing codebase, a big part of which is not easy to test due to randomness, complicated data structures, complicated logic, etc. Mobility in teams also does not help. The biggest challenge of all is generating test data, which can be cumbersome, if done manually.
Solution Approach
While there is no bulletproof solution, there is a pragmatic approach as described below.
- Split the data handling functions to save intermediate results as clean tables that enables reproducing the run.
- Execute the modified code to produce testcase tables from the clean tables .
- Run the ML pipeline end-to-end on the testcase tables to produce expected result of a run. For any subsequent feature branch, adopt a variation of the ML pipeline that runs on the testcase tables and compares its results with the expected result.
Lets flesh those steps a bit more in the following.
Step 1: Adapting data handling functions

Figure 2 illustrates an approach to enhance the ML pipeline to generate test data. Normally all data related activities, such as, cleaning individual tables, combining multiple tables, and aggregating to meaningful dimensions are done in the same task without preserving any intermediate steps (see Figure 2(a)). We advocate splitting that task into two tasks: Datafeeds and Dataprep (see Figure 2(b)). In the first task, we handle all sorts of activities related to the cleaning of individual data tables as clean tables. It should be done in such a way that even if clean table goes through the same process again, it will produce a copy of itself, i.e., f(A) → A'; f(A') → A'
. All data combination and aggregation process is covered in the Dataprep task.
Step 2: Generate testcase tables
It is not necessary to run the full pipeline to generate clean and testcase tables. We can create a variant of the ML pipeline that only run the data handling functions (see Figure 3).

When the tables are generated, the clean table may be saved in different locations than the testcase tables. For example, we may want to save the clean tables in a secure datalake, since they may contain sensitive information. However, we may want to save the testcase dataset in a more easily accessible place for rapid use. On a separate note, some end-to-end tests may be smaller in nature, e.g., smoke tests, which requires smaller tables. For these reasons, we need to generate sampled, synthetic tables instead of the original clean tables.
To create a sampled, synthetic dataset out of the original dataset while preserving the same statistical properties, we can use Data Synthesis libraries, such as Synthetic Data Vault (SDV). To use the library, generate a model that learns from clean tables.
import sdv
from sdv.tabular import GaussianCopula
model = GaussianCopula()
model.fit(clean_table)
sdv.save('clean_table.pkl')
Once the model is generated, it can be used later to generate all sorts of test case data, such as the following:
import sdv
from sdv.evaluation import evaluate
model = sdv.load('clean_table.pkl')
testcase_table = model.sample(200)
assert evaluate(clean_table, testcase_table) > 0.8
Step 3: Perform test runs

The test run includes execution of two different pipelines (see Figure 4). Firstly, the stable version, i.e., master branch, that should run on testcase tables to produce expected results. Finally, the feature branch that should also run on the same tables. However, in its last step, it should run an assert function that reads the expected result and compares against the actual result of the run.
A few practical notes
Locations for saving test cases may vary depending on the size of the data. If the dataset is small you may prefer to save it in the same repository where the codebase is maintained. If the dataset is a bit larger, it should not be saved in the repository, but in a low security datalake accessible to continuous integration systems.
The master branch test pipeline should be triggered when a new commit happens to it by merging the latest release/develop branch to the last commit of the master branch. The feature branch test pipeline should be triggered depending on the runtime of the pipeline. If it takes a few minutes, then may be the pipeline should be triggered after every commit. However, if takes much longer, then it should be triggered on rarer occasions, such as when a pull request is created, when a branch is approved, etc.
The size of the sampled data would indicate how robust the test would be. The lower the size of the sample, the cheaper it would be to run the test, but the weaker it would be in terms of the robustness of the test. Choose whatever your heart desires!
Disclaimers
In this article, I have expressed my opinion based on common sense and experiences. I do not assume that it will match your reality as is. However, like my code, my opinion has versions. It will not change that much by next week or month, but it will probably change a lot by next year. If you do not agree with me or prefer a variation of what I proposed, please provide feedback in the comments.