Don’t Let Your Data Fail You

Continuous Data Validation with whylogs and Github Actions

Published in

Towards Data Science

8 min readJul 12, 2021

From the beginning to the end of your ML pipeline, data is the lowest common denominator. But prevalence of data also comes with its downside, since almost every problem in your ML pipeline either originates from or affects data in one way or another, and possibly in complex and intricate ways. For example, the presence of bad data during serving time — be it introduced from external sources or originated during your data transformation pipeline — will not only affect your current prediction results but will also be reintroduced into the loop during future model retraining.

This is one among many examples. The bottom line is that ensuring data quality should be among your top priorities when developing your ML pipeline. In order to do that, data validation is certainly a key component. In this article, we’ll show how whylogs can help you with this purpose. We’ll first introduce the concept of constraints and how to generate them. Once created, these constraints can be integrated directly into the pipeline by applying them during your whylogs logging session. In the last section, we’ll see another way of validating your data by applying these sets of constraints as part of our Continuous Integration pipeline with the aid of Github Actions.

The code and files for this article can be found in the project’s repository. You can also find more information about generating constraints in this example notebook, and about the whylogs integration with Github Action in this example repository.

Let’s get started.

Constraints generation in whylogs
∘ Value Constraints vs Summary Constraints
∘ Assembling Dataset Constraints
∘ Applying the Constraints to a Dataset
GitHub Actions with whylogs
∘ Overview
∘ Configuring the workflow
∘ Workflow syntax
∘ Constraint Definition
What’s Next

Constraints generation in whylogs

In order to validate your data, we need to have an efficient way of expressing our expectations from it. That is done in whylogs through constraints — rules you create to assert that your data lies within the expected range. These constraints are, in turn, applied to features of your dataset, and can be organized in such a manner where one feature can have multiple constraints, and one constraint can be applied to multiple features.

Value Constraints vs Summary Constraints

Constraints can be checked against individual values or against a complete dataset profile. For a value constraint, a boolean relationship is verified against each value of a feature in a dataset. For the summary constraint, however, the relationship is verified against a whylogs “profile”, which is a collection of summary statistics for a dataset that whylogs has processed.

For example, let’s assume that we want to create a constraint to ensure that a feature’s value should be less than 3.6 for every record in the dataset. This can be done through a ValueConstraint:

The ValueConstraint takes two arguments: the type of binary comparison operator (“less than”) and the static value to be compared against the incoming stream.

Then, we simply convert the constraint from protobuf to JSON, which would yield the JSON-formatted output:

The name is generated automatically by default, but it can also be customized by passing a name argument to ValueConstraint.

Similarly, we can generate a constraint against a statistical property with SummaryConstraint:

The above code yields:

Assembling Dataset Constraints

We’ve seen how to create individual constraints, but for a given dataset, we’d like to group a number of them together so we have an overall description of what our data should look like, and then apply this list of constraints to our dataset.

To demonstrate, let’s use the LendingClub Dataset from Kaggle. The used subset contains 1000 records of loans made through the LendingClub platform. We’ll create some constraints to validate three features of the dataset:

loan_amnt — the amount of the loan applied for by the borrower;
fico_range_high — the upper boundary range of the borrower’s FICO at loan origination belongs to;
annual_inc — the borrower’s annual income.

For loan_amnt, we’ll set upper and lower boundaries of 548250 and 2500, and for fico_range_high a minimum value of 400. Lastly, let’s assert that annual_inc has only non-negative values.

Which would give us the following:

To persist our constraints and re-use them, we can save them in a JSON file. We’ll need the file in the next section where we’ll integrate our constraints in the CI/CD pipeline.

Applying the Constraints to a Dataset

Once our constraints are created, we can finally apply them to our dataset. To do so, we simply pass our constraints as an argument to log_dataframe() while we log the records into the dataset.

The report can be accessed via dc.report() and displayed after some basic formatting to make it more readable:

Constraint failures by feature - 
loan_amnt:
    test_name          total_run    failed
    value LT 548250         1000         2
    value GT 2500.0         1000        20
fico_range_high:
    test_name        total_run    failed
    value GT 4000         1000      1000
annual_inc:
    test_name                total_run    failed
    summary min GE 0/None            0         0

In this case, the value constraints were applied 1000 times for each rule, and the failed column shows us how many times our data failed our expectations. The summary constraints, however, were not applied yet, as can be seen from the total_runfield.

The summary constraints can be applied to an existing profile. Since the constraints were already supplied while creating the profile, we can call apply_summary_constraints() with no arguments:

Constraint failures by feature -annual_inc:test_name                total_run    failedsummary min GE 0/None            1         0

We could also overwrite the original summary constraint with a new one, for example:

GitHub Actions with whylogs

So far we have seen how to apply whylogs constraints directly into our pipelines. Alternatively, we can validate our data as part of our Continuous Integration pipeline with Github Actions.

Github Actions help you automate your software development lifecycle by enabling the creation of workflows. A workflow is an automated procedure that you add to your project’s repository which is triggered by an event, such as whenever a commit is pushed or when a pull request is created. A workflow itself is created by combining a series of building blocks, the smallest of which is called an action. With Github Actions you can test, build, and deploy your code in an automated manner.

Overview

With whylogs, we can expand the reach of Github Actions to not only test code, but also test data. Let’s demonstrate by imagining a simple data pipeline, in which we fetch data from a source and then apply a preprocessing routine to it.

Each validation step serves a different purpose: when applying a set of constraints to the source data, we’re interested in assessing the quality of data itself. Changes in external data sources can happen at any time, so to this end, we will schedule the job to be run in fixed intervals. The second validation step is applied after preprocessing the data with internal code. In this case, our goal is to test for the quality of our data pipeline. Since we want to run these constraints every time code changes, we will also execute the job whenever someone pushes a commit. For demonstrational purposes, we’re creating only one job to be triggered by two different events. Another approach would be to keep things separate by fixing a version of the dataset whenever you’re testing the data pipeline.

Github will log information every time the workflow is triggered, so you can check the output from the Actions tab in your project’s repository. Additionally, it will warn you whenever your data fails to conform to your expectations.

Configuring the workflow

To build the workflow, its configuration will need to be defined by creating a .yml configuration file under the .github/workflows folder. We will define only one job — whylogs_constraints — that will be triggered every time someone pushes a commit to the repository.

In broad terms, the workflow is straightforward — We’ll fetch data from a given source with fetch_data.py, which will generate the lending_club_1000.csv file. The csv file is validated against a set of constraints defined in github-actions/constraints-source.json. If the data agree with our expectations, the next step is preprocessing it. The preprocessing routine will, in turn, create a preprocessed dataset named lending_post.csv, which will be validated once again against a separate set of constraints defined in github-actions/constraints-processed.json. In this example, we’ll simply remove rows with nan and scale it to the 0–1 interval for the loan_amnt column.

It’s worth noting that the csv files don’t need to exist in our repository prior to making the commit. The files will be created inside the runner during the execution of the workflow and will not persist in our project’s repository.

Workflow syntax

Let’s discuss some of the lines in the configuration file:

on: [push] — Specifies the triggering event to our workflow. Every push event will trigger the workflow.

on: [schedule] — Specifies the triggering event to our workflow. It will be triggered on a schedule. In this example, the workflow will be executed every day at 05:30 and 17:30.

runs-on: ubuntu-latest — Specifies the virtual machine’s environment.

uses: actions/checkout@v2 — In order to run actions against our code, we need to first check out the repository into the runner, which is done by using the actions/checkout community action.

uses: whylabs/whylogs-action@v1 — The prepackaged whylogs action used to apply our constraints to the desired dataset. To use it, we also need to supply some parameters:

constraintsfile: The set of constraints in JSON to be applied
datafile: The file containing the data to which the constraints should be applied. Any format Pandas can load will work, but CSV works well.
expect-failure: Even though we usually write actions to expect success, whylogs also lets you create actions that are expected to fail by setting this flag. Defaults to false.

Constraint Definition

In this example, we’re using two sets of constraints: constraints-source.json and constraints-processed.json. For the first one, we’ll use the same constraints generated in the previous section of this article:

As for the processed file, we’ll define a summaryConstraintto verify that the normalized loan_amnt feature is indeed in the 0–1 range:

As discussed previously, we can create actions that are expected to fail or succeed. To demonstrate both cases, we’ll expect the constraints-source.json to fail and the constraints-processed.jsonto succeed.

What’s Next

The Whylabs team is constantly extending whylogs to support additional features. Regarding constraints, the following features are being considered:

Regex operators — constraints to match regex patterns on strings
Automatic constraints generation — Automatic generation of constraints from baseline profiles based on learned thresholds. Actually an expansion of an already existing feature.

Please feel free to like/comment on the related Github Issues above if you also like/want these features!

As for CI pipelines, much more can be done in terms of data validation. A real scenario will certainly have more complex data pipelines, and hence more stages for data to be validated. To further increase our system’s reliability, we could apply constraints not only on the model input but also on the output and real-time inference. Coupled with whylogs profiles, constraints enable data scientists and ML engineers to trust their ML pipelines!

If you’re interested in exploring whylogs in your projects, consider joining the Slack community to get support and also share feedback!