Notes from Industry

Why don’t we test machine learning like we test software?

Let’s make ML models more robust by taking inspiration from modern software testing! A new way to confidently improve your models.

Published in

Towards Data Science

10 min readJun 24, 2021

Machine learning systems are now ubiquitous in our daily lives and so the correctness of their behaviour is absolutely crucial. When an ML system makes a mistake it can not only result in an annoying online experience, but also limit your ability for socio-economic movement or, even worse, make life-threatening manoeuvres in your car. So how certain are you that a deployed ML system is thoroughly tested and you are not effectively a test user? On the flip side, how do you know that the system you’ve been developing is reliable enough to be deployed in the real world? And even if the current version is rigorously tested in the real world, after updating one part of the model, how can you be sure that its overall performance has not regressed? These are all hard questions that are rooted in the sheer complexity of the problems we try to solve in a data driven fashion and the scale of machine learning models we are building nowadays.

In this blog post, we are going to have a closer look at another domain facing similar issues — software engineering — the testing methodologies employed there and how they can be applied to ML testing. Ultimately, by the end of that article, I hope you will be seriously asking yourself “Why don’t we test machine learning as thoroughly as we test software?”. And see how advanced software testing methods can be used to extensively test models, catch regressions and be a part of your ML quality assurance process. Below you can see a diagram of the continuous improvement pipeline, which my team and I are building, that shows how the advanced software QA techniques can be used in the MLOps space.

Using the existing models and data, we can pair them with the expected domain of operation and test their robustness within. This will generate a domain coverage estimate and any failure data samples that have been found. (Image by the author)

Testing Machine Learning Models

The most widely adopted strategies, by far, for evaluating ML models rely on already collected fixed datasets and rarely explore more than accuracy, confusion matrices and F1 scores or proxies thereof. More mature ML teams typically have more advanced testing pipelines that include extensive data slicing, uncertainty estimation and online monitoring.

It is well known, though, that these methods are prone to missing out corner cases and suffer from issues such as domain shift and stale data. In the limit of having access to almost infinite amounts of data, these widely adopted approaches will provide solid results, but when solving real world problems and building real world systems this is simply not the case. Even though testing and iterating can take up to 60–70% of the development time for an ML system, evaluating ML models is currently more of an art than a standard engineering solution.

The problem of rigorous testing is well studied in other fields from software engineering to process control. So what can we borrow?

Extremely critical pieces of code are not just tested, but formally verified to be correct. This means that the systems are theoretically proven to exhibit the correct behaviour in all considered scenarios. While algorithms for formally verifying deep neural networks are actively being developed they are yet to be scaled to real-world applications. In practice, though, an extremely small portion of the software in the world is formally verified.

However, probably every single piece of software that is deployed in production is tested using techniques ranging from manual testing to unit and end to end testing. While maximizing tested code coverage is a common approach, more advanced testing techniques are required for data-driven systems such as ML-based ones.

Software QA and Property Based Testing

Code coverage is probably the most widely adopted measure within the software industry for how well a piece of code is tested. However, it is quite possible to achieve 100% code coverage and yet have particular data inputs that break your code.

For example, let’s imagine we’ve written our own division function ‘div(a, b)’ to calculate ‘a / b’. We can write a couple of tests ensuring that ‘div(8, 2)’ and ‘div(7, 6)’ work correctly and quickly get 100% code coverage. However, we were in a rush to write the function and completely forgot the divide by 0 corner case. Despite that, we still achieved 100% code coverage! Something doesn’t feel right. The key problem is that code coverage is just not enough for data driven solutions. And nowadays almost every single piece of software is data driven!

That’s why more advanced techniques have been used in software QA for quite a while. One such popular technique is called property based testing. The key idea is that you specify rules that must invariably hold true (“properties”) throughout the entire set of possible data inputs (“data domain”). Once you specify these rules, the computer automatically tries to find data inputs which violate the specified properties and consequently your assumptions.

If we are to test our ‘div(a, b)’ function in a property based fashion we’d specify that

‘a’, ‘b’ and the result ‘c’ are real numbers (data domain)
‘a = b * c’ must be true (property) +

This is enough information to let the computer search for corner cases and you can be certain that modern property-based testing frameworks such as Hypothesis or QuickCheck will almost immediately find the corner case when ‘b = 0’. When that happens it is up to us to either fix the code or restrict the set of possible inputs. Importantly, whichever action we choose we will be converting implicit assumptions into explicit ones.

By now you must be wondering how a computer can find these corner cases. Many modern frameworks employ purely random data generation, perhaps with some heuristics to improve performance. While this works well for testing individual source code functions that rarely take more than 7–8 input parameters, it certainly is not applicable to testing entire machine learning models that typically operate in high dimensional spaces. So how can we bring property based testing to machine learning?

+ We omit issues related to numerical accuracy in order to keep the example simple.

Bringing property based testing to ML

Before addressing the how let’s take a moment to think about the why. Why would one want to test their ML model in a property based fashion? Current ML testing methods that revolve around fixed hold out datasets and data slices are equivalent to coverage based testing for software. But as we saw before with our ‘div()’ function example, code coverage is not enough for data-based systems. Hence the need to go to more advanced strategies for testing our ML systems.

At first look, any efforts to bring property based testing to ML seem futile due to the extremely large dimensionality of input data. Even an image from the toy-like dataset MNIST lives in a 576-dimensional space. There are 2576 possible images, so randomly hitting an image that actually represents a digit and also breaks your models is simply impossible from a practical perspective. But there are concise ways to describe an image of a dataset.

Operational vs. Data domain

Imagine a friend or a colleague of yours asks you what sort of images the MNIST dataset contains. You are not going to describe them as — “Each image is a 576-dimensional binary vector that belongs to a manifold spanning handwritten digits.”. You are more likely to say something along the lines of “Each image contains a handwritten digit with random rotation, size and style”.

The first answer describes the raw data domain and the second one describes the operational domain of the model. And as humans building and testing ML models to solve problems in the real world we ultimately care about the operational domain and not so much about the data domain. Imagine having to convince a regulating body that your ML solution is sufficiently tested by describing what fraction of the raw data space you’ve explored… While this might be needed for safety-critical applications, the analysis should always start from the perspective of the operational domain.

It also turns out that thinking about the operational domain instead of the raw data domain is an important step towards bringing property based testing to ML. Operational domains are much smaller and intuitive to reason with. It is very easy to come up with requirements using the language of the operational domain such as “My model should be able to recognize the digit regardless of its orientation” or “My model should work with small and large digits”.

Recognizing handwritten digits on a black background is a relatively easy problem with a somewhat small operational domain. However, that is not the case for many of the real-world problems we tend to solve with ML. How big is the operational domain for an autonomous vehicle?

Searching vs. Random Sampling

The size of the operational domain for an autonomous vehicle coupled with rare events following long-tail distributions make random sampling absolutely impractical. Therefore, the second step needed to bring property based testing to ML is to turn the problem of randomly generating a failure case into a search problem. Hence the idea of targeted property based testing that coincidentally is also an emerging software QA method. In order to do this switch, it is important to specify properties that are not just binary FALSE or TRUE, but follow a spectrum from 0 to 1. This step is quite intuitive for various ML problems even discrete ones such as classification where we can easily measure activations and their closeness to the decision boundary.

Moving up and down the abstraction level

The figure above illustrates the difference between the operational and raw data domains and how one can move from one to the other in order to perform property based testing for an ML system. Importantly there are 2 paths from a given raw data sample x to a new one x’. The first one path going through a data sample x -> measure -> search -> synthesize -> alternative sample x’ is extremely powerful. It allows for searching in a precisely defined operational domain without worrying that invalid data samples will be used in the evaluation of the model. However, these capabilities require non-trivial steps such as ‘measuring’ various aspects of the raw data sample in order to translate it to the language of the operational domain as well as performing the inverse step of synthesizing a new raw data sample from a point in the operational domain.

With the advancement of simulators and deep generative modelling, synthesis is becoming practically feasible for more and more real-world problems, but there are still problems where this is not the case yet. This is where the second path x -> domain aware transform -> x’ comes into play. It is akin to standard augmentation techniques, but the important aspect is recognizing that transformations are often parameterized at the level of the operational domain. Adding pixel noise, for example, is a transformation at the low data level, but adding motion blur to the image parametrized by the relative velocity of the camera to the scene is a transformation at the operational domain level. Therefore, it is possible to search through the operational domain even if measuring and synthesis are not feasible.

Benefits of Property Based Testing for ML

Testing ML models in a property based fashion does not just provide information about whether it breaks, but can provide deep insights into the failure modes of the system as well as provide actionable artefacts such as a new data set of novel unseen data samples that break your model.

Accurate robustness evaluation

Actively searching the operational domain for failure regions while ensuring that you do not violate any of the specifications inevitably generates an accurate evaluation of how robust your model is across the operational domain. And because the operational domain is specified in a human-understandable fashion you can quickly understand the reasons behind a failure case as well as identify actions to resolve the issue.

Novel unseen failure data samples

At the core of property based testing is the idea of generating data inputs that violate your tests so you’ll end up with new data samples that actually cause your model to fail. This is an extremely useful resource as you can not only inspect failures from up close, but also enhance your training dataset which will inevitably improve the robustness of the model. It is even possible to monitor new incoming data and automatically flag problematic samples by comparing them to the extensive dataset of failure cases. There are, though, plenty of other useful and interesting use cases!

Iteratively building an ML system, while armed with truly representative robustness scores and novel failure data samples, is no longer a whack a mole problem as you can carefully track regressions and identify improvements. Ultimately, the property based testing framework enables you and your team to build ML solutions with confidence like never before.

If you are interested in our upcoming blog post about applying property based testing to computer vision ML problems remember to subscribe!

About the Author

I am the CTO & Co-founder at Efemarai, an early-stage startup building a platform for testing and debugging ML models and data. With our platform, machine learning teams can track regressions, monitor improvements and ship their ML solutions with confidence.

Backed by Silicon Valley and EU investors, we are on our journey to bring quality assurance to the AI world and unleash a Cambrian explosion of new ML applications!

If you want to get in touch just drop us an email at team@efemarai.com