Notes from Industry

Making AI models more robust more efficiently

Deploying Machine Learning models to the real world is prone to uncover domain coverage issues. One way to robustify the models is by generating unseen data, which the model is expected to work on. Property based testing can solve this issue!

Daniel Angelov

Published in

Towards Data Science

7 min readJul 23, 2021

Property-based testing aims to maintain a certain property given an operational domain, allowing us to estimate the robustness of the model. Image by author.

In the last article on testing Machine Learning models, we looked into drawing parallels between software testing and how those strategies can be used within the MLOps domain. We showed that static dataset-based methods for evaluating model performance are not powerful enough to fully examine the accuracy of the models in high dimensional spaces. One way to gain more insight into the robustness of a model is through methods like property-based testing that go about synthesising input data, conditioned on some specifications, in order to break the model!

But how can testing make your models both more robust and efficient to train?

First, let’s begin by talking about the expected operational domain. In other words, what is the data domain that we expect our inputs to come from — what type of variations, constraints, extreme values are we expecting to observe? For example, in the computer vision domain, we can begin describing the time of day, meteorological conditions, and even properties about the observations themselves — the size and type of objects, how they can vary and what needs to remain fixed.

This allows us to define the axis of variations, or the different variability of the data. For a small problem with few numerical inputs, it is tractable to find a transformed input that leads to poor model performance. However, if the input is an image, how can we do this? Let’s imagine we want to convert a daytime image to look like nighttime.

In the last blogpost, we spoke about converting the data samples from the data domain to an operational domain, performing the change along the time of day axis and converting back to the data domain. In the figure below we can see an image of a deer whose full body is captured with the head and antlers facing us. We can enumerate other attributes describing the image such as the fact it was taken during dawn, it contains a variety of bushes and trees in the background, etc.

We can alter the time of day attribute from dawn to evening. Diagram by the author.

If we want to make sure that our car, navigating the intrastate highway systems, is capable of recognizing the deer and that it can do so regardless of the time of day or assumed background. This means we can now change these attributes and evaluate the overall robustness of the model. In an industrial application, the collected data would be maximized to cover the operational domain of the model as well as possible. However, it is impractical at best and impossible at worst, to observe data spanning the complete operational domain leading to the deployed Machine Learning algorithms to have never been evaluated in some areas of the domain. The overcollection of data is not only leading to issues with management, increased annotation costs, but also leads to slower algorithm iterations. If we cannot explicitly enumerate the data domain, it is much more tractable to infer the operational domain that we expect the ML model to work within.

In this case, we change the background of the subject from trees to a highway. Diagram by the author with a photo by Sebastian Palomino.

At the end of each training stage, the ML model needs to be tested before deployment on the corresponding operational domain. However, at Efemarai we argue that robustness evaluation does not need to be at the end of the MLOps cycle, but can be integrated as part of the training pipeline.

Testing platform

Efemarai is a platform that tests and robustifies ML models. It works by finding edge cases in the operational domain of the problem that lead to the underperformance of the model. It gives developers the ability to easily integrate their existing ML assets (models, data, code) with specifications and tests to uncover these robustness issues.

As a result, you get a robustness coverage diagram that enables developers to monitor progress along multiple axes (alongside the industry standard accuracy, f1 score, confusion matrix, etc) to gain confidence in the deployed models and provide insights to non-technical stakeholders such as business leaders, managers, users and regulating bodies.

For example, UI/UX designers can use these insights to flow the user in a path where the AI model is well-performing and would minimize the chances of coming across issues.

In addition, the resulting novel synthesized dataset with edge cases has a direct impact on the performance of the ML model. It has multiple uses and in the rest of the article we’ll illustrate how adding these edge case samples to the training dataset can have a direct and measurable impact not only on the accuracy and f1 scores, but also on the overall robustness of the model. It speeds up iterations and training times compared to an exhaustive search and thus minimizes the resources required.

ML Model Robustification Challenge

A common problem in both the auto industry and assets management is identifying traffic signs. There is an imperative for those to be extremely accurate and robust to any perturbation that can be observed in the real world — from time of day, the exact outlook, viewing perspective to weather conditions, or scale.

Let’s use the German Traffic Sign Recognition Benchmark dataset and try to train the most robust model in the least amount of time. Thus, we’ll be benchmarking the domain coverage robustness and the speed of training.

We’ll be testing the following conditions with the same ImageNet pre-trained model:

A: An EfficientNet model trained without any additional perturbations and transformations;
B: An EfficientNet model trained with AutoAugment;
C: An EfficientNet model trained with all possible perturbations and transformations;
D: An EfficientNet model trained with no additional perturbations and transformations with the added data generated by the Efemarai platform.

ML Robustness Results

Let’s begin by highlighting the immense power of the modern neural networks as universal feature approximators as in all cases the models have achieved more than 97% accuracy. This also shows issues around cases where a new technique may improve the results by a fraction of a percent, but it is hard to evaluate what real-life improvements that may lead to. This is exactly where the domain coverage diagram can help.

But first, let’s start with the reached accuracy and training time:

Model A: 98.6%, 2 min 0 sec for 20 epochs
Model B: 97.4%, 2 min 22 sec for 20 epochs
Model C: 97.0%, 8 min 20 sec for 20 epochs
Model D: 99.3%, 3 min 34 sec for 20 epochs
Model D*: 98.6%, 2min 40 sec for 15 epochs

Training dynamics for the different models. Image by author.

From the above results, we can see that all models have performed well, and that adding different perturbations can in some cases significantly increase the training time (2 minutes vs above 8). Model D, which uses the Efemarai platform data, reaches the highest accuracy in 3 min 30 sec. If we try to match the accuracy of the second-best model, it will take over 25% fewer epochs and will nearly match the wall time of the fastest training model. In comparison, it is nearly 3x faster than using all of the augmentations. This highlights that smart transformations that target the areas of model underperformance have a much higher impact on the overall performance than a uniformly sampled blind augmentation.

Now let’s take a look at the robustness coverage diagrams.

Domain Coverage diagram. Image by author.

The radial diagram above shows the robustness levels of each model in regard to a particular perturbation. The target is to be maximally robust and completely fill the corresponding slice. We have colour grouped perturbations based on their class as geometric, colour, or noise-based in nature.

First, we want to highlight that the original model A, trained with zero transformations, can be extremely susceptible to perturbations that are not seen within the dataset! With a performance drop of more than 30%. Considering the operational domain is a very important step in transitioning models from development to deployment and especially cases that impact the business metrics.

There is a significant improvement in model D robustness, trained with the Efemarai platform, in regards to geometric and some noise-based perturbations and shows an overall best robustness capabilities. It brings the lowest-performing case from around 50% to above 80%, which is a significant improvement!

It is noteworthy how much improvement can be attributed to the additional data for the robustness of the ML model D even with 3x less training time compared to model C.

The use of the platform and the generated unseen data result in direct improvement of the model performance, fewer observed corner cases during the operation with users, and an understanding of what type of data needs to be collected. In the next blog post, we’ll show another use of the platform that directly targets this.

Conclusion

Thinking about model robustness during the development of the model has an outsized return in iteration speed and reached performance in comparison to evaluating models as the last step before deployment. We have shown that using property-based testing platforms for Machine Learning like Efemarai can result in better models, a deeper understanding of the performance, and more confidence in deploying these models.

About the author

I am the CTO & Co-founder at Efemarai, an early-stage startup building a platform for testing and debugging ML models and data. With our platform, machine learning teams can track regressions, monitor improvements, and ship their ML solutions with confidence.

Backed by Silicon Valley and EU investors, we are on our journey to bring quality assurance to the AI world and unleash a Cambrian explosion of new ML applications!

If you want to get in touch just drop us an email at team@efemarai.com