Rediscovering Unit Testing: Testing Capabilities of ML Models

What’s wrong with i.i.d. evaluations? How do we identify capabilities? How do we create tests? Could the model generalize to out-of-distribution data?

Published in

Towards Data Science

16 min readMar 14, 2021

Triggered by two papers [1, 2], I have been reading and thinking a lot recently about testing of machine-learned models. There seems to be a growing trend to test capabilities of a model in addition to just measuring its average prediction accuracy, which reminds me a lot of unit testing, and strongly shifts how to think about model evaluation.

In short, instead of just measuring generalization within a certain target population, we could try to evaluate whether the model learned certain capabilities, often key concepts of the problem, patterns of common mistakes to avoid, or strategies of how humans would approach the tasks. We can test whether a model has learned those capabilities separately, distinct from standard accuracy evaluations, more akin to black-box unit testing. Overall, the hope is that a model that has learned expected capabilities will be more robust to out-of-distribution data and attacks.

Traditional accuracy evaluations and their assumptions

To understand the relevance of testing capabilities, it is important to understand the assumptions and limitations behind traditional accuracy evaluations.

Example machine learning challenge: Detecting cancer in radiology images. Picture: Ivan Samkov

First of all, when evaluating models, we are not testing whether the model is “correct” for all inputs (we usually don’t even have a specification to establish what correct might mean), but whether it fits well for inputs from a target population, or whether it is useful for a problem. Remember that “all models are wrong, but some are useful” — we are trying to identify the useful ones. A model with 95% accuracy may fit a problem quite well and 5% of mistakes may be acceptable for a useful solution to a problem, say detecting cancer in radiology images.

Second, we evaluate a model on test data that is separate from the training data to measure how well the model generalizes to unseen data. Importantly, we do not care about generalization to any data, but only about generalization to data from a target population. For example, we would not expect a cancer detection model to work well on landscape photography, but we want it to do well on radiology images taken with certain equipment. Data that does not match the target population is known as out-of-distribution data.

A key assumption in much of machine-learning theory and practice (not always explicit; it took me a while to realize) is that training and test data are independently drawn from the same target population — independent and identically distributed or i.i.d. for short. The dominant strategy to collect a single dataset from the target population and then split that randomly into training and test data, ensuring that both are drawn from the same distribution. Any i.i.d. evaluation strategy will measure how well the model generalizes within that population. That is, we select training data from a target population to which we want to generalize, say radiology images in one hospital, and then evaluate it on test data drawn from the same population to measure how well the model generalizes within that target population.

Importantly, the whole scheme falls apart if training and test data are not a representative, unbiased sample of the intended target population. For example, if we intend our cancer detector to work across all kinds of equipment and patient demographics, but we trained and tested it on scans from a single scanner in a single hospital, our model may generalize well to the distribution used for training and testing, but may still not work well for the intended target population. Systematic biases in how data was collected and shifts in the target distribution over time are further causes why our training and test distribution might not align with the target distribution — problems commonly observed when trying to deploy machine learning solutions.

Training and test data does not always represent the full target population. Beyond the target population there are of course also many inputs that are not relevant for the problem. Image by author.

Generalizing beyond the training distribution

Ideally, we want a model to generalize well to unseen data, even to data that does not quite match the population represented by the model’s training data. This is useful if we want to transfer a problem from one population to another, say a cancer model trained on equipment and patients in one hospital to radiology images taken on other equipment or to patient demographics not well represented in the original hospital. It is similarly useful when the world changes and with it the target population, say when a new fad diet causes unusual forms of non-cancerous radiology imaging artifacts that did not exist when the training data was collected or when new radiology hardware is calibrated differently.

Indeed it is very common to see lower accuracy of a machine-learned model in production, compared to prior offline evaluations on i.i.d. data. The distribution of production data is often different from the distribution of training data, be it due to not sampling representatively in the first place or due to distribution shift as the world evolves.

One can argue (and many machine-learning researchers do, forcefully) that we have no business for criticizing a model for making wrong predictions on out-of-distribution data. This would be akin of criticizing an elementary school child for not yet scoring well on high-school math tests. Why would we expect a model to make accurate predictions for things we have not taught it yet?

The typical reaction to poor performance in the actual target distribution is then to collect more training data to better represent the target distribution, to look for potential bias in the sampling method to ensure representativeness, and to update training and test data as the target distribution shifts.

Yet, we hope that we can find a way of training models in a way that they better generalize even beyond the training data, such that those models are more robust to potential bias in training data collection and more robust to distribution shifts and maybe even adversarial attacks.

The promise of capabilities

One strategy to identify models that generalize beyond the distribution of the training data that emerges recently in some literature is to try to distinguish models that have learned key capabilities that humans found essential for solving a task. Simply speaking, we are looking for models that better mirror human strategies for solving a problem (or use other forms of domain knowledge we have about the problem), hoping that these are more robust than any alternative strategies the model could come up with.

Capabilities are inherently domain-specific and most discussions I have seen resolve around NLP tasks, but here are a couple of different examples:

In sentiment analysis, a model should be able to understand negation, emojis, and sarcasm.
In object detection, a model should not primarily rely on the background of an image and should favor shapes over texture (as humans do).
In question answering, a model should be able to reason about numbers and synonyms, and should not be distracted simply by word overlap between question and answer.
In cancer detection, a model should look for the same patterns in the same regions as radiologists do.

An object detection model should reason about the object in question without being distracted by the background. Examples and pictures from Beery et al. [9].

In a nutshell, the argument why this might work is as follows:

For every training problem we can find many different models that have similar accuracy on the distribution of the training and test data, but that learn very different patterns. Some of those models generalize better to out-of-distribution data than others. Indeed, the Google paper [2] that led me down this rabbit hole has dozens of examples showing this pattern for many learning problems.
Some models learn patterns in the data that are specific to the distribution used for training and testing, but do not generalize for the larger problems. Examples of shortcut learning on datasets abound: The object detector that detects cows on grass but not a beach, because the training data never included cows on a beach [9]; the urban legend of the tank detector that learns how to recognize weather but not tanks, because all training data with tanks was taken on sunny days and data without tanks on rainy days; the cancer detector that learns whether an image was taken on mobile or stationary scanner, picking up on human judgement whether the patient was healthy enough to be moved to the scanner [10]. In all these cases, the model found patterns that worked well in the training and test data, but only because the data was collected in a way that was not representative of the target distribution where the model was eventually supposed to be used.
We expect that learned patterns that we think are important for solving the problem, like the capabilities listed above, generalize better. The rationale is that capabilities represent domain knowledge that we already have, not just patterns discovered in a specific dataset that may represent shortcuts in that data from biases in how the data was collected. For example, we expect that detecting the shape of a cow should be a more robust mechanism in an object detector in general than analyzing the image’s background.
So whenever we have a choice between multiple models with similar accuracy, we should pick the model that has better learned relevant capabilities. We might even want to steer our training (e.g., through model structure, inductive biases, or training data augmentation) to learn those capabilities. Note that this approach does not aim to only reason with a fixed list of capabilities, as maybe in the good old days of symbolic AI. We still learn models from the training data that might find all kinds of patterns, many of which humans would probably never pick up, but we steer the process to include patterns that give the model some capabilities that we think are important and will make the model more robust.

Testing capabilities

Let’s postpone for a second how we identify capabilities and assume that we already know the capability that we want to test, say understanding negation in sentiment analysis. Again, we are not testing whether a model has learned the capability correctly (i.e., we are not looking for individual counter examples), but evaluate how well the model fits for problems that benefit from the capability.

To test a capability, we curate capability-specific test data, independent from and in addition to the i.i.d. test data traditionally used for general accuracy evaluation that matches the distribution of the training data. That is, we create test data separately for each capability such that a model who has the capability will achieve a much higher prediction accuracy on that test data than a model that does not.

There are a couple of different strategies:

Domain-specific generators: A very common strategy is to generate test data from templates or with specific strategies to generate examples of data that need the capability [1, 3, 5, 8]. For example, to test whether a sentiment analysis model understands negation, the template “I {NEGATION} {POS_VERB} the {THING}.” can be filled automatically with all kinds of negations, verbs, and things to generate many test sentences like “I didn’t love the food” that all include negation and are all expected to have negative sentiment, despite positive words [1]. As another example, a generator can create artificial images with the shape of one object but filled with the texture of another to test to test an object recognition model’s capability to prefer shape over texture [5]. One typically needs to write one or multiple generators per capability.

Artificially generating a picture with a texture-shape conflict to test a model’s capability to prioritize shape over texture. Example and picture from Geirhos et al. [5].

Mutating existing inputs: Instead of generating test inputs from scratch, one can often generate new test inputs by modifying existing ones [3, 6]. For example, to test a NLP model’s capabilities to understand synonyms, we can replace words in existing labeled sentences with synonyms to create new test data with the same label. For capabilities related to noise and distractions, we can also easily modify inputs to add neutral information to sentences (e.g., add “and false is not true” [3] or random URLs [1]) or automatically introduce typos [3]. Mutations are particularly effective if capabilities can be expressed as invariants (see metamorphic testing) that describe how a model’s prediction should change (or not change) for certain kinds of modifications.
Crowd-sourcing test creation: Where we cannot automatically generate realistic tests, we can involve humans in the task [4, 6]. For example, this strategy can be used to test a sentiment analysis model’s capability to understand sarcasm by asking crowd-workers to minimally modify a movie review to flip its sentiment by introducing sarcasm [6]; for an object detector, we could ask crowd-workers to take pictures of an object with different, possibly unexpected backgrounds [4]. In each case, we need to create specific instructions for humans to create test cases that challenge the specific capability we are interested in.

Examples of small changes made by crowd workers to change the sentiment of a sentence (red replaced by blue) that require a model to have capabilities for distinguishing facts from hopes, for detecting sarcasm, and for understanding modifiers. Examples from Kaushik et al. [6]

Slicing test data: Finally, we can also search in a large pool of test data for instances of tests that are relevant to our capability, for example, all sentences that include negation. If we want to avoid sampling from the original test data, we can also collect previously-unlabeled data in production, identify potentially challenging cases, and then label those as test data for our capability. This strategy of slicing of test data is already quite common in testing models to check for accuracy and fairness for important subpopulations — it can likely be used for capabilities too.

None of these strategies to curate test data for capabilities is particularly cheap. Writing generators or hiring humans to create or label test data involves significant effort and cost, in addition to the investment already made to acquire the original training and test data. However, as discussed, the hope is that models that do better at these capabilities generalize better beyond the distribution of the training data, with promising results from multiple papers.

Aside: Training vs Testing. Most discussions here have dual insights for training and testing. Just as we can curate test data to test whether a model has learned a specific capability, we can usually use the same process to generate additional training data to drive the training to better learn that capability. This kind of data augmentation is common and goes beyond the i.i.d. assumption just as testing capabilities does, but potentially with the same benefits. To avoid going down yet another rabbit hole, I’m not going into different data augmentation strategies and how they may relate to capabilities, but I expect that thinking in capabilities might be a path not only for strategizing about testing but also about training. Conversely, data augmentation papers [e.g., 6] can be read from a perspective of identifying capabilities and may provide inspiration for testing too.

Identifying capabilities

A final question now remains: How do we identify which capabilities to test? Surprisingly, I have not seen any papers discuss this explicitly. Many papers focus on a single capability or a small number of capabilities for a specific problem and demonstrate feasibility of testing and detecting better models, but usually without any justification why this capability was chosen over others. Reading between the lines there seem to be a few different ideas:

Analyzing common mistakes: Most papers seem to focus on specific common mistakes a model makes [3, 4, 5, 8], including mostly shortcut reasoning as when a model uses the background of an image to identify an object in the foreground. Some common problems are already well understood by the community and broadly studied, such as NLP models using word overlap rather than understanding a text’s content [8] or models focusing on texture over shape [5], making clear candidates for capabilities. More systematically, we can carefully analyze a representative sample of mistakes to identify common kinds of problems and map those to capabilities — the Stress Testing paper [3] is a great example of this: The researchers manually classified 100 mistakes made by a model for a natural language inference task, finding that, beyond word overlap, wrong outputs often relate to negation and antonyms, numerical reasoning, ambiguity, and missing real-world knowledge. While some problems like missing real-world knowledge are truly hard for a model (and human) others can more readily be captured as testable capabilities the model should have.

Analysis of mistakes of a natural language inference model: X->Y indicates that X was expected but Y was predicted with options being entailment (E, hypothesis is true given the premise), contradiction (C, hypothesis is false given the premise), or neutral (N, truth cannot be determined). Table from Naik et al. [3].

Using existing knowledge about the problem: While we typically use machine learning because we do not fully understand how to solve a problem, for many problems we have at least some partial understanding. For example, linguistics have been long studied before the rise of deep neural networks, and we know a lot about, say, sentence structures, how words relate, and which parts of a sentence are more important than others. Many capabilities seem to relate directly to theories we already have about a problem, including most capabilities for NLP models from the Checklist paper [1], including synonyms, anonyms, identifying named entities, semantic role labeling, negation, and coreference.
Observe humans: Where we do not have domain knowledge, we can study how humans solve a problem. An interesting example is the Learning the difference that makes the difference paper [4], which observed what changes humans make to text when instructed to minimally modify a sentence to change its sentiment (example shown in a figure above). This way, they identified important mechanisms for changing sentiment (often but not necessarily challenging for a model) that can be mapped to capabilities, such as sarcasm and distinguishing facts from wishes.
Derived from requirements: Some capabilities correspond to requirements for the model or aspirational goals, typically invariants that should hold for the final model. This includes particularly fairness requirements that may not hold or be observable in the original training data. For example, sentiment analysis shall not differ depending on the gender of actors in a sentence.
Causal discovery from observational data: A subfield of the machine-learning community focuses on encoding and discovering causal relationships, not just statistical relationships [7]. If we could discover and review causal relationships, many may map well to capabilities.

While I have not seen anybody explicitly provide guidance on how to identify capabilities for actual quality assurance or model development activities, it seems like a careful analysis of common problems, of knowledge about the problem and non-ML solutions, and of requirements can be used to identify capabilities for many problems.

If we look one final time at the example of the cancer detection model, we find that we could use many of these strategies: We might find that existing solutions perform poorly when brightness is not calibrated across multiple scanners (analyzing model mistakes), hence identify the capability that the model should be robust to different brightness levels, which we can translate well into test data to identify those models that learn this capability. We might ask radiologists why they disagree with a model to learn more about what capabilities experts use that the model may be missing (observing humans). Additionally, we should probably also look into pre-deep-learning literature, say at the strategies used in earlier medical imaging research in the 90s which typically made use of specific insights about the problem as part of handcrafted mathematical models. Furthermore, non-machine-learning literature on cancer diagnosis will likely also identify capabilities that radiologists use when looking for cancer in empirical studies of radiologists or in training material for radiologists (existing knowledge).

To wrap this up, let me now come back to unit testing: Identifying capabilities and then creating test data for it is not that different selecting inputs when creating unit tests. While we do not have a strong specification for machine-learning problems, we have some knowledge about the problem and past mistakes. Identifying capabilities is not unlike selecting test inputs for a program without looking at its implementation, known as black-box testing or specification-based testing. Identifying capabilities from common mistakes and developing tests to assure that they are usually avoided in future models is similar to regression testing, where developers add test cases to avoid breaking existing functionality with future changes.

Readings

[1] Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList.” In Proceedings ACL, p. 4902–4912. (2020).

[2] D’Amour, Alexander, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen et al. “Underspecification presents challenges for credibility in modern machine learning.” arXiv preprint arXiv:2011.03395 (2020).

[3] Naik, Aakanksha, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. “Stress test evaluation for natural language inference.” Proceedings of the 27th International Conference on Computational Linguistics, p. 2340–2353 (2018).

[4] Barbu, Andrei, David Mayo, Julian Alverio, William Luo, Christopher Wang, Danny Gutfreund, Joshua Tenenbaum, and Boris Katz. “ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models.” In Proc. NeurIPS (2019).

[5] Geirhos, Robert, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.” In Proc. International Conference on Learning Representations (ICLR), (2019).

[6] Kaushik, Divyansh, Eduard Hovy, and Zachary C. Lipton. “Learning the difference that makes a difference with counterfactually-augmented data.” In Proc. International Conference on Learning Representations (ICLR), (2020).

[7] Schölkopf, Bernhard. “Causality for machine learning.” arXiv preprint arXiv:1911.10500 (2019).

[8] McCoy, R. Thomas, Ellie Pavlick, and Tal Linzen. “Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference.” Proc. ACL (2019).

[9] Beery, Sara, Grant Van Horn, and Pietro Perona. “Recognition in terra incognita.” In Proceedings of the European Conference on Computer Vision (ECCV), pp. 456–473. 2018.

[10] Agrawal, Ajay, Joshua Gans, and Avi Goldfarb. Prediction machines: the simple economics of artificial intelligence. Harvard Business Press, 2018.

PS: A word on naming. The idea of testing capabilities is implicit in many papers and discussions, but rarely ever deliberately named as a strategy. The papers I found, I mostly found by following up on references in other papers. Naik et al. [3] introduced the idea of testing capabilities as “stress testing a model,” even citing some software engineering textbooks; the Underspecification paper [2] adopts this term. I find “stress testing” misleading though: Stress testing usually refers to testing performance and error handling under heavy load and randomness, but tends to be less structured. The CheckList paper [1] introduces the term “testing capabilities,” which I much prefer. The concept of “capabilities” suggest that there are multiple distinct characteristics that should be tested, without causing confusion with “requirements” and “specifications” that we often do not have for machine-learning problems. In addition, I think black-box unit testing is a useful analogy generally, since we are testing a single unit of the system (the model) and we do so in a somewhat systematic way without knowing the model internals, one capability at a time.