Covariate Shift in Malware Classification
Classifiers in rapidly changing environments need to go beyond generalization
Introduction
Covariate shift in the data-generating distribution lowers the usefulness of generalization error as a measure of model performance. By unpacking the definitions, the previous sentence translates to “high accuracy on the distribution of files we sampled our training set from (excluding the training set itself) does not always mean high accuracy on other important distributions (even if the true relationship between a sample and its label does not change).” A fruit classifier that had a single example of a banana in its training set might perform poorly on a sample set consisting solely of 10,000 variations of bananas, even if it did very well on a held-out test set sampled from the same distribution as its training set (where presumably bananas are scarce). This performance gap can be a problem because approximating generalization error with a held-out test set is the canonical way to measure a model’s performance.
This concept is essential for malware classification, and many papers discuss it in the context of sampling bias, where the samples used for training are not randomly sampled from the same distribution as the samples used for testing. Anecdotally, the intuition that models with good generalization may have poor performance once deployed — due to the difference between the files the model was trained on and the files it scans — is held by many malware analysts I’ve worked with, including those with little ML experience. However, it’s not always clear how to identify and fix issues related to covariate shift: conventional regularization techniques that improve generalization may not help when good generalization isn’t enough. After spending 5 years reverse engineering and classifying malware (using traditional, non-ML based techniques), I developed lots of intuitions about malware classification that I wasn’t able to express until I grew a passion for machine learning. My goal with this blog is to explain these opinions in the more formal context of machine learning. As some of these attempts at formal explanations are imprecise (and possibly incorrect), another goal of this blog is to solicit discussion, feedback, and corrections. I pack in a lot of different concepts and background because I feel relating concepts to each other helps improve understanding, especially if they can be tied back to practical goals like malware classification.
I hope that anyone interested in covariate shift can benefit from this blog, but the target audience is a person with some experience in machine learning who wants to apply it to computer security-related tasks, particularly “static” malware classification (where we want to classify a file given its bytes on disk, without running it). One of the key conclusions that I hope this blog supports is the importance of incorporating prior knowledge into a malware classification model. I believe prior knowledge is particularly relevant when discussing the ways a model might represent its input before the classification step. For example, I detail some domain knowledge that I believe is relevant to CNN malware classification models in this article: https://towardsdatascience.com/malware-analysis-with-visual-pattern-recognition-5a4d087c9d26#35f7-d78fef5d7d34
I’ve included an appendix that provides additional background by going over some core concepts used in the main section (such as inductive bias, generalization, and covariate shift)
Covariate Shift in Malware Classification
I do not think models that demonstrate good generalization on large sample sets are useless. If a model does not perform well on the distribution it was trained on, that’s a problem. There is plenty of empirical evidence in the security industry that a model that generalizes well without being super robust to changes in the input distribution can still provide value, especially if retrained frequently and targeted at specific customer environments. However, at this point, many models used in the industry demonstrate high accuracy on held-out test sets. I believe improving performance on samples from important new distributions, such as files created a month after the model was trained, has a more practical impact than making tiny marginal improvements in test set accuracy. The next few sections go over why these new distributions are essential and why they are an example of “covariate shift.” If you’re new to the concept of covariate shift, I’d recommend going through the background section in the appendix.
Why p(x) changes
In the case of malware classification, p(x) is the distribution of binary files analyzed by the model. Practical malware classifiers are never applied to the distribution they are trained on; they need to perform well on sets of samples in production environments that are continually evolving. For many antivirus vendors, that means performing well on files present on every machine with that antivirus, as well as samples submitted to other services like VirusTotal and antivirus review tests. There are two main reasons why generalization performance may not be a good predictor of how well a model performs on these samples:
Sample selection bias: It is rare in the security industry to train models on a random selection of files that the model encounters when deployed. This is because accurate labeling is expensive: it often requires some level of manual human analysis, and it requires having access to all scanned files. It is normal to include files that analysts have labeled for other reasons, such as for the direct benefit of customers or files that have been supplied and labeled by third parties. If human analysis isn’t involved, there is a danger of feedback loops between multiple automated analysis systems. Because we can’t feasibly train the model on a truly random sample of the target distribution, the sample selection process is biased. Technically we can get around this by randomly sampling files our scanner encounters and labeling them with human analysts, but this would result in a small sample set of mostly clean files. I refer to the distribution of files that are encountered by the scanner after the model has been deployed as the “in-production distribution.”
Environment shift: Every time you copy, edit, update, download, or delete a file, you modify the distribution your malware scanner encounters. When summed across every machine, the shift in distribution can be significant enough to mean a classifier that performed well last week may degrade significantly this week. Malware classifiers also exist in an adversarial setting, where attackers make deliberate modifications to the input to avoid the positive classification. A typical strategy to mitigate this issue is constant retraining, but it is impossible to ever train on the actual in-production distribution (even if we did have labels for them) unless we never scan files created after we gathered our training set. This scenario produces a type of sample selection bias that is impractical to overcome: we cannot train on samples that will be created in the future.
Why p(y|x) does not change
Covariate shift requires that p(y=malicious|x=file) does not change when p(x) does (see appendix). In the case of malware classification, p(y = malicious|x) is the probability a given file x is malicious. If p(y|x) and p(x) both change, it may be impossible to maintain accuracy. I can’t prove p(y|x) won’t change, but I argue that changes are generally, for certain classes of models, independent of p(x). Here p refers to the true data-generating distributions rather than a probabilistic classifier’s attempt to approximate p(y|x).
There are two main reasons why p(y=malicious|x) would be anything other than 1 or 0:
A. Uncertainty in the input: y is a deterministic function of some representation x’, but the input x the model uses does not contain all the relevant parts of x’. This concept is touched upon in the “deterministic vs stochastic classification” section in the appendix, but it’s a critical case, so I’ll expand on it in the next section. I use x’ to refer to inputs that can be mapped deterministically to a label and x to refer inputs that may or may not be deterministically mappable to a label.
B. Uncertainty in the label: The definition of y does not fully account for all inputs (grayware is an example in malware classification). I’ll leave a full explanation to the appendix section “Why p(y|x) does not change: Uncertainty in the label.”
Uncertainty in the input
In this part, I argue that the bytes in a file correspond to x’.
Here’s a simple example. I precisely define ‘eight’ to be a specific shape shown in Figure 1
and ‘three’ to be the shape shown in Figure 2
If we have access to the entire input (x’ = Figure 1 or Figure 2), we know with perfect certainty what the label should be. However, we may not have access to the full input as illustrated below, where some of the input is hidden with a red box:
In this case, our input x (Figure 3) maps stochastically (Figure 4) to two different x’ (Figure 1,2), which map deterministically to the y values ‘eight’ and ‘three’. I’ll refer to the mapping function as g, in this case, g adds a red box that hides the difference between an eight and three. I refer to the function that maps x’ to y as f_true, which is effectively a perfect classifier that avoids randomness by having access to all relevant information. See the ‘deterministic vs stochastic classification’ section in the appendix for more details.
If the training set contains 90 copies of Figure 3 labeled as “3” and 10 copies labeled as “8” and we have a uniform prior, then p(y=8|Figure 3) = 0.1 and p(y=3|Figure 3) = 0.9. If the test set is sampled from the same distribution, it is likely to have a similar ratio of 3’s to 8’s, so our model will generalize well. However, a new-distribution test set might have 8’s appear 9 times as much as 3’s on average. If we tested on a sample from that distribution, we would probably get abysmal performance. Unless we can un-hide some of the data, it may not be feasible to have good generalization error and new-distribution error. In this blog, I use new-distribution error to refer to the equivalent of generalization error when the model is applied to a distribution different than the one its training set was sampled from (more on this in the covariate shift section of the appendix). Practically, attempting to find the new-distribution error for the in-production distribution is likely a better measure of performance for a malware classifier than the generalization error. Figure 5 illustrates this example. A key takeaway is that when we transition from population 1 to population 2, we undergo covariate shift when considering our raw sample (figure 2 in both cases) and p(y|x) defined by f_true. However, the type of shift we encounter when we consider an intermediate representation that hides relevant information (figure 3) isn’t covariate shift: it’s worse than covariate shift. This is because p(y=label “three”|x=figure 3) goes from 90% in population 1 to 10% in population 2.
In summary: If information is hidden (input is g(x’)) then p(y|x) may change with p(x) because the change to p(x) is caused by the underlying distribution p(x’) changing. This change in p(y|x) can occur when there’s sampling bias or environment shift, which are both common in malware classification.
As an example, say your dataset consisted of many recent malicious samples, along with clean samples from a wide range of dates. Given that malicious samples change more frequently than clean samples, this is a common situation. Say the model uses the (PE) file’s timestamp value to classify the files (g hides all information except the timestamp). This feature works equally well for the training set and test sets sampled from the same distribution, as in both cases, malware tends to be more recent than clean files. The feature technically generalizes well, but any classifier that is based entirely on timestamp will not perform well when a dataset contains recent clean files, such as on any machine that just had a software update. To relate this to the deterministic vs stochastic classification section of the appendix, y=f_ideal(x = PE Timestamp) is very different from y=f_true(x’ = all bytes in a file), where x = g(x’) and g is a feature extractor that extracts the PE file’s timestamp. Note this assumes the malicious samples do not make deliberate modifications to the PE timestamp, which can happen.
Required assumption to keep p(y|x) constant
The definition of malware is precise, unchanging, and is dependent solely on the bytes of a file (x’), so that y is a deterministic function of x: p(y = malicious|x’) = 1 or 0 = f_true(x’ = all bytes in a file). Here’s another way to think about that: given that the space of files that can be practically represented is finite (though absurdly large), we can give a label to every point in that space. This assumption requires that we consider a file in the context of its dependencies, as some types of malware have multiple components.
In the next part, I discuss ways to try and determine if an approach is more or less vulnerable to changes in input distribution. There is always some vulnerability, as our classifiers need dimensionality reduction, and we don’t have dimensionality reduction techniques that won’t lose at least some information relevant to the deterministic mapping from x’ to y (f_true).
Mitigating Issues Caused By Covariate Shift
After making some assumptions, the previous section established that covariate shift is an ok way to describe what happens when we consider all the bytes in a file and the distribution of files changes. Unlike scenarios where both p(x) and p(y|x) change, we have a path to creating a classifier that works: find a function as close as possible to f_true. Here’s another way to look at that: probabilistic classifiers attempt to find p(y=label|x’=bytes in a file). The data generating distribution has a p(y|x’) = f_true which does not change when p(x’) changes. If we find f_true, then our model’s p(y|x’) will not change with changes in p(x’).
An alternative might be trying to predict precisely where the input distribution will shift, but I’m skeptical of how well that would work in a highly dynamic, adversarial setting. In the fruit classifier example from the introduction, trying to model f_true would mean creating a model that understood the definition of each fruit and was able to compare any input to that definition: this approach would work when the dataset shifts to mostly bananas. Predicting where the input distribution will shift might means putting a higher emphasis on a specific class during training. If we picked a class other than bananas, our model would still perform poorly after the input distribution shifts to mostly bananas.
What makes a model vulnerable to covariate shift, and how can that be fixed?
Hypothetically a new-distribution dataset sampled randomly from the files scanned by a deployed model could be used to measure relevant new-distribution error. Receiving more false-positive reports than expected given the test set error may also hint at high new-distribution error. However, as perfect generalization does not guarantee robustness to changes in the input distribution (see the PE timestamp example in the previous part), the strategies that can be used to increase generalization such as common regularization techniques will not help past a certain point (though if our model doesn’t generalize that’s bad news for new-distribution performance). We need to find a new knob to turn to improve the new-distribution error once we have a low generalization error. I think that one of the best ways to do that is to find ways to incorporate prior knowledge to get the best inductive bias. There are several ways to restrict the hypothesis space of our learner to exclude models that generalize well but perform poorly on the new-distribution test set. In the example where the timestamp is used as a feature to produce a model that generalizes but is not robust to changes in the input distribution, excluding hypotheses that rely on the PE timestamp may help reduce new-distribution error. This exclusion is based on the prior knowledge that clean and malicious files can both be new or old. For PE files, there are a huge number of features that may need to get excluded, and features we want to keep may be difficult to compute (though it may be possible to use neural networks as feature extractors for more complicated functions).
In the extreme, if we have prior knowledge of the function y = f_true(x’) that maps x’ to y across the entire domain of x’, then we can exclude all hypotheses that are not f_true(x’). While this function is generally not practical to find, it is required for malware analysts to be able to mentally approximate f_true to be able to classify all files they see, even if they haven’t seen similar files before. It helps that in practice, we don’t need to consider the whole domain of possible files, just all files created by modern humans and our tools.
We can (and probably should) incorporate inductive bias at every step: what data to use, how to preprocess it, what model to use, how to pick architecture, features, hyperparameters, training method, etc. An example of each of these is beyond the scope of this blog; instead, I’m going to focus on feature extraction, which is one of the most popular ways to introduce inductive bias.
Achieving robustness to changes in the input distribution
Feature engineering is a crucial step for many types of models as features contribute directly to the inductive bias of the model: if features exclude parts of the input data, the model cannot use that data as part of the learned hypothesis. Transforming the data into a lower-dimensional set of features is also very important, as one of the main challenges in finding f_true(x’) is that x’ is very high dimensional: a major problem for many types of classifiers. Hand engineered features are common for some models like random forests, while other models like convolutional neural networks automatically extract features. For both hand-engineered and automatically extracted features, models tend to follow the architecture shown in Figure 6:
Each box depends entirely on the previous box. The first three boxes contain groups of features or a ‘representation.’ Boxes toward the right cannot contain more information about the input than boxes to their left. Typically, the goal of an architecture like this is to throw away information that is irrelevant to the classification. Less irrelevant information allows the classifier to deal with a lower-dimensional representation than x’ while maintaining performance. Specifics depend on the model, with neural networks having many more layers of representation than decision tree-based models, for example. The critical detail is that each representation tends to have less information about the input than the previous representation, and if the information lost is relevant to f_true(x’), then all downstream portions of the model have to deal with the ‘Uncertain input’ case, where I argued that p(y|x) could change with p(x). Therefore, we want to ensure the lower-dimensional representation still contains enough information to compute f_true(x’) (see Figure 7), even though it contains less information about the original input overall. Unfortunately, we usually can’t achieve this in practice, so when considering a model’s intermediate representation of the input, we’re no longer dealing with covariate shift. That said, we can still work to mitigate the amount p(y|x=intermediate representation) changes with p(x).
As discussed in the paper “Why does deep and cheap learning work well”, this concept is related to the data processing inequality:
>= is replaced by = if g(x) is a sufficient statistic for y. Mutual information (I) is distribution-specific, so even if g(x) is a sufficient statistic for y given p(x), it may not be when the distribution shifts to p’(x). To be perfectly robust to change in the input distribution, we need y to be a deterministic function of g(x). If that’s the case, g(x) is a sufficient statistic for y for all distributions over x. We already established that this is the case for the data-generating distribution: covariate shift implies that p(y|x’) does not change with p(x’).
Classification is discussed in ‘deep and cheap learning’ paper as the reverse of a Markov chain-based generation procedure, which makes intuitive sense for malware classification: the author(s) of a file will know if the functionality they programmed is malicious or not as an early step in the creation process. An example of reversing the process for malware is shown in Figure 8:
There are shortcuts analysts use to speed up classification, but this diagram is not too far off from the engineering and reverse engineering that goes into malware creation and human analysis (if the analyst wants the most accurate result and has a lot of time). Unfortunately, each of these steps is complicated, and creating a robust function to transition between them is probably impractical today — malware packers are still the default way to hide from antivirus products. We need an alternative, but it helps to keep this roughly idealized Markov chain in mind: note that at no step in the classification process is information relevant to malicious functionality lost. In a later blog, I may discuss alternatives inspired by the shortcuts malware analysts use to speed up classification, with a focus on malware packers.
The key takeaway is that if a representation the model uses at any point does not contain information directly related to the file being malicious or not, the model may be more vulnerable to changes in the input distribution. For example, some models in the security industry primarily use values extracted from the PE header as features. However, this representation excludes a lot of information related to f_true(x’), as prior knowledge tells us PE header values do not give much indication about a file’s behavior, which is required to determine if the file is malicious. This is compounded by the fact that the default way to evade antivirus signatures — custom malware packers — can set the PE header to pretty much whatever they want. A model that depends entirely on PE header values may do a good job detecting files packed with malware packers present in its training set, but will likely do worse when presented with a new malware packer. A representation that includes a file’s possible behaviors, however, would not throw away information related to f_true(x’). This is difficult to do in the context of static detection (detection without executing the file); however, we can still use prior knowledge to try and extract features that are more closely related to f_true(x’) than PE header data. An example of this is a feature that detects obfuscated data, as obfuscation is often applied to make antivirus detection more difficult: something only malware needs to do.
The problem of throwing away information relevant to the data-generating distribution’s p(y|x’) was also illustrated back in figure 5: by hiding important data with a function g, a model that’s built on top of g’s representation will often have the relation between g(x’) and the label change when the raw input distribution p(x’) changes.
Conclusion/Future Work
Given that there are a lot of tools focused on improving generalization, it may be helpful to think about how we can tie generalization error to improving robustness to changes in the input distribution. If our samples are represented in a way that makes malicious files from a new distribution appear the same as malicious files from our original distribution, lowering generalization error may be all that’s necessary. I plan to discuss some ways to approach robustness to changes in input distribution in another blog, with a focus on representation. One example is using convolutional neural networks to specifically detect visual features in a file that human analysts would immediately recognize as malicious across any distribution of files, but for which it would be difficult to manually program heuristics. I would also like to find good methods for measuring new-distribution error for better model comparison. Some distributions are clearly better than others for measuring new-distribution error, and the practical problem of finding the true label of all in-production samples still prevents that distribution from being used in practice. For example, an alternative could be finding the true labels for a small subset (<1000) of samples from the in-production distribution and calculating accuracy (or precision) along with a confidence interval that accounts for the limited sample set. Precision is often easier to work within the security industry due to difficulty in finding in-production false negatives in an unbiased way (due to the number of true negatives), and the fact that false positives on average negatively impact more people than false negatives (clean files are much more likely to appear on more than one machine). I would also like to consider adversarial machine learning in the context of covariate shift.
In summary: In any situation where our data generating distribution undergoes covariate shift, we can potentially find a model that is robust to changes in the input distribution p(x’). One approach to this is to use domain knowledge to ensure that each intermediate representation utilized by our model does not throw away information relevant to the true relationship between p(y|x)=f_true.
*****************************************************************
Appendix
Background
In this part, I go over some relevant background concepts. Here’s a quick summary:
Training error measures how poorly a model performs on its training data, and out-of-sample (generalization) error measures how poorly a model performs on samples from the same distribution as the training data. New-distribution error measures how poorly a model performs on samples taken from a different distribution than the training data.
Deterministic vs Stochastic Classification
Given input space X and label space Y, the most desirable outcome of classification is often to find a function f_ideal that perfectly maps inputs to labels: y = f_ideal(x) (where x ∈ X, y ∈ Y) for our input distribution. This function would correspond to a Bayes optimal classifier with a Bayes error of 0. Unless we can somehow prove that we found f_ideal, we can only use a hypothesis function h that approximates it. I’ll discuss this more later in the context of measuring errors.
In many cases, the relation between Y and X isn’t deterministic: for the same x, there may be different possible y values. In that case, instead of y = f_ideal(x) we have a probability distribution where y~p(y|x). If this conditional distribution is Bernoulli — such as malicious (y=1) vs clean (y=0) for x=input file — we can have our classifier output 1 if p(y=1|x) > 0.5 and 0 otherwise. This produces the Bayes optimal classifier, but the Bayes error is not 0 due to irreducible uncertainty. A common explanation for why the same x has different y values is because we don’t have enough information. That is, x = g(x’) where x’ is the data we would need to have to compute y deterministically. g is a function that potentially hides information about x’: a lossy compressor, for example. This scenario implies there is a function f_true such that y = f_true(x’). The key difference between f_true and a f_ideal is that f_true is perfectly accurate across the entire input domain rather than a single distribution of inputs. In this case, we have an underlying distribution p(x’) where we sample x’ from and then convert to x using the deterministic, many-to-one function g. The reason the same x may have different y values is because a single sample of x represents multiple samples of x’ which have different y values.
A trivial example is if we want to determine the average color (y) of an image (x’), but we’re only given (with g) a subset (x) of the pixels. In some cases, we have no access to x’, just x, for example, recommender systems may need to simulate a person’s brain (x’) to be 100% certain what to recommend, so they use a collection of behavioral features instead (x). However, in other cases, we do have access to x’, but we create g as part of our model because x’ has too many dimensions for our classifier to handle. This scenario is central to my understanding of what can make malware classifiers vulnerable to changes in the input distribution, and much of this blog expands on it. I use x’ in cases where I believe the input deterministically determines the output; otherwise, I’ll use x for situations where the relation between x and y may or may not be deterministic. Note we can only learn the Bayes optimal classifier if our inductive biases do not exclude it from our hypothesis space; otherwise, we want to learn a function that approximates it as well as possible given the restrictions.
A review of inductive bias
As one of the most important topics in machine learning, inductive bias is relevant to most machine learning related discussions, but it is essential for this blog. Inductive bias refers to the set of assumptions a learner makes about the relationship between its inputs and outputs, which serve to prevent the learner from forming some (hopefully undesirable) hypotheses. All models we use today have some level of inductive bias, which is required to be useful. However, if the assumptions do not adequately match the problem, the model may not work well. Some views of conventional regularization methods frame them as an inductive bias for simple functions (Occam’s razor).
For example:
- If a learner assumes a simple relationship between input and output when the actual relationship is complex, it may lead to underfitting (poor performance on training data)
- If a learner assumes a complex relationship when the actual relationship is simple, it may lead to overfitting (poor performance on held-out test data)
- If a learner assumes a relationship holds across changes in distribution when it doesn’t, it may make the model vulnerable to changes in the input distribution (poor performance on new-distribution test set)
In the next section, I’ll review generalization: the final prerequisite for a discussion of covariate shift.
A review of generalization
Generalization is a fundamental concept in machine learning, but to understand a model’s performance under covariate shift it helps to know how covariate shift is different from performance on an out-of-sample test set (a common way to estimate generalization).
Classification models are usually created by training a model on a set of labeled samples. Given this data, it is trivial to create a (non-parametric) model that performs flawlessly on its training data: for each input that is in the dataset, return the label for that input (memorization). For each input that is not in the dataset, return some arbitrary label. While this model performs perfectly well on the training data, it probably won’t do better than random guessing on any other data. Given we usually want to label other data, we need a way to distinguish between two models that have perfect training accuracy, but different accuracy on unseen data.
Informally, generalization usually refers to the ability of a model to perform well on unseen inputs from the same distribution as the training data. For example, if we train two models on photographs of cats and dogs, and both have excellent performance on the training data, we will usually want to use the one that performs best on photographs of cats and dogs that weren’t in the training set. We can (probably) find which model this is by comparing performance on an “out-of-sample” test set we randomly selected to be held out from the training process. Using a random held-out test set ensures that both the test set and the training set were sampled from the same distribution.
However, just like how two models with the same training accuracy might not be equal, two models with the same accuracy on an out-of-sample test set may not be equal. For example, if one of the models can correctly classify images of cats and dogs drawn by humans in addition to photographs, even though we only trained it on photographs, it may be preferable to a model that can only classify photographs. This evaluation criterion also applies to samples designed to produce an incorrect classification (adversarial examples): if the model is not trained on any adversarial examples, good generalization does not imply that it will properly classify new examples that have been adversarially modified.
Malware packers and generalization
Bypassing malware classifiers with adversarial machine learning is in some cases overkill: creating a new malware packer (that produces samples with extremely low probability in the model’s training distribution) often achieves the same result (a FN) as adversarial machine learning for many types of malware classifiers that generalize well. Attackers have been creating new malware packers for years to avoid algorithmic detections employed by traditional antivirus software. Malware packers hide a malicious payload behind one or more layers of encryption/obfuscation, leaving just an executable decryption stub available for static analysis. The stub, along with other file attributes, can be made to have very little correlation with previous malware packers. As a result, any payload hidden by that packer will, by most standard measures, be dissimilar to files in any train or test set existing models have utilized. As a consequence, malware classifiers that generalize well do not require attackers to change their high-level strategy for avoiding detection: they can keep pumping out new malware packers to hide from both traditional antivirus techniques and ML-based static detection. A malware classifier that is somewhat robust to changes in the input distribution (in this case caused by the creation of new packers) could be much more difficult to bypass. Note malware packers are different from commercial packers such as UPX, which aren’t designed to avoid detection by antivirus software.
Covariate shift
In the case of generalization, we take both the training and out-of-sample sets from the same data-generating distribution p(x, y)=p(y|x)p(x) over inputs and labels. Covariate shift is typically defined as a scenario where the input distribution p(x) changes, but p(y|x) does not. Cases where p(y|x) changes, but p(x) does not are referred to as concept shift. If the data-generating p(y|x) changes, statistical models that approximated the original p(y|x) will not be as effective.
Here’s an example: Say we have a training set that consists of a single example, x = 5.02, with a label y = 10.04 for a regression problem. If our inductive bias is that there is a linear relationship between x and y, we might get the model y = 2x. If that is the actual relationship, this model works for the entire domain of x. If the distribution of x is a Gaussian with unit variance and 5.0 mean, our test set might contain the example x = 4.4, y = 8.8, and we’d get perfect test accuracy. However, the model still works even if we change the distribution of x to be a uniform distribution between 1 billion and 2 billion. The model is robust to changes in the input distribution, achieved by finding y = f_true(x). However, we’re out luck if the relation between y and x (p(y|x) in the stochastic case) changes in addition to p(x): say in one data set x=5.02 produces y=10.04 consistently, and in another x=5.02 produces y=20 consistently. Therefore it’s important that p(y|x) does not change.
In some cases, it can work to think of ‘robustness to changes in the input distribution’ as an extension of generalization. Take, for example, 3 different datasets. The first is a training set, the second is a test set from the same distribution as the training set (used to test generalization), and the third is a new-distribution test set, where the members are from the same domain as the other two datasets but sampled using a different distribution (used to test robustness to changes in the input distribution).
- If a learner’s inductive bias causes it to underfit the training data, it will (probably) perform poorly on all 3 sets.
- If a learner’s inductive bias causes it to overfit the training data, it will (probably) perform poorly on the test set and new-distribution test set.
- If a learner’s inductive bias allows it to perform well on the distribution its training data was sampled from and no other distributions, it will (probably) perform poorly on the new-distribution test set.
I’ll try and relate generalization and robustness to changes in the input distribution a bit more mathematically here. I loosely took the formulation of generalization used here from https://en.wikipedia.org/wiki/Generalization_error
V is defined as our loss function, h_n is our classifier trained on n training samples. x is the input, for example, binary files, and y is the label, for example, clean/malicious.
Training error can be calculated by averaging the loss function over the training samples:
Generalization error is calculated by taking the expectation of the loss function over the entire input distribution:
Generalization gap or sampling error is calculated as the difference between training and generalization error:
Given that for covariate shift p(y|x) is constant and p(x) changes to a new distribution p’(x), we can rewrite I_G[h_n] to create a “new-distribution error”:
Generalization gap is formulated as the difference between training and generalization error, so it may also be helpful to express a distribution gap as the difference between I_R[h_n] and generalization error. This difference can be calculated with simple subtraction like in https://arxiv.org/abs/1902.10811 (but reversed for consistency): I_R[h_n]-I_G[h_n]. If the new distribution is harder to classify, then the gap is positive; if it is easier, the gap is negative, and if it’s the same difficulty, the gap is 0. A significant difference between the distribution gap and generalization gap is that the generalization gap is always positive, but with a new distribution, it may be that all the samples are in ‘easy’ regions for the classifier. Because training error is optimistically biased, it is very likely for training data to be in the ‘easy’ region for a classifier trained on it, keeping the generalization gap greater than or equal to 0.
If the distributions aren’t significantly different, it may be informative to examine samples that are likely in the new distribution but unlikely in the original to provide a better test of robustness to changes in the input distribution. This next part is very informal, but one approach could be to change p’(x) to remove probability mass from areas that are likely under p(x) so that we are only evaluating samples that are more likely under p’(x) than p(x). This approach gives an error that is more focused on the samples that have become more likely in the new distribution. Here’s a pdf p’’ that subtracts p(x) from p’(x) (max prevents negative probability density, and the denominator keeps the total area at 1)
This equation may not be intractable, but my assumption would be using a new distribution test set sampled from p’(x) could help just like using an out-of-sample test set helps measure generalization error. It may take prior knowledge to confirm if a sample from p’(x) was also likely in p(x). For example, to better approximate R[h_n] rather than I_R[h_n] it may be worth focusing on the error of samples that are confirmed to be new malware families that had 0 probability in p(x). In the photo-vs-hand drawn case (from the ‘brief review of generalization’ section), it should be safe to assume that natural images of cats and dogs are rare when sampling hand-drawn images (with the exception of photorealistic rendering), so samples from p’’(x) (images of hand-drawn cats and dogs excluding those similar natural images of cats and dogs) should be similar to samples from p’(x) (images of hand-drawn cats and dogs).
Why p(y|x) does not change: Uncertainty in the label
Figure 9 shows a simple example of case B from the ‘Why p(y|x) does not change’ section. In this case, we have access to the entire input, but we define the label ‘eight’ to apply to anything that looks like an eight, and the label ‘three’ to apply to anything that looks like a three.
Uncertainty arises because the input in the top left of figure 9 looks like both an eight and a three.
Often there is a spectrum of certainty, where examples near the middle may be more vulnerable to changing definitions, as shown in Figure 10:
Because the definition of malware is not precise, the same type of uncertainty can arise. A parallel example to Figure 10 for malware would look something like this:
Due to uncertainty in definitions, files with behavior closer to the middle of the spectrum (sometimes called Potential Unwanted Applications or Grayware) can be more challenging to classify. Utilizing ‘soft labels’ — labels above 0 and below 1 — is one way to deal with uncertainty in the label definition.
The definition of what is malicious may change, which implies a change in p(y|x). However, there are a few reasons why I think we can disregard this in practice:
1. Relating to the ‘why p(x) changes’ section, the two main causes of p(x) changing — sample selection bias and environment shift — should not change our definitions.
2. Practically, we care most about the cases on the extreme ends of the spectrum where it is unlikely for changing definitions to have an impact. A file specifically designed to exfiltrate a user’s credit card info without their knowledge or permission will always be considered malicious, and an essential operating system file will always be considered clean. In the industry, changes in what we consider worth detecting is generally caused by debates around whether a certain behavior is clean or “potentially unwanted”
Obviously some future programs could become very difficult to classify, possibly requiring ethical philosophy to solve: for example p(y=malicious|x=code for an AGI).