In a recent article, we introduced the concept of a high probability lower bound (HPLB) of the TV distance, based on our article on arXiv:
https://arxiv.org/abs/2005.06006,
joint work with Loris Michel.
In the current article, we dive into a detailed treatment of this topic and on the way touch upon some very useful (and beautiful) statistical concepts. In particular, we will need to draw balls without replacement from an urn with m balls and n squares. Many thanks to Maybritt Schillinger for a wealth of constructive comments!
Outline
For some time now it has been known that powerful classifiers (such as the Random Forest classifier) can be used in two-sample or A/B testing, as explained here: We observe two independent groups of samples, one coming from a distribution P (e.g., blood pressure before treatment) and one coming from a distribution Q (e.g., blood pressure of an independent group of people, after treatment) and we want to test, _H0: P=Q. Given these two sets of data, we give one a label of 0, and the other a label of 1, train a classifier and then evaluate this classifier on some independent data. Then it seems intuitive that, the better the classifier can differentiate the two groups, the more evidence against the Null there is. This can be made formal, leading to a valid p-value and to a rejection decision when the p-value is smaller than a prespecified alpha.
This is nice because today classifiers are powerful and thus this approach leads to powerful two-sample tests that can potentially detect any difference between P and Q. On the other hand, we all heard about the problems with p-values and classical testing. In particular, a significant p-value does not tell one how different P and Q are ( this is related to effect size in medicine). The picture below illustrates an example where P and Q get progressively more different. In each case, even a strong two-sample test would simply only give a binary rejection decision.
So it would be more interesting if we could somehow meaningfully calculate how different P is from Q, ideally still using a powerful classifier in the process. Here we construct a meaningful method based on an estimate of the TV distance between P and Q.
In the following we assume to observe an i.i.d. sample _X_1, …, Xm from P and an independent i.i.d. sample _Y_1, …, Yn from Q. We then use the probability estimate of a classifier (e.g., probability of belonging to class 1) as a "projection" that takes the vectors of data and maps them onto the real line as probability estimates. Building the univariate order statistic with these values and finding a connection to TV(P,Q), we will then be able to construct our lower bound. In the following, we will also sometimes just write lambda for TV(P,Q).
Using a (powerful) Classifier to get a Univariate Problem
Concepts in this section: Drawing circles without replacement from an urn, with m circles and n squares: using the hypergeometric distribution for two-sample testing.
In general, the samples from P, Q are d-dimensional random vectors. Here the classifier comes into play already: As most classifiers can take these m+n sample points and transform them into a sequence of real numbers, the prediction of the probability of the observation i to have label 1. Thus, we can just focus on the real numbers between 0 and 1 to construct our estimator. Of course, what we really lower bound then, is the TV distance of the probability estimates. It is thus important that the classifier is strong, to make sure we do not lose too much information.
So, let’s assume we have a sample of N=m+n real numbers, and we know for each of those whether the original observation comes from P or Q. Then we can build this magical thing called order statistic. That is, we take the N numbers and order them from smallest to largest. To illustrate this, let’s represent samples from P as circles and samples from Q as squares. Then the order statistics may look like this:
Now here is the important point: The classifier tries to estimate the probability of an observation to be in class 1, or from Q, or being a square, as accurately as possible. Thus if the classifier is good, we should expect to see more squares on the right, because the estimated probability should be larger for squares than for circles! Thus, the order statistic is like a centrifuge: If there is a discernible difference between P and Q, and the classifier is able to detect it, the order statistics of the probabilities push the circles to the left and the squares to the right. Since there is still randomness and estimation error, this will not look perfect in general. However, we want it to be ”sufficiently different from randomness”.
One very elegant way to quantify this is the statistic we call _Vz, the number of circles below z. This statistic has been used for (univariate) testing for a long time. That is, at any point z=1,…, N, we simply count how many circles we have below z:
What should we expect if P and Q are the same? In this case, we simply have N i.i.d. draws from a single distribution. Thus there should be just a random arrangement of circles and squares in the order statistic, with no pattern. In this case, _Vz is actually drawing, with uniform probability, circles without replacement from an urn with m circles and n squares. Thus this connects back to the very basics of probability. Mathematically:
see e.g., this nice article.
Under H_0: P=Q, _Vz is hypergeometric: It is the number of times you draw a circle if you draw z times without replacement from an urn with m circles and n squares.
What is cool, is that we are now even able to find a function in z , q(z, alpha), for any alpha, such that when P=Q:
Finding this q(z, alpha) can be done by using asymptotic theory (see e.g. our paper and the references therein) or simply by simulation. The main point is that we know the distribution of _Vz and it is always the same, no matter what P and Q exactly are. So even if we don’t have a closed-form distribution for the maximum, we can still approximate q(z,alpha) quite readily. This can be directly used for a (univariate) two-sample test! If _max_z Vz-q(z,alpha) overshoots zero, we can reject that P=Q.
Ok so this is quite nice, but the whole point of this article is that we want to get away from simple rejection decisions, and instead get a lower bound for the Total Variation Distance. Unfortunately, under a general alternative where P and Q are different (i.e., TV(P,Q) > 0), the distribution of _Vz is no longer known! The goal will now be to find another process _Bz that is easier to analyze and such that for all z=1,…,N, _B_z ≥ Vz. If we can bound this process correctly, then the bound also holds for _Vz.
Playing around with TV(P,Q)
Concepts in this section: Using the sampling interpretation of TV to introduce the concept of Distributional Witnesses and using this to identify an area where P=Q holds, even if P is not equal to Q in general.
By finding q(z,alpha) in the last section, we essentially found a first step of the construction of a lower bound for the case when TV(P,Q)=0, i.e. if there is no difference between P and Q. We now extend this by connecting with our first article and playing around with the definition of the TV distance between P and Q. Generally, this is given as
Thus we look for the set A, out of all possible sets, such that the difference between P(A) and Q(A) is largest. Let’s make this more specific: Let p and q be the densities of P and Q (as a technicality, the data does not need to be continuous in the usual sense, we can always do that in this case). Then the maximal A is given as
so that
Let in the following X have distribution P and Y distribution Q. Now here comes the crucial part: We can use this to define a new density
Then this is a valid density (integrates to 1) and we can define similarly a density q+_. What this means is best seen graphically:
The picture shows that the densities p, q can be split up into the densities p+_, q+,_ and some middle part, that corresponds to the minimum value of both densities and integrates exactly to 1 if we standardize it with 1-TV(P,Q):
Instead of seeing X as a draw from P and Y as a draw from Q directly, we can now see X as drawn from the mixture
What this means is the following; before we draw X, we flip a coin wherewith probability TV(P,Q), we draw X from the red density p+_ and with probability (1-TV(P,Q)) we instead draw it from h. For the distribution of X it doesn’t matter how we look at it, in the end, X will have distribution P. But clearly, it appears interesting whether X actually came from p+_ or from h, because the former corresponds to the ”unique” part of p. Indeed looking at either the graphic or the densities themselves, we see that p+_ and q+_ are disjoint. So X either comes from p+_ or from h and similarly, Y is either drawn from q+_ or from h. Crucially, if both Y and X are drawn from the density h, they obviously come from the same distribution and there is no way to differentiate them, it is as if we were under the Null.
So, for the i.i.d. observations _X_1, …, Xm and _Y_1,…,Yn, each observation is either drawn from the specific part (p+_ or q+_) or from the joint part h. We call observations that are drawn from the specific part p+(q+) witnesses for P (Q).
Observations drawn from the specific part p_+ are called witnesses (for P). Observations drawn from h cannot be differentiated, so this corresponds to the part where P and Q are the same.
Ok so if we go back to the order statistics, we can now think of it like this:
Circles with blue crosses correspond to witnesses from P, while squares with blue crosses to witnesses from Q. Basically from the crossed-out observations we can learn something about the difference of P, Q, while the observations without crosses are basically from the Null. In a sense all of this is just a thought experiment – we have no way of knowing whether _Xi is drawn from p+_ or h. So we don’t know which points are witnesses, in fact, we don’t even know how many there are. Nonetheless, this thought experiment will be helpful to construct our estimator.
In the following we will propose a candidate for TV(P,Q), say _lambdac and then check whether this candidate fits a condition. If it does we choose a new candidate _lambdac that is higher than the old one and check the condition again. We do that until _lambdac violates the condition.
Let’s do some cleaning
Concepts in this section: Bounding the number of witnesses with high probability and the use an intuitive "cleaning operation" to get a better behaved process B_z that is always larger or equal V_z.
We now want to use this idea that some points are witnesses and others come from the part where P=Q for the statistics _Vz. As mentioned above, we don’t really know which points are witnesses! That would be ok, what we need is actually just the number of witnesses, though we don’t even know that. However, we can find an upper bound for this number.
Recall that we assume _X_1,…, Xm were sampled by drawing each with probability TV(P,Q) from _p+ and with probability 1-TV(P,Q) from h. So the number of witnesses in m observations, denote_d wp, actually follows a Binomial distribution with success probabilit_y TV(P,_Q). So if we have a candidat_e lambda__c, which we suspect should be the true TV distance, the number of witnesses should follow the distribution
This is still random, and thus we don’t know the exact outcome for a given sample. But since we know the distribution, we can find a higher quantile _Wp, such that _wp is overshooting this quantile with a probability less than alpha/3. For instance, for m=10 and _lambdac=0.1, this can be found as
lambda_c<-0.1
m<-10
alpha<-0.05
W_p<-qbinom(p=alpha/3, size=m, prob=lambda_c, lower.tail=F)
# test
w_p<-rbinom(n=3000, size=m, prob=lambda_c)
hist(wp,main="")
abline(v=W_p, col="darkred")
We can do the same for the witnesses _Wq.
So we have a candidate _lambdac and, based on this candidate, two values _Wp and _Wq that bound _wp and _wq with high probability. In particular, _Wp and _Wq depend directly on _lambdac, so it would actually be better to __ write _W_p(lambdac) and _W_q(lambdac), but that would blow up the notation too much.
To obtain the new process _Bz, we first make up new witnesses in the order statistics above:
The red squares now denote random points that we designated to be witnesses. This is done so that the number of witnesses matches the upper bounds _Wp and _Wq. This is ok in our context, as it actually does not change _Vz.
Now we perform our cleaning operation. This will get us from _Vz to _Bz in a way that guarantees _Bz is at least as large as _Vz. We go through the order statistics from left to right and from right to left. First, from left to right, every time we see a circle without a cross, we randomly choose a witness from P (circle with a cross) on the right and put it before the empty circle. We do so without changing the order of the squares and circles without crosses, like so:
The first circle was already a witness, so we left it as is. The second circle was a nonwitness, so we randomly moved a witness circle from further up to where it was before. The next thing was a square which was no a witness, so we moved a circle witness from the right before it and so on. The whole idea is simply to move all witnesses from P to the left and all witnesses of Q to the right, without changing the order of the non-witnesses within themselves:
Now _Bz is just counting the number of observations below z=1,…,N that belong to P in this new ordering! Note that for the first set of observations _Bz just increases linearly by 1. Then there is a middle part in which B_z behaves like a hypergeometric process:
Finally, the last few observations are only squares, so the value of _Bz just reaches m and stays there.
## Using the function defineB_z below
## Define n + m and confidence alpha
n<-50
m<-100
alpha<-0.05
# Define the candidate
lambda_c <- 0.4
plot(1:(m+n),defineB_z(m,n,alpha,lambda_c), type="l", cex=0.5, col="darkblue")
for (b in (1:100)){
lines(defineB_z(m,n,alpha,lambda_c), col="darkblue")
}
The key is that we moved all the circles more to the left than they were before! This is why _Bz is always larger (or the same) than _Vz. In particular, for lambda=0, we expect _Wp=0 and thus _B_z=Vz.
library(stats)
defineB_z <- function(m,n,alpha,lambda_c){
## Upper bound witesses for the given m, n, alpha and lamba_c
W_p<-qbinom(p=alpha/3, size=m, prob=lambda_c, lower.tail=F)
W_q<-qbinom(p=alpha/3, size=n, prob=lambda_c, lower.tail=F)
B_z<-matrix(0, nrow=n+m)
# First part: B_z=z
B_z[1:W_p,] <- 1:W_p
# Last part: B_z=m
B_z[(m+n-W_q):(m+n),] <- m
# Middle part: Hypergeometric
for (z in (W_p+1):(m+n-W_q-1) ){
B_z[z,]<-rhyper(1, m-W_p, n-W_q, z-W_p)+W_p
}
return(B_z)
}
Putting it all together
Concepts in this section: Using the above to get a bound on B_z, leading to a bound on V_z and a subsequent HPLB lambdahat we can use. It is defined through an infimum, and to find it, we need to cycle through several candidates.
Next, given the true lambda=TV(P,Q), we want to find a Q(z,alpha, lambda) that has
This will be used to define the final estimator in a second. Crucially, _Q(z,alpha, lambdac) needs to be defined for any _lambdac, but the probability statement only needs to be true at the true candidate _lambdac=lambda. So this is the "candidate" I focus on for the moment.
From above we know that for the first _Wp values, _Bz is just linearly increasing. So _Bz=z and we can also set Q(z,alpha, lambda)=z, for _z=1,…,Wp. Similarly on the other side, when all m circles are counted, we know _Bz=m and thus we can set Q(z,alpha, lambda)=m for all _z=m+n-Wq, …m +n. (Remember that in each case lambda=TV(P,Q) enters through _Wp and _Wq.)
What remains is the part in the middle that behaves as if under the Null. This is true for _z=W_p+1,…, m+n-Wq. But since here _Bz is again hypergeometric, we can use the same q function as above to get
The alpha/3 we need, since we can potentially make mistakes with _Wp and _Wq, i.e. there is an alpha/3 chance we do not overestimate.
So we have all cases considered! For _z=1,…,W_p, Bz-Q(z,alpha,lambda)=0, the same holds true for the last _Wq z‘s. The middle part finally is covered by the above equation. But since _Bz is larger than _Vz, we also have
Using this Q function, we can define our final estimator as
This looks horrible, but all it means is that starting from _lambdac=0, you (1) calculate _W_p(lambda_c), W_q(lambdac), and thus _Q(z,alpha,lambdac), and (2) check whether
is true. If it is, you can increase _lambdac a little bit and repeat steps (1) and (2). If it is not true, you stop and set the estimator as _lambdac.
Mathematically, why does this inf definition of the estimator work? It just means that lambdahat is the smallest _lambdac such that
is true. So if the true lambda (=TV(P,Q)) **** is smaller than that smallest value (our estimator), this condition cannot hold, and instead, the > 0 condition above is true. But we have just seen above that this >0 condition has a probability of occurring ≤ alpha, so we are fine.
All of this can be found implemented in the HPLB package on CRAN. The next section presents two examples.
Some Examples
Here, we use the estimator derived in the last section on two examples. In the first example, we use a Random Forest-induced estimate of the probability of belonging to class 1, as discussed above. In the second example, we actually use a regression function, showing that one can generalize the concepts discussed here.
In the first article, we already studied the following example
library(mvtnorm)
library(HPLB)set.seed(1)
n<-2000
p<-2#Larger delta -> more difference between P and Q
#Smaller delta -> Less difference between P and Q
delta<-0# Simulate X~P and Y~Q for given delta
U<-runif(n)
X<-rmvnorm(n=n, sig=diag(p))
Y<- (U <=delta)*rmvnorm(n=n, mean=rep(2,p), sig=diag(p))+ (1-(U <=delta))*rmvnorm(n=n, sig=diag(p))plot(Y, cex=0.8, col="darkblue")
points(X, cex=0.8, col="red")
In the above simulation, the delta parameter determines how different P and Q are, from delta=0, whereby P=Q, to delta=1, whereby P is a bivariate normal with mean (0,0) and Q is a bivariate normal with mean (2,2). Even a strong two-sample test would simply reject in all of these cases. With our method, using Random Forest probability estimates, we get
#Estimate HPLB for each case (vary delta and rerun the code)
t.train<- c(rep(0,n/2), rep(1,n/2) )
xy.train <-rbind(X[1:(n/2),], Y[1:(n/2),])
t.test<- c(rep(0,n/2), rep(1,n/2) )
xy.test <-rbind(X[(n/2+1):n,], Y[(n/2+1):n,])
rf <- ranger::ranger(t~., data.frame(t=t.train,x=xy.train))
rho <- predict(rf, data.frame(t=t.test,x=xy.test))$predictions
tvhat <- HPLB(t = t.test, rho = rho, estimator.type = "adapt")
tvhat
as we would have hoped: The lower bound is zero when the distributions are the same (i.e., the implicit test cannot reject) and progressively increases as P and Q get more different (i.e., delta increases).
We can also look at a more general example. Suppose we observe a (more or less) independent sample, that has however a mean shift in the middle:
Let us consider the index of observations as time t in this example and from left to right we move from time t=1 to t=1000. The code follows below. We can now check at each point of interest (for instance at each sample point) how large the TV distance is between P=points on the left and Q=points on the right. This can be done by again using a probability estimate for each t, but to speed things up, we instead use a regression of time t on the observations _zt. That is we check whether the observation value gives us an indication of whether it lies more on the left or right.
The following picture shows the true TV in red and our HPLB in black:
It can be seen that the method nicely detects the increase in TV distance when we move from left to right in the graph. It then peaks at the point where the change in distribution happens, indicating that the main change happens there. Of course, as can be seen in the code below, I cheated a bit here; I generated the whole process two times, once for training once for testing. In general in this example, one has to be a bit more careful about how to choose training and test set.
Importantly, at each point where we calculate the TV distance, we implicitly calculate a two-sample test.
library(HPLB)
library(ranger)
library(distrEx)
n <- 500
mean.shift <- 2
t.train <- runif(n, 0 ,1)
x.train <- ifelse(t.train>0.5, stats::rnorm(n, mean.shift), stats::rnorm(n))
rf <- ranger::ranger(t~x, data.frame(t=t.train,x=x.train))
n <- 500
t.test <- runif(n, 0 ,1)
x.test <- ifelse(t.test>0.5, stats::rnorm(n, mean.shift), stats::rnorm(n))
rho <- predict(rf, data.frame(t=t.test,x=x.test))$predictions
## out-of-sample
tv.oos <- HPLB(t = t.test, rho = rho, s = seq(0.1,0.9,0.1), estimator.type = "adapt")
## total variation values
tv <- c()
for (s in seq(0.1,0.9,0.1)) {
if (s<=0.5) {
D.left <- Norm(0,1)
} else {
D.left <- UnivarMixingDistribution(Dlist = list(Norm(0,1),Norm(mean.shift,1)),
mixCoeff = c(ifelse(s<=0.5, 1, 0.5/s), ifelse(s<=0.5, 0, (s-0.5)/s)))
}
if (s < 0.5) {
D.right <- UnivarMixingDistribution(Dlist = list(Norm(0,1),Norm(mean.shift,1)),
mixCoeff = c(ifelse(s<=0.5, (0.5-s)/(1-s), 0), ifelse(s<=0.5, (0.5/(1-s)), 1)))
} else {
D.right <- Norm(mean.shift,1)
}
tv <- c(tv, TotalVarDist(e1 = D.left, e2 = D.right))
}
## plot
oldpar <- par(no.readonly =TRUE)
par(mfrow=c(2,1))
plot(t.test,x.test,pch=19,xlab="t",ylab="x")
plot(seq(0.1,0.9,0.1), tv.oos$tvhat,type="l",ylim=c(0,1),xlab="t", ylab="TV")
lines(seq(0.1,0.9,0.1), tv, col="red",type="l")
par(oldpar)
Conclusion
In this article, we took a deep dive into the construction of an HPLB for the TV distance. Of course, what we really lower-bounded was the TV distance on the probability estimates. In fact, the challenge is to find a "projection" or classifier that is powerful enough to still find a signal. Algorithms like Random Forest, as we used here, are examples of such powerful methods, that moreover don’t really require any tuning.
We hope that with the code provided here and on CRAN in the HPLB package, these basic probability considerations we did here might actually be used on some real-world problems.