Living in the wilderness: hypothesis testing in a world that disagrees with statistical theory

Often statistical theory withdraws its support in data science practice. But when this happens all is not lost. We can double the bet on our data and still win — if we are careful and conscious about what we are doing

Published in

Towards Data Science

16 min readMay 23, 2021

Sometimes it seems paradoxical to call the famous bell curve “normal”. Among all the assumptions made by traditional statistical theory, the normality assumption is notorious for the frequency it doesn't hold. My aim in this article is to show a way to test hypotheses when the normality assumption of traditional hypothesis tests is violated. In this scenario, we can't rely on theoretical results, so we need to depart from theory's ivory tower and double the bet on our data. To get there, first I briefly review what hypothesis testing is, focusing on an intuitive grasp of the reasoning behind it (no equations allowed!). Then I proceed to a case study motivated by a business problem where the normality assumption doesn't hold. This makes matters concrete and will direct our discussion. After the problem is explained, I will show that bootstrapping is a good way to fill the gaps left by theory without changing anything in the reasoning at the heart of hypothesis testing. In particular, I will show that bootstrapping leads to the right conclusion about the test. I conclude this article with a critical evaluation of bootstrapping and similar methods, pointing out their pros and cons.

Introduction: the heart of hypothesis testing

Many data scientists have trouble understanding hypothesis testing. I think the problem is that there is so much mathematical pyrotechnics that we lose sight of what we are trying to accomplish. I'm not saying that mathematics is a burden, but only that its blessings come at a cost. By assuring the truth of what we are saying in such a sophisticated way, we end up losing sight of what we wanted to say. That's all. I'm a humble servant of the Queen of the Sciences just as you are.

So let's leave our beloved Queen aside and think about hypothesis testing informally. At its core, hypothesis testing is an argument — some reasoning that leads to some conclusion. There are many kinds of arguments, so what kind of argument is a hypothesis test? It is what we call a reductio ad absurdum, a fancy Latin roughly translating into "reduction to absurd". In this kind of argument, we assume something to be true. Then we show that, if this thing is true, the conclusion we reach is so absurd or ridiculous that we must reject the validity of what we assumed. Our hypothesis must be false or we are not being reasonable.

Think about police detectives. They do this all the time. At first, everyone is a suspect: in a world of corrupt cops, even your partner could have produced the dead body you are seeing right now. After good police work, however, you may end up thinking "if my partner did kill that guy, she would have done it for absolutely no reason at all!" To think that someone would kill someone else today for no reason at all is clearly absurd. Therefore you must either conclude that your partner didn't do it or that you belong to a mental institution. You choose the first case and proceed to more likely suspects.

This is exactly what we are doing in hypothesis testing when we assume the null hypothesis is true beforehand and by default. The null hypothesis is some statement involving a quantity that we simply assume is true, as the grades of students under different teachers being equal. The alternative hypothesis is a statement that is true if the null hypothesis is false, as the grades being different. The point of the test statistic and the normal distribution is to show whether it is absurd to believe that the null hypothesis is true. Believing that the grades of students under different teachers are equal (on average) may be as absurd as believing that your cop partner killed some random guy for no reason at all.

But it may not be. It could be true and very reasonable. It is a fact that some teachers are better than others, and this reflects on the average grade of their students. You may end up arresting your partner for killing some guy that you found was going to tell she is in the pocket of some drug dealer. That is why in both cases we need evidence. In statistics, we collect them in the form of data. The p-value is just a quantitative way to show how absurd it would be to believe that the null hypothesis is true after seeing the evidence. It doesn’t tell us that our null hypothesis is sure to be false, but only that it is extremely probable that it is indeed false. That’s why error rates are ubiquitous in statistics.

So far so good, but what about that normal distribution that generates a horrible table we use to get a conclusion? That normal distribution is provided by our Queen, mathematics, with her army of assumptions. Our test statistic has a normal distribution because that would be the case if the null hypothesis were true. Reductio ad absurdum here: we need to assume the validity of the null hypothesis to construct the test. It is the mathematical way of saying we need to start somewhere (why not from the null hypothesis?). That is exactly why we say that the test statistic follows a normal distribution under the null hypothesis when presenting a test.

Our beloved Queen needs one soldier in particular in her army of assumptions to provide us with such a beautiful result. This soldier is the statement that the distribution of grades under two given teachers are normally distributed. Consequently, their difference would also be normally distributed. Our Queen assured us that we will get a normally distributed test statistic; thus we don't need to deduce it for ourselves. That is, we don't need to derive the curve ourselves for every test where there are normal distributions. The answer is already given to us by statistical theory! Much less work for us to do in this scenario.

But what happens when variables are not normally distributed? Our test continues to be a reductio ad absurdum. However, because the normality assumption is violated, we need something to replace the pre-computed normal distribution that can't be used. Without the statistical theory backing us, we need to find a replacement using only the data. We will see that the bootstrap is a good way to do this. Before talking about bootstrap, though, let's make matters more concrete by using an example of a test where variables are expected not to be normally distributed.

Case study: testing hypothesis about credit scores

Once upon a time, a data scientist was working in the credit market. She needed to test whether a credit score made by a third party could separate clients with a high probability of repaying the loan from those with a low probability of doing so. Let's refer to each of these classes of clients as "good" and "bad" clients. A credit score is simply a grading system, where "good" clients typically receive higher grades than "bad" clients. The higher the credit score, the lower the probability of a client not paying the loan back. Credit scores are the core of the business in credit markets, for we want our money back (plus interest). Credit is not charity.

So, what is the fundamental characteristic of a credit score? Looking at it from the point of view of a data scientist, a credit score is just a classifier trained to predict default on loans. Thus a good credit score should have a good degree of separation between "good" and "bad" clients. In particular, the distribution of scores should be such that "good" clients are clustered on the higher scores and "bad" clients are clustered on lower scores.

Try to visualize this pattern. Can you see why this is a big problem for statistical theory? Let's look at "good" clients. Because the majority of them are located on the higher scores, the probability of having a "good" client on a high score is high. On the other hand, the probability of having a "good" client on a low score is low. Hence, this probability distribution is negatively skewed. The distribution of "bad" clients mirrors this, being thus positively skewed. If we draw these two as blue and red curves, we get the figure below.

**Figure 1: distribution of credit scores for "good" and "bad" clients when the score is good**. The “good” clients are clustered on the higher values (blue curve), while the "bad" clients are clustered on the lower values (red curve). This behavior is ugly for theory but beautiful for business. Notice that there is some overlapping between the curves: no machine learning algorithm is perfect! Image by the author

We can't rely on the normality assumption here. But this isn't the biggest problem. The biggest problem is that, if the credit score is bad, it is very reasonable to think that the distributions for "good" and "bad" clients are normally distributed. To visualize this, think about a poor classifier. It is doesn't differentiate effectively between classes. Hence all scores should be somewhat randomly distributed among the two classes of clients. There is no particular reason why some clients may receive lower scores than others. This randomness leads to two approximately normal curves, one on top of the other, like the figure below.

**Figure 2: distribution of credit scores for “good” and “bad” clients when the score is bad**. The score values are somewhat random for both "good" and "bad" clients. Thus it is reasonable to assume they follow a normal distribution. The curves overlap because the algorithm doesn't separate the classes effectively. Image by the author

We are in a theoretical nightmare here: we have a problem such that the normality assumption may or may not be reasonable. We can't know which case is it because knowing that requires knowing the answer! We could indeed handle skewness by getting more data. This way, the accuracy of the test would be good enough for the conclusions to be reliable. Yet we don't need this. Getting more data should not be the first answer that comes to your mind as a data scientist, because in most cases you can't do anything about it. Frequently it is impractical, prohibitively expensive, or even impossible to get more data. The era of big data is not an excuse for a good data scientist not striving to do his or her job using as little data as possible. That is what statistics is all about.

How can we solve this problem? We need a method general enough that could handle both scenarios, in such a way that we don't need to worry about whether the credit score is indeed good or bad. The answer: bootstrap. I will show you that, in both scenarios, the bootstrap derives a distribution of our test statistic (under the null hypothesis) that leads us to the right conclusion.

Here we have a typical thought experiment. I simulated two credit scores, a good and a bad one, so that I know beforehand which is which. This is a good way to test whether something works. We just need to check if it leads to the expected answer, which we know beforehand is correct. If the method does lead us to that answer, it works. Let's see how the bootstrap goes.

Imagining the world of the null hypothesis using bootstrap

The bootstrap is beautifully simple. It is just a series of draws of tiny samples such that the same observation could be sampled more than once. In the case of the dataset simulated here, we have 1,000 rows of data for each type of credit score. With them, we draw a huge amount of tiny samples, of size about 20 or 50. Then we calculate the test statistic for every subsample, storing the result. Let's say we drew 10,000 subsamples (getting more data may be beyond your reach, but computer power today probably isn't!). If we did a histogram of this array of test statistic's values, we should get the probability distribution of the test statistic. All we are left to do is to get the distribution's quantile corresponding to our desired level of statistical significance and voilà: problem solved.

Before we hit the keyboard on our computing environment, we should remember something very important: we need the distribution of the test statistic under the null hypothesis! We need to transform our data in such a way that our null hypothesis is true before doing the bootstrap. Let's state our null hypothesis conveniently as follows: the average credit score for "good" and "bad" clients are equal. The alternative hypothesis is: the average credit score for "good" clients is greater than the average credit score for "bad" clients.

Now we can proceed with the test. Take the dataset for the good score. Let’s split our 1,000 observations by the class labels. This way we get distributions of “good” and “bad” clients as two independent distributions. How can we change these data in such a way that the null hypothesis is true? We know that subtracting the mean of a variable leads to its mean being zero. If I subtract the mean from the "good" clients' scores and "bad" clients' scores, both of them would have a mean of zero, right? If they both have a mean of zero, their means are equal! The null hypothesis is true!

Let's form a mental image of the transformations we are making in our dataset. Imagine the dataset as a spreadsheet. First, we had two columns: the credit score and the class label. Then, we separated all scores associated with the class label zero (no default) and all scores associated with the class label one (default). Now we have two independent columns: one of "good" client scores and another of "bad" client scores. This is similar to the famous control-treatment group dichotomy. We can't overwrite these columns when subtracting the means of each, for we need the original data to test our null hypothesis. Hence we make a copy of them and subtract the mean from the copies. Now we have four columns: the two original arrays of scores plus the corresponding arrays of centered scores.

It's time for some bootstrapping. Can you see that the two arrays of centered scores are a way of imagining what the world would be like if the null hypothesis were true? Forget the details and focus instead on the sheer facts. The mean of both arrays are equal? Yes, they both are equal to zero. So this is the world where the null hypothesis is true, period. Now the catch: if we sample from these two arrays and calculate the test statistic of the tiny subsamples, we would get the distribution of our test statistic under the null hypothesis, right? That's exactly why we solve our problem. Instead of relying on theory to know the distribution under the null hypothesis, we compute it ourselves using the data, transformed in such a way that the null hypothesis is true. This is the computational, brute-force solution that we need when the elegant mathematical one can't help.

If we do this for the good score, we should get the distribution below. Notice that it is asymmetric, as expected. To arrive at some conclusion, we need to know what is the value of our test statistic such that the probability of we observing a value equal or lower than it (under the null hypothesis) is, say, 95%. If our scores were normally distributed, this value would be 1.64 because our test is one-sided (no need to split the error rate evenly in both tails). Yet this value is 2.37. Why?

**Figure 3: distribution for the test statistic under the null hypothesis when the credit score is good**. Notice the asymmetry. Alright, this is not the perfect example of an asymmetric distribution, but it still is asymmetric! If it were not, the critical value should be 1.64, not 2.37. The asymmetry is adding 0.73 points to our critical value to account for the higher likelihood of spotting large differences between two population averages. Image by the author

Because the two distributions are clustered on the opposite extremes of credit score values. Hence we expect large differences to be more likely than would be the case if the scores were evenly distributed around some average value (as when following a normal distribution). It is this fact that adds 0.73 points to our critical value. This 44% increase makes a huge difference for our test. If we used 1.64, we would be ignoring the fundamental fact about good credit scores: that they group "good" and "bad" clients on opposite sides of their range of values. We would be saying that our error rate is 5% when it was actually much higher than that! The probability of we concluding wrongly that our credit score is good would be much greater!

We now have our reasonable critical value. What is the observed value? Let’s take the two original arrays for the good score (those which don’t have a mean of zero) and compute the test statistic using all observations. We get the value 51.66. This is our observed value for the test statistic.

Now let's check if our null hypothesis (equal average credit scores) is true. Comparing our observed value with our critical value, we see that the observed value is immensely bigger. We expect that, if our null hypothesis were true, 95% of the time we should observe a value of the test statistic below the critical value (2.37). Yet our observed value is much bigger than that (51.66). What is the probability of observing such a huge value if the null hypothesis were true? My god, if 2.37 is for 95%, how much is the probability of 51.66? That is a very small probability, indeed. To be reasonable in believing the validity of the null hypothesis, we should impose that we are so lucky that we got a sample where the observed value is expected if the null hypothesis were true. That's like winning the lottery many times over! That's not reasonable at all! Therefore our null hypothesis must be false! Reductio ad absurdum.

How this validates the bootstrap? The catch is that we know the score is good because we constructed it to be good. Hence the bootstrap should lead us to reject the null hypothesis of the average score between "good" and "bad" clients being the same. This is what is expected of a good credit score. Did the bootstrap lead us to that conclusion? Yes, it did: the null hypothesis was shown to be false with a high degree of confidence. So the bootstrap works for the case when the credit score is good.

What about the case when the credit score is bad? Let's do the same thing: open the other spreadsheet with the 1,000 rows dataset on the bad score. Split it into two distributions according to class labels. Copy them and subtract their means from the copies. Now bootstrap the centered arrays and calculate the distribution of the test statistic. The distribution should be something like the figure below.

**Figure 4: distribution for the test statistic under the null hypothesis where the credit score is bad**. Look how closely it resembles the theoretical curve! It is so similar that our critical value is 1.67, just 0.03 points higher than the theoretical 1.64. Image by the author

Here both distributions are normal. It makes sense that the test statistic's distribution is (approximately) normal too. We are in the realm of statistical theory again. Does the bootstrap lead to the right conclusion? Our critical value is 1.67 (notice how close it is to 1.64!). Our observed value is -0.73, well below that. Thus we are indeed observing what would be expected if our null hypothesis were true! Hence it is reasonable to conclude that it is! We knew beforehand that the score was bad, so the bootstrap works here too! If it works in both cases, and every credit score must be either good or bad, then this method is general enough to save us from worrying about whether the third party's credit score is indeed good or bad. May the data tell us which case is it!

Conclusion: don't eat soup with a fork!

Sometimes we are forced to depart from the elegance and handy results of statistical theory. We may face the difficult decision of jumping from the theory's ivory tower to achieve our goals on the wilderness of ill-behaved probability distributions. Fortunately, tools like bootstrap create good spots for us to land safely in the exotic landscape.

Like every tool, however, the bootstrap requires caution and much critical thinking. The bootstrap is a method statisticians call nonparametric. It fills the gap left by theory by increasing the trust in our data reflecting accurately the reality. This should lift a red flag right away, for nowhere in our test we check whether our sample is representative. We simply assume it is, as we do in the traditional theoretic test. If our sample is not representative, no matter how big our "tiny" bootstrapped samples are, every test statistic calculated with them would be biased. Thus our entire distribution under the null hypothesis would be biased. This is a more serious problem for the bootstrapped version of the test because in the traditional version the distribution under the null hypothesis is always the same. Hence, in the traditional test, the bias is acting only on our observed value. In the bootstrapped version, however, the bias is acting on both our observed value and our test statistic distribution. More is going wrong here.

This is just the tip of the iceberg. We are also assuming that we have a reasonable amount of data. Not exactly lots of data, but enough to be able to calculate simple statistics accurately. The bootstrap may seem like the savior of small sample sizes because it approaches the problem in a way that resembles simulation. This is a dangerous illusion. Bootstrap is not Monte Carlo, where we can sample from theoretical distributions as much as we want. Bootstrap always relies on a fraction of the data collected. The problem is that a fraction of something small is even smaller. I used mere 20 observations for every bootstrapped sample to show that we don't need observations by the hundreds to get reasonable results. Although I would surely seek to increase this value, calculating the average of 20 observations is still reasonable. If I had only 100 observations, I would calculate averages from what, 3 to 5 observations? That is a very noisy estimation! This is a particular weakness of nonparametric methods. Because they double the bet on data, we frequently need more data than would be necessary under parametric methods. It is like the statistical theory is saving us the trouble of collecting more data when using parametric methods. Nonparametric methods instead trade data for generality: we accept the obligation of getting more and better data as a counterpart to greater flexibility. This added obligation means we need to be sure we have enough data to avoid noisy estimations.

Sometimes this trade is very profitable. Here we stressed the case for skewed distributions. We saw that we can work well with skewed distributions while also coming close to theoretical results when distributions are normal. This generality doesn't end with skewness, though. We could also carry on hypothesis tests on other statistics than the average, which is very difficult to do with parametric methods. We could easily test the median, the trimmed mean, every single quantile we want, and so on. As long as we have enough unbiased data (coupled with some computing power), nonparametric methods are really powerful tools.

Don't be awed by the cleverness of nonparametric methods, though. This frequently leads to poor judgment about when to apply them. It causes me great pain to see fellow data scientists just applying the same method for every kind of problem they face. Especially if this method has strong data, computational, or theoretical requirements. I think it is part of our job description to resist the temptation of seeing everything as a nail just because we have a pretty awesome hammer. Statistical methods are just a bunch of useful tools for data analysis, just like cutlery is a bunch of useful tools for eating. Every tool must be applied for the problem it was designed to handle — and only for them. Forks are a pretty useful tool for eating, but try to eat soup with them!

Code and results used in this article are available on a GitHub repository here