Getting Started

Bootstrap statistics — how to work around limitations of simple statistical tests

Florin Andrei
Towards Data Science
7 min readJan 14, 2021

--

You have a data sample. From it, you want to calculate a confidence interval for the population mean value. What’s the first thing you think about? It’s usually a t-test.

But the t-test has several requirements, one of which is that the sampling distribution of the mean is nearly normal (either the population is normal, or the sample is reasonably large). In practice, that’s not always true, and so the t-test may not always deliver optimal results.

To work around that kind of limitation, use the bootstrap method. It has only one important requirement: that the sample approximates the population well enough. Normality is not required.

Let’s give the t-test a try, and then compare it with the bootstrap method.

One-sample test for the confidence interval of the mean value

Let’s load first all libraries we need:

from sklearn.utils import resample
import pandas as pd
from matplotlib import pyplot as plt
import pingouin as pg
import numpy as np
import seaborn as sns
np.random.seed(42)
plt.rcParams["figure.figsize"] = (16, 9)

Note: we’re using the Pingouin library for several statistical tests here. It’s great because many important techniques are kept under the same roof, and have uniform interfaces and syntax. Give it a try:

https://pingouin-stats.org/

Now back to the main topic — this is the sample:

spl = [117, 203, 243, 197, 217, 224, 279, 301, 317, 307, 324, 357, 364, 382, 413, 427, 490, 742, 873, 1361]
plt.hist(spl, bins=20);

You can sort-of pretend that looks normal, but it has several large outliers, which the t-test does not tolerate very well. Still, let’s do the t-test for the 95% confidence interval of the population mean:

pgtt = pg.ttest(spl, 0)
print(pgtt['CI95%'][0])
[272.34 541.46]

That’s a very wide interval, reflecting the spread of values in the sample. Can we do better? The answer is yes, using the bootstrap method.

Bootstrap

If the sample was larger, or if you could extract multiple samples from the same population, the results would be better. But that’s not possible here — this sample is all we have.

However, if we assume the sample approximates the population fairly well, we can pretend to generate multiple pseudo-samples from the population: what we really do is take this sample and resample it — generate multiple samples, of the same size, using the same values. And then we work with the set of generated samples.

The number of pseudo-samples you make needs to be large, in the thousands usually.

Since the samples we generate from the original sample are the same size as the original, some values will be repeated in each pseudo-sample. That’s fine.

CI for the mean value

Let’s generate ten thousand chunks of resampled data from the original sample:

B = 10000
rs = np.zeros((B, len(spl)), dtype=np.int)
for i in range(B):
rs[i,] = resample(spl, replace = True)
print(rs)
[[ 279 1361 413 ... 357 357 490]
[ 307 427 413 ... 317 203 1361]
[ 413 279 357 ... 357 203 307]
...
[ 217 243 224 ... 279 490 1361]
[ 301 324 224 ... 224 317 317]
[ 324 301 364 ... 873 364 203]]

Each line in the array is a resampled chunk and is the same size as the original sample. There are 10k lines in total.

Now let’s build the bootstrap distribution: for each line, calculate the mean value:

bd = np.mean(rs, axis=1)
print(bd)
[376.35 515.15 342.75 ... 507.8 426.15 377.05]

The bootstrap distribution contains the means for each resampled chunk of data. It’s easy to get the 95% confidence interval for the mean from here: simply sort the bootstrap distribution array (the means), cut off the top and bottom 2.5%, and read the remaining extreme values:

bootci = np.percentile(bd, (2.5, 97.5))
print(bootci)
[300.0975 541.705]

And that’s the bootstrap 95% confidence interval for the mean. How does it compare to the t-test confidence interval?

y, x, _ = plt.hist(bd, bins=100)
ymax = y.max()
plt.vlines(pgtt['CI95%'][0], 0, ymax, colors = 'red')
plt.vlines(bootci, 0, ymax, colors = 'chartreuse')
plt.vlines(np.mean(spl), 0, ymax, colors = 'black')
plt.show();

The bootstrap CI (in green) is somewhat more narrow than the t-test CI (in red).

CI for the median value

You can use bootstrap to generate a CI for the median value as well: simply build the bootstrap distribution using np.median() instead of np.mean():

bd = np.median(rs, axis = 1)
bootci = np.percentile(bd, (5, 95))
print(bootci)
[279. 382.]

CI for variance

Let’s take a sample from a normal population of known variance (var=100):

sp = np.random.normal(loc=20, scale=10, size=50)
sns.violinplot(x=sp);

The variance of our sample is indeed close to 100:

print(np.var(sp))112.88909322502681

Let’s use bootstrap to extract a confidence interval for the population variance from our existing sample. Same idea: resample the data many times, calculate the variance for each resampled chunk, get a percentile interval for the list of variances.

rs = np.zeros((B, len(sp)), dtype=np.float)
for i in range(B):
rs[i,] = resample(sp, replace=True)
bd = np.var(rs, axis=1)
bootci = np.percentile(bd, (2.5, 97.5))
print(bootci)
[74.91889462 151.31601521]

Not bad; the interval is narrow enough, and it includes the true value for the population variance (100). Here’s the histogram of variances from the bootstrap distribution:

y, x, _ = plt.hist(bd, bins=100)
ymax = y.max()
plt.vlines(bootci, 0, ymax, colors = 'chartreuse')
plt.vlines(np.var(sp), 0, ymax, colors = 'black')
plt.show();

Two-sample difference of means, unpaired

Here are the weights of two samples of plums, grouped by color:

https://raw.githubusercontent.com/FlorinAndrei/misc/master/plumdata.csv

plumdata = pd.read_csv('plumdata.csv', index_col=0)
sns.violinplot(x='color', y='weight', data=plumdata);

Are the two colors, on average, different in terms of weight? In other words, what’s the confidence interval for the difference of mean weights? Let’s try the classic t-test first:

plum_red = plumdata['weight'][plumdata['color'] == 'red'].values
plum_yel = plumdata['weight'][plumdata['color'] == 'yellow'].values
plumtt = pg.ttest(plum_red, plum_yel)
print(plumtt['CI95%'][0])
[8.67 27.2]

The t-test says — yes, with 95% confidence, the mean weight of red plums is different from the mean weight of yellow plums (the CI does not include 0).

For bootstrap, we will literally resample both samples 10k times, calculate the mean for each resampled chunk, do the difference of means 10k times, and look at the 2.5 / 97.5 percentiles for the difference of means distribution.

This is the resampled data:

plzip = np.array(list(zip(plum_red, plum_yel)), dtype=np.int)
rs = np.zeros((B, plzip.shape[0], plzip.shape[1]), dtype=np.int)
for i in range(B):
rs[i,] = resample(plzip, replace = True)
print(rs.shape)
(10000, 15, 2)

10000 is for the number of resamples, 15 is the length of each resample, and there are 2 samples we compare in each case.

Let’s calculate the mean for each resampled chunk:

bd_init = np.mean(rs, axis=1)
print(bd_init.shape)
(10000, 2)

And the bootstrap distribution is the difference of means (red mean minus yellow mean) for each one of the 10000 resampled cases:

bd = bd_init[:, 0] - bd_init[:, 1]
print(bd.shape)
(10000,)print(bd)[12.33333333 21. 10.33333333 ... 21.46666667 11.66666667
17.06666667]

And now we can get the CI for the difference of means from the bootstrap distribution:

bootci = np.percentile(bd, (2.5, 97.5))
print(bootci)
[9.33333333 26.06666667]

Again bootstrap provides a tighter interval than the t-test:

y, x, _ = plt.hist(bd, bins=100)
ymax = y.max()
plt.vlines(plumtt['CI95%'][0], 0, ymax, colors = 'red')
plt.vlines(bootci, 0, ymax, colors = 'chartreuse')
plt.show();

Two-sample mean of differences, paired

Here’s paired data — the strength of several brands of ropes in either wet or dry conditions:

https://raw.githubusercontent.com/FlorinAndrei/misc/master/strength.csv

strength = pd.read_csv('strength.csv', index_col=0)
sns.violinplot(data=strength);

Is this enough data to conclude that the wet/dry conditions influence the strength of the ropes? Let’s calculate a 95% confidence interval for the mean difference of strength, wet vs dry. First, the t-test:

s_wet = strength['wet']
s_dry = strength['dry']
s_diff = s_wet.values - s_dry.values
strengthtt = pg.ttest(s_wet, s_dry, paired=True)
print(strengthtt['CI95%'][0])
[0.4 8.77]

The t-test indicates, with 95% confidence, that indeed the wet/dry conditions make a statistically significant difference — but just barely. One end of the CI is close to 0.

Now let’s do bootstrap for the same CI. We can simply take s_diff (the differences in strength for each rope), and go back to the one-sample bootstrap procedure:

rs = np.zeros((B, len(s_diff)), dtype=np.int)
for i in range(B):
rs[i,] = resample(s_diff, replace = True)
bd = np.mean(rs, axis=1)
bootci = np.percentile(bd, (2.5, 97.5))
print(bootci)
[1.33333333 8.5]

Both the t-test and bootstrap agree that there’s a statistically significant difference (95% confidence) that the strengths are different when conditions are different. Bootstrap is a little more confident (interval is more narrow and is further away from 0).

y, x, _ = plt.hist(bd, bins=100)
ymax = y.max()
plt.vlines(strengthtt['CI95%'][0], 0, ymax, colors = 'red')
plt.vlines(bootci, 0, ymax, colors = 'chartreuse')
plt.show();

Notebook with code:

https://github.com/FlorinAndrei/misc/blob/master/bootstrap_plain.ipynb

Final thoughts

This is just the basic bootstrap technique. It does not perform bias corrections, etc.

There is no cure for small sample sizes. Bootstrap is powerful, but it’s not magic — it can only work with the information available in the original sample.

If the samples are not representative of the whole population, then bootstrap will not be very accurate.

--

--

BS in Physics. MS in Data Science. Over a decade experience with cloud computing.