The world’s leading publication for data science, AI, and ML professionals.

Pulling Your Data Up By the Bootstraps

What is bootstrapping and why do we use it?

Photo by Shahadat Rahman on Unsplash
Photo by Shahadat Rahman on Unsplash

If you work with any large datasets, you have probably heard of bootstrapping. If you are a burgeoning statistician or bioinformatician, it is part of your computational toolset. What is the point of using this function? More importantly, what is bootstrapping anyway?

Bradley Efron first published the idea of bootstrapping in 1979¹. This computer-intensive technique became more popular and useful as computing power became cheaper and more available. Indeed, researchers have cited the bootstrapping method more than 20 000 times.

When working with large datasets, we aim to make inferences about the population from which our data is drawn. While we can calculate a mean or median, we do not know the certainty of this estimate. If we increase our sample size, we can reduce the error and approach the population parameters. However, if we are conducting RNA-sequencing or collecting large swathes of data, it is expensive or even impossible to increase the sample size. Bootstrapping is a resampling method that helps us determine error and confidence intervals. Results from bootstrapping later inform conclusions, whether you are looking at stock market data, phylogenetic trees or gene transcript abundances.

Defining the Bootstrap

Bootstrapping is a method of resampling with replacement. We will run through an example to explain how this works as well as the assumptions for this method.

Supposed we have a dataset indicating the cost that basketball players charge for making appearances on birthdays. However, it is difficult for you to contact more than 8 players, so your dataset, D, in this example contains 8 values. Since we talked with a wide spectrum of different basketball players, from benchwarmers, to ensure that your sample is similar enough to the entire population of players.

Herein lies our statistical assumption: our data sample approximates the population distribution.

D = {100, 200, 200, 300, 500, 1000, 1000, 750}

Here the average of our sample D is 506.25. If we bootstrap this sample a few times, we will get a better idea of the variance within this dataset. Bootstrapping involves resampling with replacement. Our resampled bootstraps will have 8 values each, however since they are resampled with replacement, the same value (i.e. 100) could appear multiple times. In this way, bootstrapping may generate different estimates each time it is run. However, with enough bootstraps, we generate an approximation of the variance within the data. Notice the following:

  1. We are not adding any new points to our dataset.
  2. Each resampled bootstrap contains the same amount of values as our original sample.
  3. Since we resample with replacement, the probability of resampling any value is the same throughout the bootstrap. Each value is drawn as an independent event. If the first value that we resampled is 200, this does not change the probability that the second value in this bootstrap will also be 200.
D₁ = {100, 1000, 500, 300, 200, 200, 200, 100}
D₂ = {300, 1000, 1000, 300, 500, 100, 200, 750}
D₃ = {750, 300, 200, 200, 100, 300, 750, 1000}

The averages of D₁, D₂, D₃ are 325, 518.75, 450. We can then use these values to generate standard error, confidence intervals and other measures of interest. Using Python, R or other languages, its simple to generate 50, 100 or even 1000 bootstrapped samples. Knowing the bias, variance and spread of our sample helps us make better inferences about the population that its drawn from. It helps you incorporate the robustness of your sample into the rest of your inferences.

For the sake of this example, we used a small dataset. In general, bootstrapping does not apply to small datasets, datasets with many outliers or datasets involving dependent data measures.

If you are still having trouble visualizing this method, I’ve shown the process of bootstrapping below, on a dataset of jellybeans.

Created by Simon Spichak
Created by Simon Spichak

Using the Bootstrap for Bioinformatics

Example 1: Phylogenetic Trees²

Bootstrapping helps us determine the confidence of specific branches within a phylogenetic tree. We might be looking at an amino acid sequence from a protein or a nucleotide sequence from a gene. Our original sample can quickly be resampled 1000 times, reconstructing 1000 bootstrapped trees. If your original tree shows that a specific protein or gene sequence branches off, you can check your bootstrapped tree to see how often this branch occurs. If it occurs more than 950 times, you can be fairly certain that your data is robust. If it only occurs around 400 times, then it could be resultant from an outlier.

Example 2: Estimating Gene Transcript Abundance

Sleuth³ software estimates gene transcript abundance using a bootstrap approach. By re-sampling our next-generation sequencing reads, we can calculate a more robust estimate of transcript abundance. Re-sampling gives us an idea of technical variability within our data. The technical variation is used along with biological variation when estimating whether a specific gene or transcript is increased within your dataset.


Other uses of bootstrapping include aggregating for ensemble Machine Learning. Basically, our dataset is resampled many times. Each bootstrapped sample is then run through our classifier or machine learning model. We can use all of the outputs together to generate a more accurate classifier. This prevents us from overfitting data based on our limited sample.

References

  1. Efron, B. Bootstrap Methods: Another Look at the Jackknife. Ann. Statist. 7 (1979), no. 1, 1–26. doi:10.1214/aos/1176344552. https://projecteuclid.org/euclid.aos/1176344552
  2. Efron, Bradley, Elizabeth Halloran, and Susan Holmes. "Bootstrap confidence levels for phylogenetic trees." Proceedings of the National Academy of Sciences 93.23 (1996): 13429–13429.
  3. https://hbctraining.github.io/DGE_workshop_salmon/lessons/09_sleuth.html

Related Articles