The world’s leading publication for data science, AI, and ML professionals.

An Intuitive Guide to the Bootstrap

The bootstrap¹ to me is one of the most remarkable inventions of the 21st century. It is an easy-to-implement resampling technique that is…

Getting Started

This article provides an overview of how to answer causal questions while making very few assumptions

The bootstrap¹ to me is one of the most remarkable inventions of the 21st century. It is an easy-to-implement resampling technique that is profoundly powerful. This technique is widely used across many domains.

From testing causal claims to improving the predictive accuracy of machine learning models, the bootstrap has vastly enhanced the kinds of questions anyone can ask of their data. To fully appreciate how it works I would like to walk all of you through a fun and simple example.

Imagine that Netflix is promoting a fictional action movie called "Hollywood Rising". They want to maximize the amount of time people spend watching this movie. To do so, they would like to optimize the artwork (thumbnail) a user sees, for this movie when they log on to Netflix’s site.

A movie’s artwork is a user’s focal point of interaction with Netflix’s site. Unappealing thumbnails could thus deter viewers from watching a movie. Hence, the execs over at Netflix are keen on using a thumbnail that appeals to a large proportion of their user base. Let’s assume for a moment that the artwork below is one of two options under consideration.

Photo by Angela Ng on Unsplash. Do users who view thumbnail A watch Hollywood Rising for longer?
Photo by Angela Ng on Unsplash. Do users who view thumbnail A watch Hollywood Rising for longer?

These execs would like to know how their users will respond to thumbnail A before pushing it out to everyone. To do so they carry out an experiment.

First, they randomly select 1000 viewers from their user base. These viewers are then shown thumbnail A. Each of these users’ view times is then logged. View times are essentially the minutes a user spends watching _Holl_ywood Rising. This information is then used to compute various summary statistics of interest to Netflix executives.

For instance, they may be interested in the mean of this view time distribution (MVT). For example, say they observed that the MVT for thumbnail A is higher than the MVT under thumbnail B.

Photo by Cameron Venti on Unsplash. Alternative thumbnail B Netflix is considering showing its members during the Hollwyood Rising premiere
Photo by Cameron Venti on Unsplash. Alternative thumbnail B Netflix is considering showing its members during the Hollwyood Rising premiere

A Simple Experiment

A question on the mind of a lot of execs at Netflix is how reliable is this one number? Should they use it to push thumbnail A over B? Can they somehow get a sense of the underlying uncertainty in their estimate?

Let us try and answer these questions with some simulated data. I have drawn observations from a normal distribution with a mean of 50 minutes and a variance of 15 minutes.

View times of all 1000 viewers who watched Set it up after being shown thumbnail A.
View times of all 1000 viewers who watched Set it up after being shown thumbnail A.

With those parameter values, we observe a wide range of view time behaviors. It seems as though some users barely watched the movie (view time ~ 0 mins) whilst a handful of them viewed it till completion (view time ~ 120 mins). All of these scenarios represent plausible viewing behavior. We also find the empirical mean viewing time to be 49.7 minutes. That is expected as I am drawing from a distribution with a mean of 50 minutes.

How can I know if this number is unusually low or unusually high? How can I tell if my statistic’s value is simply the result of random chance? Maybe I observed the value I did because of the specific people chosen in my sample and their specific moods on the day this sample was carried out?

To answer all of these questions, we need to measure the variability of our statistic. That is we want to get a sense of how different the value of our statistic would have been had we chosen a different random sample. So how do we do that?

Well, there are two broad ways:

  1. Use methods from statistical theory
  2. Bootstrap

Statistical Theory

Since each of the observations in our sample are independent (drawn at random) and identically distributed (all chosen from a normal distribution), statistical theory gives us the following:

Derivation of standard error (sample standard deviation) of the sample mean from statistical theory
Derivation of standard error (sample standard deviation) of the sample mean from statistical theory

We can now use these equations to measure the variability of our sample mean statistic. Well, that was easy! Why don’t we just do that? Well, one reason is that these nice, closed-form formulas have only been developed for a subset of well-studied distributions (think Poisson, Normal). Thus, in order to make use of simple formulas like the one above, one needs to ensure that their data conform to the assumptions of each of these distributions.

This is not an appealing solution. Assumptions like most things in real life need to be tested. Whilst they may be reasonable at times, they end up forcing an analyst to spend a ton of time validating these assumptions instead of studying the actual data at hand.

Therein lies, in my opinion, the ultimate motivation for a model-free solution. Can we somehow learn about the behavior of our estimator without imposing all of these conditions on its data generating process?

Is there a way for us to use the empirical distribution of our observed sample as our north star; our guiding compass?

Yes, there is! Let me tell you all about it in the next section


The Bootstrap

We know that each of our simulated observations is independent and identically distributed. Hence, why don’t we just sample from our sample? Specifically, why don’t we re-sample with replacement?

Let’s think about that for a moment. Our sample of 1000 view times is essentially the only depiction of Netflix user behavior we have. If this sampling step was carried out correctly **** (i.e. in a stratified random fashion that represents the distribution of Netflix’s user base) it should be a fairly accurate representation of their users’ true viewing behaviors.

Why is this? Well, it is interesting to note that the shape of the distribution of a randomly drawn sample often resembles the population it is drawn from². I know this sounds a bit strange. I too was skeptical of this when I first heard it. To convince me of it, I conducted tests. Before I show you my results, however, I would like to draw your attention to another important question.

Does this mean every random sample I draw from my population will look alike? No! Nor should it. You will notice some extreme samples where the same number is either drawn a large number of times or never drawn at all. This occurs because we sample with replacement. Doing so ensures:

  1. Each draw is independent of one and another.
  2. We draw samples in a manner that reflects the shape of the data generating process.

These pointers are once again extremely useful in the context of simulating counterfactual (what if) realities. For instance, say the view times you observed in your original experiment were inherently influenced by people’s mood that day. Some people may have just wanted to grab some popcorn and watch a good action thriller with their partner, whilst others might have just had a rough day and thus wanted to watch something funny and relaxing.

Hence, in situations like these, you would expect substantially skewed view time distributions. There could be a few small view time durations and a few unusually large ones.

What if we had carried out our experiment on a day when a large chunk of the user base wanted to watch a good action flick or vice versa? Questions like these are what make it difficult for Netflix to accurately infer what their viewers want through the lens of a single experiment.

So, should they conduct the same experiment over and over again? What if that’s costly? What if it ruins their user experience? Is there an alternative? This is one of the many instances in which bootstrapping comes to the rescue.

This bootstrap procedure is described below:

  1. Randomly select a group of Netflix viewers and show them thumbnail A.
  2. Record their viewing times.
  3. Compute the mean viewing time (MVT) of this random sample.
  4. Draw another sample (at random with replacement) from the sample collected in step 1 and record its MVT. Call this Bootstrap MVT
  5. Repeat step 4 B times (usually B is set to a large number like 250 or 500).
  6. Compute the sample standard deviation (standard error) of the B values of Bootstrap MVT.

I have also attached a snippet of the code I used to create my own bootstrap samples in Python:

For a comprehensive overview of the code I used to generate the plots below, please check out my _GitHub repo_. Now as promised, here are pictures of a few bootstrap samples I drew:

A few images that display bootstrap samples created by sampling 1000 observations from the observed sample
A few images that display bootstrap samples created by sampling 1000 observations from the observed sample

As expected, we see quite a bit of variation in each of their shapes, despite being drawn from the same distribution. You may also be wondering does the bootstrapped estimate of standard error match up to that predicted by statistical theory? Well, statistical theory tells us this number is 0.4963. Our bootstrapped estimate is 0.48694. I think the numbers speak for themselves.

As shown above, bootstrapping enables us to re-sample from the data collected. Thus, we can in essence simulate many alternative realities. These samples in essence allow us to pretend as though we are conducting our experiment at different points in time.

Thus, by simulating a vast and varied set of plausible realities we can form a distribution over the different values our estimator (mean viewing time in our example) can take. Hence, with this distribution, we can get a much better picture of the uncertainty in our estimator.


Limitations

The bootstrap, like all statistical methods, has its shortcomings. For instance, its reliance on the empirical distribution of the data at hand can be faulty when sample sizes are extremely small. I highly encourage readers to check out _An Introduction to the bootstrap¹ for a comprehensive discussion about some of these limitations. For readers who prefer a quick overview please check out this wonderful stack overflow_³ answer.

Conclusions

And that is all there is to it! This elegant procedure allows us to do a ton of wonderful things. We can use it to measure the uncertainty in our estimates. We can build confidence intervals around our estimates and carry out hypothesis tests (e.g. are view times higher under thumbnail A vs thumbnail B). We can also make inferential statements about complex statistics (e.g. What is the standard error of the median view time under thumbnail A?) for which it is not possible to get such nice closed-form formulas.

Each of these avenues allows Netflix to get a better sense of their experimental results. It also equips them to make well-informed and meaningful decisions with minimal assumptions about their data.

Finally, for those of you concerned about the statistical properties of this procedure, worry not. Statisticians¹ have put in a lot of time and effort to demonstrate that this procedure produces estimates with a ton of desirable statistical properties (like unbiasedness).


References

[1] **** Bradley Efron and Robert J Tibshirani, An introduction to the bootstrap (1994), CRC press

[2] Explaining how bootstrapping works in laymen terms (2012), https://stats.stackexchange.com/questions/26088/explaining-to-laypeople-why-bootstrapping-works

[3] Pros and Cons of Bootstrapping (2017), https://stats.stackexchange.com/questions/280725/pros-and-cons-of-bootstrapping


Related Articles