The world’s leading publication for data science, AI, and ML professionals.

Bootstrap Sampling in R

Booststrapping uses random sampling with replacement to estimate statistics from a sample. By resampling from this sample we can generate…

Exploring the differences in tuition between public and private schools

Bootstrapping uses random sampling with replacement to estimate statistics from a sample. By resampling from this sample we can generate novel data that can be a representative of a population. This is loosely based on the law of large numbers. Instead of estimating estimating a statistic once–from the small sample we start with–the statistic can be estimated multiple times. Hence, if we resample with replacement 10 times, we will compute estimates of a desired statistic 10 times.

The procedure in bootstrapping is as follows:

  1. Resample the data with replacement n times
  2. Compute desired statistic n times to generate a distribution of estimated Statistics
  3. Determine standard error/confidence interval for the bootstrapped statistic from the bootstrapped distribution

In this article I will demonstrate how to perform bootstrap resampling to estimate the relation between the tuition cost of private and public colleges using the historical tuition dataset from TidyTuesday.

To begin we load the data from GitHub.

library(tidyverse)
library(Tidymodels)
historical_tuition <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/historical_tuition.csv')
# A tibble: 6 x 4
  type             year    tuition_type    tuition_cost
  <chr>            <chr>   <chr>                  <dbl>
1 All Institutions 1985-86 All Constant           10893
2 All Institutions 1985-86 4 Year Constant        12274
3 All Institutions 1985-86 2 Year Constant         7508
4 All Institutions 1985-86 All Current             4885
5 All Institutions 1985-86 4 Year Current          5504
6 All Institutions 1985-86 2 Year Current          3367

Currently, the data is in long format where each row represents one observation. There are multiple observations for each year representing data from each institution type: Public, Private, All Institutions. In order for us to make use of the data with resampling we need to convert the data to a wide format where we have one column containing tuition costs for private schools and another column containing the tuition costs for public schools. Each row in this wide format represents the historical tuition cost for a particular year. Below is the code that transforms the long format data to a wider format. The columns we will be interested in for modeling are public and private representing the tuition costs for public and private schools for a particular academic year as given by the column year .

tuition_df <- historical_tuition %>% 
  pivot_wider(names_from = type,
              values_from = tuition_cost
              ) %>%
  na.omit() %>% 
  janitor::clean_names()
# A tibble: 25 x 5
   year    tuition_type    all_institutions public private
   <chr>   <chr>                      <dbl>  <dbl>   <dbl>
 1 1985-86 All Constant               10893   7964   19812
 2 1985-86 4 Year Constant            12274   8604   20578
 3 1985-86 2 Year Constant             7508   6647   14521
 4 1985-86 All Current                 4885   3571    8885
 5 1985-86 4 Year Current              5504   3859    9228
 6 1985-86 2 Year Current              3367   2981    6512
 7 1995-96 All Constant               13822   9825   27027
 8 1995-96 4 Year Constant            16224  11016   27661
 9 1995-96 2 Year Constant             7421   6623   18161
10 1995-96 All Current                 8800   6256   17208

With this wider format data, we can quickly visualize the relation between private and public school tuition costs.

tuition_df %>% 
  ggplot(aes(public, private))+
  geom_point()+
  scale_y_continuous(labels=scales::dollar_format())+
  scale_x_continuous(labels=scales::dollar_format())+
  ggtitle("Private vs Public School Tuition")+
  xlab("Public School Tuition")+
  ylab("Private School Tuition")+
  theme_linedraw()+
  theme(axis.title=element_text(size=14,face="bold"),
        plot.title = element_text(size = 20, face = "bold"))

There appears to be a direct positive relation between private and public school tuition costs. In general private schools cost more than public schools. We will now quantitatively just how much.

Image by author
Image by author

We can perform a linear fit on the data to determine the relation between private and public school tuition costs. We can accomplish this with the lm function in R. We can see the slope is estimated to be 2.38. For every dollar in tuition spent at a public school a private school is expected to pay around 2.38 times more.

tuition_fit <- lm(private ~ 0 + public,
               data = tuition_df)
# We tidy the results of the fit
tidy(tuition_fit)
# A tibble: 1 x 5
  term   estimate std.error statistic  p.value
  <chr>     <dbl>     <dbl>     <dbl>    <dbl>
1 public     2.38    0.0346      68.7 1.05e-66

Now we will use resampling to attempt to gauge a better estimate of the relation between public and private school tuition costs. With bootstrapping we will randomly draw with replacement to create new datasets from the original dataset that are the same size as the original. This essentially simulates generating new sets of data from the original set. We will resample the data 1,000 times.

# Set set.seed a starting number to generate a sequence of random numbers so that we can get reproducible results
set.seed(123)
tution_boot <- bootstraps(tuition_df,
                          times = 1e3,
                          apparent = TRUE)

Next we will now fit a linear model to the 1,000 re-samplings. These results will be stored in a new column we call model and the statistics will be stored in a column called conf_inf . We will then unnest the results to extract the estimated statistic (slope of the line), errors, and p values.

tuition_models <- tution_boot %>% 
  mutate(model = map(splits, ~lm(private ~ 0 + public,
               data = .) ),
         coef_inf = map(model, tidy))
tuition_coefs <- tuition_models %>% 
  unnest(coef_inf)
splits          id            model  term   estimate std.error statistic  p.value
   <list>          <chr>         <list> <chr>     <dbl>     <dbl>     <dbl>    <dbl>
 1 <split [72/25]> Bootstrap0001 <lm>   public     2.37    0.0354      66.7 8.41e-66
 2 <split [72/28]> Bootstrap0002 <lm>   public     2.38    0.0353      67.4 4.30e-66
 3 <split [72/23]> Bootstrap0003 <lm>   public     2.36    0.0365      64.9 6.19e-65
 4 <split [72/25]> Bootstrap0004 <lm>   public     2.30    0.0328      70.1 2.67e-67
 5 <split [72/25]> Bootstrap0005 <lm>   public     2.35    0.0364      64.7 7.15e-65
 6 <split [72/26]> Bootstrap0006 <lm>   public     2.36    0.0344      68.5 1.31e-66
 7 <split [72/25]> Bootstrap0007 <lm>   public     2.33    0.0299      77.9 1.61e-70
 8 <split [72/23]> Bootstrap0008 <lm>   public     2.34    0.0368      63.5 2.60e-64
 9 <split [72/21]> Bootstrap0009 <lm>   public     2.41    0.0349      68.9 8.70e-67
10 <split [72/26]> Bootstrap0010 <lm>   public     2.39    0.0334      71.6 6.00e-68
# ... with 991 more rows

We now have 1,000 estimates, obtained from resampling, of the relation between the tuition cost of private and public schools which is stored in the column estimate . The question now arises, what is the distribution of the estimated slope between private and public school tuition?

tuition_coefs %>% 
  ggplot(aes(estimate))+
  geom_histogram(alpha = .7)+
  ggtitle("Distribution of Estimated Slope")+
  xlab("Estimate")+
  ylab("Frequency")+
  theme_linedraw()+
  theme(axis.title=element_text(size=14,face="bold"),
        plot.title = element_text(size = 20, face = "bold"))

We can see the distribution appears normal and we can gain a feel for how broad it is. Numerically we can tabulate the description of this distribution as follows.

Image by author
Image by author
int_pctl(tuition_models,
         coef_inf)
# A tibble: 1 x 6
  term   .lower .estimate .upper .alpha .method   
  <chr>   <dbl>     <dbl>  <dbl>  <dbl> <chr>     
1 public   2.31      2.38   2.46   0.05 percentile

We can see with bootstrapping the estimated slope that described the relation between private and public school tuition costs is around 2.38–similar to what we obtained without bootstrapping. This could be because the underlying relation between public and private schools is linear and the underlying assumptions of linear models apply to this dataset. The underlying assumptions in a linear regression are:

  • The underlying relationship between the variables is linear
  • The errors are normally distirbuted
  • There is equal variance around the line of best fit (homoscedasticity of error)
  • Observations are independent

Furthermore, with bootstrapping we have lower and upper bounds for the relation as well which we did not have with just using a linear model on the un-sampled data.

Lastly, we can visualize the various estimates we calculated from the bootstrapped samples.

tuition_aug <- tuition_models %>% 
  # Sample only 200 bootstraps for visualization
  sample_n(200) %>% 
  mutate(augmented = map(model, augment)) %>% 
  unnest(augmented)
tuition_aug %>% 
  ggplot(aes(public, private))+
  geom_line(aes(y = .fitted, group = id), alpha = .1, color = 'grey')+
  geom_point()+
  scale_y_continuous(labels=scales::dollar_format())+
  scale_x_continuous(labels=scales::dollar_format())+
  ggtitle("200 of 1000 Slope Estimations from Bootstrap Resampling")+
  xlab("Public School Tuition")+
  ylab("Private School Tuition")+
  theme_linedraw()+
  theme(axis.title=element_text(size=14,face="bold"),
        plot.title = element_text(size = 20, face = "bold"))
Image by author
Image by author

In summary, this blog demonstrated how to use bootstrap resampling in R to determine the relation between private and public school tuition. This technique estimates the sampling distribution of nearly any statistic using random sampling.


Related Articles