Exploring the differences in tuition between public and private schools
Bootstrapping uses random sampling with replacement to estimate statistics from a sample. By resampling from this sample we can generate novel data that can be a representative of a population. This is loosely based on the law of large numbers. Instead of estimating estimating a statistic once–from the small sample we start with–the statistic can be estimated multiple times. Hence, if we resample with replacement 10 times, we will compute estimates of a desired statistic 10 times.
The procedure in bootstrapping is as follows:
- Resample the data with replacement n times
- Compute desired statistic n times to generate a distribution of estimated Statistics
- Determine standard error/confidence interval for the bootstrapped statistic from the bootstrapped distribution
In this article I will demonstrate how to perform bootstrap resampling to estimate the relation between the tuition cost of private and public colleges using the historical tuition dataset from TidyTuesday.
To begin we load the data from GitHub.
library(tidyverse)
library(Tidymodels)
historical_tuition <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/historical_tuition.csv')
# A tibble: 6 x 4
type year tuition_type tuition_cost
<chr> <chr> <chr> <dbl>
1 All Institutions 1985-86 All Constant 10893
2 All Institutions 1985-86 4 Year Constant 12274
3 All Institutions 1985-86 2 Year Constant 7508
4 All Institutions 1985-86 All Current 4885
5 All Institutions 1985-86 4 Year Current 5504
6 All Institutions 1985-86 2 Year Current 3367
Currently, the data is in long format where each row represents one observation. There are multiple observations for each year representing data from each institution type: Public, Private, All Institutions. In order for us to make use of the data with resampling we need to convert the data to a wide format where we have one column containing tuition costs for private schools and another column containing the tuition costs for public schools. Each row in this wide format represents the historical tuition cost for a particular year. Below is the code that transforms the long format data to a wider format. The columns we will be interested in for modeling are public
and private
representing the tuition costs for public and private schools for a particular academic year as given by the column year
.
tuition_df <- historical_tuition %>%
pivot_wider(names_from = type,
values_from = tuition_cost
) %>%
na.omit() %>%
janitor::clean_names()
# A tibble: 25 x 5
year tuition_type all_institutions public private
<chr> <chr> <dbl> <dbl> <dbl>
1 1985-86 All Constant 10893 7964 19812
2 1985-86 4 Year Constant 12274 8604 20578
3 1985-86 2 Year Constant 7508 6647 14521
4 1985-86 All Current 4885 3571 8885
5 1985-86 4 Year Current 5504 3859 9228
6 1985-86 2 Year Current 3367 2981 6512
7 1995-96 All Constant 13822 9825 27027
8 1995-96 4 Year Constant 16224 11016 27661
9 1995-96 2 Year Constant 7421 6623 18161
10 1995-96 All Current 8800 6256 17208
With this wider format data, we can quickly visualize the relation between private and public school tuition costs.
tuition_df %>%
ggplot(aes(public, private))+
geom_point()+
scale_y_continuous(labels=scales::dollar_format())+
scale_x_continuous(labels=scales::dollar_format())+
ggtitle("Private vs Public School Tuition")+
xlab("Public School Tuition")+
ylab("Private School Tuition")+
theme_linedraw()+
theme(axis.title=element_text(size=14,face="bold"),
plot.title = element_text(size = 20, face = "bold"))
There appears to be a direct positive relation between private and public school tuition costs. In general private schools cost more than public schools. We will now quantitatively just how much.

We can perform a linear fit on the data to determine the relation between private and public school tuition costs. We can accomplish this with the lm
function in R. We can see the slope is estimated to be 2.38. For every dollar in tuition spent at a public school a private school is expected to pay around 2.38 times more.
tuition_fit <- lm(private ~ 0 + public,
data = tuition_df)
# We tidy the results of the fit
tidy(tuition_fit)
# A tibble: 1 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 public 2.38 0.0346 68.7 1.05e-66
Now we will use resampling to attempt to gauge a better estimate of the relation between public and private school tuition costs. With bootstrapping we will randomly draw with replacement to create new datasets from the original dataset that are the same size as the original. This essentially simulates generating new sets of data from the original set. We will resample the data 1,000 times.
# Set set.seed a starting number to generate a sequence of random numbers so that we can get reproducible results
set.seed(123)
tution_boot <- bootstraps(tuition_df,
times = 1e3,
apparent = TRUE)
Next we will now fit a linear model to the 1,000 re-samplings. These results will be stored in a new column we call model
and the statistics will be stored in a column called conf_inf
. We will then unnest the results to extract the estimated statistic (slope of the line), errors, and p values.
tuition_models <- tution_boot %>%
mutate(model = map(splits, ~lm(private ~ 0 + public,
data = .) ),
coef_inf = map(model, tidy))
tuition_coefs <- tuition_models %>%
unnest(coef_inf)
splits id model term estimate std.error statistic p.value
<list> <chr> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 <split [72/25]> Bootstrap0001 <lm> public 2.37 0.0354 66.7 8.41e-66
2 <split [72/28]> Bootstrap0002 <lm> public 2.38 0.0353 67.4 4.30e-66
3 <split [72/23]> Bootstrap0003 <lm> public 2.36 0.0365 64.9 6.19e-65
4 <split [72/25]> Bootstrap0004 <lm> public 2.30 0.0328 70.1 2.67e-67
5 <split [72/25]> Bootstrap0005 <lm> public 2.35 0.0364 64.7 7.15e-65
6 <split [72/26]> Bootstrap0006 <lm> public 2.36 0.0344 68.5 1.31e-66
7 <split [72/25]> Bootstrap0007 <lm> public 2.33 0.0299 77.9 1.61e-70
8 <split [72/23]> Bootstrap0008 <lm> public 2.34 0.0368 63.5 2.60e-64
9 <split [72/21]> Bootstrap0009 <lm> public 2.41 0.0349 68.9 8.70e-67
10 <split [72/26]> Bootstrap0010 <lm> public 2.39 0.0334 71.6 6.00e-68
# ... with 991 more rows
We now have 1,000 estimates, obtained from resampling, of the relation between the tuition cost of private and public schools which is stored in the column estimate
. The question now arises, what is the distribution of the estimated slope between private and public school tuition?
tuition_coefs %>%
ggplot(aes(estimate))+
geom_histogram(alpha = .7)+
ggtitle("Distribution of Estimated Slope")+
xlab("Estimate")+
ylab("Frequency")+
theme_linedraw()+
theme(axis.title=element_text(size=14,face="bold"),
plot.title = element_text(size = 20, face = "bold"))
We can see the distribution appears normal and we can gain a feel for how broad it is. Numerically we can tabulate the description of this distribution as follows.

int_pctl(tuition_models,
coef_inf)
# A tibble: 1 x 6
term .lower .estimate .upper .alpha .method
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 public 2.31 2.38 2.46 0.05 percentile
We can see with bootstrapping the estimated slope that described the relation between private and public school tuition costs is around 2.38–similar to what we obtained without bootstrapping. This could be because the underlying relation between public and private schools is linear and the underlying assumptions of linear models apply to this dataset. The underlying assumptions in a linear regression are:
- The underlying relationship between the variables is linear
- The errors are normally distirbuted
- There is equal variance around the line of best fit (homoscedasticity of error)
- Observations are independent
Furthermore, with bootstrapping we have lower and upper bounds for the relation as well which we did not have with just using a linear model on the un-sampled data.
Lastly, we can visualize the various estimates we calculated from the bootstrapped samples.
tuition_aug <- tuition_models %>%
# Sample only 200 bootstraps for visualization
sample_n(200) %>%
mutate(augmented = map(model, augment)) %>%
unnest(augmented)
tuition_aug %>%
ggplot(aes(public, private))+
geom_line(aes(y = .fitted, group = id), alpha = .1, color = 'grey')+
geom_point()+
scale_y_continuous(labels=scales::dollar_format())+
scale_x_continuous(labels=scales::dollar_format())+
ggtitle("200 of 1000 Slope Estimations from Bootstrap Resampling")+
xlab("Public School Tuition")+
ylab("Private School Tuition")+
theme_linedraw()+
theme(axis.title=element_text(size=14,face="bold"),
plot.title = element_text(size = 20, face = "bold"))

In summary, this blog demonstrated how to use bootstrap resampling in R to determine the relation between private and public school tuition. This technique estimates the sampling distribution of nearly any statistic using random sampling.