Disclaimer: This article is solely for educational purposes. It does not reflect or infer any factual information about the election

The United States presidential election is now around the corner. US presidential election is known for the extensive use of data and analytics. From understanding the voters’ demographic backgrounds to estimating election results in different states, from predicting the voters’ preferences to developing marketing strategies, data is everywhere. All political campaigns spend millions of dollars on data analytics to win the election. Although we have seen incorrect sampling techniques failed us to predict the election outcomes before, we cannot just simply deny the important role data play in the United States presidential election. In this article, we will try to use a sampling technique named Bootstrapping to understand the voters’ overall preferences in the coming presidential election.
Bootstrapping:

In statistics, the theory of sampling distribution tells us that if we take many samples from an unknown population, the statistic such as sample mean, median, standard deviation, etc. from all those samples make a distribution called a sampling distribution. If a sampling distribution follows certain rules, the statistic of the sampling distribution approximates to the parameter such as population mean, median, Standard deviation, etc. This means that even if we are not knowledgeable about the population parameter, we can still estimate or infer the unknown parameters using the theory of sampling distribution. Now one of the major challenges of developing a sampling distribution is having several samples from the population. Many times, we just have one sample from a population, but to estimate the population parameter, we need several samples. This is when the bootstrapping method comes in handy to generate many many samples of data from a given sample. Bradley Efron, an American statistician, introduced Bootstrapping in 1979. In this article, we will take data on different election polls and try to apply bootstrapping to generate many samples of poll results to estimate the percentage of Americans supporting the democratic party and the republican party. One of the major advantages of the bootstrap sampling distribution is that we can calculate different statistics such as mean, median, variance, etc. from it. In addition, the distribution doesn’t need to follow any specific shape. One downside to note is that the bootstrapping method generates data from the existing sample. If the given sample of polls in our case study doesn’t represent all types of voters in the election, the bootstrapping method will not generate data for the groups who are not in the given sample. Therefore, it is very important that we take as many polls as possible to represent all the voter groups. We will use the R programming language to do this analysis.
The data can be found at this link
https://github.com/PriyaBrataSen/US-Election-Poll
Let’s look into the dataset now…
Data:
The data set used in this analysis includes poll results. There are in total 51 states, and every state has different polls indicating the percentage of supporters for the Democrat party and Republican party (GOP).
df<-read.csv('pres_polls.csv')
head(df)

We can see from the above table that there are 8 variables in this dataset. The Day variable represents the day of the year when the survey is done. The variable state represents the state in which the survey is done. Region variable describes the 4 regions in which each state falls. EV stands for electoral votes for that state. Dem and GOP represent the percentage of people who gave favorable responses for the Democratic party and the Republican party. The date column represents the date at which the survey is done, and finally, the Pollster is the name of the poll.
library(dplyr)
poll=unique(pull(df,Pollster))
str(poll)

Let’s look into the Pollster variable to see how many pool results we have in this sample. We can use the ‘str’ function to see how many unique polls are in the data set. From the output, there are 233 unique rows which tell us that the given sample has in total 233 polls. We will use the bootstrapping to generate 10,000 samples which will have 233 pools’ results in each of them. For 10,000 samples we will get 10,000 sample means which will make out bootstrap sampling distribution. As bootstrapping is a resampling technique with replacement, some samples may have the same pools several times.
Estimating Voters’ Preferences:
In this section, we will try to create the bootstrap distribution for the whole dataset regardless of region or state to get an estimate of the voters’ preferences across the nation. After that, we will build a bootstrap distribution for each of the 4 regions of the United States of America. We will calculate the confidence interval of the mean from those bootstrap distribution to understand the level of preferences.
Let’s dive in..
library(boot)
bootmean=function(x,i){mean(x[i])}
prefer_country=function(data){
boot.object=boot(data,bootmean,R=10000)
boot.ci(boot.object,conf = 0.95,type = 'bca')
}
Dem=round(prefer_country(df$Dem)$bca[,c(4,5)],4)
GOP=round(prefer_country(df$GOP)$bca[,c(4,5)],4)
c('Democratc party:',Dem)
c('Republican party:',GOP)

In R, to calculate the bootstrap distribution, We first load the ‘boot’ library. We have to install the library before loading it if we are running it for the first time. To develop the bootstrap distribution, we first need to develop a function that tells R what statistic we want to calculate when we will generate multiple samples from our given sample. In our case, we want to calculate the average from every new sample we generate from the given sample. Therefore, we have created a function named ‘bootmean’ in the second line which mainly takes a vector and calculates the mean out of it. Next, we have created a function ‘prefer_counrty’ which mainly creates the bootstrap distribution and then calculates the intervals at a 95% level of confidence. If we look inside the function closely, we can see in the first line we have used the boot function to create a boot object named here ‘boot.object’. The boot function is going to take the data that we will pass and calculates 10,000 new samples out of it as we mentioned R=10,000. Finally, it will calculate the statistic from each of the samples and store the results in the ‘boot.object’. That is why in the second argument in the boot function we have passed the ‘bootmean’ function that we created in the second line of our code. There are different things stored in the boot.object. If we want to see what are the things stored in it we can use str(boot.object) to see the structure of the object. However, in our case, we are interested to get the confidence interval of the mean for the two parties. Therefore, in the next line, we have used the boot. ci function to calculate the interval of mean at a 95% level of confidence. There are several types of confidence intervals saved in the boot. object, but we want the bias-corrected and accelerated (‘BCA’) bootstrap interval. Therefore, we mentioned type=’bca’.
Finally, we have called the prefer_country function for both Democratic and Republican parties’ data and saved the results in the ‘Dem’ and ‘GOP’ variable. The last two lines have printed out the results. So, from our bootstrap analysis, we can tell at a 95% level of confidence, between 48.01% and 48.86% of Americans are preferring the Democratic candidate, and between 43.31% and 44.20% of Americans are preferring the Republican candidate in this presidential election.

Let’s look into the regional level to understand what are the confidence intervals for both parties. we will be using the same techniques but we need to calculate the bootstrap samples at every region rather than for the whole nation. Below is the R code for that
lower=c()
upper=c()
region=c()
a=unique(pull(df,Region))
prefer_region=function(data){
for (i in a){
data_Dem=data[df$Region==i]
boot.Dem=boot(data_Dem,bootmean,R=10000)
p=boot.ci(boot.Dem,conf = 0.95)
lower=c(lower,p$bca[,c(4)])
upper=c(upper,p$bca[,c(5)])
region=c(region,i)
}
preference=data.frame(region,lower,upper)
preference}
DEM=prefer_region(df$Dem)%>%rename(Dem_lower=lower,Dem_upper=upper)
GOP=prefer_region(df$GOP)%>%rename(GOP_lower=lower,GOP_upper=upper)
inner_join(DEM,GOP,by='region')

We have started with establishing 3 empty vectors ‘lower’, ‘upper’, and ‘region’. We will save the bootstrapping outputs in these vectors to develop a data frame. Next, we have saved the names of the distinct regions in vector ‘a’ which has four values South, North East, West, Mid West. We will use this vector in our for loop to filter our data for specific regions and then use the filtered data to develop bootstrap samples only for that region. Now, we have written another function named ‘prefer_region’ which takes a column from our df data frame as an input. Inside the function, we have written a loop that mainly takes the assigned column and filters it for a specific region. The loop is going to read the name of the regions from the vector ‘a’ and perform the bootstrap calculation in a similar fashion as we have done for the whole nation. Every time, we have calculated the intervals, we have saved the lower limit in the ‘lower’ vector and upper limit in the ‘upper’ vector. In addition, we have saved the region name in the ‘region’ vector. Once the calculations are done for all the regions, the for loop ends. Now, we have taken the three vectors ‘lower’, ‘upper’, and ‘region’ and created a data frame named ‘preference’. This is the end of our function. In the last three lines, we have mainly called the functions for both democratic and republican parties and saved the results in one data frame which has been shown in the output.
If we look into the output, we can see that at a 95% level of confidence, the intervals of mean percentages are higher for the Democratic party in the West, North East, and Mid West region. However, in the south region, the intervals overlap each other. Between 46.19% and 47.56% of Americans in the south region are preferring the Democratic candidate, and between 45.41% and 46.81% of Americans in the south region are preferring the Republican candidate.
Conclusion:
From the above discussion, we can see how to implement bootstrapping using the R programming language. We were able to generate 10,000 samples each having 233 observations from a given sample of 233 observations. One important thing here is that we try to find out the intervals from our bootstrapping sampling distribution at the national and regional levels. It doesn’t necessarily tell us which party is going to win the election because the study is not done at the state level. The electoral votes vary from state to state. So, this study gives an overall understanding of the general political preferences of voters by applying the bootstrapping method.