Five Confidence Intervals for Proportions That You Should Know About

Published in

Towards Data Science

17 min readAug 1, 2020

Confidence intervals are crucial metrics for statistical inference . Nowadays confidence intervals are receiving more attention (and rightly so!) which used to get overlooked especially because of the obsession with p-values. Here, I detail about confidence intervals for proportions and five different statistical methodologies for deriving confidence intervals for proportions that you, especially if you are in healthcare data science field, should know about. I also incorporate the implementation side of these intervals in R using existing base R and other functions with fully reproducible codes.

Note: This article is intended for those who have at least a fair sense of idea about the concepts confidence intervals and sample population inferential statistics. The very beginners might find it hard to follow-through this article. Those who are more than familiar with the concept of confidence can skip the initial part and directly jump to the list of confidence intervals starting with the Wald Interval. I also recommend reading this review article on confidence interval estimation¹

Proportions and confidence intervals

In an earlier article where I detailed binomial distribution, I spoke about how binomial distribution, the distribution of the number of successes in a fixed number of independent trials, is inherently related to proportions. From the context of clinical/epidemiological research, proportions are almost always encountered in any study. Incidences (number of new cases of disease in a specific period of time in the population), prevalence (proportion of people having the disease during a specific period of time) are all proportions. Estimation of the disease burden by estimating the true incidence and prevalence of a disease is probably the most commonly executed epidemiological studies. In my earlier article about binomial distribution, I tried to illustrate how binomial distributions are inherently related to the prevalence of a disease by citing a hypothetical COVID-19 seroprevalence study.

To study proportion of any event in any population, it is not practical to take data from the whole population. But what we can do is to take a rather practically feasible smaller subset of the population randomly and compute the proportion of the event of interest in the sample. Now, how do we know that this proportion that we got from sample can be related to the true proportion, the proportion in population? This is where confidence intervals comes into play. This process of inferential statistics of estimating true proportions from sample data is illustrated in the figure below.

The process of estimation of true proportions from the point estimates we get from the sample data

Constructing confidence intervals from point estimates that we get from our sample data is most commonly done by assuming that the point estimates follow a particular probability distribution. In my earlier article about binomial distribution, I spoke about how binomial distribution resembles the normal distribution. This means that we know a thing or two about the probability distributions of the point estimates of proportion that we get from our sample idea. This in turn means that we can some fairly reasonable estimates of the true proportions. Intuition behind normal approximation of binomial distribution is illustrated in the figure below.

Binomial distributions for different sample sizes (n) when probability of success (p) is 0.1. You can see that the distribution becomes more and more normal with larger sample sizes. When p is near 0.5, the distribution can be assumed to be normal even with smaller sample sizes. Here, I just wanted to illustrate a rather extreme case when p is on the extreme (0.1 here) because in reality these extreme values are more common than values near 0.5 in epidemiological studies

Okay, now that we know that point estimates of proportion from sample data can be assumed to follow a normal distribution because of the normal approximation phenomenon of binomial distribution, we can construct a confidence interval using the point estimate. But what exactly is this confidence interval? For now let’s assume that the a 95% confidence interval means that we are 95% ‘confident’ that the true proportion lies somewhere in that interval. Note that this definition is statistically not correct and purists will find it hard to accept. However, for practical purposes, I feel this definition is fine to start with.

In a normal distribution with mean 0 and standard deviation 1 (aka standard normal distribution), 95% of the values will be symmetrically distributed around the mean like what is shown in the figure below. The X-axis values ranging from -1.96 to +1.96 is thus the 95% confidence interval in this example.

The Standard Normal Distribution. The X-axis represents the values and Y-axis represents the probability. The shaded area represents the values that constitute 95% of all values around the mean. The unshaded areas on both the left and right side represents the rather extreme values and they do not constitute the 95% confidence interval

z = 1.96 in the figure above is a magical number. This is because confidence intervals are usually reported at 95% level. Since the normal distribution is symmetric this means that we have to exclude values that are 2.5% towards the left side and 2.5% towards the right side in the above figure. This in turn means that we need to find the threshold that cuts these two points and for a 95% confidence interval, this value turns out to be 1.96. We can say that 95% of the values of the distribution lies within 1.96 times of the standard deviation of the values (left and right). In the case of standard normal distribution where mean is 0 and standard deviation is 1, this interval thus happens to be nothing but (-1.96, +1.96).

Similarly, if we assume that ‘p’ is your point estimate of proportion and ’n’ is the sample size, then the the confidence interval for ‘p’ is thus:

Here, the ‘hat’ symbol above ‘p’ is included just to indicate that it is a point estimate from sample and not the true proportion. For a 95% confidence interval, z is 1.96. This confidence interval is also known commonly as the Wald interval.

In case of 95% confidence interval, the value of ‘z’ in the above equation is nothing but 1.96 as described above. For a 99% confidence interval, the value of ‘z’ would be 2.58. This captures an intuition that if you want to increase your confidence from 95% to 99%, then it makes sense that the range of your interval has to be increased so that you can be more confident. So intuitively, if your confidence interval needs to change from 95% level to 99% level, then the value of ‘z’ has to be larger in the latter case. Similarly, for a 90% confidence interval, value of ‘z’ would be smaller than 1.96 and hence you would get a narrower interval. ‘z’ for 90% happens to be 1.64.

Now that the basics of confidence interval have been detailed, let’s dwell into five different methodologies used to construct confidence interval for proportions.

1. Wald Interval

The Wald interval is the most basic confidence interval for proportions. Wald interval relies a lot on normal approximation assumption of binomial distribution and there are no modifications or corrections that are applied. It is the most direct confidence interval that can be constructed from this normal approximation.

However, it performs very poorly in practical scenarios. What is meant by this ‘poor performance’ is that the coverage for 95% Wald Interval is in many cases less than 95%! That is not good. We have to have a reasonable ‘coverage’ when we construct a confidence interval. For example, we would expect that a 95% confidence interval would ‘cover’ the true proportion 95% of the times or at least near to 95% of the times. But if it’s much less, then we are in trouble. Wald interval is infamous for low coverage in practical scenarios. This is because in many practical scenarios, the value of ‘p’ is on the extreme side (near to 0 or 1) and/or the sample size (n) is not that large.

We can explore the coverage of the Wald interval using R for various values of p. It has to be noted that the base R package does not seem to have Wald interval returned for the proportions. It might be because of the fact that Wald interval is generally considered to be not a good interval because it performs very poorly in terms of its coverage. So, I define a simple function R that takes ‘x’ and ’n’ as arguments. x is the number of successes in n Bernoulli trials. So the sample proportion would be nothing but the ratio of x to n. Based on the formula described above, it is pretty straightforward to return the upper and lower bounds of confidence interval using Wald method.

waldInterval <- function(x, n, conf.level = 0.95){
 p <- x/n
 sd <- sqrt(p*((1-p)/n))
 z <- qnorm(c( (1 — conf.level)/2, 1 — (1-conf.level)/2)) #returns the value of thresholds at which conf.level has to be cut at. for 95% CI, this is -1.96 and +1.96
 ci <- p + z*sd
 return(ci)
 }#example
waldInterval(x = 20, n =40) #this will return 0.345 and 0.655

Okay, now we have a function that will return the upper and lower bounds of 95% Wald interval. Next step is to simulate random sampling and estimate confidence intervals for each of the random samples and see whether or not the constructed confidence intervals from these samples actually ‘cover’ (include) the true proportion. For this, we will pre-define a set of different true population proportions. Finally, for each of these pre-defined probabilities, we see what is the coverage %. Ideally, for a 95% confidence interval, this coverage should always be more or less around 95%. Let’s see if that is true for the Wald interval. All of these steps are implemented in the R code shown below.

numSamples <- 10000 #number of samples to be drawn from population
numTrials <- 100 #this is the sample size (size of each sample)
probs <- seq(0.001, 0.999, 0.01) #true proportions in prevalence. #for each value in this array, we will construct 95% confidence #intervals
coverage <- as.numeric() #initializing an empty vector to store coverage for each of the probs defined above
for (i in 1:length(probs)) {
 x <- rbinom(n = numSamples, size=numTrials, prob = probs[i]) #taken #n random samples and get the number of successes in each of the n #samples. thus x here will have a length equal to n
 isCovered <- as.numeric() #a boolean vector to denote if the true #population proportion (probs[i]) is covered within the constructed ci
 #since we have n different x here, we will have n different ci for #each of them. 
 for (j in 1:numSamples) {
 ci <- waldInterval(x = x[j], n = numTrials)
 isCovered[j] <- (ci[1] < probs[i]) & (probs[i] < ci[2]) #if the #true proportion (probs[i]) is covered within the constructed CI, #then it returns 1, else 0
 }
 coverage[i] <- mean(isCovered)*100 #captures the coverage for each #of the true proportions. ideally, for a 95% ci, this should be more #or else 95%
}plot(probs, coverage, type=”l”, ylim = c(75,100), col=”blue”, lwd=2, frame.plot = FALSE, yaxt=’n’, main = “Coverage of Wald Interval”,
 xlab = “True Proportion (Population Proportion) “, ylab = “Coverage (%) for 95% CI”)
abline(h = 95, lty=3, col=”maroon”, lwd=2)
axis(side = 2, at=seq(75,100, 5))

Below is the coverage plot obtained for the Wald Interval

Coverage for Wald Interval against different population proportions

The above plot is testament to the fact that Wald intervals performs very poorly. In fact, 95% coverage is only obtained for proportions that are more or less around 0.5. The coverage is awfully low for extreme values of p.

2. Clopper — Pearson Interval (Exact Interval)

Clopper-Pearson interval (also known as exact interval) came into existence with an objective to have the coverage at a minimum of 95% for all values of p and n. As the alternative name of ‘exact’ interval suggests, this interval is based on the exact binomial distribution and not on the large sample mid-p normal approximation like that of Wald interval. For those who are interested in the math and the original article, please refer to the original article published by Clopper and Pearson² in 1934. This is considered to be too conservative at times (in most cases this coverage can be ~99%!). In R, the popular ‘binom.test’ returns Clopper-Pearson confidence intervals. This is also known as exact binomial test. Similar to what we have done for Wald Interval, we can explore the coverage of Clopper-Pearson interval also. The fully reproducible R code is given below.

numSamples <- 10000 #number of samples to be drawn from population
numTrials <- 100 #this is the sample size (size of each sample)
probs <- seq(0.001, 0.999, 0.01) #true proportions in prevalence. #for each value in this array, we will construct 95% confidence #intervals
coverage <- as.numeric() #initializing an empty vector to store coverage for each of the probs defined above
for (i in 1:length(probs)) {
 x <- rbinom(n = numSamples, size=numTrials, prob = probs[i]) #taken #n random samples and get the number of successes in each of the n #samples. thus x here will have a length equal to n
 isCovered <- as.numeric() #a boolean vector to denote if the true #population proportion (probs[i]) is covered within the constructed ci
 #since we have n different x here, we will have n different ci for #each of them. 
 for (j in 1:numSamples) {
 ci <- binom.test(x = x[j], n = numTrials)$conf
 isCovered[j] <- (ci[1] < probs[i]) & (probs[i] < ci[2]) #if the #true proportion (probs[i]) is covered within the constructed CI, #then it returns 1, else 0
 }
 coverage[i] <- mean(isCovered)*100 #captures the coverage for each #of the true proportions. ideally, for a 95% ci, this should be more #or else 95%
}plot(probs, coverage, type=”l”, ylim = c(75,100), col=”blue”, lwd=2, frame.plot = FALSE, yaxt=’n’, main = “Coverage of Wald Interval”,
 xlab = “True Proportion (Population Proportion) “, ylab = “Coverage (%) for 95% CI”)
abline(h = 95, lty=3, col=”maroon”, lwd=2)
axis(side = 2, at=seq(75,100, 5))

And here is the coverage plot for Clopper-Pearson interval

Coverage for Clopper Pearson (exact) interval against different population proportions

Wow, this looks like it’s an exact opposite of the Wald interval coverage! In fact, the coverage even reaches almost 100% in many scenarios and never ever the coverage goes below 95%. This looks very promising and that is correct. But it is also too conservative in that the confidence intervals are likely to be more wider. This is a drawback with the Clopper-Pearson interval.

3. Wilson Interval (Score Interval)

The Wilson Score Interval³ is an extension of the normal approximation to accommodate for the loss of coverage that is typical for the Wald interval. So it can be considered as a direct improvement over the Wald interval by applying some transformation to the normal approximation formula³. Those who are interested in the math can refer the original article by Wilson³.

In R, the popular ‘prop.test’ function to test for proportions returns the Wilson score interval by default. It is to be noted that Wilson score interval can be corrected in two different ways. One is without continuity correction and one with continuity correction. The latter is known as Yate’s continuity correction and the argument ‘correct’ in the ‘prop.test’ can be assigned to TRUE or FALSE to apply this correction or not respectively. Yate’s continuity correction is recommended if the sample size is rather small or if the values of p are on the extremes (near 0 or 1). Yate’s continuity correction is considered to be a bit conservative, although it is not as conservative as Clopper-Pearson interval.

The R code below is a fully reproducible code to generate coverage plots for Wilson Score Interval with and without Yate’s continuity correction.

#let's first define a custom function that will make our jobs easiergetCoverages <- function(numSamples = 10000,numTrials = 100, method, correct = FALSE){
 probs <- seq(0.001, 0.999, 0.01) 
 coverage <- as.numeric() 
 for (i in 1:length(probs)) {
 x <- rbinom(n = numSamples, size=numTrials, prob = probs[i]) 
 isCovered <- as.numeric() 
 for (j in 1:numSamples) {
 if (method ==”wilson”){
 if (correct){
 ci <- prop.test(x = x[j], n = numTrials, correct = TRUE)$conf
 }else {
 ci <- prop.test(x = x[j], n = numTrials, correct = FALSE)$conf
 }
 }else if (method==”clopperpearson”){
 ci <- binom.test(x = x[j], n = numTrials)$conf
 }else if(method==”wald”){
 ci <- waldInterval(x = x[j], n = numTrials)
 }else if(method ==”agresticoull”){
 ci <- waldInterval(x = x[j]+2, n = numTrials + 4)
 }
 isCovered[j] <- (ci[1] < probs[i]) & (probs[i] < ci[2])
 }
 coverage[i] <- mean(isCovered)*100 #captures the coverage for each #of the true proportions. ideally, for a 95% ci, this should be more #or else 95%
 }
 return(list(“coverage”= coverage, “probs” = probs))
}

The code below uses the function defined above to generate the Wilson score coverage and corresponding two plots shown below

out <- getCoverages(method=”wilson”)out2 <- getCoverages(method=”wilson”, correct = TRUE)plot(out$probs, out$coverage, type=”l”, ylim = c(80,100), col=”blue”, lwd=2, frame.plot = FALSE, yaxt=’n’, 
 main = “Coverage of Wilson Score Interval without continuity correction”,
 xlab = “True Proportion (Population Proportion) “, ylab = “Coverage (%) for 95% CI”)
abline(h = 95, lty=3, col=”maroon”, lwd=2)
axis(side = 2, at=seq(80,100, 5))plot(out2$probs, out2$coverage, type=”l”, ylim = c(80,100), col=”blue”, lwd=2, frame.plot = FALSE, yaxt=’n’, 
 main = “Coverage of Wilson Score interval with continuity correction”,
 xlab = “True Proportion (Population Proportion) “, ylab = “Coverage (%) for 95% CI”)
abline(h = 95, lty=3, col=”maroon”, lwd=2)
axis(side = 2, at=seq(80,100, 5))

Wilson Score interval coverage with and without continuity correction. The coverage with Yate’s continuity correction (fig on right) has very good coverage similar to Clopper-Pearson, but it can be a bit too conservative in extreme scenarios

4. Agresti-Coull Interval

Agresti & Coull a simple solution⁴ to improve the coverage for Wald interval. This simple solution is also considered to perform better than Clopper-Pearson (exact) interval also in that this Agresti-Coull interval is less conservative whilst at the same time having good coverage. The solution might seem to very simple because all this does is to add two successes and two failures to the original observations! Yes, that’s right. Here, x (number of successes) becomes x+2 and n (sample size) becomes n+4 for a 95% confidence interval. That’s all. But this very simple solution seems to work very well in practical scenarios. That’s the beauty of it. By adding these fake observations, the distribution of ‘p’ is pulled towards 0.5 and thus the skewness of the distribution of ‘p’ when it is on the extreme is taken care of by pulling it towards 0.5. So, in a way you can say that this is also some sort of a continuity correction. Another surprising fact is that the original paper was published in 1998 as opposed pre-WW II papers of Clopper-Pearson and Wilson. So, it is relatively a much newer methodology.

The coverage for Agresti-Coull interval is depicted in the figure below. The

The Agresti-Coull interval is a very simple solution to mitigate the very poor performance of Wald interval, but this very simple solution yielded a drastic improvement in coverage as is shown above. The R code for generating this coverage plot for Agresti-Coull interval is given below. Note that it uses the custom function ‘getCoverages’ that was defined earlier.

ac <- getCoverages(method =”agresticoull”)plot(ac$probs, ac$coverage, type=”l”, ylim = c(80,100), col=”blue”, lwd=2, frame.plot = FALSE, yaxt=’n’, 
 main = “Coverage of Agresti-Coull Interval without continuity correction”,
 xlab = “True Proportion (Population Proportion) “, ylab = “Coverage (%) for 95% CI”)
abline(h = 95, lty=3, col=”maroon”, lwd=2)
axis(side = 2, at=seq(80,100, 5))

5. Bayesian HPD Interval

Bayesian HPD interval is the last one in this list and it stems from an entirely different concept altogether known as Bayesian statistical inference.

All the four confidence intervals that we discussed above are based on the concept of frequentist statistics.The frequentist statistics is the field of statistics where inference of population statistics or estimation of population statistics are done based on sample data by focusing on the frequency of the data. The assumption here is that one hypothesis is true and the probabilistic distribution of the data is assumed to follow some known distributions and that the we are collecting samples from that distribution.

Bayesian statistical inference is an entirely different school of statistical inference. Here, the inference of parameters requires the assumption of a prior distribution of data and the observed (sampled) data, the likelihood, is used to create the distribution of the parameter given the data using the likelihood. Posterior distribution is what we are really interested in and it is that we want to estimate. We know likelihood from the data and we know prior distribution by assuming a distribution. Using likelihood we are equipped to update our conclusions from prior to posterior — that is, the data throws some light and enables us to update our existing (assumed) knowledge which is the prior.

Bayesian statistical inference used to be highly popular prior to 20th century and then frequentist statistics dominated the statistical inference world. One of the reasons why Bayesian inference lost its popularity was because it became evident that to produce robust Bayesian inferences, a lot of computing power was needed. However, the world have seen a monumental rise in the capability of computing power over the last one or two decades and hence Bayesian statistical inference is gaining a lot of popularity again. This is why the popular ‘Bayesian vs Frequentist’ debates are emerging in statistical literature and social media.

p-values, confidence intervals — these are all frequentist statistics. So the Bayesian HPD (highest posterior density) interval is in fact not a confidence interval at all! This interval is rather known as credible intervals.

As discussed above, we can summarise the Bayesian inference as

Posterior ~ Likelihood * Prior

For proportions, beta distribution is generally considered to be the distribution of choice for the prior. Beta distribution depends on two parameters alpha and beta. When alpha = beta = 0.5, this is known as Jeffrey’s prior. Jeffrey’s prior is said to have some theoretical benefits and this is the most commonly used prior distribution to estimate credible intervals of proportions.

The ‘binom’ package in the R has this ‘binom.bayes’ function that estimates the bayesian credible interval for proportions. The best credible intervals cuts the posterior with a horizontal line and these are known as highest posterior density (HPD) intervals.

Let’s look at the coverage of Bayesian HPD credible interval

Baye’s HPD credible interval coverage with Jeffrey’s prior

The coverage of Baye’s HPD credible interval seems to be better than that of Wald, but not better than the other three frequentist confidence intervals. One advantage with using credible intervals though is in the interpretation of the intervals.

The frequentist definition of a 95% confidence interval:

If we are to take several independent random samples from the population and construct confidence intervals from each of the sample data, then 95 out of 100 confidence intervals will contain the true mean (true proportion, in this context of proportion)

Oops, the above definition seems to be way complicated or perhaps even confusing compared to our original thinking of confidence interval. Like I said before, it is still safe to assume that we can be 95% confident that the true proportion lies somewhere within the confidence interval.

But when it comes to Bayesian credible intervals, the actual statistical definition is itself very intuitive.

The Bayesian definition of a 95% credible interval:

The probability that the true proportion will lie within the 95% credible interval is 0.95

Wow, the above definition seems to be way more ‘likeable’ than the frequentist definition. So this is one definite advantage of Bayesian statistical inference in that the definitions are way more intuitive from a practical point of view whereas the *actual* definition of frequentist parameters like p-values, confidence intervals are complicated for the human mind.

Putting it all together

Let us summarize all the five different types of confidence intervals that we listed. The plot below puts all the coverages together.

In a nutshell…

The Clopper-Pearson interval is by far the the most covered confidence interval, but it is too conservative especially at extreme values of p
The Wald interval performs very poor and in extreme scenarios it does not provide an acceptable coverage by any means
The Bayesian HPD credible interval has acceptable coverage in most scenarios, but it does not provide good coverage at extreme values of p with Jeffrey’s prior. However, this might be dependent on the prior distribution used and can change with different priors. One advantage with credible interval is the intuitive statistical definition unlike the other confidence intervals
Agresti-Coull provides good coverage with a very simple modification of the Wald’s formula.
Wilson’s score interval with and without correction also have very good coverage although with correction applied it tends to be a bit too conservative.

Here is a table summarizing some of the important points about the five different confidence intervals

Brown, Cai and Dasgupta recommend¹ using Wilson score with continuity correction when sample size is less than 40 and for larger samples the recommended one is Agresti-Coull interval.

References

Brown, Lawrence D.; Cai, T. Tony; DasGupta, Anirban. Interval Estimation for a Binomial Proportion. Statist. Sci. 16 (2001), no. 2, 101–133. doi:10.1214/ss/1009213286. https://projecteuclid.org/euclid.ss/1009213286
Clopper,C.J.,and Pearson,E.S.(1934),“The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial”, Biometrika 26, 404–413.
Wilson, E.B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22, 209–212. doi: 10.2307/2276774.
Agresti A., Coull B.A. Approximate is better than ‘exact’ for interval estimation of binomial proportions. Am. Stat. 1998;52:119–126. doi: 10.2307/2685469