Why You Should Prefer Confidence Interval over p-value

Communicating Results of Your Statistical Analysis

Published in

Towards Data Science

8 min readJun 29, 2022

The journey of an analyst in the world of data science, fundamentally, consists of three phases: learning, doing, and presenting. These phases are not necessarily linear; most of us go back and forth. Which of these three phases do you perceive to be the most difficult one?

Personally, I think the third one is the most complicated. Surely the earlier two phases have enough challenges; however, after a while, you realize that learning and being able to use certain methods is only a matter of time. After all, methodologists derived all the equations, package developers converted those equations into ready-to-use functions, and generous instructors created an enormous amount of free tutorials to make your life easier. On the contrary, regarding the last phase, it seems that plenty of factors are beyond our control.

Think about the presentation phase for a moment. Before you prepare the presentation of your findings (as part of a report and/or an oral presentation), ask yourself at least three questions:

1. To whom am I presenting the results of my analysis?
2. What are their limitations (e.g., lack of statistical training/misunderstanding of jargon)?
3. Given the limitations, how can I present the findings in a way that is both accessible to my readers/audiences and technically correct?

There are, indeed, numerous challenges in terms of effectively and accurately presenting your statistical findings.

Today, we are going to focus on a particular issue that I personally struggled with and observed many other people struggling with — how to explain the significance of the findings of statistical tests? Should we depend on the p-value? Or should we prefer the confidence interval?

A real-world example

Let’s pretend that our research question is: to what extent are retired people more likely to experience frequent memory loss compared to non-retired people?

To investigate this question, we are going to use publicly available data from the Financial Well-Being Survey 2016 conducted by the Consumer Financial Protection Bureau (CFPB).

Let’s import the data directly from the CFPB website to R:

#Import data
data <- read.csv("https://www.consumerfinance.gov/documents/5614/NFWBS_PUF_2016_data.csv")

The question on memory loss (the variable name is MEMLOSS) asks:

During the past 12 months, have you experienced confusion or memory loss that is happening more often or is getting worse?
Response options are: 1= Yes, 0= No

And the question on retirement (the variable name is EMPLOY1_8) asks:

Which of the following describe(s) your current employment or work status?
Response options are: 1= Self-employed, 2= Work full-time for an employer or the military, 3= Work part-time for an employer or the military, 4= Homemaker, 5=Full-time student, 6. Permanently sick, disabled or unable to work, 7= Unemployed or temporarily laid off, 8= Retired

The good thing is they already created a retirement dummy (EMPLOY1_8) which takes a value of 1 if somebody is retired and a 0 otherwise.

Let’s estimate the difference in the proportion of respondents reporting frequent memory loss in the last 12 months among retired people and among non-retired people. We can run multiple statistical tests, which will lead us to the same conclusion. Regardless, let’s do a weighted least square (i.e., estimate a model using linear regression incorporating the survey weights):

#finalwt is the weight variable that accounts for the complex survey design
summary(lm(MEMLOSS~EMPLOY1_8,data=data,weights = finalwt))

If you follow the p-value approach, this is how you would interpret the above finding:

The proportion of respondents experiencing frequent memory loss in the last 12 months is 5.24 percentage points higher among the retired compared to the non-retired. This difference is significantly different from 0 at the 5% significance level as the p-value is less than 0.05. Simply put, retired people are significantly more likely to experience frequent memory loss than non-retired people/there is a statistically significant difference in the probability of experiencing frequent memory loss among retired people compared to non-retired people.

If your readers are experts in inferential statistics, the above conclusion should work fine. However, what if you are communicating the above result to people who never took an inferential statistics course?

Before I started learning inferential statistics, had somebody presented the above finding to me, I would have skipped the earlier part and only focused on the part after “simply put”. I would have interpreted the phrase “statistically significant” as a substantial/noteworthy/ notable/remarkable/important finding found using a statistical method.

But what does a statistically significant finding really mean? (I discuss this in detail in another article). In our example, it means that we have substantial evidence to believe that the true population-level difference in frequent memory loss among the retired and the non-retired is different from 0.

Statistical Significance: Are You Interpreting Correctly?

Know the null hypothesis!

vivdas.medium.com

Whether I am just a dilettante reader or a policy maker, the above finding is neither interesting nor useful to me. Why?

For someone not trained in statistics, perhaps the most confusing part is that, at first, you are saying the difference is 5.24 percentage points. And then, you are saying the difference is significantly different from 0. In mathematics, isn’t it obvious that 5.24 is different from 0? 🤔

If I focus on the fact that the difference is non-zero, the finding seems obvious because I already know that retired people are more likely to be older and older people are more likely to experience frequent memory losses. Moreover, I do not have any idea of how big the difference in the outcome between the two groups is. I would like to know “the extent” or “the magnitude” of the difference so that I can make sense of the real-world significance (and not just the so-called statistical significance) of the finding.

The p-value approach, as traditionally used, fails in that regard. Unfortunately, the convention in many academic disciplines is to explain the results of statistical tests using the p-value approach. And in many cases, journalists and bloggers misinterpret these results (not their fault 🤷‍♂️), which eventually leads to a large-scale misunderstanding of the magnitude of a real-world phenomenon 😕.

This is precisely why I prefer the confidence interval approach. Let’s estimate the confidence interval of the difference:

confint(lm(MEMLOSS~EMPLOY1_8,data=data,weights = finalwt))

As an analyst, if you report that the 95% confidence interval (CI) of the difference in frequent memory loss between the retired and the non-retired is [3.25 percentage points, 7.22 percentage points], as a non-statistician, intuitively, this is how I would think about it: “there is a 95% chance that the true difference lies somewhere between 3.25 percentage points and 7.22 percentage points.”

**The above interpretation is intuitive but wrong. Interestingly, many researchers explain their 95% CI this way (O’Brien and Yi, 2016). Also, if you google the interpretation of 95% CI, you are going to find tons of resources showing the same interpretation. Still, this wrong interpretation is preferable as it is mostly harmless.**

The 95% CI approach is far more useful. We understand that statistics is a scientific tool to reduce uncertainty, and ideally, any statistical test should:

Produce an estimate of a target parameter and
Attach a range of possible values on either side of the estimate.

Most importantly, knowing the range of possibilities of a certain outcome is a prerequisite in policymaking under uncertainty. Beyond inferring that the true difference in outcome between the two groups is non-zero (as the interval does not contain 0), the 95% CI approach helps us understand the range of possibilities of the true difference.

To be “technically correct 🤓”, the interpretation of the 95% confidence interval is opposite to what we intuitively perceive 😒. Here is how to think about it:

If we drew 100 different random samples from the same population (i.e., if the CFPB ran the same survey 100 times) and constructed a 95% CI for each random sample, we would expect that 95 of these 100 CIs would contain the (unknown) true population-level difference in the outcome between the two groups.

In reality, we draw only one sample (i.e., the CFPB ran the survey only once) from the population of interest. And so, the probability that this CI [3.25 percentage points, 7.22 percentage points] contains the true population-level difference in the outcome between the two groups is 95%.

A simplified version of the above is:

Correct one: We are 95% confident that the interval [3.25 percentage points, 7.22 percentage points] contains the true population-level difference in the outcome between the two groups.
Wrong one: We are 95% confident that the true population-level difference in the outcome between the two groups is between 3.25 percentage points and 7.22 percentage points.

Overall, here is the dilemma:

If you report your statistical finding using the p-value/statistical significance approach, readers learn whether the estimate of something is zero or not. However, knowing the estimate of something is non-zero may have little practical utility in many contexts. Also, a statistically significant finding may be practically insignificant, yet readers may intuitively think the finding is substantial.
If you report your statistical finding using the 95% CI approach, readers get an idea of the range of possibilities of the estimate and the practical significance of the finding. Nevertheless, despite your best efforts, the readers may intuitively interpret the 95% CI in a way that is wrong from a strictly technical point of view.

So, a misinterpretation is highly likely in both cases. However, the misinterpretation in the case of the confidence interval approach is potentially harmless 😐, whereas the misinterpretation in the case of the p-value approach can be gross 😡.

Update: I wrote another article trying to further explain the “correct” interpretation of confidence interval. You can find it here:

Confidence Interval: Are you interpreting correctly?

A detailed explanation

vivdas.medium.com

If you would like to read some of my previous posts on how to attempt to know the unknown, here are some suggestions:

Why Is Correlation Neither Necessary Nor Sufficient for Causation (in Non-Experimental Data)?

A detailed explanation with toy examples

How to Explore the Effect of Doing Something? (Part 1)

Applied Causal Inference 101: Counterfactual Worlds and The Experimental Ideal

How to Explore the Effect of Doing Something? (Part 2)

Applied Causal Inference 101: Non-Experimental Data

Does Money Buy Happiness?

A practical tutorial to explore the question

References

The CFPB Financial Well-Being Survey 2016: https://www.consumerfinance.gov/data-research/financial-well-being-survey-data/

User’s Guide: https://files.consumerfinance.gov/f/documents/cfpb_nfwbs-puf-user-guide.pdf

Codebook: https://files.consumerfinance.gov/f/documents/cfpb_nfwbs-puf-codebook.pdf

O’Brien, S. F., & Yi, Q. L. (2016). How do I interpret a confidence interval? Transfusion, 56(7), 1680–1683.