Statistics — Part I: How Bayesian Can Complement Frequentist.

Published in

Towards Data Science

7 min readSep 4, 2020

For many years, academics have been using so-called frequentist statistics to evaluate whether experimental manipulations have significant effects.

Frequentist statistic is based on the concept of hypothesis testing, which is a mathematical based estimation of whether your results can be obtained by chance. The lower the value, the more significant it would be (in frequentist terms). By the same token, you can obtain non-significant results using the same approach. Most of these "negative" results are disregarded in research, although there is tremendous added value in also knowing what manipulations do not have an effect. But that’s for another post ;)

Thing is, in such cases where no effect can be found, frequentist statistics are limited in their explanatory power, as I will argue in this post.

Below, I will be exploring one limitation of frequentist statistics, and proposing an alternative method to frequentist hypothesis testing: Bayesian statistics. I will not go into a direct comparison between the two approaches. There is quite some reading out there, if you are interested. I will rather explore how why the frequentist approach presents some shortcomings, and how the two approaches can be complementary in some situations (rather than seeing them as mutually exclusive, as sometimes argued).

This is the first of two posts, where I will be focusing on the inability of frequentist statistics to disentangle between the absence of evidence and the evidence of absence.

Absence of evidence vs evidence of absence

Background

In the frequentist world, statistics typically output some statistical measures (t, F, Z values… depending on your test), and the almighty p-value. I discuss the limitations of only using p-values in another post, which you can read to get familiar with some concepts behind its computation. Briefly, the p-value, if significant (i.e., below an arbitrarily decided threshold, called alpha level, typically set at 0.05), determines that your manipulation most likely has an effect.

However, what if (and that happens a lot), your p-value is > 0.05? In the frequentist world, such p-values do not allow you to disentangle between an absence of evidence and an evidence of absence of effect.

Let that sink in for a little bit, because it is the crucial point here. In other words, frequentist statistics are pretty effective at quantifying the presence of an effect, but are quite poor at quantifying evidence for the absence of an effect. See here for literature.

The demonstration below is taken from some work that was performed at the Netherlands Institute for Neuroscience, back when I was working in neuroscience research. A very nice paper was recently published on this topic, that I encourage you to read. The code below is inspired by the paper repository, written in R.

Simulated Data

Say we generate a random distribution with mean=0.5 and standard deviation=1.

np.random.seed(42)
mean = 0.5; sd=1; sample_size=1000
exp_distibution = np.random.normal(loc=mean, scale=sd, size=sample_size)
plt.hist(exp_distibution)

Figure 1 | Histogram depicting random draw from a normal distribution centered at 0.5

That would be our experimental distribution, and we want to know whether that distribution is significantly different from 0. We could run a one sample t-test (which would be okay since the distribution seems very Gaussian, but you should theoretically prove that parametric testing assumptions are fulfilled; let’s assume they are)

t, p = stats.ttest_1samp(a=exp_distibution, popmean=0)
print(‘t-value = ‘ + str(t))
print(‘p-value = ‘ + str(p))

Quite a nice p-value that would make every PhD student’s spine shiver with happiness ;) Note that with that kind of sample size, almost anything gets significant, but let’s move on with the demonstration.

Now let’s try a distribution centered at 0, which should not be significantly different from 0

mean = 0; sd=1; sample_size=1000
exp_distibution = np.random.normal(loc=mean, scale=sd, size=sample_size)
plt.hist(exp_distibutiont, p = stats.ttest_1samp(a=exp_distibution, popmean=0)
print(‘t-value = ‘ + str(t))
print(‘p-value = ‘ + str(p))

Here, we have as expected a distribution that does not significantly differ from 0. And here is where things get a bit tricky: in some situations, frequentist statistics cannot really tell whether a p-value > 0.05 is an absence of evidence, and an evidence for absence, although that is a crucial point that would allow you to completely rule out an experimental manipulation from having an effect.

Let’s take an hypothetical situation:

You want to know whether a manipulation has an effect. It might be a novel marketing approach in your communication, a interference with biological activity or a “picture vs no picture” test in a mail you are sending. You of course have a control group to compare your experimental group to.

When collecting your data, you could see different patterns:

(i) the two groups differ.
(ii) the two groups behave similarly.
(iii) you do not have enough observations to conclude (sample size too small)

While option (i) is an evidence against the null hypothesis H0 (i.e., you have evidence that your manipulation had an effect), situations (ii) (=evidence for H0, i.e, evidence of absence) and (iii) (=no evidence, i.e, absence of evidence) cannot be disentangled using frequentist statistics. But maybe the bayesian approach can add something to this story...

How p-values are affected by effect and sample sizes

The first thing is to illustrate the situations where frequentist statistics have shortcomings.

Approach background

What I will be doing is plotting how frequentist p-values behave when changing both effect size (i.e., the difference between your control, here with a mean=0, and your experimental distributions) and sample size (number of observations or data points).

Let’s first write a function that would compute these p-values:

def run_t_test(m,n,iterations):
    """
    Runs a t-test for different effect and sample sizes and stores the p-value
    """
    my_p = np.zeros(shape=[1,iterations])
    for i in range(0,iterations):
        x = np.random.normal(loc=m, scale=1, size=n)
        # Traditional one tailed t test
        t, p = stats.ttest_1samp(a=x, popmean=0)
        my_p[0,i] = p
    return my_p

We can then define the parameters of the space we want to test, with different sample and effect sizes:

# Defines parameters to be tested
sample_sizes = [5,8,10,15,20,40,80,100,200]
effect_sizes = [0, 0.5, 1, 2]
nSimulations = 1000

We can finally run the function and visualize:

# Run the function to store all p-values in the array "my_pvalues"
my_pvalues = np.zeros((len(effect_sizes), len(sample_sizes),nSimulations))for mi in range(0,len(effect_sizes)):
    for i in range(0, len(sample_sizes)):
        my_pvalues[mi,i,] = run_t_test(m=effect_sizes[mi], 
                                n=sample_sizes[i], 
                                iterations=nSimulations
                               )

I will quickly visualize the data to make sure that the p-values seem correct. The output would be:

p-values for sample size = 5
Effect sizes:
          0       0.5       1.0         2
0  0.243322  0.062245  0.343170  0.344045
1  0.155613  0.482785  0.875222  0.152519
 
 
p-values for sample size = 15
Effect sizes:
          0       0.5       1.0             2
0  0.004052  0.010241  0.000067  1.003960e-08
1  0.001690  0.000086  0.000064  2.712946e-07

I would make two main observations here:

When you have high enough sample size (lower section), the p-values behave as expected and decrease with increasing effect sizes (since you have more robust statistical power to detect the effect).
However, we also see that the p-values are not significant for a small sample sizes, even if the effect sizes are quite large (upper section). That is quite striking, since the effect sizes are the same, only the number of data points is different.

Let’s visualize that.

Visualization

For each sample size (5, 8, 10, 15, 20, 40, 80, 100, 200), we will count the number of p-values falling in significance level bins.

Let’s first compare two distributions of equal mean, that is, we have an effect size = 0.

Figure 2 | Number of p values located in each “significance” bins for effect size = 0

As we can see from the plot above, most of the p-values computed by the t-test are not significant for an experimental distribution of mean 0. That makes sense, since these two distributions are not different in their means.

We can, however, see that in some cases, we do obtain significant p values, which can happen when using very particular data points drawn from the overall population. These are typically false positive, and the reason why it is important to repeat experiments and replicate results ;)

Let’s see what happens if we use a distribution whose mean differs by 0.5 compared to the control:

Figure 3 | Number of p values per “significance” bins for effect size = 0.5

Now, we clearly see that increasing sample size dramatically increases the ability to detect the effect, with still many non significant p-values for low sample sizes.

Below, as expected, you see that for highly different distributions (effect size = 2), the number of significant p-values increase:

Figure 3 | Number of p values per “significance” bins for effect size = 2

OK, so that was it for an illustrative example of how p-values are affected by sample and effect sizes.

Now, the problem is that when you have a non significant p value, you are not always sure whether you might have missed the effect (say because you had a low sample size, due to limited observations or budget) or whether your data really suggest the absence of an effect. As matter of fact, most scientific research have a problem of statistical power, because they have limited observations (due to experimental constraints, budget, time, publishing time pressure, etc…).

Since the reality of data in research is a rather low sample size, you still might want to draw meaningful conclusions from non significant results based on low sample sizes.

Here, Bayesian statistics could help you make one more step with your data ;)

Stay tuned for the following post where I explore the Titanic and Boston data sets to demonstrate how Bayesian statistics can be useful in such cases!

You can find this notebook in the following repo: https://github.com/juls-dotcom/bayes