Statistics Part II: Bayesian To The Rescue

Julien Hernandez Lallement
Towards Data Science
9 min readSep 7, 2020

--

In this post, I will compare the output of frequentist and Bayesian statistics, and explain how these two approaches can be complementary, in particular for unclear results resulting from a frequentist approach.

For a first proof of concept, I will use the famous Titanic data set, that every first Kaggle user is exposed to upon registration. These statistics can be of course applied on any other data set. I selected the Titanic data set because it has a large range of variables, and readers might already know the data.

For the ones not familiar with this data set, it offers a range of variables that can be used to predict the likelihood of having survived the accident that sunk the boat back then. You will find all kind of approaches online to analyze this data set, as well as machine learning techniques to predict survival.

I downloaded it from some source on the net, and you can find the exact data set I used here.

FYI, the variables are listed below:

[print(i) for i in df.columns]PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked

Absence of Evidence

If you do the analysis yourself, you will find out that some variables are pretty good at predicting survival. For the sake of argumentation, and because I think it offers a nice explanatory power, let’s look at the variable age:

df.Age.plot(kind='hist')
Figure 1 | Histogram of the Age column in the Titanic dataset.

Since we want to investigate the effect of age on survival, let’s split that accordingly:

(df.groupby('Survived')
.apply(lambda d: pd.Series({
"std": d.Age.std(),
"sem": d.Age.std() / d.Age.count(),
"avg": d.Age.mean()
}))
.plot(kind='barh',
y = "avg",
legend = False,
title = "Mean Age per Surival Class +/- std",
xerr = "std"
)
);
Figure 2 | Mean age per survival category. Errorbars shows the standard deviation of the distribution.

From a simple bar plot, there does not seem to be a crazy difference in the age of passengers that survived and did not survived the accident. Looking at the error bars, we might think that these distributions are not significantly different.

Let’s test that statistically.

For the demonstration of Bayesian statistics, I will be using the open source software JASP, which offers a user-friendly interface. There are many other packages out there that would allow you to run Bayesian stats from code. Since the readers might not be well versed in code, I use this software to show how to run basic Bayesian testing.

Let’s first load the Titanic data set in JASP:

Figure 3 | Screenshot of the Titanic dataset when loaded in JASP

Above you can see that JASP automatically reorganizes the data in columns in a nice readable way.

JASP allows you to perform basic statistical testing from both frequentist and Bayesian approaches. While I typically run my stats using SciPy, its kind of nice to have both approaches embedded in one software, so that you can compare the outputs easily.

Let’s first start with the classic frequentist approach.

Below, you see a screenshot of the JASP window that pops out when you want to do an independent sample t-test, which is what we should be doing if we want to test whether passengers that survived had a significantly different age than people that died due to the tragedy.

Figure 4 | Frequentist Independent Sample t-test in JASP.

As you can see, there is a lot of options that one could change, such as the type of test (Student, Welch, Mann-Whitney if you want to do a non parametric test), whether you have a hypothesis for the testing (one or two sided test). Additionally, you can also obtain more descriptive statistics if you want to explore your dataset using JASP.

I will just run a standard Student test, that is parametric testing. Before doing that, we should be checking whether assumptions for parametric testing are fulfilled by the distribution, but for the sake of demo, let’s assume they are.

Quite surprisingly, the test shows a significant difference between the two distributions (t(712) = 2.067, P = 0.039), i.e., the observed difference in age is unlikely under the null hypothesis. The p-value really flirts with the typical 0.05 alpha level, suggesting that this effect is significant according to frequentist statistics, but not very convincing if I might add.

Before moving on, we should be looking at other measures than the p-value (effect size, see here, but since this post is about comparing frequentist and bayesian approach, I will just move on.

Now let’s look at what a Bayesian Independent Sample t-test would show.

Figure 5| Frequentist Independent Sample t-test in JASP.

The Bayesian test outputs a so-called Bayes Factor (BF), which is the relative predictive performance of the null hypothesis H0 versus the alternative hypothesis H1. See here for more information on Bayes Factor.

While I do not like the concept of arbitrary threshold, these ideas can be useful to draw meaningful conclusions about the data.

In the frequentist world, the typical arbitrary threshold is 0.05, below which the effect is said to be significant.

In the Bayesian world, and according to initial classifications by Jeffreys, the following nomenclature could be used:

  • BF < 1/3: evidence against the null hypothesis
  • 1/3 < BF < 3 : Anecdotical evidence
  • BF > 3: Evidence for the null hypothesis

In our case, we select BF10, which represents the likelihood of the data under the 1 Hypothesis compared to the likelihood of the data under the null hypothesis (in math terms: p(data | H1) / p(data | H0)).

Back to our test. We find interesting options in the main window, that would allow you to perform a one sided or two sided test (“Alt. Hypothesis”, indicate by “+” in JASP), as well as BF manipulations that allow you to calculate that ratio for each comparison, BF10 (hyp 1 vs hyp 0) and BF01 (reverse comparison).

I suggest that you also explore the nice plots options that will allow you to visualize your prior and posterior distributions (that is for a separate post though…).

Let’s run the test:

In our case, we obtain a BF = 0.685, meaning that our data was 0.685 times more likely under H1 that under H0. According to initial classifications by Jeffreys, this speaks for a absence of evidence for H0, that is we cannot conclude that age does not affect the likelihood of survival in the Titanic accident. I insist here: since the BF is not below 1/3, we cannot claim that the have obtained evidence of the absence of effect of age on survival. In such situations, more data might be needed to observe how the BF might evolve with more accumulating data.

I like to think of Bayes Factor in the following terms: “How much should I change my belief that age has an impact on the survival likelihood of the titanic disaster?”. The answer is the Bayes Factor. Depending on your prior belief (that would be the so-called prior, that you can adapt based on what you already know from the data), the BF will be then different, depending on the strength of the effect.

Here, it seems that I should not change my belief by much.

As you can see, while the frequentist approach would conclude of an effect of age, the bayesian one would say the more data is required before concluding. In such cases, the best would be to collect more data to get evidence of effect, or prove an evidence of absence of effect.

Evidence of Effect

Let’s now use another famous dataset, the boston housing dataset, to explore another situation. You can find this dataset on the repo given at the beginning of the post.

Below I am plotting the price of the houses (which you are supposed to predict in the initial competition), per location on or away from the Charles River.

(df_housing.groupby('chas')
.apply(lambda d: pd.Series({
"std": d.medv.std(),
"sem": d.medv.std() / d.medv.count(),
"avg": d.medv.mean()
}))
.plot(kind='barh',
y = "avg",
legend = False,
title = "Mean price per river location +/- std",
xerr = "std"
)
);
Figure 6 | Mean housing price per river location. Errorbars shows the standard deviation of the distribution.

As we can see, the prices are higher for True values (close to the Charles River) than for False values.
Now, let’s explore this observation statistically.

I won’t be showing the screenshots of the JASP results here, but only the results.

A frequentist independent sample t-test shows a highly significant difference between the two distributions (t(504) = -3.996, p < 0.001), i.e., the observed difference in housing price is significant between accomodations located on and away from the Charles River.

Let’s see what we obtain with a bayesian test.

The bayesian approach confirms that observations with a very high BF value, suggesting that the data is 270 times more likely under the H1 hypothesis, than under the H0.

Now, we have the last case to explore: where bayesian statistics would allow us to conclude on the absence of an effect.

Evidence of Absence

So far, so good. In the previous example, we saw that both approaches made sense and concurred when the effect is high. Now let’s look at another variable where we might be able to use the power of the bayesian approach a bit more clearly.

Instead of looking at a variable that is likely to show a difference in price, let’s look at another one: the “zn” variable, i.e., the proportion of residential land zoned for lots over 25,000 sq.ft.

First, let’s see what a frequentist would say:

And then, what a bayesian would say:

The frequentist would say that there is no effect, but again, we do not know whether there actually is no effect, or whether we are lacking statistical power…

The bayesian says that we have a BF10 = 0.284. This value is below 1/3 and, according to the nomenclature mentioned above, this time, our data provides moderate evidence for H0, i.e., that locations on or away from the Charles River does not lead to a higher proportion in industrial zone. In other words, we have evidence of absence of an effect, and we can completely rule out this variable from further models and interpretations, to simplify our analysis.

This conclusion is opposed to a “absence of evidence” situation, that we found previously in the Titanic dataset using the variable Age. In that particular situation, we should not take away Age from our models and analysis, since we do not have clear evidence that it does not play a role in our observed effect.

Combining the frequentist and bayesian approach

The most powerful approach to such statistical testing is probably to report both frequentist and bayesian approaches. This is something we have done in a recent publication, to accommodate both frequentist and bayesian reviewers, and to justify why some variables were taken away from further covariate analysis, while others were maintain

Cheers and thanks for reading :)

Ju

You can find this notebook at the following repo: https://github.com/juls-dotcom/bayes

--

--

I have been working with data since 2010, mainly in neuroscience research, but also economics, behavioral sciences and lately marketing.