BCN Causal ALGO

Bayesian Statistics of Efficacy of the Pfizer-BioNTech COVID-19 Vaccine — part I

Vaccine Efficacy and Beta-binomial model

Bartek Skorulski

Published in

Towards Data Science

12 min readApr 10, 2021

Part I (this post)
Introduction
What is Vaccine Efficacy
Credibility of results
Bayesian Inference
Beta-binomial model
Statistics of Vaccine Efficacy using simulations
Vaccine and Placebo Incidence Rates
Monte Carlo methods
Posterior probabilities and 95% Credible Interval

Part II (next post)
Reproducing statistics from the article
Additional parameter θ
Prior distribution of θ and an adjustment of occurrences
Credible Interval for posterior Vaccine Efficacy
COVID-19 occurrence in participants with and those without prior evidence of infection
Final note
References

Introduction

On December 10th, 2020 the New England Journal of Medicine published a paper entitled Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine. This article presented results of a Randomised Control Study in which 43,548 participants were randomly assigned to receive two doses of BNT162b2 vaccine (now called Pfizer-BioNTech COVID-19 Vaccine) or two doses of placebo. These doses were applied at an interval of three weeks. Participants were then followed for development of COVID-19 for about 2 months. The results were astonishing. Vaccine efficacy was 95% with credible interval equal to (90.3–97.6%). A day after publication, the U.S. Food and Drug Administration issued the first emergency use authorisation for the vaccine. It was a new hope for a world that was going through a very complicated period of increased number of deaths, lockdowns, closed borders, rising unemployment, disappearance of jobs, economic recession…

Apart from being an important result, what caught my attention was the fact that the authors of the paper reported statistics using Credible Intervals. This means that the researcher decided to use Bayesian instead of Frequentist Inference. Since I am involved in designing Randomised Control Trials, I wanted to understand how it was done. Unfortunately not all details were included either in the article or in the study protocol. So I needed to do some detective work.

First, I took the information about sizes of the vaccine and placebo groups, the number of infected participants in each of the groups and then I did some calculations. The resulting size of Credible Interval was close to the one from the article, but not the same. So then I looked more carefully at the protocol. I tried to reproduce the steps from there. This time the calculated Credible Interval was closer to the original one, but still not quite the same.

At the end I realised that we need to make additional adjustments to the sizes of the vaccine and placebo groups. And “Voilà!”. The numbers were the same. But how could I be sure that it was not a coincidence? Fortunately, in the article there is also another Credible Interval calculated. With that I could validate that those were very likely the right steps.

I have to admit that I have had some fun trying to figure out how to reproduce results from the article. If you are interested in learning how statistics are calculated in one of the most important studies of 2020, in this note you can find all you need to follow my steps. I expect that you are familiar with statistics, but you do not need to be an expert. Here I have tried to explain a few concepts from Bayesian Inference that could help you understand calculations. I have also included Python code that you execute by yourself.

I have divided the note into two parts. In Part I, I will give you a minimal summary of Bayesian Inference that is needed to understand the following sections. Then I will present a straightforward method to calculate statistics for the article. With this method you can get results close to the original ones. If this is not enough for you, you should also read Part II of the note. This part is slightly more difficult. We will show you a method that allows you to reproduce results with one-digit precision. Then we will validate the method with another Credible Interval that can be found in the article.

What is Vaccine Efficacy

Before moving to Bayesian statistics, let me explain how to estimate Vaccine Efficacy. First we need to randomly assign study participants into two groups: vaccine and placebo group. People in the first group, called the vaccine group, receive vaccines, and those in the placebo group receive placebo. Neither participants nor personnel who apply know if a participant is getting a vaccine or placebo. Then, when the study is finished, the Vaccine Efficacy is estimated by the formula

where IRR is Incidence Rate Ratio given by the formula

Vaccine Incidence Rate is the ratio of confirmed cases of Covid-19 illness per number of people in the vaccine group and Placebo Incidence Rate is the same for the placebo group (see Statistical Analysis section in the paper).

Now, let us look at the numbers from the paper and redo these calculations. The numbers can be found in the following table that I have copied from the original article.

Table 1: Vaccine Efficacy against covid (reproduced from https://www.nejm.org/doi/full/10.1056/NEJMoa2034577)

You can see in the first row of the table that in the case of participants without evidence of infection before vaccination in the vaccine group we have 8 cases of COVID-19 and the number of participants, adjusted for surveillance time, is 17,411(see Table 1). Note that we do not use the total number of participants, which is 18,198. This is because an adjustment for surveillance time is needed, since not every person had participated in the study for the same time. For example, it is not the same if one person was monitored for 1 month and another for 2 months, because the latter would have had more chances of developing symptomatic COVID-19. So we have to modify this total number and we obtain 17,411. We are not able to reproduce this computation since we do not have access to data of individual patients. However, we refer the reader who is interested in how it could be done to this notebook.

Hence we can calculate Vaccine Incidence Rate as follows

and the same way the incidence rate for placebo group:

Putting this together

We can interpret this result in the following way. Among vaccinated people, 95% of them who would normally get symptomatic COVID-19 did not show any evidence of it.

Credibility of results

This 95% efficacy looks very good indeed. But… How much confidence do we have? In Randomised Control Trials we answer this question by estimating the probability that a vaccine that is not efficacious or with not enough efficacy would get to a market. We call it “controlling type I error”.

In the study the assessment of vaccine efficacy was based on the probabilities that the vaccine efficacy is greater than 30% (see Section 9.1.2.1 of the protocol, page 107). They have decided that, in order to approve the vaccine, this probability has to be larger than 97.5%. That is

In other words, the probability of making an error of type I, which in this case means approving a vaccine with efficacy lower than 30%, should be lower than 2.5%. And what are the results? The last column of Table 1 shows that the probability that the vaccine efficacy was greater than 30% is higher than 99.99%. Equivalently, the probability of making a type I error is lower than 0.01%.

In fact the results are much better than that. The 95% Credible Interval was equal to (90.3–97.6%). I will explain in detail what 95% Credible Interval is later on. For now, let me tell you that it implies that the probability that the vaccine efficacy is greater than 90.3% is 97.5%. So the initial threshold of 30% was quite pessimistic.

*The 30% threshold was rather pessimistic, since it turned out that the distribution of Vaccine Efficacy is far away on the right. This is great news for us.*

Now it is our turn to dive into Bayesian Inference in order to calculate these results.

Bayesian Inference

It is impossible to explain Bayesian Statistics in this short note. There are many books that do this very well (for example, Bayesian Statistics, An introduction by Peter Lee). What I will try to do instead is to present enough of it to explain and reproduce results from the article. I will also refer the reader to where additional explanations can be found.

The central role in Bayesian Statistics is of course played by Bayes’ Theorem. For our needs we can rephrase it as

What does it mean? Prior beliefs represent what we believe before collecting data and is represented by a probabilistic distribution of possible parameters. Likelihood is the probability of getting observed data with respect to each possible parameter. (Prior beliefs) * (Likelihood of observed data) is the formula for modification of our beliefs. If observed data are more likely to be collected for some parameters, then those parameters get more weights in the distribution of Posterior beliefs. The strange equal like symbol means that we can ignore constants, that is we do not care if this formula is multiplied by 0.01 or 100. Here we deal with probabilistic distributions so we only need to know the relation between different possibilities.

I know that this formalisation is a little awkward when seen for the first time. But it represents something that is quite intuitive. For example, if we have a new vaccine, which we do not know anything about, our prior beliefs are very weak. In this situation we could say that, according to our prior beliefs, the probability that the Vaccine Efficacy is lower than 30% or greater than 30% is the same. However, after seeing the results of a study, our posterior beliefs will change. With results like the one we have here, we can say that it is more likely that Vaccine Efficacy is greater than 30%.

Beta-binomial model

We also need to explain how the Beta-binomial model works. In this model we assume that one person has a certain probability of getting COVID-19. Let’s say that this probability is θ. Then, if we observe that out of n people, k get sick, the likelihood of this event follows Binomial distribution. That is

We assume as our prior beliefs that θ follows Beta distribution:

where α and β are our parameters greater than zero. The most important reason for this assumption is the following mathematical link between Beta and Binomial Distributions that we explain below. Namely Beta distribution is a conjugate prior for Bernoulli distribution.

Indeed, since in the formula

we can ignore constants (which means that we can ignore all factors that does not contain θ), we get

This means that the posterior beliefs also follow Beta distribution with parameters α+k and β+n-k. If posterior and prior follows the same type of distribution we say that we have conjugate prior for the likelihood function (see also section 3.1.1 of Lee’s Bayesian Statistics book).

Summing up, if our model has likelihood function that follows Binomial distribution and the prior function follows Beta distribution, we call this model Beta-binomial model and write it as

Statistics of Vaccine Efficacy using simulations

First I will calculate statistics for Vaccine Efficacy a slightly differently from how it was done in the article. I think it is a little simpler and more straightforward to understand at first read. In the second part of this note we will reproduce original computations.

Vaccine and Placebo Incidence Rates

Now we come back to our Vaccine and Placebo Incidence Rates. First we assume that our prior beliefs are that the probability of getting COVID-19 with a vaccine and with a placebo are identical. This means that they both follow the same Beta distribution with the same parameters α and β. Now, what should the values of these parameters be? Well, we can assume that on average around 1% of people get sick. Moreover, we would like to choose our prior beliefs to be relatively weak, which means that they can be easily changed when we collect sufficient data. Then the natural choice is to assume that β=1 and α≤1 such that

Rewriting it as

we get that

Then α should be around 0.010101. Hence we assume that the prior distributions of Vaccine and Placebo Incidence Rates follow the same Beta(0.010101, 1) distribution.

Prior distribution of Vaccine and Placebo Incidence Rates

Now, after running the trial, in the vaccine group we have observed 8 cases of COVID-19 out of 17,411 participants (let me remind you that the number is adjusted, since participants’ surveillance time varied). Hence using the formula from the previous section it follows that the posterior beta-binomial model for Vaccine Incidence Rate is Beta(0.010101+8, 1+17411–8). On the other hand, for Placebo Incidence Rate it is Beta(0.010101+162, 1+17511–162). We sum it up as follows.

Then the plot of these distributions is the following:

Posterior distributions of the Vaccine and the Placebo Incidence Rates

These plots clearly show that the distribution of the Vaccine Incidence Rate is much lower than the distribution of the Placebo Incidence Rate. This already shows that it is extremely unlikely that we get those results by chance. Anyway, let us calculate statistics.

Monte Carlo method

So now we are going to calculate the probability that the Vaccine Incidence Rate is lower than the Placebo Incidence Rate. That is

And we will do this using the Monte Carlo method.

The Monte Carlo method is a way to get numerical estimations of formulas which are complicated or impossible to have in explicit form. In order to estimate our probability we are going to sample 1,000,000 pairs of numbers from the distribution of Vaccine Incidence Rate and Placebo Incidence Rate and then see how many of them will have the Vaccine Incidence Rate as lower. Here we provide python code for that.

The outcome is 1.0. It means that the probability is almost one.

Posterior probabilities and 95% Credible Interval

Let us recall that the vaccine efficacy is given by the formula.

First, we calculate the probability that vaccine efficacy is greater than 30%. Although it is possible to obtain an explicit formula (see for example this paper), here again, we use simulations.

Hence we can write as in Table 1:

The same way we can calculate the 95% Credible interval. There are several ways of defining Credible Intervals (see for example this wikipedia article). In the article, 95% Credible Interval is defined as the interval that contains 95% of posterior distribution and is limited by percentile 2.5 and percentile 97.5. Here is the python code for calculating it.

Posterior distribution with 95% Credible Interval: (90.8–97.9%)

Finally our results are that the vaccine efficacy is 95% with 95%-Credible Interval equal to (90.8–97.9%). You can get slightly different results but they should be very close. Those results are very similar to the one from the article, where 95% Credible Interval is equal to (90.3–97.6%). But they are slightly different. In the second part of this note we will present how we can redo the calculations from the article.

End of part I

In this part we have learnt how to calculate Vaccine Efficacy and how to estimate its credibility using Bayesian Statistics. The second part is slightly more complicated. We will take a closer look at the study protocol in order to show you how you can calculate the statistics similarly to how it was done in the article.

All visualisations, unless otherwise noted, are by the author.