Bayesian A/B Testing in R

Analyze social media performance with Bayesian vs. Frequentist statistics

Published in

Towards Data Science

20 min readSep 8, 2022

Professionals have always wondered how they can improve their products and services. Marketing practitioners usually question how modifications to their website would change online purchases to maximize sales. Similarly, scientists test different drug versions against each other to identify the most effective one for future patients. Every time comparable options need to be tested against each other, we can run a little experiment to answer the following question: How does version A differ from version B?

What is A/B testing?

A/B Testing is just a fancy term for a classical hypothesis test where A and B just represent different variants of a thing. For instance, they can refer to a control condition (A) and a treatment condition (B) from clinical trial data. Another field that has widely adopted A/B testing is marketing.

Professionals often would like to know how many customers have engaged with the website for A) an original version and B) a slightly changed version. To find out, they design a randomized experiment: People are randomly assigned into two groups that are presented with either of the two versions. To find out which one is better, they test which one scores higher on a specific metric of success. For example, this can be something like a conversion rate; that is the percentage of users that have completed a desired action, that is the number of users who clicked on an advertisement divided by the overall size of the audience. The click on the advertisement is considered to be a success while not clicking is a failure.

The final goal is to make a data-driven decision to improve the website: Whatever version attracts more users will be adopted. To make sure that they exploit their potential and get a nearly perfect end result, they iterate through this process multiple times: collect data about the current version and test it against a potential advancement.

Contents of this article

Usually, this process is statistically backed up by something called NHST (null-hypothesis significance testing) aka Frequentist statistics, which is the traditional approach taught anywhere around the world. However, this paradigm has been criticised from practitioners and scientists alike for its misleading interpretations. In search of a more straightforward way to test hypotheses, people often turn towards Bayesian statistics. When there is so much money at risk behind marketing campaigns, we should be pretty confident about our choices and ideally build upon previous data. Does the Bayesian approach live up to it? I think so.

In this article, you will find hands-on information about:

The key differences between Bayesian and Frequentist A/B testing both in theory and in practice with a coding tutorial in R.
The steps to plan your experiment and specify priors, test hypothesis and assess the magnitude of evidence.
The practical side of A/B testing by comparing the user engagement of video vs. photo posts on a real-world dataset.

The pain of hypothesis testing — Frequentist vs. Bayesian

Bayesian statistics is about updating one’s belief about something as the new information becomes available. It allows the analyst to incorporate prior knowledge (e.g., based on prior data, how much difference in user engagement can we expect between publication types?). As new data arrive, we thus continuously re-allocate the credibility between different options until we have enough evidence. This procedure is an ideal candidate for meta-analysis in science and continuous analyses in the business context — possibly it will save you the resources spent on an analysis that would not turn out to be fruitful anyways. This is a big plus in terms of efficiency!

The most fundamental difference between the two paradigms seems to be the reversed understanding of what probability is. The term frequentist means that probability refers to sampling distributions of simulated data. Under the hood, the computation of p-values relies on a distribution of imaginary test-statistics (e.g., t-values) if we imagined that there is no difference between A and B to begin with. Thus, we can say something about how unlikely it is to observe such data in the light of the null hypothesis. By looking at the proportion of outcomes our sample would leave behind, we assess how extreme or unusual our observations are. In fact, this is what the famous p-value estimates: the estimated probability to observe the data given the null hypothesis. In contrast, Bayesian probability is based on the degree of reasonable belief in a specific hypothesis given the data. So, the interpretation is exactly reversed.

You would probably agree that this interpretation of significance is more intuitive and in line with our natural way of thinking: We start with the data and make inferences about the presence of effects in nature instead of the other way around. On the other hand, we are usually not interested in the likelihood of the data in light of the null hypothesis — something we actually do not believe in either. There is no need to perform these mental gymnastics for Bayesian statistics. Moreover, it is a bit humbler because we embrace the degree of uncertainty that comes with every statistical estimation. The reason behind this lies in something called probability distributions. If we take user engagement as an example, the analysis is based on information about the specific values it usually takes and with what probabilities. We have an idea about the success of a cause (e.g., of social media content) that is based on our expert knowledge and collect data to update our belief.

Bayesians analyse data exactly in the way learning works: we adjust our beliefs as we observe and thus continuously grow our knowledge.

If you are interested in more theoretical details on this topic, read my article on Frequentist vs. Bayesian statistics to predict the weather on my wedding.

The key differences in a nutshell

Bayesian A/B testing…

…quantifies the difference between the success of A and B. In contrast, the frequentist approach only says something about if there is a difference to begin with.
…differentiates between different hypothesis and can prove the absence of an effect (e.g., A and B do not differ) in light of the data. A low p-value however is only indirect evidence suggests that the data are too inconclusive or not odd enough to reject the null hypothesis.
… informs economic decision-making. Because frequentist statistics do not estimate the probability of parameters (e.g., conversion rate for A vs. B), we cannot translate it back into relevant business outcomes like expected profit or survival rates.
…is truly intuitive or straightforward to non-stats people. It naturally complements the human way of thinking about evidence. For Frequentists, the interpretation of probability is reversed which makes it a pain to understand directly.
…requires much less data and is more time-efficient. Unlike frequentist statistics, Bayesians do not follow a sampling plan that would make an analysis of existing data illegitimate before the full sample size is reached.

A short note on R packages

By now, there are two packages in R that specialise on Bayesian A/B testing, and are based on different assumptions. The bayesAB package (Portman, 2017) is based on the ‘Independent Beta Estimation’ (IBE) approach which assumes that:

The success probabilities of A and B are independent, so learning about the success rate of one experimental condition does not affect our knowledge about the success rate of the other.
The presence of an effect, so the experimental manipulation must be effective in some way and we cannot obtain evidence in favour of the null hypothesis.

The Logit Transformation Testing (LTT) approach overcomes these restrictions which is implemented in the abtest package by (Gronau, 2019) in R (R Core Team, 2020). This is what we will use for our upcoming case study to yield more informative results and if you are interested in the technical details, I recommend you to read the paper by Hoffmann, Hofman & Wagenmakers (2020).

Benchmark social media performance

A typical example from the field of marketing is analysing the user engagement from a company’s website to improve its popularity. In this case study, we will look at just that. Technically, user or customer engagement is defined as the voluntary and potentially profitable behaviour towards a firm. For example, it can manifest itself in the fact that a person is willing to draw attention towards a firm as well as prospect customers by word of mouth, writing a comment, sharing information or referring to a product. This should ring a bell to marketers because user engagement is essentially free, creates a beneficial relationship towards a brand, and may increase sales in the long run. But why not simply use the click-through rate, so the number of users who have actually clicked on the post or advertisement?

Because user engagement is a stronger indicator of the degree to which our social media content triggers some kind of interest that requires the user to pay attention. Specifically, we would like to know if videos engage the user more than photos. Databox even suggest that videos require users to focus more because they are not digested so easily like photos and therefore generate twice as many clicks and 20–30 % more conversions.

The dataset we will use stems from an open-source paper from Moro and Rita (2016) which include 500 posts from the Facebook’s page of a worldwide renowned cosmetic brand that were collected between the 1st of January and the 31th of December of 2014. There are 12 outcome variables that were analysed by the authors, such as the following:

Lifetime post total reach: The number of people who saw a page post (unique users).
Lifetime post total impressions: Impressions are the number of times a post from a page is displayed, whether the post is clicked or not.
Lifetime engaged users: The number of people who clicked anywhere in a post (unique users).
Total interactions: The sum of “likes,” “comments,” and “shares” of the post.

To further characterise the posts we deal with, there are five other variables available:

Type: Categorization into whether the post is a Link, Photo, Status or Video.
Category: Categorization into whether the post refers to an action (special offers and contests), product (direct advertisement, explicit brand content), or inspiration (non-explicit brand related content).
Paid: Categorization into whether the brand paid Facebook for advertising (yes/no).

Load data and look at descriptive stats

After having defined an R-project, we can use the neat here-package to set up our paths in a simple way that is more resistant against changes on your local machine (Müller & Bryan, 2020). I recommend using data.table’s fread()function to load the data because it is much faster than the read_csv()counterpart and gets the data format right more frequently (Dowle, 2021). Let’s take a look at our raw data.

Disclaimer: All graphics are made by the author unless specified differently.

Only 5 of the observations are missing and the summary the skim()function gives us shows some descriptive statistics including mean, standard deviation, percentiles as well as a histogram of each of the variables (18 numeric; 1 factor).

To get an overview how many publications were posted, we load the dplyr package and count the number of observations by Type.

# A tibble: 4 x 2
# Groups:   Type [4]
  Type       n
  <fct>  <int>
1 Link      22
2 Photo    426
3 Status    45
4 Video      7

It turns out that the vast majority of content consists of photos (426) whereas Links (22), Status updates (45) and Videos (7) were less common. Now this should not discourage you because what counts is the number of observations behind each of the post s— the number of success by the number of trials, or stated differently: the number of engaged users by reach for each of the content types which we will save for later.

How to calculate the engagement rate

Now the raw number of users who have engaged with the post may fluctuate a lot depending on its reach, so we need a way of standardization to benchmark the performance. Therefore, we will create a percentage that represents the number of engaged users relative to the total number of people who have seen the post. There is a whole guide on different ways to calculate engagement rates by Hootsuite, which I highly recommend you to read. Facebook gives you various metrics that could be interesting for marketers, but for simplicity’s sake we will stick to one metric.

Design the experiment with prior knowledge

Since frequentist statistics are highly data-dependent, we need to run a power analysis first to find out which number of datapoints would be sufficient to capture the effect of interest. We will use the powerMediation package in R (Qiu, 2021) to apply it to our problem since we will run a logistic regression later.

The function takes the following arguments

baseline value — value for the current control condition
desired value — expected value for the test condition
proportion of the data from the test condition (ideally 0.5)
significance threshold / alpha — level where we consider the effect to be (generally 0.05)
power / 1 — beta — probability correctly rejecting null hypothesis (generally 0.8)

According to analysis results published by Hootsuite, the average engagement rate for social media content on Facebook amounts to 0.06%. This engagement rate was calculated not in reference to the number of people who have seen the post but to the number of followers. Nevertheless, we set this as a benchmark for our baseline condition (engagement rate for photos). Let’s assume that videos engage 50% more users as photos which would result in an engagement rate of 0.09% (desired value). Previously, we have calculated the reach of the contents per type so we know the number of observations that logistic regression will be based on later. Unfortunately, we do not an equal group size but the proportion of the data from the test condition is 0.06%. Significance level and power is set to default values.

According to the power analysis, we will need a total sample size of 9947 unique users that have been reached. Theoretically, this would keep the probability of correctly rejecting the null to 80%. Because we have a total sample of 5,955,149 users that were reached across social media posts for both of the types which exceeds the minimum sample size. So, we are good!

Now let’s specify our prior for the Bayesian A/B test. As stated before, we assume video content to engage 50% more users compared to photo content. Assuming that the 50% benefit corresponds to the prior median, this expectation corresponds to a median absolute risk (i.e., difference of success probabilities) of 0.03. We construct a 95% uncertainty interval ranging from 0.01 (almost no difference) to 0.12 (double as much successes).

Test your hypothesis and update your knowledge

For our frequentist approach to hypothesis testing, we would like to run a logistic regression. But because the dataset as we have loaded it does not contain a binary outcome variable (engaged YES/NO), we first need to create one. The rbinom()function that we will use requires us to know the proportion of success (1 instead of 0) for each content type, so we can use our engagement summary that we have created previously.

> ER_summary
# A tibble: 4 x 4
  Type   Engagement   Reach     ER
  <fct>       <int>   <int>  <dbl>
1 Link         7542  407981 0.0185
2 Photo      348871 5596709 0.0623
3 Status      91810  588550 0.156 
4 Video       11949  358440 0.0333

Huh? It looks like as if videos actually resulted in a lower engagement rate than photos. Let’s plug these values into the rbinom() function to generate two objects that represent this distribution. Then we add a character string to each of them to keep track of the content type and merge both objects together to a large dataframe.

We use the glm()(generalized linear model) function and set the family-argument to binomial to run a logistic regression. To get a cleaner model output, we use the tidy()function from the broom package.

# A tibble: 2 x 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   -2.75    0.00178   -1546.        0
2 TypeVideo     -0.703   0.00986     -71.3       0

We can see that the exposure to videos makes the probability of success (user engagement) less likely compared to the exposure of videos. Specifically, the log odds of a user engaging with the content significantly decreases by 0.7 compared to photos. By taking the exponential, we can convert the log odds back to a more interpretable value — the odds ratio. The odds of success are defined as the ratio of the probability of success over the probability of failure. We get a odds ratio of 0.5 to 1 which indicates that — compared to photos — the video content decreases odds of user engagement is decreasd by a factor of 0.5. By comparison, if the probability of success would be randomly distributed across both conditions, then the odds of success would be 1 to 1 (odds ratio of 1).

The fact that our p-value falls below .05 means that the null hypothesis — photo and video content being equally successful — seems a bit at odds with the data. Not more, not less. It does not say anything about whether or not the alternative hypothesis (e.g., a difference between the versions) is true in light of the data. Therefore, let’s turn to the Bayesian A/B test to answer this question.

We know from our ER_summary that the observed success probabilities are p1 = .062for photo- and p2 = .033for video content. Consequently, the observed success probabilities suggest that there is a negative effect of the exposure to video of 0.3%. To find out if this is statistically compelling, the Bayesian A/B test from the abtest package is used:

> print(ab)
Bayesian A/B Test Results:Bayes Factors:BF10: Inf
BF+0: 0.0001890052
BF-0: InfPrior Probabilities Hypotheses:H+: 0.4
H-: 0.1
H0: 0.5Posterior Probabilities Hypotheses:H+: 0
H-: 1
H0: 0

The first part of the output presents Bayes factors in favour of the hypotheses H1, H+, and H-, where the reference hypothesis (i.e., denominator of the Bayes factor) is H0. The Bayes factor estimates the degree to which the credibility is reallocated from prior towards posterior distribution. In other words, it tells us how much we need to update our beliefs in light of the new data. Specifically, the Bayes factor is the relative evidence in the data favouring one of two hypotheses. In other words, it is the degree to which the hypothesis predicts the observed data better than the other. Please note that the Bayes Factor is not the plausibility of one hypothesis over the other after seeing the data (this would be the posterior odds).

Since the Bayes factor hypothesis about a positive effect (H1) is smaller than 1, it indicates evidence in favour of the null hypothesis of no effect. But we get a more nuanced picture here: The Infinitely high Bayes factors for H- indicate strong evidence against the null hypothesis when we assume a negative effect of video content on user engagement. The next part of the output displays the prior probabilities of the hypotheses: Previously, we have assumed that H1 and H0 are equally likely, but we have given the presence of a positive effect more credibility based on our prior knowledge.

A very clear picture emerges if we look at the posterior probabilities: In light of the data, we have to completely refresh our understanding of H- since the plausibility of a negative effect has increased from 0.1 to 1. Conversely, this means that both H+ and H0 appear very unlikely given the data.

Estimate magnitude of evidence

Our logistic regression does not allow any inference about which success probabilities are likely for each of the content types; it only fits the model to our sample data and leaves us with a point estimate. We can only describe the observe engagement rates and try to eyeball the actual difference between the two types. Let’s plot the data to get a rough impression:

Okay, we can see the difference between the content types across different posts and get an actual feeling that complements the numbers we already know. We can also see directly that the video content is backed up with much fewer data compared to the photo content.

Nevertheless, these are only descriptive statistics. In contrast, we can actually infer the posterior success probabilities with our Bayesian framework and thus make a statement about the magnitude of evidence and direction of the effect:

In our example, p1 and p2 correspond to the probability of a user engaging with the content for photos and videos, respectively. The graph indicates that the posterior median for p1 (photos) is 0.062, with 95% credible ranging from 0.062 to 0.063, and the posterior median for p2 (videos) is 0.033, with 95% credible interval ranging from 0.033 to 0.034. We can also see that the credible interval for p2 (videos) is actually wider which is not reflected in the bare numbers because we have only three decimal values by default. A reason for this could be that the estimation of success is more uncertain for relatively fewer data compared to the baseline.

In sum, our data provide strong evidence in favour of the alternative hypothesis that there is a negative effect of videos on user engagement when compared to photos. In addition, we have strong evidence against the null hypothesis that there is no effect between the two content types as well as against the hypothesis of a positive effect.

Look at the emergence of evidence

For our frequentist analysis, there is no way to test how the evidence actually emerged across trials because null-hypothesis significance testing assumes that we first complete data collection before analysing any results. But what we can do is to alter the sample plan and allow for interim analyses to save our resources and prevent p-hacking while still preserving our (in)tolerance for false alarms. We thus need to define certain stopping rules to run a sequential analysis. This is a procedure in which a statistical test (e.g., logistic regression) is conducted repeatedly over time as the data are collected. After each observation, the cumulative data are analysed and one of the following three decisions taken:

Stop the data collection, reject the null hypothesis and claim “statistical significance” (e.g., data are compelling).
Stop the data collection, do not reject the null hypothesis and state that the results are not statistically significant (e.g., data are inconclusive).
Continue the data collection, since as yet the cumulated data are still inadequate to draw a conclusion.

We can use the gsDesign package (Anderson, 2022) to find out how we need to adjust threshold that would indicate a significant p-value and at which points of the data collection we can peak at the results.

Given that we would like to run interim analyses three times and run a one-sided significance test, we should stop after we have 1,985,050, 3,970,099 and 5,955,149 total users who have seen the content. Also, instead of the more liberal significance threshold of p < .05, we need to adjust the alpha level to 0.0232 to justify the fact that we look at the data in between.

Now let us see how the evidence for and against each hypothesis evolves as the data come in for the Bayesian framework.

This tracks the evidence for either hypothesis in chronological order. After about 300.000 users who have seen the respective content, the evidence for H- exceeds the evidence for H+ and H0. After about 600.000 trials, the evidence for H- becomes very strong so we could probably have stopped data collection here. In a real-life marketing campaign, this would save us 400.000 additional users the content would be advertised to.

The posterior probability of the hypotheses is also shown as a probability wheel on the top of Figure. For the prior probabilities, the green area visualizes the posterior probability of the alternative hypotheses and the grey area visualizes the posterior probability of the null hypothesis. The data have increased the plausibility of H- from 0.1 to almost 1 while the posterior plausibility of both the null hypothesis and H+ has correspondingly decreased to almost 0.

A or B? Think again.

Limitations

Before wrapping up, let’s face the limitations of our case study on user engagement. Firstly, we have no randomized controlled experiment in the way that we can know for sure that a) users only saw either photo or video content of b) exactly the same product or inspiration while c) the groups would be equally split among users. Instead, we have a highly unbalanced group size and video content makes only a fraction of the observations available for photos. So, the question remains whether videos are actually a poor choice when marketing content on social media. We also do not know if the same unique users maybe even have seen both, photos and videos, which would give us a within-subject design that should be modelled too. If this example would be a real data-science project, the company should be encouraged to collect more data on video posts to arrive at a fairer comparison before deciding to focus more on different types of content.

Bayesian A/B testing tells us more about user engagement

Nevertheless, we have a good comparison between the two frameworks of A/B testing: The Frequentist approach led us to reject the null hypothesis and the sample estimates pointed towards a negative effect of video content on user engagement. But we have no way to know if the data support the alternative hypothesis — in fact, p-values cannot distinguish between absence of evidence (i.e., data are too messy) and evidence of absence (i.e., no effect) (Keysers et al., 2020; Robinson, 2019). This however is what we are really interested in, right?

The Bayesian A/B test provides clear evidence for a negative effect while the two other hypotheses seem implausible in light of the data. Interestingly, both approaches have suggested a similar minimal sample size to capture the effect even if the requirements differ: We could nicely monitor the evidence as the data accumulate with the Bayesian framework and we could have stopped our campaign after 600,000 views. For the frequentist framework, the power analysis gave us a very similar minimum sample size but keeping an eye on the data would have been detrimental: we would have needed to correct for multiple comparisons by adjusting the p-value. Finally, we would not have been able to say something about typical user engagement values without the Bayesian framework. To be perfectly honest with you, this is where the Bayesian framework is definitely more work on behalf of the analyst because we get better experts on the problem because itself. The prior specification required me to engage with user engagement and social media metrics in more depth compared to the traditional approach.

Since I am a behavioural scientist and not a marketer, I had to do some research to find out which hypotheses would be reasonable and which parameter values would be plausible in practice. I have learnt much more about user engagement this way compared to simply running the logistic regression. You probably love the computational part of data science and so do I. Nevertheless, this should not distract us from the actual problem we want to solve.

At the end of the journey, we have a very targeted and convenient testing procedure. I am convinced that the Bayesian framework includes the tools we need to customize our hypothesis testing and become even better experts in our field.