The world’s leading publication for data science, AI, and ML professionals.

My Easy Guide to Pre vs. Post Treatment Tests

A quick introduction to Before and After Tests with code.

Photo by Towfiqu barbhuiya on Unsplash
Photo by Towfiqu barbhuiya on Unsplash

Introduction

I will start this post by saying that A/B Testing was never a strong skill for me. "Ok, but how if you are writing a post about it?", you might think.

The fact is that I had to study and learn as much as possible about the subject to find a way to perform those tests quickly and smoothly at work. For many months I avoided these tests because I could not find any straightforward content that immediately clicked with my understanding.

I have read many conceptual posts about A/B Testing, some very driven to marketing professionals, thus covering much more of the tasks of getting the sample size right and the test period. But when it comes to _Before and After Test (_also referred to as Pre-Post Test, which we could say that is a type of A/B Testing), the knowledge is even harder to find.

So, this post is hopefully a good introduction to Before and After Test (or Pre-Post Test) for all of those seeking an easy tutorial to follow to perform this type of test using Python.

Let’s dive into it.

The Concept

A Before and After Test is like an A/B Test. It compares two different sets of data to determine if they are different, and how much that difference is. The Before and After test only adds the time component. It will also compare the Test and Control sets, and how they performed before and after a treatment is applied to the Test set.

The Before and After Test compares the state of something before an intervention and after it.

I think one of the reasons there aren’t too many posts about Before and After Testing is that it is not always trivial to design this test and prepare the data for it.

First, we need to split our data into Control and Treatment samples. The Control sample will be those observations that won’t change anything. They will just keep doing what they do for the whole period of the test. On the other hand, the Treatment group is the one receiving the intervention.

Let’s illustrate with an example: A grocery store chain observes a spike in some brands of coffee and wants to test whether it increases sales or not if they double the facings of these new best-performing brands to the customer. So, to make that happen, they can select some stores at random as a treatment group and make that change.

After some time of the intervention, the business will assess the results of both groups to determine if the new design is performing better or not.

  • Summarizing
  • Control Group: Stores without change in the coffee section
  • Test Group: Stores with the redesigned coffee section
  • Pre-Period: a period before the intervention. The size of this period must be chosen taking into consideration the seasonality of the business and any other aspects that can affect the results, like sales, promotions, holidays, and weekends.
  • Post Period: the period after the intervention. The same is true for determining the size of the period.

Dataset

In our case, we will create a dataset to simulate a situation of Pre-Post intervention.

Let’s import some modules.

import numpy as np
import pandas as pd
import scipy.stats as scs
import matplotlib.pyplot as plt
import seaborn as sns
import pingouin as pg

Next, we create the dataset seen in the following figure 1. The code I used to create it can be found in this Git Hub repository, together with the entire code.

Variables

  • dt: Date (from 2024–01–01 to 2024–01–30)
  • store_id: 1 to 1000
  • group: Control and Treatment
  • sales: sales by store by day.
Fig 1: Dataset created. Image by the author.
Fig 1: Dataset created. Image by the author.

Let’s check how many stores we have in each group.

# Check how many stores in each group
df.groupby('group').store_id.nunique()

---
[OUT]
group
Control      503
Treatment    497
Name: store_id, dtype: int64

Nice. Almost a 50/50 split. Next, let’s see if this is enough for our test.

Size of the Sample

Performing a Before and After Test, just like an A/B Test, requires that we check if the result can be inferred to a greater group and was not just by chance. Therefore, it is important to check the needed size of the sample to capture the desired effect.

For each kind of test, there is a Calculation that can be done for the size of the sample. Such calculation is based on the Power of the test. Now, to explain what is Power, I need to step back a little and remind us of a couple of other concepts.

When we are performing a Test of Means in Statistics, that is nothing more than a Hypothesis Test. So every Hypothesis Test works with a null hypothesis Ho (everything stays as is) and an alternative hypothesis Ha (when something changes). If I can’t prove that the alternative is valid, I keep accepting what I have as truth. When I accept a false null hypothesis, this is a Type II error. If I accept a false alternative hypothesis, that’s a Type I error.

Coming back to our subject, for the Before and After Tests, the Ho says that the intervention is not effective, while the Ha says that it is effective. As we are testing the effectiveness of the treatment, I want to make sure I won’t accept it is effective unless I have a certain degree of confidence. That is what the Power Test gives us.

Power is calculated as [1 – Type II error]. So, I am removing the probability of being wrong.

Translating it to words, I am removing from my whole probability space (100%) the probability that I will accept a false treatment effect.

The code below calculates the number of observations we need to observe an effect of medium size.

from statsmodels.stats.power import TTestIndPower

# Parameter for the power analysis
effect = 0.2 # effect size must be positive
alpha = 0.05
power = 0.8

# Perform power analysis
pwr = TTestIndPower()

result = pwr.solve_power(effect, power = power, nobs1= None, 
                         ratio = 1, alpha=alpha)

print(result)

---
[OUT]

393.4056989990335

Since we have around 500 stores in each group, we’re covered.

Preparing the Data

Let’s prepare the data for input in our test.

First, we will set the cut between Pre and Post periods on 2024–01–15. This is arbitrary in this case, just for the sake of the example. Next, we group the data by store_id, group and before/after, taking the mean of each day. Then we pivot the values and add a column to calculate the difference between post - pre periods.

# split between Pre and Post periods
df['after'] = np.where(df['dt'] > '2024-01-15', 1, 0)

# pre_post data
df_pre_post = (df # dataset
               .groupby(['store_id','group','after']) # groupings
               .sales.mean() # calculate sales means
               .reset_index()
               .pivot(index=['store_id', 'group'], columns='after', values='sales') # pivot the data to put pre and post in columns
               .reset_index()
               .rename(columns={0:'pre', 1:'post'}) # rename
               )

# create col difference post-pre
df_pre_post = df_pre_post.assign(dif_pp= df_pre_post.post - df_pre_post.pre) 

Here is the resulting table.

Fig 2: Data ready for Test input. Image by the author.
Fig 2: Data ready for Test input. Image by the author.

Every time we perform a Before and After Test, this is more or less how the input table should look like.

We can start getting some insights, if we want.

# Pre and Post periods means
df_pre_post.groupby('group')[['pre', 'post']].mean()
Pre vs Post. Image by the author.
Pre vs Post. Image by the author.

We can see that the Treatment group has a higher increase in the post-period. If we want to visualize that, two plots that I like are following.

# Visualization
plot = df.assign(after= np.where(df['dt'] > '2024-01-15', 1, 0))
sns.stripplot(x='group', y='sales', hue= 'after', dodge= True, data=plot);

# BoxenPlot
sns.boxenplot(x='group', y='sales', hue= 'after', data=plot);
The Treatment group shows a Post period with a higher median and more compact. Image by the author.
The Treatment group shows a Post period with a higher median and more compact. Image by the author.

Once again, we can notice that the Control group has practically the same distribution, but the Treatment group shows a higher median and a distribution more centered around it. Additionally, the boxenplots show that the distributions are close to normality.

With that, we close this section. Let’s go to the Test itself.

Pre-Post Test

Levene and T-Test

For this whole exercise, our significance level is 5%.

We start with Levene’s Test, which checks for equal variances of the samples. This test will be important to determine the next step. So, creating a simple function and running it.

# Checking for equal variances of the samples (Pre and Post)
# This can be done with Levene's test.
pg.homoscedasticity(df_pre_post, dv='dif_pp', 
                               group='group', 
                               method='levene', alpha=0.05)

---
[OUT]:
         W     pval      equal_var
levene 18.701  0.000017  False

Next, we should check if the difference found between the groups means is statistically significant. For a Pre-Post test with only two groups, we can use:

  • Two samples paired T-test if the distribution is normal
  • Wilcoxon test if not normal

Let’s test the data for normality.

# Let's test for normality of the distributions of Pre and Post periods.
# Ho = Data is normally distributed

print(scs.shapiro(df_pre_post.query('group == "Control"')['pre']))
[OUT]: ShapiroResult(statistic=0.9967775344848633, pvalue=0.41802579164505005)

print(scs.shapiro(df_pre_post.query('group == "Control"')['post']))
[OUT]:ShapiroResult(statistic=0.9979024529457092, pvalue=0.7954561114311218)

print(scs.shapiro(df_pre_post.query('group == "Treatment"')['pre']))
[OUT]:ShapiroResult(statistic=0.9956731796264648, pvalue=0.1869998276233673)

print(scs.shapiro(df_pre_post.query('group == "Treatment"')['post']))
[OUT]:ShapiroResult(statistic=0.9935430884361267, pvalue=0.032079797238111496)

The last one – Post period of the Treatment group – the p-value is 0.03. It’s close enough to the alpha value, but it is not normal. We can use the Wilcoxon test for that one to check for means difference.

# Paired 2 sample test
# This is a test for the null hypothesis that two related or repeated samples have identical average (expected) values.
print(scs.ttest_rel(df_pre_post.query('group == "Control"')['pre'], df_pre_post.query('group == "Control"')['post']))
print(scs.ttest_rel(df_pre_post.query('group == "Treatment"')['pre'], df_pre_post.query('group == "Treatment"')['post']))

# Running Wilcoxon for the Not Normal sample
print(scs.wilcoxon(df_pre_post.query('group == "Treatment"')['pre'], df_pre_post.query('group == "Treatment"')['post'])) 

TtestResult(statistic=-0.9901946209840918, pvalue=0.3225560292205154, df=502)
TtestResult(statistic=-14.576067836748276, pvalue=2.5969980216983843e-40, df=496)
WilcoxonResult(statistic=21562.0, pvalue=2.544753810592371e-36)

As you can see, we also used the T-test for the not-normal sample, but it was so close to normality that it has a very similar result.

Results:

  • The differences between means of the Control group are not statistically significant (p-value = 0.32)
  • The differences between means of the Treatment group are statistically significant (p-value = 0.00000)

Games-Howell

Now that we know that there is a difference and it is statistically significant, let’s quantify it.

We can use Tukey’s test for equal variances or use Games-Howell for unequal variances (our case)

  • Difference in Difference (DiD): Since there is a difference between Test and Control and also a difference Before and After, if we want to isolate the real change effect, we can calculate the difference between Before vs After and then make the comparison of that difference between both samples, Control and Treatment.
# Performing Games-Howell Test
pg.pairwise_gameshowell(data=df_pre_post, dv='dif_pp',
                        between='group').round(2)
Before and After Test result. Image by the author.
Before and After Test result. Image by the author.

And there it is, the final result. It is statistically significant since the p-Value is below our threshold of 5%. The difference between the groups means is $3.65 and the effect is strong (hedges > 0.5).

Confidence Interval

Photo by Dose Juice on Unsplash
Photo by Dose Juice on Unsplash

We have two ways to check the confidence interval of our test. The easiest one is using scipy. We take the diff and se from the result and calculate the interval with 95% confidence. Here, we observe that the difference between the groups means can be between 2.80 until 4.50 dollars.

# create 95% confidence interval for the Test 
scs.norm(loc=3.65, scale=0.43).interval(confidence= 0.95)

(2.8072154866477765, 4.492784513352223)

Another way of doing this is simulating this data many times to make sure our result is robust. We can bootstrap the data N times by taking a sample, and calculating the mean, appending the result to a data frame. This will allow us to actually see the differences between means.

I am using samples of 400 because that’s the size of the sample needed (394) calculated below.

# Let's simulate 5000 times a Control versus a Test sample. 

# We can Bootstrap
boot_means = []

# Simulation N times
N = 5_000
for i in range(N):
  # take 400 stores from each group
  control = df_pre_post.query('group == "Control"').sample(n=400, replace=True)
  test =  df_pre_post.query('group == "Treatment"').sample(n=400, replace=True)
  final_data = pd.concat([control, test])

  # shuffle
  boot_sample = (
      final_data
      .groupby('group')['dif_pp']
      .mean()
      )
  #append
  boot_means.append(boot_sample)

# To Dataframe
boot_means = pd.DataFrame(boot_means)

# kde plot
boot_means.plot(kind='kde')
plt.title('Difference of Means Control vs Test');
Result of the test. Image by the author.
Result of the test. Image by the author.

The Treatment group is clearly performing better. We can look at the distribution of the differences as well.

# create a new column, diff, which is the difference between the two variants, scaled by the control group
boot_means['test_control_diff'] = (boot_means['Treatment'] - boot_means['Control'])

# plot the bootstrap sample difference
ax = boot_means['test_control_diff'].plot(kind = 'kde')
ax.set_xlabel("$ diff in means")
plt.title('Distribution of the Difference in Means');
Distribution of means diff. Image by the author.
Distribution of means diff. Image by the author.

The difference between the groups is centered at approximately 3.6 dollars, as we saw earlier.

Finally, if we calculate the confidence interval of the distribution in the graphic above, the result is almost the same as the Games-Howell test.

# Mean and Std.
b_mu = boot_means.test_control_diff.mean()
b_std = boot_means.test_control_diff.std()

# create 95% confidence interval for the Test 
scs.norm(loc=b_mu, scale=b_std).interval(confidence= 0.95)

(2.7251897754505348, 4.587888504398052)

The results match what we expected. The difference between the groups means is between 2.72 until 4.58 dollars, on a 95% confidence level.

Another Way – Normal Distribution

There is yet another way to perform this test, using the normal distribution. As we saw in the simulations in previous sections, when we repeat many times the Bootstrap process, the distribution of the means will converge to a normal distribution, as per the Central Limit Theorem.

So, we can leverage that when performing those simulations. If we collect the mean and standard deviation of the samples, then generate a normal distribution using those parameters we are doing the same thing, in essence.

# Collecting Mean and Standard Deviation of the samples
sample_means = (
    df_pre_post
    .groupby('group')
    .agg({'dif_pp' : ['mean', 'std']})
    .reset_index()
)

# Rename cols
sample_means.columns = ['group', 'mean', 'std']

# Get Mean and Std
mean_control = sample_means.query('group == "Control"')['mean'].values[0]
std_control = sample_means.query('group == "Control"')['std'].values[0]

mean_treatment = sample_means.query('group == "Treatment"')['mean'].values[0]
std_treatment = sample_means.query('group == "Treatment"')['std'].values[0]

N= 10_000
# Plot data
plot = pd.DataFrame({'group': np.repeat(['Control', 'Treatment'], N),
                     'vals': np.concatenate([np.random.normal(loc=mean_control, scale=std_control, size=N),
                                             np.random.normal(loc=mean_treatment, scale=std_treatment, size=N)])   })
# Plot
sns.kdeplot(data=plot, x='vals', hue='group');

The result of the code will be seen as the following graphic. And again, we see that the Treatment group is doing better.

Normal distribution of the groups. Image by the author.
Normal distribution of the groups. Image by the author.

Next, we will generate the normal distribution based on the differences in difference between samples. We collect:

  • The sample sizes
  • The difference between [Treatment Mean] -[ Control Mean]
  • And the calculated standard error according to this formula:
N= 10_000
# Plot parameters
nA = df_pre_post.query('group == "Control"').shape[0]
nB = df_pre_post.query('group == "Treatment"').shape[0]
dif_mu = df_pre_post.groupby('group').dif_pp.mean()[1] - df_pre_post.groupby('group').dif_pp.mean()[0] # [mean diff Treatment] - [mean diff Control]
dif_std = np.sqrt( ((std_control**2) / nA ) + ( (std_treatment**2) / nB) )

# Plot
sns.kdeplot( np.random.normal(loc= dif_mu,
                              scale= dif_std,
                              size=N) );

print('95% Confidence Interval')
scs.norm(loc= dif_mu, scale= dif_std).interval(confidence= 0.95)
95% confidence interval of Before and After Test. Image by the author.
95% confidence interval of Before and After Test. Image by the author.

Before You Go

This post is meant to be an introductory tutorial to Before and After Testing. I hope that you can get the most out of it, and start creating your own tests.

In this dataset I created, I determined that the difference was about 3 dollars and I was able to confirm that difference in the statistical tests, indicating that I may be in the right way learning how to perform a solid test.

This tool has so many applications, not only for online business as one might think, but anything, from the grocery business to the health industry, security, and many more.

All you need is the correct split of the groups, the time frame you will capture the data, the size of the sample to determine if the effect can be captured and that’s it.

If you liked this content, follow me for more. I am also on LinkedIn. Let’s connect.

Gustavo Santos – Medium

Here is the link for the Git Hub Repository with all the code.

GitHub – gurezende/Before-and-After-Testing: How to Perform a Before and After Stats test.

References

Power Analysis, Statistical Significance, & Effect Size

Before-After A/B Testing Statistics for Marketing Analytics With Both R and Python

scipy.stats.ttest_rel – SciPy v1.13.1 Manual

Before-After A/B Testing Statistics for Marketing Analytics With Both R and Python

A Practical Guide To A/B Tests in Python

Paired t-Test

Guidance for Pre – Post Tests.


Related Articles