Practical Computer Simulations for Product Analysts

Part 2: Using bootstrap for observations and A/B tests

Published in

Towards Data Science

21 min readApr 30, 2024

In the first part of this series, we've discussed the basic ideas of computer simulations and how you can leverage them to answer "what-if" questions. It's impossible to talk about simulations without bootstrap.

Bootstrap in statistics is a practical computer method for estimating the statistics of probability distributions. It is based on the repeated generation of samples using the Monte Carlo method from an existing sample. This method allows for simple and fast estimation of various statistics (such as confidence intervals, variance, correlation, etc.) for complex models.

When I learned about bootstrap in the statistics course, it felt a bit hacky. Instead of learning multiple formulas and criteria for different cases, you can just write a couple of lines of code and get confidence interval estimations for any custom and complicated use case. It sounds like magic.

And it really is. Now, when even your laptop can run thousands of simulations in minutes or even seconds, bootstrap is a powerful tool in your analytical toolkit that can help you in many situations. So, I believe that it's worth learning or refreshing your knowledge about it.

In this article, we will talk about the idea behind bootstrap, understand when you should use it, learn how to get confidence intervals for different metrics and analyse the results of A/B tests.

What is bootstrap?

Actually, bootstrap is exceptionally straightforward. We need to run simulations drawing elements from our sample distribution with replacement, and then we can make conclusions based on this distribution.

Let's look at the simple example when we have four elements: 1, 2, 3 and 4. Then, we can simulate many other collections of 4 elements where each element might be 1, 2, 3 or 4 with equal probabilities and use these simulations to understand, for example, how the mean value might change.

The statistical meaning behind bootstrap is that we consider that the actual population has precisely the same distribution as our sample (or the population consists of an infinite number of our sample copies). Then, we just assume that we know the general population and use it to understand the variability in our data.

Usually, when using a classical statistical approach, we assume that our variable follows some known distribution (for example, normal). However, we don't need to make any assumptions regarding the nature of the distribution in Bootstrap. It's pretty handy and helps to analyse even very complex custom metrics.

It's almost impossible to mess up the bootstrap estimations. So, in many cases, I would prefer it to the classical statistical methods. The only drawback is computational time. If you're working with big data, simulations might take hours, while you can get classical statistics estimations within seconds.

However, there are cases when it's pretty challenging to get estimations without bootstrap. Let's discuss the best use cases for bootstrap:

if you have outliers or influential points in your data;
if your sample is relatively small (roughly less than 100 cases);
if your data distribution is quite far from normal or other theoretical distribution, for example, it has several modes;
if you're working with custom metrics (for example, the share of cases closed within SLA or percentiles).

Bootstrap is a wonderful and powerful statistical concept. Let's try to use it for descriptive statistics.

Working with observational data

First, let's start with the observational data and work with a synthetic dataset. Imagine we are helping a fitness club to set up a new fitness program that will help clients prepare for the London Marathon. We got the first trial group of 12 customers and measured their results.

Here is the data we have.

We collected just three fields for each of the 12 customers:

races_before — numbers of races customers had before our program,
kms_during_program — kilometres clients run during our program,
finished_marathon — whether the program was successful and a customer has finished the London Marathon.

We aim to set up a goal-focused fair program that incentivises our clients to train with us more and achieve better results. So, we would like to return the money if the client has run at least 150 kilometres during the preparation but couldn't complete the marathon. However, before launching this program, we would like to make some estimations: what distance clients cover during preparation and the estimated share of refunds. We need it to ensure that our business is profitable and sustainable.

Estimating average

Let's start with estimating the average distance. We can try to leverage our knowledge of mathematical statistics and use formulas for confidence intervals.

To do so, we need to make an assumption about the distribution of this variable. The most commonly used is a normal distribution. Let's try it.

import numpy as np
from scipy.stats import norm, t

def get_normal_confidence_interval(data, confidence=0.95):
    # Calculate sample mean and standard deviation
    sample_mean = np.mean(data)
    sample_std = np.std(data, ddof=1)  
    n = len(data)

    # Calculate the critical value (z) based on the confidence level
    z = norm.ppf((1 + confidence) / 2)

    # Calculate the margin of error using standard error
    margin_of_error = z * sample_std / np.sqrt(n)

    # Calculate the confidence interval
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    return lower_bound, upper_bound

get_normal_confidence_interval(df.kms_during_program.values)
# (111.86, 260.55)

The other option often used with real-life data is t-test distribution, which gives a broader confidence interval (since it assumes fatter tales than normal distribution).

def get_ttest_confidence_interval(data, confidence=0.95):
    # Calculate sample mean and standard deviation
    sample_mean = np.mean(data)
    sample_std = np.std(data, ddof=1)  
    n = len(data)

    # Calculate the critical value (z) based on the confidence level
    z = t.ppf((1 + confidence) / 2, df=len(data) - 1)

    # Calculate the margin of error using standard error
    margin_of_error = z * sample_std / np.sqrt(n)

    # Calculate the confidence interval
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    return lower_bound, upper_bound

get_ttest_confidence_interval(df.kms_during_program.values)
# (102.72, 269.69)

We have a few examples in our sample. Also, there's an outlier: a client with 12 races who managed to run almost 600 km preparing for the marathon, while most other clients run less than 200 km.

So, it's an excellent case to use the bootstrap technique to understand the distribution and confidence interval better.

Let's create a function to calculate and visualise the confidence interval:

We run num_batches simulations, doing samples with replacement, and calculating the average distance.
Then, based on these variables, we can get a 95% confidence interval: 2.5% and 97.5% percentiles of this distribution.
Finally, we can visualise the distribution on a chart.

import tqdm
import matplotlib.pyplot as plt

def get_kms_confidence_interval(num_batches, confidence = 0.95):
    # Running simulations
    tmp = []
    for i in tqdm.tqdm(range(num_batches)):
        tmp_df = df.sample(df.shape[0], replace = True)
        tmp.append(
            {
                'iteration': i,
                'mean_kms': tmp_df.kms_during_program.mean()
            }
        )
    # Saving data
    bootstrap_df = pd.DataFrame(tmp)
    
    # Calculating confidence interval
    lower_bound = bootstrap_df.mean_kms.quantile((1 - confidence)/2)
    upper_bound = bootstrap_df.mean_kms.quantile(1 - (1 - confidence)/2)

    # Creating a chart
    ax = bootstrap_df.mean_kms.hist(bins = 50, alpha = 0.6, 
        color = 'purple')
    ax.set_title('Average kms during the program, iterations = %d' % num_batches)
    
    plt.axvline(x=lower_bound, color='navy', linestyle='--', 
        label='lower bound = %.2f' % lower_bound)
    
    plt.axvline(x=upper_bound, color='navy', linestyle='--', 
        label='upper bound = %.2f' % upper_bound)
    
    ax.annotate('CI lower bound: %.2f' % lower_bound, 
                xy=(lower_bound, ax.get_ylim()[1]), 
                xytext=(-10, -20), 
                textcoords='offset points',  
                ha='center', va='top',  
                color='navy', rotation=90) 
    ax.annotate('CI upper bound: %.2f' % upper_bound, 
                xy=(upper_bound, ax.get_ylim()[1]), 
                xytext=(-10, -20), 
                textcoords='offset points',  
                ha='center', va='top',  
                color='navy', rotation=90) 
    plt.xlim(ax.get_xlim()[0] - 20, ax.get_xlim()[1] + 20)
    plt.show()

Let's start with a small number of batches to see the first results quickly.

get_kms_confidence_interval(100)

We got a bit narrower and skewed to the right confidence interval with bootstrap, which is in line with our actual distribution: (139.31, 297.99) vs (102.72, 269.69).

However, with 100 bootstrap simulations, the distribution is not very clear. Let's try to add more iterations. We can see that our distribution consists of multiple modes — for samples with one occurrence of outliers, two occurrences, three, etc.

With more iterations, we can see more modes (since more occurrences of the outlier are rarer), but all the confidence intervals are pretty close.

In the case of bootstrap, adding more iterations doesn't lead to overfitting (because each iteration is independent). I would think about it as increasing the resolution of your image.

Since our sample is small, running many simulations doesn't take much time. Even 1 million bootstrap iterations take around 1 minute.

Estimating custom metrics

As we discussed, bootstrap is handy when working with metrics that are not as straightforward as averages. For example, you might want to estimate the median or share of tasks closed within SLA.

You might even use bootstrap for something more unusual. Imagine you want to give customers discounts if your delivery is late: 5% discount for 15 minutes delay, 10% — for 1 hour delay and 20% — for 3 hours delay.

Getting a confidence interval for such cases theoretically using plain statistics might be challenging, so bootstrap will be extremely valuable.

Let's return to our running program and estimate the share of refunds (when a customer ran 150 km but didn't manage to finish the marathon). We will use a similar function but will calculate the refund share for each iteration instead of the mean value.

import tqdm
import matplotlib.pyplot as plt

def get_refund_share_confidence_interval(num_batches, confidence = 0.95):
    # Running simulations
    tmp = []
    for i in tqdm.tqdm(range(num_batches)):
        tmp_df = df.sample(df.shape[0], replace = True)
        tmp_df['refund'] = list(map(
            lambda kms, passed: 1 if (kms >= 150) and (passed == 0) else 0,
            tmp_df.kms_during_program,
            tmp_df.finished_marathon
        ))
        
        tmp.append(
            {
                'iteration': i,
                'refund_share': tmp_df.refund.mean()
            }
        )

    # Saving data
    bootstrap_df = pd.DataFrame(tmp)

    # Calculating confident interval
    lower_bound = bootstrap_df.refund_share.quantile((1 - confidence)/2)
    upper_bound = bootstrap_df.refund_share.quantile(1 - (1 - confidence)/2)
    
    # Creating a chart
    ax = bootstrap_df.refund_share.hist(bins = 50, alpha = 0.6, 
        color = 'purple')
    ax.set_title('Share of refunds, iterations = %d' % num_batches)
    plt.axvline(x=lower_bound, color='navy', linestyle='--',
        label='lower bound = %.2f' % lower_bound)
    plt.axvline(x=upper_bound, color='navy', linestyle='--', 
        label='upper bound = %.2f' % upper_bound)
    ax.annotate('CI lower bound: %.2f' % lower_bound, 
                xy=(lower_bound, ax.get_ylim()[1]), 
                xytext=(-10, -20), 
                textcoords='offset points',  
                ha='center', va='top',  
                color='navy', rotation=90) 
    ax.annotate('CI upper bound: %.2f' % upper_bound, 
                xy=(upper_bound, ax.get_ylim()[1]), 
                xytext=(-10, -20), 
                textcoords='offset points',  
                ha='center', va='top',  
                color='navy', rotation=90) 
    plt.xlim(-0.1, 1)
    plt.show()

Even with 12 examples, we got a 2+ times smaller confidence interval. We can conclude with 95% confidence that less than 42% of customers will be eligible for a refund.

That's a good result with such a small amount of data. However, we can go even further and try to get an estimation of causal effects.

Estimation of effects

We have data about the previous races before this marathon, and we can see how this value is correlated with the expected distance. We can use bootstrap for this as well. We only need to add the linear regression step to our current process.

def get_races_coef_confidence_interval(num_batches, confidence = 0.95):
    # Running simulations
    tmp = []
    for i in tqdm.tqdm(range(num_batches)):
        tmp_df = df.sample(df.shape[0], replace = True)
        # Linear regression model
        model = smf.ols('kms_during_program ~ races_before', data = tmp_df).fit()
        
        tmp.append(
            {
                'iteration': i,
                'races_coef': model.params['races_before']
            }
        )
    
    # Saving data
    bootstrap_df = pd.DataFrame(tmp)

    # Calculating confident interval
    lower_bound = bootstrap_df.races_coef.quantile((1 - confidence)/2)
    upper_bound = bootstrap_df.races_coef.quantile(1 - (1 - confidence)/2)

    # Creating a chart
    ax = bootstrap_df.races_coef.hist(bins = 50, alpha = 0.6, color = 'purple')
    ax.set_title('Coefficient between kms during the program and previous races, iterations = %d' % num_batches)
    plt.axvline(x=lower_bound, color='navy', linestyle='--', label='lower bound = %.2f' % lower_bound)
    plt.axvline(x=upper_bound, color='navy', linestyle='--', label='upper bound = %.2f' % upper_bound)
    ax.annotate('CI lower bound: %.2f' % lower_bound, 
                xy=(lower_bound, ax.get_ylim()[1]), 
                xytext=(-10, -20), 
                textcoords='offset points',  
                ha='center', va='top',  
                color='navy', rotation=90) 
    ax.annotate('CI upper bound: %.2f' % upper_bound, 
                xy=(upper_bound, ax.get_ylim()[1]), 
                xytext=(10, -20), 
                textcoords='offset points',  
                ha='center', va='top',  
                color='navy', rotation=90) 
    # plt.legend() 
    plt.xlim(ax.get_xlim()[0] - 5, ax.get_xlim()[1] + 5)
    plt.show()

    return bootstrap_df

We can look at the distribution. The confidence interval is above 0, so we can say there's an effect with 95% confidence.

You can spot that distribution is bimodal, and each mode corresponds to one of the scenarios:

The component around 12 is related to samples without an outlier — it's an estimation of the effect of previous races on the expected distance during the program if we disregard the outlier.
The second component corresponds to the samples when one or several outliers were in the dataset.

So, it's super cool that we can make even estimations for different scenarios if we look at the bootstrap distribution.

We've learned how to use bootstrap with observational data, but its bread and butter is A/B testing. So, let's move on to our second example.

Simulations for A/B testing

The other everyday use case for bootstrap is designing and analysing A/B tests. Let's look at the example. It will also be based on a synthetic dataset that shows the effect of the discount on customer retention. Imagine we are working on an e-grocery product and want to test whether our marketing campaign with a 20 EUR discount will affect customers' spending.

About each customer, we know his country of residence, the number of family members that live with them, the average annual salary in the country, and how much money they spend on products in our store.

Power analysis

First, we need to design the experiment and understand how many clients we need in each experiment group to make conclusions confidently. This step is called power analysis.

Let's quickly recap the basic statistical theory about A/B tests and main metrics. Every test is based on the null hypothesis (which is the current status quo). In our case, the null hypothesis is "discount does not affect customers' spending on our product". Then, we need to collect data on customers' spending for control and experiment groups and estimate the probability of seeing such or more extreme results if the null hypothesis is valid. This probability is called the p-value, and if it's small enough, we can conclude that we have enough data to reject the null hypothesis and say that treatment affects customers' spending or retention.

In this approach, there are three main metrics:

effect size — the minimal change in our metric we would like to be able to detect,
statistical significance equals the false positive rate (probability of rejecting the null hypothesis when there was no effect). The most commonly used significance is 5%. However, you might choose other values depending on your false-positive tolerance. For example, if implementing the change is expensive, you might want to use a lower significance threshold.
statistical power shows the probability of rejecting the null hypothesis given that we actually had an effect equal to or higher than the effect size. People often use an 80% threshold, but in some cases (i.e. you want to be more confident that there are no negative effects), you might use 90% or even 99%.

We need all these values to estimate the number of clients in the experiment. Let's try to define them in our case to understand their meaning better.

We will start with effect size:

we expect the retention rate to change by at least 3% points as a result of our campaign,
we would like to spot changes in customers' spending by 20 or more EUR.

For statistical significance, I will use the default 5% threshold (so if we see the effect as a result of A/B test analysis, we can be confident with 95% that the effect is present). Let's target a 90% statistical power threshold so that if there's an actual effect equal to or bigger than the effect size, we will spot this change in 90% of cases.

Let's start with statistical formulas that will allow us to get estimations quickly. Statistical formulas imply that our variable has a particular distribution, but they can usually help you estimate the magnitude of the number of samples. Later, we will use bootstrap to get more accurate results.

For retention, we can use the standard test of proportions. We need to know the actual value to estimate the normed effect size. We can get it from the historical data before the experiment.

import statsmodels.stats.power as stat_power
import statsmodels.stats.proportion as stat_prop

base_retention = before_df.retention.mean()
ret_effect_size = stat_prop.proportion_effectsize(base_retention + 0.03, 
    base_retention)


sample_size = 2*stat_power.tt_ind_solve_power(
    effect_size = ret_effect_size,
    alpha = 0.05, power = 0.9,
    nobs1 = None, # we specified nobs1 as None to get an estimation for it
    alternative='larger'
)

# ret_effect_size = 0.0632, sample_size = 8573.86

We used a one-sided test because there's no difference in whether there's a negative or no effect from the business perspective since we won't implement this change. Using a one-sided instead of a two-sided test increases the statistical power.

We can similarly estimate the sample size for the customer value, assuming the normal distribution. However, the distribution is not normal actually, so we should expect more precise results from bootstrap.

Let's write code.

val_effect_size = 20/before_df.customer_value.std()

sample_size = 2*stat_power.tt_ind_solve_power(
    effect_size = val_effect_size,
    alpha = 0.05, power = 0.9, 
    nobs1 = None, 
    alternative='larger'
)
# val_effect_size = 0.0527, sample_size = 12324.13

We got estimations for the needed sample sizes for each test. However, there are cases when you have a limited number of clients and want to understand the statistical power you can get.

Suppose we have only 5K customers (2.5K in each group). Then, we will be able to achieve 72.2% statistical power for retention analysis and 58.7% — for customer value (given the desired statistical significance and effect sizes).

The only difference in the code is that this time, we've specified nobs1 = 2500 and left power as None.

stat_power.tt_ind_solve_power(
    effect_size = ret_effect_size,
    alpha = 0.05, power = None,
    nobs1 = 2500, 
    alternative='larger'
)
# 0.7223

stat_power.tt_ind_solve_power(
    effect_size = val_effect_size,
    alpha = 0.05, power = None,
    nobs1 = 2500, 
    alternative='larger'
)
# 0.5867

Now, it's time to use bootstrap for the power analysis, and we will start with the customer value test since it's easier to implement.

Let's discuss the basic idea and steps of power analysis using bootstrap. First, we need to define our goal clearly. We want to estimate the statistical power depending on the sample size. If we put it in more practical terms, we want to know the percentage of cases when there was an increase in customer spending by 20 or more EUR, and we were able to reject the null hypothesis and implement this change in production. So, we need to simulate a bunch of such experiments and calculate the share of cases when we can see statistically significant changes in our metric.

Let's look at one experiment and break it into steps. The first step is to generate the experimental data. For that, we need to get a random subset from the population equal to the sample size, randomly split these customers into control and experiment groups and add an effect equal to the effect size for the treatment group. All this logic is implemented in get_sample_for_value function below.

def get_sample_for_value(pop_df, sample_size, effect_size):
  # getting sample of needed size
  sample_df = pop_df.sample(sample_size)

  # randomly assign treatment
  sample_df['treatment'] = sample_df.index.map(
    lambda x: 1 if np.random.uniform() > 0.5 else 0)

  # add efffect for the treatment group
  sample_df['predicted_value'] = sample_df['customer_value'] \
    + effect_size * sample_df.treatment
  
  return sample_df

Now, we can treat this synthetic experiment data as we usually do with A/B test analysis, run a bunch of bootstrap simulations, estimate effects, and then get a confidence interval for this effect.

We will be using linear regression to estimate the effect of treatment. As discussed in the previous article, it's worth adding to linear regression features that explain the outcome variable (customers' spending). We will add the number of family members and average salary to the regression since they are positively correlated.

import statsmodels.formula.api as smf
val_model = smf.ols('customer_value ~ num_family_members + country_avg_annual_earning', 
    data = before_df).fit(disp = 0)
val_model.summary().tables[1]

We will put all the logic of doing multiple bootstrap simulations and estimating treatment effects into the get_ci_for_value function.

def get_ci_for_value(df, boot_iters, confidence_level):
    tmp_data = []
    
    for iter in range(boot_iters):
        sample_df = df.sample(df.shape[0], replace = True)
        val_model = smf.ols('predicted_value ~ treatment + num_family_members + country_avg_annual_earning', 
          data = sample_df).fit(disp = 0)
        tmp_data.append(
            {
                'iteration': iter,
                'coef': val_model.params['treatment']
            }
        )

    coef_df = pd.DataFrame(tmp_data)
    return coef_df.coef.quantile((1 - confidence_level)/2), 
        coef_df.coef.quantile(1 - (1 - confidence_level)/2)

The next step is to put this logic together, run a bunch of such synthetic experiments, and save results.

def run_simulations_for_value(pop_df, sample_size, effect_size, 
    boot_iters, confidence_level, num_simulations):

    tmp_data = []
    
    for sim in tqdm.tqdm(range(num_simulations)):
        sample_df = get_sample_for_value(pop_df, sample_size, effect_size)
        num_users_treatment = sample_df[sample_df.treatment == 1].shape[0]
        value_treatment = sample_df[sample_df.treatment == 1].predicted_value.mean()
        num_users_control = sample_df[sample_df.treatment == 0].shape[0]
        value_control = sample_df[sample_df.treatment == 0].predicted_value.mean()

        ci_lower, ci_upper = get_ci_for_value(sample_df, boot_iters, confidence_level)

        tmp_data.append(
            {
                'experiment_id': sim,
                'num_users_treatment': num_users_treatment,
                'value_treatment': value_treatment,
                'num_users_control': num_users_control,
                'value_control': value_control,
                'sample_size': sample_size,
                'effect_size': effect_size,
                'boot_iters': boot_iters,
                'confidence_level': confidence_level,
                'ci_lower': ci_lower,
                'ci_upper': ci_upper
            }
        )

    return pd.DataFrame(tmp_data)

Let's run this simulation for sample_size = 100 and see the results.

val_sim_df = run_simulations_for_value(before_df, sample_size = 100, 
    effect_size = 20, boot_iters = 1000, confidence_level = 0.95, 
    num_simulations = 20)
val_sim_df.set_index('simulation')[['sample_size', 'ci_lower', 'ci_upper']].head()

We've got the following data for 20 simulated experiments. We know the confidence interval for each experiment, and now we can estimate the power.

We would have rejected the null hypothesis if the lower bound of the confidence interval was above zero, so let's calculate the share of such experiments.

val_sim_df['successful_experiment'] = val_sim_df.ci_lower.map(
  lambda x: 1 if x > 0 else 0)

val_sim_df.groupby(['sample_size', 'effect_size']).aggregate(
    {
        'successful_experiment': 'mean',
        'experiment_id': 'count'
    }
)

We've started with just 20 simulated experiments and 1000 bootstrap simulations to estimate their confidence interval. Such a few simulations can help us get a low-resolution picture quite quickly. Keeping in mind the estimation we got from the classic statistics, we should expect that numbers around 10K will give us the desired statistical power.

tmp_dfs = []
for sample_size in [100, 250, 500, 1000, 2500, 5000, 10000, 25000]:
    print('Simulation for sample size = %d' % sample_size)
    tmp_dfs.append(
        run_simulations_for_value(before_df, sample_size = sample_size, effect_size = 20,
                              boot_iters = 1000, confidence_level = 0.95, num_simulations = 20)
    )

val_lowres_sim_df = pd.concat(tmp_dfs)

We got results similar to those of our theoretical estimations. Let's try to run estimations with more simulated experiments (100 and 500 experiments). We can see that 12.5K clients will be enough to achieve 90% statistical power.

I've added all the power analysis results to the chart so that we can see the relation clearly.

In that case, you might already see that bootstrap can take a significant amount of time. For example, accurately estimating power with 500 experiment simulations for just 3 sample sizes took me almost 2 hours.

Now, we can estimate the relationship between effect size and power for a 12.5K sample size.

tmp_dfs = []
for effect_size in [1, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100]:
    print('Simulation for effect size = %d' % effect_size)
    tmp_dfs.append(
        run_simulations_for_value(before_df, sample_size = 12500, effect_size = effect_size,
                              boot_iters = 1000, confidence_level = 0.95, num_simulations = 100)
    )

val_effect_size_sim_df = pd.concat(tmp_dfs)

We can see that if the actual effect on customers’ spending is higher than 20 EUR, we will get even higher statistical power, and we will be able to reject the null hypothesis in more than 90% of cases. But we will be able to spot the 10 EUR effect in less than 50% of cases.

Let's move on and conduct power analysis for retention as well. The complete code is structured similarly to the customer spending analysis. We will discuss nuances in detail below.

import tqdm

def get_sample_for_retention(pop_df, sample_size, effect_size):
    base_ret_model = smf.logit('retention ~ num_family_members', data = pop_df).fit(disp = 0)
    tmp_pop_df = pop_df.copy()
    tmp_pop_df['predicted_retention_proba'] = base_ret_model.predict()
    sample_df = tmp_pop_df.sample(sample_size)
    sample_df['treatment'] = sample_df.index.map(lambda x: 1 if np.random.uniform() > 0.5 else 0)
    sample_df['predicted_retention_proba'] = sample_df['predicted_retention_proba'] + effect_size * sample_df.treatment
    sample_df['retention'] = sample_df.predicted_retention_proba.map(lambda x: 1 if x >= np.random.uniform() else 0)
    return sample_df

def get_ci_for_retention(df, boot_iters, confidence_level):
    tmp_data = []
    
    for iter in range(boot_iters):
        sample_df = df.sample(df.shape[0], replace = True)
        ret_model = smf.logit('retention ~ treatment + num_family_members', data = sample_df).fit(disp = 0)
        tmp_data.append(
            {
                'iteration': iter,
                'coef': ret_model.params['treatment']
            }
        )

    coef_df = pd.DataFrame(tmp_data)
    return coef_df.coef.quantile((1 - confidence_level)/2), coef_df.coef.quantile(1 - (1 - confidence_level)/2)

def run_simulations_for_retention(pop_df, sample_size, effect_size, 
                                  boot_iters, confidence_level, num_simulations):
    tmp_data = []
    
    for sim in tqdm.tqdm(range(num_simulations)):
        sample_df = get_sample_for_retention(pop_df, sample_size, effect_size)
        num_users_treatment = sample_df[sample_df.treatment == 1].shape[0]
        retention_treatment = sample_df[sample_df.treatment == 1].retention.mean()
        num_users_control = sample_df[sample_df.treatment == 0].shape[0]
        retention_control = sample_df[sample_df.treatment == 0].retention.mean()

        ci_lower, ci_upper = get_ci_for_retention(sample_df, boot_iters, confidence_level)

        tmp_data.append(
            {
                'experiment_id': sim,
                'num_users_treatment': num_users_treatment,
                'retention_treatment': retention_treatment,
                'num_users_control': num_users_control,
                'retention_control': retention_control,
                'sample_size': sample_size,
                'effect_size': effect_size,
                'boot_iters': boot_iters,
                'confidence_level': confidence_level,
                'ci_lower': ci_lower,
                'ci_upper': ci_upper
            }
        )

    return pd.DataFrame(tmp_data)

First, since we have a binary outcome for retention (whether the customer returns next month or not), we will use a logistic regression model instead of linear regression. We can see that retention is correlated with the size of the family. It might be the case that when you buy many different types of products for family members, it's more difficult to find another service that will cover all your needs.

base_ret_model = smf.logit('retention ~ num_family_members', data = before_df).fit(disp = 0)
base_ret_model.summary().tables[1]

Also, the functionget_sample_for_retention has a bit trickier logic to adjust results for the treatment group. Let's look at it step by step.

First, we are fitting a logistic regression on the whole population data and using this model to predict the probability of retaining using this model.

base_ret_model = smf.logit('retention ~ num_family_members', data = pop_df)\
  .fit(disp = 0)
tmp_pop_df = pop_df.copy()
tmp_pop_df['predicted_retention_proba'] = base_ret_model.predict()

Then, we got a random sample equal to the size and split it into a control and test group.

sample_df = tmp_pop_df.sample(sample_size)
sample_df['treatment'] = sample_df.index.map(
  lambda x: 1 if np.random.uniform() > 0.5 else 0)

For the treatment group, we increase the probability of retaining by the expected effect size.

sample_df['predicted_retention_proba'] = sample_df['predicted_retention_proba'] \
    + effect_size * sample_df.treatment

The last step is to define, based on probability, whether the customer is retained or not. We used uniform distribution (random number between 0 and 1) for that:

if a random value from a uniform distribution is below probability, then a customer is retained (it happens with specified probability),
otherwise, the customer has churned.

sample_df['retention'] = sample_df.predicted_retention_proba.map(
    lambda x: 1 if x > np.random.uniform() else 0)

You can run a few simulations to ensure our sampling function works as intended. For example, with this call, we can see that for the control group, retention is equal to 64% like in the population, and it's 93.7% for the experiment group (as expected with effect_size = 0.3 )

get_sample_for_retention(before_df, 10000, 0.3)\
  .groupby('treatment', as_index = False).retention.mean()

# |    |   treatment |   retention |
# |---:|------------:|------------:|
# |  0 |           0 |    0.640057 |
# |  1 |           1 |    0.937648 |

Now, we can also run simulations to see the optimal number of samples to reach 90% of statistical power for retention. We can see that the 12.5K sample size also will be good enough for retention.

Analysing results

We can use linear or logistic regression to analyse results or leverage the functions we already have for bootstrap CI.

value_model = smf.ols(
  'customer_value ~ treatment + num_family_members + country_avg_annual_earning', 
  data = experiment_df).fit(disp = 0)
value_model.summary().tables[1]

So, we got the statistically significant result for the customer spending equal to 25.84 EUR with a 95% confidence interval equal to (16.82, 34.87) .

With the bootstrap function, the CI will be pretty close.

get_ci_for_value(experiment_df.rename(
    columns = {'customer_value': 'predicted_value'}), 1000, 0.95)
# (16.28, 34.63)

Similarly, we can use logistic regression for retention analysis.

retention_model = smf.logit('retention ~ treatment + num_family_members',
    data = experiment_df).fit(disp = 0)
retention_model.summary().tables[1]

Again, the bootstrap approach gives close estimations for CI.

get_ci_for_retention(experiment_df, 1000, 0.95)
# (0.072, 0.187)

With logistic regression, it might be tricky to interpret the coefficient. However, we can use a hacky approach: for each customer in our dataset, calculate probability in case the customer was in control and treatment using our model and then look at the average difference between probabilities.

experiment_df['treatment_eq_1'] = 1
experiment_df['treatment_eq_0'] = 0

experiment_df['retention_proba_treatment'] = retention_model.predict(
    experiment_df[['retention', 'treatment_eq_1', 'num_family_members']]\
        .rename(columns = {'treatment_eq_1': 'treatment'}))

experiment_df['retention_proba_control'] = retention_model.predict(
    experiment_df[['retention', 'treatment_eq_0', 'num_family_members']]\
      .rename(columns = {'treatment_eq_0': 'treatment'}))

experiment_df['proba_diff'] = experiment_df.retention_proba_treatment \
    - experiment_df.retention_proba_control

experiment_df.proba_diff.mean()
# 0.0281

So, we can estimate the effect on retention to be 2.8%.

Congratulations! We’ve finally finished the full A/B test analysis and were able to estimate the effect both on average customer spending and retention. Our experiment is successful, so in real life, we would start thinking about rolling it to production.

You can find the full code for this example on GitHub.

Summary

Let me quickly recap what we’ve discussed today:

The main idea of bootstrap is simulations with replacements from your sample, assuming that the general population has the same distribution as the data we have.
Bootstrap shines in cases when you have few data points, your data has outliers or is far from any theoretical distribution. Bootstrap can also help you estimate custom metrics.
You can use bootstrap to work with observational data, for example, to get confidence intervals for your values.
Also, bootstrap is broadly used for A/B testing analysis — both to estimate the impact of treatment and do a power analysis to design an experiment.

Thank you a lot for reading this article. If you have any follow-up questions or comments, please leave them in the comments section.

Reference

All the images are produced by the author unless otherwise stated.

This article was inspired by the book “Behavioral Data Analysis with R and Python” by Florent Buisson.