Introduction
One of my references in the Data Science field is Julia Silge. On her Tidy Tuesday videos she always makes a code-along type of video teaching/ showing a given technique, helping other analysts to upskill and incorporate that to their repertoire.
Last Tuesday, the topic was Empirical Bayes (her blog post), which caught my attention.
But, what is that?
Empirical Bayes
Empirical Bayes is a statistical method used when we work with ratios like [success]/[total tries]. When we are working with such variables, many are the times when we face a 1/2 success, which translates to a 50% success percentage, or 3/4 (75%), 0/1 (0%).
Those extreme percentages do not represent the long term reality because there were so little tries that it makes it very hard to tell if there is a trend there, and most times these cases are just ignored or deleted. It takes more tries to tell what the real success rate is, like 30/60, 500/100, or whatever makes sense for a business.
Using Empirical Bayes, though, we are able to use the current data distribution to calculate an estimate for its own data in earlier or later stages, as we will see next in this post.
We use the data distribution to estimate earlier and later stages of each observation’s ratio.
Analysis
Let’s jumps to the analysis. The steps to follow are:
- Load the data
- Define success and calculate the success ratio
- Determine the distribution’s parameters
- Calculate Bayes estimates
- Calculate the Credible Interval
Let’s move on.
Imports
# Imports
import pandas as pd
import numpy as np
import scipy.stats as scs
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from distfit import distfit
Dataset
The dataset to be used in this post is a log of customers and their quantities of e-commerce orders and brick and mortar orders. It was generated based on real numbers, but all the dataset was reduced, randomized and modified in proportions to become anonymized to be used publicly.
The variables are:
- customer: Customer ID
- _brickmortar: How many transactions have happened in physical store .
- ecomm: How many e-commerce transactions have happened online.
Success Ratio
Suppose our company is willing to increase their online penetration. Therefore, for this analysis, our success is defined as when the customer purchases online. The customer bought online [success], bought brick and mortar [not success].
This way, our metric can be a simple ecommerce transactions over the total _[ecomm]/[ecomm + brickmortar]. It will give us a ratio. Our question now becomes: how different are these two customers: 808 vs. 672.
808 has a ratio of 1/11 (9%) and 672 has a ratio of 1/2 (50%). Which one is better? Where should I invest more time for growth? Well, customer 672 has 50% of penetration, but it’s just 2 transactions total with us. Customer 808 has a longer history with the company, but has just started buying online.
Let’s calculate all the ratios. Here, we are going to discard the customers that never bought online, simply because it is not possible to calculate an estimate out of that.
# Creating success ratio
df2c = (
df
.query('ecomm > 0 & brick_mortar > 0')
.assign(total_trx = df['brick_mortar'] + df['ecomm'],
ratio = df['ecomm']/(df['brick_mortar'] + df['ecomm']) )
)
display(df2c)
And we have the result.
Next, let’s find out the distribution parameters.
Find Out the Distribution Parameters
The next step here is to calculate the parameters of the distribution of our ratios. But before we jump into that, let’s build our intuition on why we even need to perform this step.
Well, the Thomas Bayes theorem says that probability is conditional. So, **** the likelihood of something occurring should be based on a previous outcome having occurred in similar circumstances.
If something happened 10 times, it is more likely to happen an 11th time than something that failed to happen 10 times to happen on the 11th try.
We will use the beta distribution to represent our prior expectations, or what has been happening so far. It is like looking at our data and understanding the pattern to be able to give a more precise estimate for any data point in any stage, considering it would follow that same distribution.
To know which α and β makes the beta distribution to fit our data, we will use the module distfit
in Python.
Looking at the distribution, we can just plot a histogram.
# Looking at the distribution
px.histogram(df2c,
x='ratio', nbins=20,
template="simple_white", width = 1000,
title='Histogram of the Ratio Brick & Mortar vs e-Comm')
Now, let’s see which parameters make the beta distribution fit the data.
# Our distribution
X = df2c['ratio']
# Alternatively limit the search for only a few theoretical distributions.
dfit = distfit(method='parametric', todf=True, distr=['beta'])
# Fit model on input data X.
dfit.fit_transform(X)
# Print the bet model results.
dfit.model
--------------
[OUT]
[distfit] >INFO> fit
[distfit] >INFO> transform
[distfit] >INFO> [beta] [0.11 sec] [RSS: 0.823271] [loc=0.011 scale=0.941]
[distfit] >INFO> Compute confidence intervals [parametric]
{'name': 'beta',
'score': 0.8232713059833795,
'loc': 0.011363636363636362,
'scale': 0.9411483584238516,
>>>'arg': (0.850939343634336, 1.4553354599535102),<<<
'params': (0.850939343634336,
1.4553354599535102,
0.011363636363636362,
0.9411483584238516),
'model': <scipy.stats._distn_infrastructure.rv_continuous_frozen at 0x7838c17ad570>,
'bootstrap_score': 0,
'bootstrap_pass': None,
'color': '#e41a1c',
'CII_min_alpha': 0.030238213140192628,
'CII_max_alpha': 0.8158034848017729}
I marked this line >>>arg: (0.850939343634336, 1.4553354599535102),<<<
, that are the α and β we need.
In case we want to see the fit, use this code, where we are just creating a figure with 3 subplots and adding the Probability Density Function, the Cumulative Density Function and the QQ Plot.
import matplotlib.pyplot as plt
# Create subplot
fig, ax = plt.subplots(1,3, figsize=(18, 7))
# Plot PDF with histogram
dfit.plot(chart='PDF', ax=ax[0])
# Plot the CDF
dfit.plot(chart='CDF', ax=ax[1]);
dfit.qqplot(X, ax=ax[2])
Ok. Time to calculate our estimates.
Bayes Estimates
To calculate the estimates, we need to set the alpha (a) and beta(b) as just discovered.
# Alpha and Beta values
a,b = 0.850939343634336, 1.4553354599535102
With the estimated parameters of the Beta Distribution, let’s model our ratio of e-Commerce transactions with Empirical Bayes. We’ll start with our overall prior, and update based on the individual evidence. The calculation is simply adding a to the number of successes (e-Comm orders) , and a+b to the total number of orders.
# Calculating Bayes Estimates
df2c = (df2c
.assign(emp_bayes = (df2c['ecomm'] + a)/(df2c['total_trx'] + a + b) )
)
Nice! Now we can already plot the simple ratio against the estimated ratio and see how the model is working.
# Scatterplot
fig = sns.scatterplot(data = df2c,
x= 'ratio', y= 'emp_bayes',
s=50)
# Slope line
plt.plot([0,1], [0,1], linestyle='--', color='red');
Which results in the graphic.
Here, the more the dots get close to the red line, the more reliable that ratio is. It means that the simple ratio [success]/[total] is very close to the Bayes estimate, which already takes in account the prior probability.
Look at some numbers in the table.
Looking at the table, we notice that the higher the number of e-commerce transactions and total transactions, the closer the Bayes estimate is from the real ratio. This means that those 1/2 cases will be really off the mark, therefore, much less reliable as a trend.
We can also calculate the confidence interval for the estimates. That’s what we will do next.
Confidence Interval
The confidence interval is an extra step to make our analysis even more complete. We can give the decision maker a range of trust.
First, let’s calculate the alpha 1 and beta 1 for each observation, which are the posterior shape parameters for the beta distribution, after the prior has been updated.
# Calculate a1 and b1
df2 = (df2c
.assign(a1 = df2c.ecomm + a,
b1 = df2c.total_trx - df2c.ecomm + b,)
)
Then, the next task can be performed with scipy.stats.beta
module, using the method interval
. The output for a 95% confidence interval is a tuple with two numbers, where the index [0]
is the low boundary and the [1]
is the high.
# Calculating the CI and add to the dataset
df2 = (df2
.assign( low = scs.beta.interval(.95, df2.a1, df2.b1)[0],
high = scs.beta.interval(.95, df2.a1, df2.b1)[1] )
.sort_values('total_trx', ascending=False)
)
And, finally, plotting the interval by observation. We create a figure, then a scatterplot for the middle point and two bar graphics, one for the low range, which is white (so it disappears on white background) and the other one for the high range.
# customer ID to string for better visualization
df2['customer'] = df2.customer.astype('string')
# Number of customers to plot
n_cust = 100
# Plot
plt.figure(figsize=(25,7))
plt.scatter(df2.customer[:n_cust], df2.emp_bayes[:n_cust], color='black', s=100 )
plt.bar(df2.customer[:n_cust], df2.low[:n_cust], label='low', color='white', zorder=1 )
plt.bar(df2.customer[:n_cust], df2.high[:n_cust], label='high', zorder=0 )
plt.legend();
Here is the result.
This graphic is ordered by decreasing quantity of e-comm orders. From more (left) to less (right), we notice that the size of the confidence interval only gets bigger, showing that when we have too little number of e-comm transactions, it is really difficult to predict a reliable estimate.
Coming back to one of our questions: should I invest more time in growing e-comm for customer 672 or 808, here are the intervals.
The customer 672 has a much larger CI, so it is less reliable. We should spend more time developing customer 808, which aligns with the history of relationship with the brand.
Before You Go
Well, this tool shows itself as powerful. We see that it brought good results for our comparison case here. Many times, we will just discard customers with little transactions. But many other times, that may not be possible. Furthermore, we may want to in fact analyze the tiny clients. Or yet, it will help us comparing two clients that are ordering a lot too, looking at their potential, based on the CI.
Empirical Bayes is a good idea for larger data, it is an extra variable for other models, maybe.
Well, I bet you are probably already thinking about the utility of this in your job. So, we end here.
If you liked this content, follow my blog, subscribe to my page and get the posts once they’re published.
Also, find me on LinkedIn.
Here is the complete code, for reference.
Studying/Python/statistics/Empirical_Bayes.ipynb at master · gurezende/Studying
Reference
Empirical Bayes for #TidyTuesday Doctor Who episodes | Julia Silge
Understanding empirical Bayes estimation (using baseball statistics)
Understanding credible intervals (using baseball statistics)