
As a data scientist, you will most likely come across the effect size while working on some kind of A/B testing. A possible scenario is that the company wants to make a change to the product (be it a website, mobile app, etc.) and your task is to make sure that the change will – to some degree of certainty – result in better performance in terms of the specified KPI.
This is when hypothesis testing comes into play. However, a statistical test can only inform about the likelihood that an effect exists. By effect, I simply mean a difference – it can just be a difference in either direction, but it can also be a more precise variant of a hypothesis stating that one sample is actually better/worse than the other one (in terms of the given metric). And to know how big the effect is, we need to calculate the effect size.
In this article, I will provide a brief theoretical introduction to the effect size and then show some practical examples of how to calculate it in Python.
Theoretical introduction
Formally, the effect size is the quantified magnitude of a phenomenon we are investigating. As mentioned before, statistical tests result in the probability of observing an effect, however, they do not specify how big the effect actually is. This can lead to situations in which we detect a statistically significant effect, but it is actually so small that for the practical case (business) it is negligible and not interesting at all.
Additionally, when planning A/B tests, we want to estimate the expected duration of the test. This is connected to the topic of power analysis, which I covered in another article. To quickly summarize it, in order to calculate the required sample size, we need to specify three things: the significance level, the power of the test, and the effect size. Keeping the other two constant, the smaller the effect size, the harder it is to detect it with some kind of certainty, thus the larger is the required sample size for the statistical test.
In general, there are potentially hundreds of different measures of the effect size, each one with some advantages and drawbacks. In this article, I will present only a selection of the most popular ones. Before diving deeper into the rabbit hole, the measures of the effect size can be grouped into 3 categories, based on their approach to defining the effect. The groups are:
- Metrics based on the correlation
- Metrics based on differences (for example, between means)
- Metrics for categorical variables
The first two families cover continuous random variables, while the last one is used for categorical/binary features. To give a real-life example, we could apply the first two to a metric such as time spent in an app (in minutes), while the third family could be used for conversion or retention – expressed as a boolean.
I will describe some of the measures of effect size below, together with the Python implementation.

Examples in Python
In this part, I will describe more in detail a few examples from each of the effect size families and show how to calculate them in Python using popular libraries. Of course, we could just as well code these functions ourselves, but I do believe there is no need for reinventing the wheel.
As the first step, we need to import the required libraries:
1. The correlation family
The name of this group (also known as the "r family") comes from the measure of association between two variables – correlation. And by far the most popular measure of correlation is the Pearson’s correlation coefficient (Pearson’s r).
Before diving into the metrics, we will generate some random, correlated variables coming from the multivariate Normal distribution. They have different means, so we can actually detect some effect, while we keep the variance at 1 for simplicity.
Remember that the more random observations we generate, the more their distribution will resemble the one we specified.
Pearson’s r
This should not come as a surprise, as the name of the family is based on this metric. Pearson’s correlation coefficient measures the degree of linear association between two real-valued variables. The metric is unit-free and is expressed as a number in the range of [-1, 1]. For brevity, I will only describe the interpretation of extreme cases:
- a value of -1 indicates a perfect negative relationship between variables,
- a value of 0 indicates no linear relationship,
- a value of 1 indicates a perfect positive relationship.
As this is one of the most commonly used metrics in general, there are many ways to calculate the correlation coefficient in Python:
pearsonr
inscipy.stats
– in addition to the correlation coefficient, we also receive the p-value of the correlation test. Quoting the documentation: "The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets."
stats.pearsonr(x[:,0], x[:,1])
# (0.6023670412294826, 0.0)
numpy.corrcoef
– returns the correlation matrix, userowvar
to indicate whether the observations of the random variables are stored in rows or columns.
np.corrcoef(x, rowvar=False)
# array([[1. , 0.60236704],
# [0.60236704, 1. ]])
- the
corr
method of apandas
DataFrame/Series. pingouin
‘scorr
function – by default it returns the Pearson’s r coefficient (other measures of correlation are also available). In contrast toscipy
, the function returns a bit more detailed results of the correlation test. We can also use the one-sided variant of the test.
pg.corr(x[:, 0], x[:, 1])

Coefficient of determination (R²)
The second measure of the effect size in this family is the coefficient of determination, also known as R². It states what proportion of the dependent variable’s variance is explained (predictable) by the independent variable(s). In other words, it is a measure of how well the observed outcomes are replicated by the model.
There are several definitions of the coefficient of determination, however, the most relevant one for us right now is the one connected to Pearson’s r. When using simple linear regression (with one dependent variable) with the intercept included, the coefficient of determination is simply the square of the Pearson’s r. If there are more dependent variables, the R² is the square of the coefficient of multiple correlation.
In either of the mentioned cases, the coefficient of determination normally covers the range between 0 and 1. However, if another definition is used, the values can become negative as well. Due to the fact that we square the correlation coefficient, the coefficient of determination does not convey any information about the direction of the correlation.
We can calculate the coefficient of determination by running simple linear regression and inspecting the reported value:
- using
pingouin
:
pg.linear_regression(x[:, 0], x[:, 1])

- using
statsmodels
:
In both cases, the coefficient of determination is close to 0.36, which is the square of the correlation coefficient (0.6).
Eta-squared (_η_²)
The last considered metric in this family is the eta-squared. It is a ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, which makes it similar to the r².

where SS stands for the sum of squares. _η_² is a biased estimator of the population’s variance explained by the model, as it only estimates the effect size in the considered sample. This means that eta-squared will always overestimate the actual effect size, although this bias becomes smaller as the sample size grows. Eta-squared also shares the weakness of r² – each additional variable automatically increases the value of _η_².
To calculate eta-squared in Python, we can use the pingouin
library:
pg.compute_effsize(x[:, 0], x[:, 1], eftype='eta-square')
# 0.057968511053166284
Additionally, the library contains a useful function called convert_effsize
, which allows us to convert the effect size measured by Pearson’s r or Cohen’s d into, among others, eta-squared.
2. The "difference" family
The second family is called the difference family, after perhaps the most common method of measuring the effect size – calculating the difference between the mean values of the samples. Usually, that difference is also standardized by dividing it by the standard deviation (of either or both populations).
In practice, the population values are not known and have to be estimated from sample statistics. That is why there are multiple methods for calculating the effect size as the difference between means – they differ in terms of which sample statistics they use.
On a side note, such a form of estimating the effect size resembles calculating the t-statistic, with the difference being dividing the standard deviation by the square root of n in the t-statistic’s denominator. Unlike the t-statistic, the effect size aims to estimate a population-level value and is not affected by the sample size.
This family is also known as the "d family", named after the most common method of estimating the effect size as a difference between means – Cohen’s d.
Before diving into the metrics, we define two random variables coming from the Normal distribution. We use different means and standard deviations to make sure that the variables differ enough to obtain reasonable effect sizes.

Cohen’s d
Cohen’s d measures the difference between the means of two variables. The difference is expressed in terms of the number of standard deviations, hence the division in the formula. Cohen’s d is defined as:

where s is the pooled standard deviation and s_1, s_2 are standard deviations of the two independent samples.
Note: Some sources use a different formulation of the pooled standard deviation and do not include the -2 in the denominator.
The most common interpretation of the magnitude of the effect size is as follows:
- Small Effect Size: d=0.2
- Medium Effect Size: d=0.5
- Large Effect Size: d=0.8
Cohen’s d is very frequently used in estimating the required sample size for an A/B test. In general, a lower value of Cohen’s d indicates the necessity of a larger sample size and vice versa.
The easiest way to calculate the Cohen’s d in Python is to use the the pingouin
library:
pg.compute_effsize(x, y, eftype='cohen')
# -0.5661743543595718
Glass’ Δ
Glass’ Δ is very similar to Cohen’s d, with the difference that the standard deviation of the second sample (the control group in an A/B test) is used instead of the pooled standard deviation.

The rationale for using only the standard deviation of the control group was based on the fact that if we were to compare multiple treatment groups to the same control, this way we would have the common denominator in all the cases.
pg.compute_effsize(x, y, eftype='glass')
# -0.6664041092152272
Hedge’s g
Cohen’s d is a biased estimator of the population-level effect size, especially for small samples (n < 20). That is why Hedge’s g corrects for that by multiplying the Cohen’s d by a correction factor (based on the gamma functions):

pg.compute_effsize(x, y, eftype='hedges')
# -0.5661722311818571
We can see that the difference between Cohen’s d and Hedge’s g is very small. It would be more pronounced for smaller sample sizes.
3. The categorical family
This family is used for estimating the effect size of categorical (and in some simpler cases – binary) random variables. We generate the random variables for this section by running the following snippet:
φ (phi coefficient)
The phi coefficient is a measure of association between two binary variables introduced by Karl Pearson, and is related to the chi-squared statistic of a 2×2 contingency table. In Machine Learning terms, a contingency table is basically the same as the confusion matrix.

Two binary random variables are positively associated when most of the data falls along the diagonal of the contingency table (think about true positives and true negatives). Conversely, the variables are negatively associated when most of the data falls off the diagonal (think about false positives and false negatives).
As a matter of fact, the Pearson’s correlation coefficient (r) calculated for two binary variables will result in the phi coefficient (we will prove that in Python). However, the range of the phi coefficient is different from the correlation coefficient, especially when at least one of the variables takes more than two values.
What is more, in machine learning we see the increasing popularity of the Matthews correlation coefficient as a measure of evaluating the performance of classification models. In fact, the MCC is nothing else as Pearson’s phi coefficient.
Phi/MCC considers all the elements of the confusion matrix/contingency table, that is why it is considered a balanced evaluation metric that can also be used in cases of class imbalance.
Running the code results in the following output:
Phi coefficient: 0.000944
Matthews Correlation: 0.000944
Pearson's r: 0.000944
Cramér’s V
Cramér’s V is another measure of association between categorical variables (not restricted to the binary case).

where k and r stand for the number of columns and rows in the contingency table and φ is the phi coefficient as calculated above.
Cramér’s V takes a value in the range of 0 (no association between the variables) and 1 (complete association). Note that for the case of a 2×2 contingency table (two binary variables), Cramér’s V is equal to the phi coefficient, as we will soon see in practice.
The most common interpretation of the magnitude of the Cramér’s V is as follows:
- Small Effect Size: V ≤ 0.2
- Medium Effect Size: 0.2 < V ≤ 0.6
-
Large Effect Size: 0.6 < V
Cramer's V: 0.000944
We have indeed obtained the same value as in the case of the phi coefficient.
Cohen’s w
Cohen suggested another measure of the effect size, which "increases with the degree of discrepancy between the distribution specified by the alternate hypothesis and that which represents the null hypothesis" (for more details, see page 216 in [1]). In this case, we are dealing with proportions (so fractions of all observations), in contrast to the contingency tables for the previous metrics.

where:
- p_{0i} – the proportion in cell i under the null hypothesis,
- p_{1i} – the proportion in cell i under the alternative hypothesis,
- m – number of cells.
The effect size measured by Cohen’s w is considered small for values close to 0.1, medium for around 0.3, and large for around 0.5.
Cohen's w: 0.173820
Cohen’s h
Another measure used for comparing proportions from two independent samples is Cohen’s h, defined as follows:

where p_1 stands for the proportion of the positive cases in the first sample. To assess the magnitude of the effect size, the author suggests the same range of indicative values as in the case of Cohen’s d.
Cohen's h: 0.174943
Odds Ratio
The effect size measured by the odds ratio is computed by noting that the odds of an event happening in the treatment group are X times higher/lower than in the control group.

Odds Ratio: 1.374506
The odds of an event (for example conversion) happening are ~1.37 times higher in the x
group than in the y
one, which is in line with the probabilities provided while generating the data.
BONUS: Common language effect size
The last metric is quite interesting, because – as the name suggests – it aims to express the effect size in a plain language understood by everyone. Also, it does not belong to the families mentioned above. The authors described this metric in [2] as:
the probability that a score sampled at random from one distribution will be greater than a score sampled from some other distribution.
To make the description as clear as possible, I will paraphrase the example mentioned in the paper. Imagine that we have a sample of heights of adult men and women, and the CLES is 0.8. This would mean that in 80% of randomly selected pairs, the man will be higher than the women. Or to put it differently, in 8 out of 10 blind dates, the man will be higher than the woman.

Conclusions
In this article, I introduced different measures of the effect size and showed how to calculate them in Python. Knowing these, or at the very least one key metric per family, will definitely come in handy while planning A/B tests using the frequentist approach.
You can find the code used for this article on my GitHub. As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.
In case you found this article interesting, you might also like:
The new kid on the statistics-in-Python block: pingouin
References
[1] Cohen, J. (2013). Statistical power analysis for the behavioral sciences. Academic press.
[2] McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin, 111(2), 361–365. https://doi.org/10.1037/0033-2909.111.2.361