with Python
Introduction
In a series of weekly articles, I will cover some important statistics topics with a twist.
The goal is to use Python to help us get intuition on complex concepts, empirically test theoretical proofs, or build algorithms from scratch. In this series, you will find articles covering topics such as random variables, sampling distributions, confidence intervals, significance tests, and more.
At the end of each article, you can find exercises to test your knowledge. The solutions will be shared in the article of the following week.
Articles published so far:
- Bernoulli and Binomial Random Variables with Python
- From Binomial to Geometric and Poisson Random Variables with Python
- Sampling Distribution of a Sample Proportion with Python
- Confidence Intervals with Python
- Significance Tests with Python
- Two-sample Inference for the Difference Between Groups with Python
- Inference for Categorical Data
- Advanced Regression
- Analysis of Variance – ANOVA
As usual, the code is available on my GitHub.
Analysis of Variance step by step
Once again, we are working with the salaries of Data Scientists. In this case, we are not interested in predicting the salary based on some independent feature. We are focused on understanding if there is a difference in the mean of the salaries between 3 groups of Data Scientists with distinct backgrounds: the first are samples from graduates from Computer Science, the second from Economics, and the third from Informatics Engineering (notice that the salary unit is 10,000€).

import pandas as pd
import numpy as np
from scipy.stats import f
df = pd.DataFrame.from_dict({'g1': [5,9,10,12,8,8,9],
'g2': [5,4, 4, 5,5,4,8],
'g3': [9,8, 5, 6,7,7,6]})
df

The first step to perform an ANOVA test is to calculate the SST (total sum of squares), the SSW (total sum of squares within), and the SSB (total sum of squares between), and the corresponding degrees of freedom. They calculated as following:
- SST is the sum of the squared distance between each data point and the mean of the dataset. The degrees of freedom, in this case, is the number of groups m times the number of data points n in each group, and then we subtract 1, i.e., m•n -1.
- SSW is the sum of the squared distance between each data point and the respective group mean. The degrees of freedom is the number of groups times the number of data points minus 1, i.e., m•(n -1).
- SSB is the sum of the squared distance between each group mean and the mean of the dataset for each data point. The degrees of freedom is the number of groups minus 1, i.e., m -1.
m = df.shape[1]
n = df.shape[0]
SST = np.sum(np.sum((df - np.mean(np.mean(df)))**2))
SST
98.57142857142858
df_sst = m*n-1
df_sst
20
SSW = np.sum(np.sum((df - np.mean(df))**2))
SSW
50.28571428571429
df_ssw = m*(n-1)
df_ssw
18
SSB = np.sum(np.sum((np.tile(np.mean(df), (3,1)) - np.mean(np.mean(df)))**2))
SSB
20.6938775510204
df_ssb = m-1
df_ssb
2
Hypothesis Test
Let’s define our hypothesis test. Our null hypothesis is the scenario where the background does not make a difference. In contrast, our alternative hypothesis states that background makes a difference in the salary of a Data Scientist.

As usual, we will assume that our null hypothesis is true and figure out the Probability of getting a statistic as extreme or more extreme than the one we get from the data observed. For that, we will use an F-statistic, which is basically a ratio of two chi-square statistics. It is actually the ratio of two of the metrics calculated above divided by their respective degrees of freedom:

The idea is that if the numerator is significantly larger than the denominator, this should make us believe that there is a difference between the true populations’ means. Conversely, if the denominator is significantly larger, it means that the variation within each sample is a bigger percentage of the total variation when compared with the variation between the samples. Thus, any difference we could observe in the means is probably just a result of random chance.
F = (SSB/df_ssb)/(SSW/df_ssw)
F
3.703733766233764
f.ppf(0.95, dfn=df_ssb, dfd=df_ssw)
3.554557145661787
Now, we can calculate our p-value. Let’s use a significance level of 0.1.
p_value = (1 - f.cdf(F, dfn=df_ssb, dfd=df_ssw))*2
p_value
0.08991458167840971
if p_value<0.1:
print('Reject H_0')
else:
print('Accept H_0')
Reject H_0
We see that the p-value is smaller than the significance level, which makes us reject the null hypothesis. There is enough evidence to accept a difference between the populations’ means that does not come from chance alone or from the variance within each group. With that said, we can conclude that the salary of a Data Scientist is different depending on the graduation background.
Conclusion
This article covered analysis of variance (ANOVA), a collection of methods for comparing multiple means across different groups. We also introduced a new statistic, called F-statistic, which we used to conduct a hypothesis test on the difference of means of our groups.
This is the final article of this series on "College Statistics with Python." I hope that you enjoyed it!
Keep in touch: LinkedIn
Answers from last week
- Márcia collected data on the battery life and price of a random sample of Portable Computers. Based on the data presented below, what is the test statistic for the null hypothesis that the population slope is 0?
data = {'Intercept': [200.312, 92.618],
'Battery': [7.546,4.798]}
df = pd.DataFrame.from_dict(data, columns=['Coef', 'SE Coef'], orient='index')
df

t = (df['Coef'][1]-0)/df['SE Coef'][1]
t
1.5727386411004585
- Rui obtained a random sample of colleagues at work and noticed a positive linear relationship between their ages and the number of kilometers they said they walked yesterday. A 95% confidence interval for the slope of the regression line was (15.4, 155.2). Rui wants to use this interval to test H_0: β=0 vs. H_1: β ≠ 0 at the 5% significance level. Assume that all conditions for inference have been met. What should Rui conclude?
Rui should reject H_0, i.e., the data suggest a linear relationship between age and the number of kilometers walked yesterday.