with Python
Introduction
In a series of weekly articles, I will cover some important statistics topics with a twist.
The goal is to use Python to help us get intuition on complex concepts, empirically test theoretical proofs, or build algorithms from scratch. In this series, you will find articles covering topics such as random variables, sampling distributions, confidence intervals, significance tests, and more.
At the end of each article, you can find exercises to test your knowledge. The solutions will be shared in the article of the following week.
Articles published so far:
- Bernoulli and Binomial Random Variables with Python
- From Binomial to Geometric and Poisson Random Variables with Python
- Sampling Distribution of a Sample Proportion with Python
- Confidence Intervals with Python
- Significance Tests with Python
- Two-sample Inference for the Difference Between Groups with Python
- Inference for Categorical Data
- Advanced Regression
- Analysis of Variance – ANOVA
As usual, the code is available on my GitHub.
Chi-square Distribution
Let’s start by defining a standard normal distribution Random Variable (RV) X_1.

To define our first Chi-square RV, we sample from our standard normal distribution X_1 and square the result. Since we are taking the sum of one standard normally distributed variable, we define the number of degrees of freedom of our Chi-square distribution to be one. To define a Chi-square RV with 2 degrees of freedom, we follow the same idea. This time we sample from two independently standard normally distributed RVs, take the square of the respective samples, and finally sum the result.

from scipy.stats import norm, chi2
import matplotlib.pyplot as plt
import math
import numpy as np
import seaborn as sns
from scipy import stats
import tabulate
import pandas as pd
from IPython.display import HTML, display
import tabulate
mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
sns.lineplot(x = x, y = norm.pdf(x, loc=mu, scale=sigma));

norm_dist = stats.norm(0, 1)
x1 = norm_dist.rvs(size=100000)**2
x2 = norm_dist.rvs(size=100000)**2
x3 = norm_dist.rvs(size=100000)**2
x = [x1, x2, x3]
_, ax = plt.subplots(1, 3, figsize=(18, 7))
f = np.zeros(100000)
for i in range(3):
x_ = x[i]
f += x_
ax[i].hist(f, 60, density=True, label=f'Sum of {i+1} standard Gaussian RV')
d = np.arange(0, 10, .05)
ax[i].plot(d, stats.chi2.pdf(d, df=i+1), color='r', lw=2, label='Chi-Square dist.')
ax[i].set_xlim(0, 10)
ax[i].set_title(f'Chi-Square dist. with df={i+1}')
ax[i].legend()

Goodness of Fit
Rui works remotely for a tech company. He likes to work in coffee shops if the number of people is not too high. A new coffee shop opened recently, and he wanted to understand the distribution of customers per day of the week. This way, he would choose the days with smaller percentages to work there. Based on his experience of similar places, he draws the distribution of the number of customers for each day of the week. To test this assumption, for the next 3 months, he randomly chose one sample of each of the days of the week and recorded the observed number of customers.
table = [["Day",'M','T', 'W', 'T', 'F', 'S', 'S'],
["Expected (%)",10,10, 10, 20, 30, 15, 5],
["Observed",30, 14, 34, 45, 57, 20, 10]]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

Before moving any further, we need to ensure that Chi-square goodness-of-fit test conditions are met. Let’s enumerate them first:
- The sample has to be random
- The expected number of each category of outcomes has to be greater or equal to 5 (also called the large count’s condition)
- The samples are required to be independent. The rule of thumb is that if you are sampling without replacement, your sample size should be less than 10% of the population size
We were told that Rui randomly chose each of the days of the week, so the first criteria was fulfilled. For the large count’s condition, let’s calculate the number of expected customers per day of the week.
n = 7 # number of days in a week
alpha = 0.05
table = np.asarray(table)[1:,1:]
table = table.astype(np.float32)
table[0] = table[0]/100
total_number_customers = np.sum(table[1])
expected_num = table[0]*total_number_customers
table = np.concatenate((table, expected_num.reshape(1,-1)))
table[2]
array([ 26.460001, 64.26 , 85.049995, 107.729996, 37.8 ,
18.9 ], dtype=float32)
Notice that we do not have any value less than 5. Finally, we have the independence condition. Rui selected each day from a population of 3 months, which gives us 12 possible values for each category. This means that Rui sampled less than 10% of the population size, and we can assume independence even though he sampled without replacement.
With this data, Rui defined the following hypothesis test:

To start, he has to calculate a statistic that compares the estimated and the observed number of customers. It follows approximately a chi-square distribution. Using this statistic, he can calculate the probability of observing that specific value or values even more extreme given that the distribution based on similar places is correct. If this probability is smaller than the significance level (let’s use α=0.05), we can reject H_0 and thus, accept the alternative hypothesis that the new coffee shop has a distribution of customers per day different from the similar places considered by Rui.
chi_sq_statistic = np.sum((table[2]-table[1])**2/table[2])
chi_sq_statistic
19.246035
(1-chi2.cdf(chi_sq_statistic, df=n-1))<alpha
True
if (1-chi2.cdf(chi_sq_statistic, df=n-1))<alpha:
print('Reject H_0')
Reject H_0
We rejected H_0 based on the observations that Rui had been collecting, which consisted of the number of customers of the new coffee shop per day. The fact that we rejected H_0 means that the new place does not follow the assumed distribution defined by Rui based on his experience on similar places.
Contingency Table Chi-square Test
A government agency wanted to know if the vaccines against Covid-19 currently being administered produce any effect against the new Delta variant. They separated the sample of people into 3 different groups: the first one took Pfizer’s vaccine, the second took Janssen, and the third took a placebo.

table = [['Sick', 15, 10, 30],['Not sick', 100, 110, 90]]
alpha = 0.05
df = pd.DataFrame(table)
df.columns = ['Effect', 'Pfizer', 'Janssen', 'Placebo']
df = df.set_index('Effect')
df

Let’s define our hypothesis test:

Like in any hypothesis test, we will assume the null hypothesis as true and calculate the likelihood of getting the data collected above. If it is lower than the significance level, we reject the null hypothesis.
arr = df.to_numpy()
arr = np.concatenate((arr, (arr.sum(axis=1)[0]/arr.sum() * arr.sum(axis=0)).reshape(1,-1)))
arr = np.concatenate((arr, (arr.sum(axis=1)[1]/arr.sum() * arr.sum(axis=0)).reshape(1,-1)))
arr
array([[ 15. , 10. , 30. ],
[100. , 110. , 90. ],
[ 17.81690141, 18.5915493 , 18.5915493 ],
[ 97.18309859, 101.4084507 , 101.4084507 ]])
chi_sq_statistic = np.sum((arr[2] - arr[0])**2/arr[2]) + np.sum((arr[3] - arr[1])**2/arr[3])
The number of degrees of freedom that we should use equals the number of rows of our table minus one times the number of columns minus one.
print('P-value = ' + str(np.round(1-chi2.cdf(chi_sq_statistic, df =2*1), 4)))
P-value = 0.0012
if 1-chi2.cdf(chi_sq_statistic, df =2*1) < alpha:
print('Reject H_0')
Reject H_0
We reject H_0, meaning that the vaccines produced some effect and impacted the number of sick people in our experiment.
Chi-square Test for Homogeneity
The same agency decided to test Pfizer’s vaccine, but the goal was to test the effect on men and women this time.
This is a homogeneity test, which can be translated to the following hypothesis:

table = [['Sick', 25, 12],['Not sick', 92, 88]]
alpha = 0.05
df = pd.DataFrame(table)
df.columns = ['Effect', 'Men', 'Women']
df = df.set_index('Effect')
df

arr = df.to_numpy()
arr = np.concatenate((arr, (arr.sum(axis=1)[0]/arr.sum() * arr.sum(axis=0)).reshape(1,-1)))
arr = np.concatenate((arr, (arr.sum(axis=1)[1]/arr.sum() * arr.sum(axis=0)).reshape(1,-1)))
arr
array([[25. , 12. ],
[92. , 88. ],
[19.94930876, 17.05069124],
[97.05069124, 82.94930876]])
chi_sq_statistic = np.sum((arr[2] - arr[0])**2/arr[2]) + np.sum((arr[3] - arr[1])**2/arr[3])
print('P-value = ' + str(np.round(1-chi2.cdf(chi_sq_statistic, df =1*1), 4)))
P-value = 0.0674
if 1-chi2.cdf(chi_sq_statistic, df =1*1) < alpha:
print('Reject H_0')
else:
print('Fail to reject H_0')
Fail to reject H_0
Notice that despite the probability of observing these values or even more extreme values being quite low (about 6.7%), we failed to reject H_0. It means that we do not have enough evidence to state that there is a difference in the effect of the vaccine on men and women.
Chi-squared Test for Association
Finally, let’s build a chi-squared test for the association between two variables. In this case, we want to test if there is an association between a surfer’s height X and the maximum wave size he ever surfed Y. Notice that this particular test uses a random sample of a single population.

table = [['x<1.6m', 25, 22, 28],['1.6m<=x<1.9m', 10, 21, 35], ['x>=1.9m', 5, 10, 34]]
alpha = 0.05
df = pd.DataFrame(table)
df.columns = ['height', 'y<2m', '2m<=y<4m', 'y>=4m']
df = df.set_index('height')
df

arr = df.to_numpy()
for i in range(arr.shape[0]):
arr = np.concatenate((arr, (arr.sum(axis=1)[i]/arr.sum() * arr.sum(axis=0)).reshape(1,-1)))
arr
array([[25. , 22. , 28. ],
[10. , 21. , 35. ],
[ 5. , 10. , 34. ],
[15.78947368, 20.92105263, 38.28947368],
[13.89473684, 18.41052632, 33.69473684],
[10.31578947, 13.66842105, 25.01578947]])
chi_sq_statistic = np.sum((arr[3] - arr[0])**2/arr[3]) + np.sum((arr[4] - arr[1])**2/arr[4]) + np.sum((arr[5] - arr[2])**2/arr[5])
print('P-value = ' + str(np.round(1-chi2.cdf(chi_sq_statistic, df =2*2), 4)))
P-value = 0.0023
if 1-chi2.cdf(chi_sq_statistic, df =2*2) < alpha:
print('Reject H_0')
else:
print('Fail to reject H_0')
Reject H_0
We reject H_0, i.e., there is evidence of an association between the surfer’s height and maximum wave size.
Conclusion
This article covered part of the family of chi-square tests. They are useful to test hypotheses about distributions of categorical data. We assessed good-of-fit tests, where the sample data is tested to see if it fits a hypothesis distribution. We also saw different types of independence tests between two variables. In cases where we sampled our data from two different populations, we test homogeneity. In cases where we sampled from a single population, we test the association between the variables.
Keep in touch: LinkedIn
Exercises
You will get the solutions in next week’s article.
- According to a distributor of surfboards, 66% of the boards are common, 25% are uncommon, and 9% are rare. José wondered if the rarity levels of the boards he and his friends owned followed this distribution, so he took a random sample of 500 boards and recorded their rarity levels. The results are presented in the table below. Carry out a goodness-of-fit test to determine if the distribution of rarity levels of surfboards José and his friends own disagrees with the claimed percentages.
table = [['Cards', 345, 125, 30]]
alpha = 0.05
df = pd.DataFrame(table)
df.columns = ['Rarity level', 'Common', 'Uncommon', 'Rare']
df = df.set_index('Rarity level')
df

Answers from last week
- Physicians hypothesized that the mean time spent in the hospital due to Covid-19 before and after the vaccine changed. A group of 1,000 patients was randomized between a treatment group and a control group. The treatment group had already taken the vaccine, while the control group did not. The results show that the treatment group’s mean time spent in the hospital was 10 days less than the time spent by the control group. The table below summarizes the results for the 1,000 re-randomizations of the data. Based on the data, what is the probability that the treatment group’s mean is smaller than the one from the control group by 10 days or more? What can you conclude from the experiment’s result (assuming a 5% significance level)?
diff = [[-17.5,1],
[-15.0, 6],
[-12.5, 15],
[-10.0, 41],
[-7.5, 82],
[-5.0, 43],
[-2.5, 150],
[0., 167],
[2.5, 132],
[5.0, 127],
[7.5, 173],
[10.0, 38],
[12.5, 18],
[15.0, 6],
[17.5, 1]]
plt.bar(x = np.asarray(diff)[:,0], height = np.asarray(diff)[:,1], width=2, color='C0')
plt.bar(x = np.asarray(diff)[:4,0], height = np.asarray(diff)[:4,1], width=2, color='C1');

diff = np.asarray(diff)
np.sum(diff[diff[:,0]<=-10][:,1])/np.sum(diff[:,1]) < 0.05
False
Based on the significance level of 5%, the result is not significant. The difference that was measured in the experiment could have resulted from random chance alone.