Statistical Hypothesis Testing with Python

Using Pingouin to Examine a Case Study of ANOVA

Published in

Towards Data Science

10 min readAug 25, 2022

Hypothesis testing is an inferential statistics method that lets us determine population characteristics by analyzing a sample dataset. The mathematical tools necessary for hypothesis testing were formalized in the early 20th century by statisticians Ronald Fisher, Jerzy Neyman and Egon Pearson¹. Their influential work established concepts like the null hypothesis and p-values, with those tools becoming a fundamental part of modern scientific research. It should be noted that Fisher and Neyman-Pearson had engaged in academic rivalry, but eventually a combination of their differing approaches became established as the modern form of hypothesis testing². Apart from academic research, hypothesis testing is particularly useful to data scientists, as it lets them conduct A/B tests and other experiments. In this article, we are going to examine a case study of hypothesis testing on the seeds dataset, by using the Pingouin Python library.

The Basic Steps of Hypothesis Testing

The first step in hypothesis testing is coming up with the research hypothesis, a statement that can be tested statistically and involves the comparison of variables, e.g. drug X can lower blood pressure more than a placebo. After doing that, we specify the null hypothesis H₀ which states the effect is not present in the population. In contrast, the alternative hypothesis H₁ states the effect is actually present in the population. The next step is data collection, which can be accomplished with experiments, surveys, interviews, and other methods depending on the type of research. For example, A/B tests collect user feedback from different website versions to evaluate their performance. You may also use datasets that have been created for other purposes, a method known as secondary data analysis.

Overview of Statistical Tests — Image by Philipp Probst (MIT License)

Afterwards, we need to decide which test is most suitable for our hypothesis. There are numerous available, including t-test, analysis of variance (ANOVA), Chi-squared, Kruskal-Wallis and many more. Choosing the appropriate test depends on a number of factors, including the type of variables and their distribution. Parametric tests like ANOVA are based on various assumptions, so we need to assess whether our dataset satisfies them. The table above provides an overview of all basic hypothesis tests, and can be a valuable tool when trying to find the most suitable. Keep in mind there are more hypothesis tests available, but this table covers the fundamental cases.

Type I and Type II Errors — Image by Author

Afterwards, we need to specify the significance level α (alpha), which is a threshold for rejecting the null hypothesis, typically set to 0.05. Therefore, a hypothesis test resulting to a p-value > 0.05 means the null hypothesis can’t be rejected. In contrast, a p-value ≤ 0.05 allows us to reject the null hypothesis and accept the alternative hypothesis. More specifically, p-value is the probability of an observed effect occurring with the null hypothesis being true. Furthermore, the significance level α is equal to the probability of committing a type I error, i.e. rejecting the null hypothesis when it is true (false positive). In addition, β (beta) is the probability of committing a type II error, i.e. failing to reject the null hypothesis when it is false (false negative). Another important concept is statistical power, which is the probability of the null hypothesis being correctly rejected, and is defined as 1-β. After completing the previous steps, we execute the hypothesis test and state our conclusion, either rejecting the null hypothesis or not.

The Pingouin Library

Pingouin logo — image by https://pingouin-stats.org/

Pingouin is an open source Python library that supports a wide variety of hypothesis tests and statistical models³. The library includes numerous tests like ANOVA, t-test, Chi-squared, Kruskal-Wallis, Mann-Whitney, Wilcoxon signed-rank and others, hence covering a wide variety of cases. Furthermore, Pingouin lets you calculate the correlation coefficient between two variables, as well as create linear and logistic regression models. Pingouin is user-friendly yet powerful, as it returns extensive results for all tests, making it a great addition to the scientific Python ecosystem. In the rest of this article, we are going to use Pingouin to run hypothesis tests and interpret the provided results. Feel free to check the official library documentation for extensive details about its functionality, and consider making a contribution if you want.

The Seeds Dataset

The case study of this article is based on the seeds dataset, which is freely provided by the UCI Machine Learning Repository. This dataset contains information about samples of 3 wheat varieties, i.e. Kama, Rosa and Canadian⁴. Furthermore, the dataset includes various geometric properties for each wheat kernel, including area, perimeter, compactness, kernel length, kernel width and more. This dataset is widely used for machine learning tasks, such as classification and clustering, but we are utilizing it for hypothesis testing. More specifically, our goal is to assess the geometric differences between wheat varieties.

A Case Study of ANOVA

We are now going to examine a practical example of hypothesis testing, by utilizing the Pingouin library and the seeds dataset. Our research hypothesis is that compactness values are related to wheat variety, so we state the null and alternative hypothesis:

H₀: All wheat varieties have the same mean compactness.
H₁: The wheat varieties have different mean compactness.

After stating our hypotheses, we move on to the coding section that is based on Python 3.9 and Anaconda. In case you are interested, the full code of this article is available as a Jupyter notebook, so I encourage you to clone the associated Github repository.

import pandas as pd
import pingouin as pg
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
mpl.rcParams['figure.dpi'] = 300
plt.style.use('seaborn-whitegrid')

df = pd.read_csv('data/seeds.csv')

dv = 'compactness'
iv = 'variety'
df.groupby(iv).mean()

We begin by importing the necessary Python libraries and loading the seeds dataset to a pandas dataframe. Afterwards, we use the groupby() function to group dataset rows by wheat variety, and calculate the mean value for each column. As we can see, the mean values of most variables are significantly different for each variety. Compactness seems to be an exception, with all wheat varieties having similar mean values, so we are going to examine this variable in detail.

df.boxplot(column = dv, by = iv, figsize = (8, 6), grid = False)

plt.box(False)
plt.show()

We use the boxplot() pandas function to create box plots for the compactness variable. Evidently, the Kama and Rosa varieties have similar quartiles, with median values that are nearly identical. In contrast, the Canadian variety appears to slightly differ from the rest, but we need to verify this with a hypothesis test. We want to compare the mean compactness value of all wheat varieties, i.e. there is a numeric dependent variable and an independent variable with three categories. Hence, the most suitable test for this case is one-way ANOVA.

fig, ax = plt.subplots(figsize = (8, 6))
ax.grid(False)
ax.set_frame_on(False)

sns.kdeplot(data = df, x = dv, hue = iv,
            fill = False, ax = ax)
plt.show()

pg.normality(df, dv = dv, group = iv, method = 'shapiro')

As a parametric test, ANOVA is based on various assumptions about the dataset, one of them being that all group samples are normally distributed⁵. We can visually evaluate this, by using the kdeplot() Seaborn function to create a KDE plot for each wheat variety. Furthermore, we use the Pingouin normality() function to run a Shapiro-Wilk normality test⁶, which confirms that all samples are normally distributed. Keep in mind that Shapiro-Wilk is not particularly accurate on large samples, so tests like Jarque-Bera or Omnibus are preferable in those cases. Furthermore, research has indicated that ANOVA can be robust to violations of this assumption⁷, so slight deviations from the normal distribution are not a serious concern. Still, you should always evaluate whether your dataset satisfies test assumptions, and consider using a non-parametric test otherwise.

fig, axes = plt.subplots(2, 2, figsize = (10, 8))
axes[1,1].set_axis_off()

categories = df[iv].unique()
for ax, cat in zip(axes.flatten(), categories):
    mask = df[iv] == cat
    sample = df.loc[mask, dv]
    pg.qqplot(sample, ax = ax)
    ax.set_title(f"Q-Q Plot for category {cat}")
    ax.grid(False)

Apart from tests like Shapiro-Wilk, creating a Q-Q plot is another way to evaluate sample normality. This is a scatter plot that lets you easily compare the quantiles of the normal and sample distribution. If the sample distribution is normal, all points will be near the line y = x. We can easily create a Q-Q plot for various theoretical distributions, by using the Pingouin qqplot() function. Furthermore, a best-fit line is also included in the plot, based on a linear regression model. Evidently, all sample quantiles are nearly identical to the normal distribution, further confirming the Shapiro-Wilk test and the KDE plot visual assessment.

pg.homoscedasticity(df, dv = dv, group = iv, method = 'levene')

The ANOVA test is also based on the assumption that all samples have equal variance, a property known as homoscedasticity. The homoscedasticity() Pingouin function lets us easily evaluate this by using the Levene test, a typical approach in assessing the equality of variances⁸. According to the Levene test results, group samples don’t satisfy the assumption of homoscedasticity, i.e. they have unequal variances. We can overcome this problem by using the Welch ANOVA test, which is more robust to violations of this assumption, in comparison to the classic ANOVA⁹.

df_anova = pg.welch_anova(df, dv = dv, between = iv)
df_anova

After executing the Welch ANOVA test, we examine the resulting dataframe to evaluate the results. First of all, the F-value indicates that variation between sample means is high, as compared to variation within the samples. The partial Eta-squared value represents effect size, thus helping us calculate statistical power. Furthermore, the p-value is almost equal to zero, making it particularly lower than the significance level (α = 0.05). Hence, we can reject the null hypothesis and accept the alternative hypothesis, i.e. the wheat varieties have different mean compactness values.

pg.pairwise_gameshowell(df, dv = dv, between = iv)

After rejecting the null hypothesis of ANOVA, it is advisable to execute a post-hoc test to determine which group differences are statistically significant. We opted for the Games-Howell test, because it is robust to heterogeneity of variances, hence making it complementary to Welch ANOVA¹⁰. Evidently, the difference between Canadian and the other varieties is statistically significant. In contrast, the mean compactness values of Kama and Rosa varieties don’t have a significant difference.

Conclusion

In this article, I introduced the fundamental concepts of statistical hypothesis testing, by using the Pingouin library and seeds dataset. Hopefully, I helped you understand those concepts, as hypothesis testing is a challenging topic that leads to numerous misconceptions. Feel free to read Introduction to Modern Statistics, an excellent book that delves deeper into the topic, while being friendly to beginners. I also encourage you to share your thoughts in the comments, or follow me on LinkedIn, where I regularly post content about data science. You can also visit my personal website or check my latest book, titled Simplifying Machine Learning with PyCaret.

References

[1] Biau, David Jean, Brigitte M. Jolles, and Raphaël Porcher. “P value and the theory of hypothesis testing: an explanation for new researchers.” Clinical Orthopaedics and Related Research® 468.3 (2010): 885–892.

[2] Lenhard, Johannes. “Models and statistical inference: The controversy between Fisher and Neyman–Pearson.” The British journal for the philosophy of science (2020).

[3] Vallat, Raphael. “Pingouin: statistics in Python.” J. Open Source Softw. 3.31 (2018): 1026.

[4] Charytanowicz, Małgorzata, et al. “Complete gradient clustering algorithm for features analysis of x-ray images.” Information technologies in biomedicine (2010): 15–24.

[5] Scheffe, Henry. The analysis of variance. Vol. 72. John Wiley & Sons, 1999.

[6] Shapiro, Samuel Sanford, and Martin B. Wilk. “An analysis of variance test for normality (complete samples).” Biometrika 52.3/4 (1965): 591–611.

[7] Schmider, Emanuel, et al. “Is it really robust? Reinvestigating the robustness of ANOVA against violations of the normal distribution assumption.” Methodology: European Journal of Research Methods for the Behavioral and Social Sciences 6.4 (2010): 147.

[8] Levene, Howard. “Robust tests for equality of variances.” Contributions to probability and statistics. Essays in honor of Harold Hotelling (1961): 279–292.

[9] Liu, Hangcheng. “Comparing Welch ANOVA, a Kruskal-Wallis test, and traditional ANOVA in case of heterogeneity of variance.” (2015).

[10] Games, Paul A., and John F. Howell. “Pairwise multiple comparison procedures with unequal n’s and/or variances: a Monte Carlo study.” Journal of Educational Statistics 1.2 (1976): 113–125.