The world’s leading publication for data science, AI, and ML professionals.

Hypothesis Testing in Data Science

Attempting to break down hypothesis testing using examples and Python's SciPy library.

Photo by Scott Graham on Unsplash
Photo by Scott Graham on Unsplash

In statistics and data analysis, Hypothesis Testing is very important because when we perform experiments, we typically do not have access to all members of a population so we take samples of measurements to make inferences about the population. These inferences are hypotheses. In essence, a statistical hypothesis test is a method for testing a hypothesis about a parameter in a population using data measured in a sample.

In this article, I will be reviewing the steps in hypothesis testing, define key terminology and use examples to show the different types of hypothesis tests.


Regardless of the type of statistical hypothesis test you are performing, there are five main steps to executing them:

  1. Set up a null and alternative hypothesis
  2. Choose a significance level α (or use the one assigned)
  3. Determine the critical test statistic value or p-value (Find the rejection region for the null hypothesis)
  4. Calculate the value of the test statistic
  5. Compare the test statistic value to the critical test statistic value to reject the null hypothesis or not

Let’s break these down and define some key terminology.

1. Set up a null and alternative hypothesis

Null hypothesis: Can be thought of as the "control" of the experiment. The hypothesis assumed to be true before we collect data and usually has some sort of equal sign (≥, ≤, =).

Alternative hypothesis: Can be thought of as the "experiment". This is what we want to prove to be true with our collected data and usually has the opposite sign to the null hypothesis.

2. Choose a significance level α (or use the one assigned)

The significance level α is the threshold at which you are okay with rejecting the null hypothesis. It is the probability of rejecting the null hypothesis when it is true. A significance level of 0.05 is most common, this says that you’re okay with rejecting the null hypothesis if there is less than a 5% chance that the result I am seeing are actually due to randomness.

  1. Determine the critical value (Find the rejection region for the null hypothesis)

The critical value is a value on the test distribution compared to the test statistic to determine whether to reject the null hypothesis.

  1. Calculate the test statistic or p-value

The test statistic is a value calculated from the data given and then compared to the critical value to determine whether to reject the null hypothesis.

The p-value is the probability of observing a test statistic at least as large as the one observed, by random chance, given the null hypothesis is true.

How do we know what type of test statistic to calculate? It depends on which kind of hypothesis test we run. There are four types: z-test, t-test, ANOVA, and chi-square tests. A z-test is used when comparing population means and when the population standard deviation is known. A t-test is used when comparing population means and the population standard deviation is not known. An ANOVA is used when comparing sample means among three or more groups. Lastly, a chi-square test is a non-parametric test used to test relationships between categorical variables.

  1. Compare the test statistic value to the critical test statistic value, or p-value to the significance level, to determine whether to reject the null hypothesis or not

If the test statistic is more than the critical value, we reject the null hypothesis. If the test statistic is less than or equal to the critical value, we fail to reject the null hypothesis.

If p < α, we reject the null hypothesis. If pα, we fail to reject the null hypothesis.

And there you have it, five steps to conduct hypothesis testing, so let’s now conduct some hypothesis testing in Python using SciPy.


For our example, we will be using this dataset from Kaggle, which consists of responses to a questionnaire to determine if a child has traits of Autism Spectrum Disorder (ASD). Let’s import the necessary libraries and csv file.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from Scipy import stats
from scipy.stats import chi2_contingency
from scipy.stats import chi2
df_old = pd.read_csv('Toddler Autism dataset July 2018.csv')
Image by Author
Image by Author

The first statistical test we can conduct is an independent samples t-test, in which we compare the means of two groups along a dependent variable. For our example, using a significance level of 0.05, we will be looking at sex vs. qchat_score (total score from questionnaire used to determine if a child has ASD traits). Let’s determine our null and alternative hypothesis.

Null Hypothesis: There is no statisical difference between male and female mean qchat scores.

Alternative Hypothesis: There is a statistical difference between male and female mean qchat scores.

Next, we need to separate sex into two dataframes for the hypothesis test.

male_df = df.loc[(df['Sex'] == 'm')]
female_df = df.loc[(df['Sex']== 'f')]

Now we can use Python’s SciPy library, which is an open-source library for math, science, and engineering, to calculate the test statistic and p-value.

print(stats.ttest_ind(male_df['qchat_score'], female_df['qchat_score'], equal_var=False))
print('WE FAIL TO REJECT THE NULL HYPOTHESIS SINCE OUR P-VALUE IS MORE THAN 0.05')
df.groupby('Sex').asd_traits.describe()
Image by Author
Image by Author

As you can see, since our p-value is more than our significance level (0.05), we fail to reject the null hypothesis. Upon further inspection, we see that there are a higher number of males than females in the sample, which may be why we were unable to detect a significant difference. We could have also used our test statistic and compared it to our critical value to determine whether to reject the null hypothesis.

Now, let’s try an ANOVA. Again, an ANOVA is conducted when comparing three or more groups against a dependent variable. For our example, using a significance level of 0.05, we will be comparing Ethnicity vs. .

Null Hypothesis: There is no statisticially significant difference of mean qchat scores between groups of ethnicity.

Alternative Hypothesis: There is a statisticially significant difference of mean qchat scores between groups of ethnicity.

# CALCULATE TEST STATISTIC AND PVALUE
print(stats.f_oneway(df['qchat_score'][df['Ethnicity'] == 'White European'],
                df['qchat_score'][df['Ethnicity'] == 'asian'], 
                df['qchat_score'][df['Ethnicity'] == 'middle eastern'],
               df['qchat_score'][df['Ethnicity'] == 'south asian'],
               df['qchat_score'][df['Ethnicity'] == 'black'],
               df['qchat_score'][df['Ethnicity'] == 'Hispanic'],
               df['qchat_score'][df['Ethnicity'] == 'Others'],
               df['qchat_score'][df['Ethnicity'] == 'Latino'],
               df['qchat_score'][df['Ethnicity'] == 'mixed'],
               df['qchat_score'][df['Ethnicity'] == 'Pacifica'],
               df['qchat_score'][df['Ethnicity'] == 'Native Indian']))

print('WE REJECT THE NULL HYPOTHESIS SINCE OUR P-VALUE IS LESS THAN 0.05')
Image by Author
Image by Author

As you can see, since our p-value is less than our significance level of 0.05, we reject the null hypothesis, there is a statistically significant difference of mean qchat scores between ethnicity groups. How can we determine where the differences lie within the ethnicity group? We must conduct post-hoc analysis to do so.

Using statsmodels which is a Python module that provides classes and functions for the estimation of many different statistical models, we will be conduct Tukey’s HSD for post-hoc analysis.

from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison
mc = MultiComparison(df['qchat_score'], df['Ethnicity'])
mc_results = mc.tukeyhsd()
print(mc_results)
print('POST HOC TESTS SHOW THAT THE MEAN DIFFERENCES OF AGE BETWEEN ASIAN AND WHITE EUROPEAN GROUPS ARE SIGNIFICANTLY DIFFERENT')
Image by Author
Image by Author

As we can see from the output, the differences lie between the White European and Asian group.

Lastly, let’s perform a chi-square analysis on our current data. Again, a chi-square analysis is used to compared two categorical variables.

Null Hypothesis: There is no relationship between Sex and Ethnicity of those individuals with ASD traits.

Alternative: There is a relationship between Sex and Ethnicity of those individuals with ASD traits.

In order to conduct our analysis, we will have to create a new dataframe.

df_both = df.groupby('Sex').Ethnicity.value_counts()
df_new = df_both.unstack()
values = {'Native Indian': 0}
df_new.fillna(value=values,inplace=True)
Image by Author
Image by Author

Awesome, now we can conduct our chi-square analysis.

stat, p, dof, expected = chi2_contingency(df_both)
print(expected)

#interpret test-statistic
prob = 0.95
critical = chi2.ppf(prob, dof)
print('probability=%.3f, critical=%.3f, stat=%.3f)' % (prob, critical, stat))
if abs(stat) >= critical:
    print('Dependent (reject H0)')
else:
    print('Independent (fail to reject H0)')
#interpret p-value
alpha= 1-prob
print('signigicance=%.3f, p=%.3f,' % (alpha, p))
if p <= alpha:
    print('Dependent (reject H0)')
else: 
    print('Independent (fail to reject)')
Image by Author
Image by Author

From these results, we fail to reject our null hypothesis, there is no relationship between Sex and Ethnicity of those individuals with ASD traits.


And there you have it! The steps to conduct a hypothesis test, definitions of key terminology, and some examples in Python using SciPy. Thank you for reading, here is a link to the notebook used, and stay safe everyone!


Related Articles