Suppose you are exploring a dataset and you want to examine if two categorical variables are dependent on each other.
The motivation could be a better understanding of the relationship between an outcome variable and a predictor, identification of dependent predictors, etc.
In this case, a Chi-square test can be an effective statistical tool.
In this post, I will discuss how to do this test in Python (both from scratch and using SciPy) with examples on a popular HR analytics dataset – the IBM Employee Attrition & Performance dataset.
Table of Curiosities
- What is Chi-square test?
- What are the categorical variables that we want to examine?
- How to perform this test from scratch?
- Is there a shortcut to do this?
- What else can we do?
- What are the limitations?
Overview
Chi-square test is a statistical hypothesis test to perform when the test statistic is Chi-square distributed under the null hypothesis and particularly the Chi-square test for independence is often used to examine independence between two categorical variables [1].
The key assumptions associated with this test are: 1. random sample from the population. 2. each subject cannot be in more than 1 group in any variable.
To better illustrate this test, I have chosen the IBM HR dataset from Kaggle (link), which includes a sample of employee HR information regarding attrition, work satisfaction, performance, etc. People often use it to uncover insights about the relationship between employee attrition and other factors.
Note that this is a fictional data set created by IBM data scientists [2].
To see the full Python code, check out my Kaggle kernel.
Without further ado, let’s get to the details!
Exploration
Let’s first check out the number of employees and the number of attributes:
data.shape
--------------------------------------------------------------------
(1470, 35)
There are 1470 employees and 35 attributes.
Next, we can check what these attributes are and see if there is any missing value associated with each of them:
data.isna().any()
--------------------------------------------------------------------
Age False
Attrition False
BusinessTravel False
DailyRate False
Department False
DistanceFromHome False
Education False
EducationField False
EmployeeCount False
EmployeeNumber False
EnvironmentSatisfaction False
Gender False
HourlyRate False
JobInvolvement False
JobLevel False
JobRole False
JobSatisfaction False
MaritalStatus False
MonthlyIncome False
MonthlyRate False
NumCompaniesWorked False
Over18 False
OverTime False
PercentSalaryHike False
PerformanceRating False
RelationshipSatisfaction False
StandardHours False
StockOptionLevel False
TotalWorkingYears False
TrainingTimesLastYear False
WorkLifeBalance False
YearsAtCompany False
YearsInCurrentRole False
YearsSinceLastPromotion False
YearsWithCurrManager False
dtype: bool
Identify Categorical Variables
Suppose we want to examine if there is a relationship between ‘Attrition’ and ‘JobSatisfaction’.
Counts for the two categories of ‘Attrition’:
data['Attrition'].value_counts()
--------------------------------------------------------------------
No 1233
Yes 237
Name: Attrition, dtype: int64
Counts for the four categories of ‘JobSatisfaction’ ordered by frequency:
data['JobSatisfaction'].value_counts()
--------------------------------------------------------------------
4 459
3 442
1 289
2 280
Name: JobSatisfaction, dtype: int64
Note that for ‘JobSatisfaction’, 1 is ‘Low’, 2 is ‘Medium’, 3 is ‘High’, and 4 is ‘Very High’.
Null Hypothesis and Alternate Hypothesis
For our Chi-square test for independence here, the null hypothesis is that there is no significant relationship between ‘Attrition’ and ‘JobSatisfaction’.
The alternative hypothesis is that there is significant relationship between ‘Attrition’ and ‘JobSatisfaction’.
Contingency Table
In order to compute the Chi-square test statistic, we would need to construct a contingency table.
We can do that using the ‘crosstab’ function from pandas:
pd.crosstab(data.Attrition, data.JobSatisfaction, margins=True)

The numbers in this table represent frequencies. For example, the ’46’ shown under both ‘2’ in ‘JobSatisfaction’ and ‘Yes’ in ‘Attrition’ means that out of the 1470 employees, 46 of them rated their job satisfaction as ‘Medium’ and they did leave the company.
Chi-square Statistic
The formula for calculating the Chi-square statistic (X²) is shown as follows:
X² = sum of [(observed-expected)² / expected]
The term ‘observed‘ refers to the numbers we have seen in the contingency table, and the term ‘expected‘ refers to the expected numbers when the null hypothesis is true.
Under the null hypothesis, there is no significant relationship between ‘Attrition’ and ‘JobSatisfaction’, which means the percentage of attrition should be consistent across the four categories of job satisfaction. As an example, the expected frequency for ‘4’ and ‘Attrition’ should be the number of employees that rate their job satisfactions as ‘Very High’ (total attrition/total employee count), which is 459237/1470, or about 74.
Let’s compute all the expected numbers and store them in a list called ‘exp’:
row_sum = ct.iloc[0:2,4].values
exp = []
for j in range(2):
for val in ct.iloc[2,0:4].values:
exp.append(val * row_sum[j] / ct.loc['All', 'All'])
print(exp)
--------------------------------------------------------------------
[242.4061224489796,
234.85714285714286,
370.7387755102041,
384.99795918367346,
46.593877551020405,
45.142857142857146,
71.26122448979592,
74.00204081632653]
Note that the last term (74) verifies that our calculation is correct.
Now we can compute X²:
((obs - exp)**2/exp).sum()
--------------------------------------------------------------------
17.505077010348
Degree of Freedom
One parameter we need apart from X² is the degree of freedom, which is computed as (number of categories in the first variable-1)(number of categories in the second variable-1), and it is (2–1)(4–1) in this case, or 3.
(len(row_sum)-1)*(len(ct.iloc[2,0:4].values)-1)
--------------------------------------------------------------------
3
Interpretation
With both X² and degrees of freedom, we can use a Chi-square table/calculator to determine its corresponding p-value and conclude if there is a significant relationship given a specified significance level of alpha.
In another word, given the degrees of freedom, we know that the ‘observed’ should be close to ‘expected’ under the null hypothesis which means X² should be reasonably small. When X² is larger than a threshold, we know the p-value (probability of having a such as large X² given the null hypothesis) is extremely low, and we would reject the null hypothesis.
In Python, we can compute the p-value as follows:
1 - stats.chi2.cdf(chi_sq_stats, dof)
--------------------------------------------------------------------
0.000556300451038716
Suppose the significance level is 0.05. We can conclude that there is a significant relationship between ‘Attrition’ and ‘JobSatisfaction’.
Using Scipy
There is a shortcut to perform this test in Python, which leverages the SciPy library (documentation).
obs = np.array([ct.iloc[0][0:4].values,
ct.iloc[1][0:4].values])
stats.chi2_contingency(obs)[0:3]
--------------------------------------------------------------------
(17.505077010348, 0.0005563004510387556, 3)
Note that the three terms are X² statistic, p-value, and degree of freedom, respectively. These results are consistent with the ones we computed by hand earlier.
‘Attrition’ and ‘Education’
It is somewhat intuitive that whether the employee leaves the company is related to the job satisfaction. Now let’s look at another example where we examine if there is significant relationship between ‘Attrition’ and ‘Education’:
ct = pd.crosstab(data.Attrition, data.Education, margins=True)
obs = np.array([ct.iloc[0][0:5].values,
ct.iloc[1][0:5].values])
stats.chi2_contingency(obs)[0:3]
--------------------------------------------------------------------
(3.0739613982367193, 0.5455253376565949, 4)
The p-value is over 0.5, so at the significance level of 0.05, we fail to reject that there is no relationship between ‘Attrition’ and ‘Education’.
Break Down the Analysis by Department
We can also check if a significant relationship exists breaking down by department. For example, we know there is a significant relationship between ‘Attrition’ and ‘WorkLifeBalance’ but we want to examine if that is agnostic to departments. First, let’s see what are the departments and the number of employees in each of them:
data['Department'].value_counts()
--------------------------------------------------------------------
Research & Development 961
Sales 446
Human Resources 63
Name: Department, dtype: int64
To ensure enough samples for the Chi-square test, we will only focus on R&D and Sales in this analysis.
alpha = 0.05
for i in dep_counts.index[0:2]:
sub_data = data[data.Department == i]
ct = pd.crosstab(sub_data.Attrition, sub_data.WorkLifeBalance, margins=True)
obs = np.array([ct.iloc[0][0:4].values,ct.iloc[1][0:4].values])
print("For " + i + ": ")
print(ct)
print('With an alpha value of {}:'.format(alpha))
if stats.chi2_contingency(obs)[1] <= alpha:
print("Dependent relationship between Attrition and Work Life Balance")
else:
print("Independent relationship between Attrition and Work Life Balance")
print("")
--------------------------------------------------------------------
For Research & Development:
WorkLifeBalance 1 2 3 4 All
Attrition
No 41 203 507 77 828
Yes 19 32 68 14 133
All 60 235 575 91 961
With an alpha value of 0.05:
Dependent relationship between Attrition and Work Life Balance
For Sales:
WorkLifeBalance 1 2 3 4 All
Attrition
No 10 78 226 40 354
Yes 6 24 50 12 92
All 16 102 276 52 446
With an alpha value of 0.05:
Independent relationship between Attrition and Work Life Balance
From these output, we can see that there is a significant relationship in the R&D department, but not in the Sales department.
Caveats and Limitations
There are a few caveats when conducting this analysis as well as some limitations of this test:
- In order to draw a meaningful conclusion, the number of samples in each scenario needs to be sufficiently large, which might not be the case in reality.
- A significant relationship does not imply causality.
- The Chi-square test itself does not provide additional insights besides ‘significant relationship or not’. For example, the test does not inform that as job satisfaction increases, the proportion of employees who leave the company tends to decrease.
Summary
Let’s quickly recap.
We performed a Chi-square test for independence to examine the relationship between variables in the IBM HR Analytics dataset. We discussed two ways to do it in Python, both from scratch and using SciPy. Last, we showed that when a significant relationship exists, we can also stratify it and check if it is true for each level.
I hope you enjoyed this blog post and please share any thoughts that you may have 🙂
Check out my other post on building an image classification through Streamlit and PyTorch:
Create an Image Classification Web App using PyTorch and Streamlit
References
[1] https://en.wikipedia.org/wiki/Chi-squared_test [2] https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset