The world’s leading publication for data science, AI, and ML professionals.

Chi-Square Test for Independence in Python with Examples from the IBM HR Analytics Dataset

Is employee attrition dependent on factor "X"?

Photo by Campaign Creators on Unsplash
Photo by Campaign Creators on Unsplash

Suppose you are exploring a dataset and you want to examine if two categorical variables are dependent on each other.

The motivation could be a better understanding of the relationship between an outcome variable and a predictor, identification of dependent predictors, etc.

In this case, a Chi-square test can be an effective statistical tool.

In this post, I will discuss how to do this test in Python (both from scratch and using SciPy) with examples on a popular HR analytics dataset – the IBM Employee Attrition & Performance dataset.


Table of Curiosities

  1. What is Chi-square test?
  2. What are the categorical variables that we want to examine?
  3. How to perform this test from scratch?
  4. Is there a shortcut to do this?
  5. What else can we do?
  6. What are the limitations?

Overview

Chi-square test is a statistical hypothesis test to perform when the test statistic is Chi-square distributed under the null hypothesis and particularly the Chi-square test for independence is often used to examine independence between two categorical variables [1].

The key assumptions associated with this test are: 1. random sample from the population. 2. each subject cannot be in more than 1 group in any variable.

To better illustrate this test, I have chosen the IBM HR dataset from Kaggle (link), which includes a sample of employee HR information regarding attrition, work satisfaction, performance, etc. People often use it to uncover insights about the relationship between employee attrition and other factors.

Note that this is a fictional data set created by IBM data scientists [2].

To see the full Python code, check out my Kaggle kernel.

Without further ado, let’s get to the details!


Exploration

Let’s first check out the number of employees and the number of attributes:

data.shape
--------------------------------------------------------------------
(1470, 35)

There are 1470 employees and 35 attributes.

Next, we can check what these attributes are and see if there is any missing value associated with each of them:

data.isna().any()
--------------------------------------------------------------------
Age                         False
Attrition                   False
BusinessTravel              False
DailyRate                   False
Department                  False
DistanceFromHome            False
Education                   False
EducationField              False
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                      False
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                     False
JobSatisfaction             False
MaritalStatus               False
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                      False
OverTime                    False
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesLastYear       False
WorkLifeBalance             False
YearsAtCompany              False
YearsInCurrentRole          False
YearsSinceLastPromotion     False
YearsWithCurrManager        False
dtype: bool

Identify Categorical Variables

Suppose we want to examine if there is a relationship between ‘Attrition’ and ‘JobSatisfaction’.

Counts for the two categories of ‘Attrition’:

data['Attrition'].value_counts()
--------------------------------------------------------------------
No     1233
Yes     237
Name: Attrition, dtype: int64

Counts for the four categories of ‘JobSatisfaction’ ordered by frequency:

data['JobSatisfaction'].value_counts()
--------------------------------------------------------------------
4    459
3    442
1    289
2    280
Name: JobSatisfaction, dtype: int64

Note that for ‘JobSatisfaction’, 1 is ‘Low’, 2 is ‘Medium’, 3 is ‘High’, and 4 is ‘Very High’.

Null Hypothesis and Alternate Hypothesis

For our Chi-square test for independence here, the null hypothesis is that there is no significant relationship between ‘Attrition’ and ‘JobSatisfaction’.

The alternative hypothesis is that there is significant relationship between ‘Attrition’ and ‘JobSatisfaction’.

Contingency Table

In order to compute the Chi-square test statistic, we would need to construct a contingency table.

We can do that using the ‘crosstab’ function from pandas:

pd.crosstab(data.Attrition, data.JobSatisfaction, margins=True)

The numbers in this table represent frequencies. For example, the ’46’ shown under both ‘2’ in ‘JobSatisfaction’ and ‘Yes’ in ‘Attrition’ means that out of the 1470 employees, 46 of them rated their job satisfaction as ‘Medium’ and they did leave the company.

Chi-square Statistic

The formula for calculating the Chi-square statistic (X²) is shown as follows:

X² = sum of [(observed-expected)² / expected]

The term ‘observed‘ refers to the numbers we have seen in the contingency table, and the term ‘expected‘ refers to the expected numbers when the null hypothesis is true.

Under the null hypothesis, there is no significant relationship between ‘Attrition’ and ‘JobSatisfaction’, which means the percentage of attrition should be consistent across the four categories of job satisfaction. As an example, the expected frequency for ‘4’ and ‘Attrition’ should be the number of employees that rate their job satisfactions as ‘Very High’ (total attrition/total employee count), which is 459237/1470, or about 74.

Let’s compute all the expected numbers and store them in a list called ‘exp’:

row_sum = ct.iloc[0:2,4].values
exp = []
for j in range(2):
    for val in ct.iloc[2,0:4].values:
        exp.append(val * row_sum[j] / ct.loc['All', 'All'])
print(exp)
--------------------------------------------------------------------
[242.4061224489796,
 234.85714285714286,
 370.7387755102041,
 384.99795918367346,
 46.593877551020405,
 45.142857142857146,
 71.26122448979592,
 74.00204081632653]

Note that the last term (74) verifies that our calculation is correct.

Now we can compute X²:

((obs - exp)**2/exp).sum()
--------------------------------------------------------------------
17.505077010348

Degree of Freedom

One parameter we need apart from X² is the degree of freedom, which is computed as (number of categories in the first variable-1)(number of categories in the second variable-1), and it is (2–1)(4–1) in this case, or 3.

(len(row_sum)-1)*(len(ct.iloc[2,0:4].values)-1)
--------------------------------------------------------------------
3

Interpretation

With both X² and degrees of freedom, we can use a Chi-square table/calculator to determine its corresponding p-value and conclude if there is a significant relationship given a specified significance level of alpha.

In another word, given the degrees of freedom, we know that the ‘observed’ should be close to ‘expected’ under the null hypothesis which means X² should be reasonably small. When X² is larger than a threshold, we know the p-value (probability of having a such as large X² given the null hypothesis) is extremely low, and we would reject the null hypothesis.

In Python, we can compute the p-value as follows:

1 - stats.chi2.cdf(chi_sq_stats, dof)
--------------------------------------------------------------------
0.000556300451038716

Suppose the significance level is 0.05. We can conclude that there is a significant relationship between ‘Attrition’ and ‘JobSatisfaction’.

Using Scipy

There is a shortcut to perform this test in Python, which leverages the SciPy library (documentation).

obs = np.array([ct.iloc[0][0:4].values,
                  ct.iloc[1][0:4].values])
stats.chi2_contingency(obs)[0:3]
--------------------------------------------------------------------
(17.505077010348, 0.0005563004510387556, 3)

Note that the three terms are X² statistic, p-value, and degree of freedom, respectively. These results are consistent with the ones we computed by hand earlier.

‘Attrition’ and ‘Education’

It is somewhat intuitive that whether the employee leaves the company is related to the job satisfaction. Now let’s look at another example where we examine if there is significant relationship between ‘Attrition’ and ‘Education’:

ct = pd.crosstab(data.Attrition, data.Education, margins=True)
obs = np.array([ct.iloc[0][0:5].values,
                  ct.iloc[1][0:5].values])
stats.chi2_contingency(obs)[0:3]
--------------------------------------------------------------------
(3.0739613982367193, 0.5455253376565949, 4)

The p-value is over 0.5, so at the significance level of 0.05, we fail to reject that there is no relationship between ‘Attrition’ and ‘Education’.

Break Down the Analysis by Department

We can also check if a significant relationship exists breaking down by department. For example, we know there is a significant relationship between ‘Attrition’ and ‘WorkLifeBalance’ but we want to examine if that is agnostic to departments. First, let’s see what are the departments and the number of employees in each of them:

data['Department'].value_counts()
--------------------------------------------------------------------
Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64

To ensure enough samples for the Chi-square test, we will only focus on R&D and Sales in this analysis.

alpha = 0.05
for i in dep_counts.index[0:2]:
    sub_data = data[data.Department == i]
    ct = pd.crosstab(sub_data.Attrition, sub_data.WorkLifeBalance, margins=True)
    obs = np.array([ct.iloc[0][0:4].values,ct.iloc[1][0:4].values])
    print("For " + i + ": ")
    print(ct)
    print('With an alpha value of {}:'.format(alpha))
    if stats.chi2_contingency(obs)[1] <= alpha:
        print("Dependent relationship between Attrition and Work Life Balance")
    else:
        print("Independent relationship between Attrition and Work Life Balance")
    print("")
--------------------------------------------------------------------
For Research &amp; Development: 
WorkLifeBalance   1    2    3   4  All
Attrition                             
No               41  203  507  77  828
Yes              19   32   68  14  133
All              60  235  575  91  961
With an alpha value of 0.05:
Dependent relationship between Attrition and Work Life Balance

For Sales: 
WorkLifeBalance   1    2    3   4  All
Attrition                             
No               10   78  226  40  354
Yes               6   24   50  12   92
All              16  102  276  52  446
With an alpha value of 0.05:
Independent relationship between Attrition and Work Life Balance

From these output, we can see that there is a significant relationship in the R&D department, but not in the Sales department.


Caveats and Limitations

There are a few caveats when conducting this analysis as well as some limitations of this test:

  1. In order to draw a meaningful conclusion, the number of samples in each scenario needs to be sufficiently large, which might not be the case in reality.
  2. A significant relationship does not imply causality.
  3. The Chi-square test itself does not provide additional insights besides ‘significant relationship or not’. For example, the test does not inform that as job satisfaction increases, the proportion of employees who leave the company tends to decrease.

Summary

Let’s quickly recap.

We performed a Chi-square test for independence to examine the relationship between variables in the IBM HR Analytics dataset. We discussed two ways to do it in Python, both from scratch and using SciPy. Last, we showed that when a significant relationship exists, we can also stratify it and check if it is true for each level.

I hope you enjoyed this blog post and please share any thoughts that you may have 🙂

Check out my other post on building an image classification through Streamlit and PyTorch:

Create an Image Classification Web App using PyTorch and Streamlit


References

[1] https://en.wikipedia.org/wiki/Chi-squared_test [2] https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset


Related Articles