Introduction
So far we have covered the Chi-Square Distribution and have used that to explain the Chi-Square goodness of fit test. You can check out both of these articles here:
I would highly recommend reading over those articles before this one!
In this post we will cover the other well known Chi-Square Test, the test of independence. This test determines if two categorical variables are related in some way e.g. are they independent or dependent. You can think of this loosely as the categorical version of the correlation between two continuous variables.
In this article we will run through the procedure of carrying out the test of independence and end with an example problem to show how to implement it in practise!
But first, make sure to subscribe to my YouTube Channel!
Click on the link for video tutorials that teach you core Data Science concepts in a digestible manner!
Assumptions
- Both variables are CATEGORICAL
- Observations are INDEPENDENT
- The COUNT for each category is GREATER THAN 5
- Each count in a category is MUTUALLY EXCLUSIVE
- Data is chosen RANDOMLY
Hypothesis Testing Steps
Here we will layout the basic steps involved in almost every hypothesis test:
- State the null, _H_0, and alternate, H_1_, hypotheses.
- Determine your significance level and calculate the corresponding critical value (or critical probability) for your distribution.
- Calculate the test statistic for your test, in our case this will be the Chi-Square statistic.
- Compare the test statistic (or P-value) to the critical value to either reject or fail to reject the null hypothesis.
These are the basic surface level steps for any hypothesis test. I haven’t gone into detail in explaining every topic as that would make this article very exhaustive! However, for the unfamiliar reader, I have linked sites for each step so you can gain some intuition about these ideas in more depth.
I also have other posts that cover the concepts in hypothesis testing in a more broken down format that you can check out here:
Chi-Square Test Statistic and Degrees of Freedom
For the Chi-Square Test, the test statistic we need to compute is:

- v is the degrees of freedom
- O is the observed sampled values
- E is the computed expected values
- n is the number of categories in the variable
Note: the Chi-Square distribution comes from the squaring of the numerator and this also ensures we only have positive values which ‘add’ to the statistic.
The degrees of freedom, v, is computed as:

- r is the number of rows in the contingency table (the number of categories in variable 1)
- c is the number of columns in the contingency table (the number of categories in variable 2)
Both of these formulas will make much more sense when we go over an example problem next.
Example Problem
We want to see if age has an impact on what political party you vote for.
Data
We collect a random sample of 135 people and display it in the following contingency table broken down by age and political party:

Note: This is purely synthetic data I made up myself and is of no relation to any real political party.
Hypothesis
Lets start by stating our hypotheses:
- _H_0_: Age has no impact on the political party you vote for. The two variables are independent.
- _H_1_: Age does have an impact on the political party. The two variables are dependent.
Significance Level and Critical Value
For this example we will use a 5% significance level. As we have 2 degrees of freedom (using the formula above):

Using the significance level, degrees of freedom and Chi-Square probability table we find our critical value to be 5.991. This means our Chi-Square statistic needs to be greater than 5.991 in order for us to reject the null hypothesis and the variables to not be independent.
Calculating Expected Counts
We now need to determine the expected count frequency for each cell in our contingency table. These are the expected values if the null hypothesis is true and is calculated using the following formula:

Where _n_r and n_c are the row and column totals for certain categories and n_T_ is the total number of counts.
For example, the expected count for ages 18–30 who voted Liberals is:

We can then populate the contingency table with these expected values (in brackets):

Chi-Square Statistic
It is now time to calculate the Chi-Square statistic using the formula above:

This equals 37.2!
Therefore, our statistic is much greater than the critical value and so we can reject the null hypothesis!
Conclusion
In this article we have described and shown an example of the Chi-Square test of independence. This test measures if two categorical variables are dependent on each-other. This is used in Data Science for Feature Selection where we only want modelling features that have an effect on the target.
Another Thing!
I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.