The world’s leading publication for data science, AI, and ML professionals.

Chi Square Test of Independence

A simple and concise explanation of the Chi-Square Test of Independence

Photo by Tra Nguyen on Unsplash
Photo by Tra Nguyen on Unsplash

Introduction

So far we have covered the Chi-Square Distribution and have used that to explain the Chi-Square goodness of fit test. You can check out both of these articles here:

Chi-Square Distribution Simply Explained

Chi-Square Goodness of Fit Test

I would highly recommend reading over those articles before this one!

In this post we will cover the other well known Chi-Square Test, the test of independence. This test determines if two categorical variables are related in some way e.g. are they independent or dependent. You can think of this loosely as the categorical version of the correlation between two continuous variables.

In this article we will run through the procedure of carrying out the test of independence and end with an example problem to show how to implement it in practise!


But first, make sure to subscribe to my YouTube Channel!

Click on the link for video tutorials that teach you core Data Science concepts in a digestible manner!

Egor Howell


Assumptions

  • Both variables are CATEGORICAL
  • Observations are INDEPENDENT
  • The COUNT for each category is GREATER THAN 5
  • Each count in a category is MUTUALLY EXCLUSIVE
  • Data is chosen RANDOMLY

Hypothesis Testing Steps

Here we will layout the basic steps involved in almost every hypothesis test:

  • State the null, _H_0, and alternate, H_1_, hypotheses.
  • Determine your significance level and calculate the corresponding critical value (or critical probability) for your distribution.
  • Calculate the test statistic for your test, in our case this will be the Chi-Square statistic.
  • Compare the test statistic (or P-value) to the critical value to either reject or fail to reject the null hypothesis.

These are the basic surface level steps for any hypothesis test. I haven’t gone into detail in explaining every topic as that would make this article very exhaustive! However, for the unfamiliar reader, I have linked sites for each step so you can gain some intuition about these ideas in more depth.

I also have other posts that cover the concepts in hypothesis testing in a more broken down format that you can check out here:

Z-Test Simply Explained

Confidence Intervals Simply Explained

Chi-Square Test Statistic and Degrees of Freedom

For the Chi-Square Test, the test statistic we need to compute is:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.
  • v is the degrees of freedom
  • O is the observed sampled values
  • E is the computed expected values
  • n is the number of categories in the variable

Note: the Chi-Square distribution comes from the squaring of the numerator and this also ensures we only have positive values which ‘add’ to the statistic.

The degrees of freedom, v, is computed as:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.
  • r is the number of rows in the contingency table (the number of categories in variable 1)
  • c is the number of columns in the contingency table (the number of categories in variable 2)

Both of these formulas will make much more sense when we go over an example problem next.

Example Problem

We want to see if age has an impact on what political party you vote for.

Data

We collect a random sample of 135 people and display it in the following contingency table broken down by age and political party:

Table created by author.
Table created by author.

Note: This is purely synthetic data I made up myself and is of no relation to any real political party.

Hypothesis

Lets start by stating our hypotheses:

  • _H_0_: Age has no impact on the political party you vote for. The two variables are independent.
  • _H_1_: Age does have an impact on the political party. The two variables are dependent.

Significance Level and Critical Value

For this example we will use a 5% significance level. As we have 2 degrees of freedom (using the formula above):

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

Using the significance level, degrees of freedom and Chi-Square probability table we find our critical value to be 5.991. This means our Chi-Square statistic needs to be greater than 5.991 in order for us to reject the null hypothesis and the variables to not be independent.

Calculating Expected Counts

We now need to determine the expected count frequency for each cell in our contingency table. These are the expected values if the null hypothesis is true and is calculated using the following formula:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

Where _n_r and n_c are the row and column totals for certain categories and n_T_ is the total number of counts.

For example, the expected count for ages 18–30 who voted Liberals is:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

We can then populate the contingency table with these expected values (in brackets):

Table produced by author.
Table produced by author.

Chi-Square Statistic

It is now time to calculate the Chi-Square statistic using the formula above:

Equation generated by author in LaTeX.
Equation generated by author in LaTeX.

This equals 37.2!

Therefore, our statistic is much greater than the critical value and so we can reject the null hypothesis!

Conclusion

In this article we have described and shown an example of the Chi-Square test of independence. This test measures if two categorical variables are dependent on each-other. This is used in Data Science for Feature Selection where we only want modelling features that have an effect on the target.

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.

Dishing The Data | Egor Howell | Substack

Connect With Me!


Related Articles