The world’s leading publication for data science, AI, and ML professionals.

Chi-square Test for Independence

Use of pingouin library for Chi-square analysis implementation

Image from unsplash
Image from unsplash

Introduction

Data scientists sometimes need to examine if one categorical variable is related to another one in the same population. If the data is continuous, one can simply calculate the correlation between the variables and determine if those are highly correlated depending on the correlation coefficient. The Chi-square test is a tool to perform that analysis on categorical variables. For example, we may want to check if gender plays a role in heart disease or education is related to marital status. In cases like these, the Chi-square test is the correct analysis tool.

Background

To jump into the Chi-square test, I would like to provide a simple refresher for the related terminologys’ background. The analysis, as well as the interpretation of the output in python, needs an understanding of these terminologies.

P-value, Confidence Interval and Significance Level

P-value is the probability that the difference between the two values is there just by chance. If the p=value is small, the probability that the observed data is coming by chance is very small and therefore we conclude that there is a statistically significant difference between the observed data and the expected data.

Confidence interval is the range between which the percentage of test outcomes falls. If CI is set at 95%, it is not like that we are 95% confident about test outcomes. It can be thought as if we repeat the test 100 times, the test outcome will fall inside that range 95 times out of 100. Usually, CI is set at 95% in most cases.

Another term named significance level (alpha) is the probability of rejecting a null hypothesis when it is in fact true. It is usually set at 5% in most cases.

Chi-square test

There are few types of Chi-square tests. One type of Chi-square test is called the goodness-of-fit test which checks if one categorical variable fits well with the population data. Another type of test check the Independence of one categorical variable over another and this is called the Chi-square test of independence. In this article, I will go through the Chi-square test of independence to check if one categorical variable is related with another one by checking the Chi-square statistic as well the p-value.

Implementation in python

Let’s import the data for heart disease. It shows data of heart-related variables like systolic and diastolic pressure, diabetes, BMI, heartrate glucose level, smoking habit and much more from several individuals.

Heart Disease Dataset
Heart Disease Dataset

Python haschi2_contingencymodule from scipy.stats where we need to provide the contingency table. A contingency table is the summary of the relation between two categorical variables. There is a module called Pingouin which provides the contingency table if we only provide the data.

From our data, let’s say we want to check if there is a dependency of Coronary Heart Disease (CHD) on the gender distribution. Using pingouin, the code is a one-liner.

Use of pingouin to get Chi-square statistics
Use of pingouin to get Chi-square statistics

The chi2_independence returns three tables. The expected table is the contingency table showing the relationship between the two categorical variables of interest from the initial data.

To analyze the expected data, we first need to obtain the ratio across the gender in the initial data. Our data shows that the ratio between group 0 and group 1 is 2420:1820 = 1.329 and in order to be a bad predictor for CHD, the gender ratio across the groups of CHD should be similar.

We get the same 1.329:1 ratio in the expected table between different genders when we take the ratio between group 0 and group 1. For example, the gender ratio in group 0 of TenYearCHD is 2052.42/1543.56 which is equal to 1.329 (approx) and the same ratio holds for the other group.

The null hypothesis states that we expect the same ratio in the observed table. We need to validate the null hypothesis by the Chi-square statistics which is compared with the specific Chi-square value from the Chi-square table depending on the degrees of freedom and significance level. The observed table above shows the relationship between the gender category and CHD which is actually observed in the data. If we calculate the gender ratio from the observed table, we obtain 2118.5/1477.5 = 1.433 and 342.5/301.5 = 1.136 which are different from the expected ratio. Next, we need to find out the test statistic and p-value from the stats table.

Ch-square statistic and p-value
Ch-square statistic and p-value

The Chi-square statistic from Pearson residual is the most common statistic. Pearson residual is defined as the difference between the observed and expected value normalized by the square root of the expected value.

Pearson residual = (Observed – Expected)/(sqrt(Expected))

For this single degree of freedom and with a 5% significance level, the critical value for the Chi-square statistic is 3.841 and the test statistic is obtained at 32.618 which is much higher. This statistic is a measure of the extent to which the observed data deviates from the expected values. We have also observed a very small p-value which basically provides evidence against the null hypothesis. The smaller the p-value, the lower the chance that the observed difference is merely coming by chance. Therefore, in this case, we have strong evidence to reject the null hypothesis and state that the observed difference is real. Essentially, we can conclude that gender is a good predictor for CHD.

Extension of A/B test

The Chi-square test can be considered as an extended version of the simple A/B test which is performed between two groups to check if there is any observed difference between the groups. One group is called the control group and the other group is the treatment group. Sometimes we are interested to check multiple treatments at once and the Chi-square test provides the information of the extension of deviation of the groups from the control group. For example, to check the number of clicks on multiple versions of renovated webpage, we can essentially make more than two groups and provide them to different user groups. The contingency table should reflect the number of clicks or the number of final purchases across the newer versions of the webpage along with the initial page.

Conclusion

In this article, I have described the background of the Chi-square test and demonstrated its implementation in Python. The Chi-square test is a simple statistical test for checking the independence of categorical variables. When multiples treatments are required to check, we need to go beyond the simple A/B test and perform the Chi-square test.

Github page

Reference:

  1. pingouin documentation
  2. Heart disease data

Related Articles