Python for Data Science

Chi-square Test of Independence

The $\chi^2$ test of independence tests for dependence between categorical variables and is an omnibus test. Meaning, that if a significant relationship is found and one wants to test for differences between groups then post-hoc testing will need to be conducted. Typically, a proportions test is used as a follow-up post-hoc test.

The $\chi^2$ test of independence analysis utilizes a cross tabulation table between the variables of interest $r$ rows and $c$ columns. Based on the cell counts, it is possible to test if there is a relationship, dependence, between the variables and to estimate the strength of the relationship. This is done by testing the difference between the expected count, $E$, and the observed count, $O$. The subscript i will be used to denote the row group, i.e. $\text{row group}_{i}$, and j will be used to denote the column group, i.e. $\text{column group}_{j}$, meaning the cell will be denoted with the appropriate row and column group subscripts, i.e. $\text{row group}_{i}$ and $\text{column group}_{j}$ will be $\text{cell}_{i,j}$. Let's take a look at an example cross tabulation.

General Observed Contingency Table
Column Variable
C1 C2
Row Variable R1 $O_{1,1}$ $O_{1,2}$ $n_{1, .}$
R2 $O_{2,1}$ $O_{2,2}$ $n_{2, .}$
$n_{., 1}$ $n_{., 2}$ $\text{Grand Total}$

The expected counts, which are needed to calculate the $\chi^2$ test statistics, are estimated using the following formula: $$\hat{E}_{i,j} = \frac{(n_{i,.})(n_{., j})}{\text{Grand Total}}$$ For example, to estimate the expected counts for cell $O_{1,1}$ one would use the follwing formula: $$\hat{E}_{1,1} = \frac{(n_{1,.})(n_{.,1})}{\text{Grand Total}}$$ The expected cell frequency is generalized below to create a general expected frequency table.

General Expected Contingency Table
Column Variable
C1 C2
Row Variable R1 $\hat{E}_{1,1} = \frac{(n_{1,.})(n_{.,1})}{\text{Grand Total}}$ $\hat{E}_{1,2} = \frac{(n_{1,.})(n_{.,2})}{\text{Grand Total}}$ $n_{1, .}$
R2 $\hat{E}_{2,1} = \frac{(n_{2,.})(n_{.,1})}{\text{Grand Total}}$ $\hat{E}_{2,2} = \frac{(n_{2,.})(n_{.,2})}{\text{Grand Total}}$ $n_{2, .}$
$n_{., 1}$ $n_{., 2}$ $\text{Grand Total}$

Before the example is conducted, let's touch on the assumptions, the hypothesis, and the test statistic.

$\chi^2$ test of independence assumptions

  • The two samples are independent
  • No expected cell count is = 0
  • No more than 20% of the cells have and expected cell count < 5

Hypothesis

  • $H_0: \text{Variables are independent}$
  • $H_A: \text{Variables are dependent}$

Test statistic

  • $\chi^2 = \sum_{i,j}\frac{(O_{i,j} - \hat{E}_{i,j})^2}{\hat{E}_{i,j}}$

One would reject the null hypothesis, $H_0$, if the calculated $\chi^2$ test statistic is > the critical $\chi^2$ value based on the degrees of freedom and $\alpha$ level. Degrees of freedom are calculated using $(r-1)(c-1)$ where $r$ is the number of rows and $c$ is the number of columns.

  • One needs to look-up the critical $\chi^2$ test statistic using the calculated degrees of freedom and set $\alpha$ value; this is typically calculated for the user when using statistical software.

Before the decision to accept or reject $H_0$, check the assumptions.

Chi-Square ($\chi^2$) test of independence with Python

Don't forget to check the assumptions before interpreting the results! This demonstration will cover how to conduct a $\chi^2$ test of independence using scipy.stats and researchpy. First, let's import pandas, statsmodels.api, scipy.stats, researchpy, and the data for this demonstration.

The data used in this example comes from Stata and is 1980 U.S. census data from 956 cities.

import pandas as pd
import researchpy as rp
import scipy.stats as stats

# To load a sample dataset for this demonstration
import statsmodels.api as sm

df = sm.datasets.webuse("citytemp2")

Let's take a high level look at the data.

df.info()
Int64Index: 956 entries, 0 to 955 Data columns (total 7 columns): division 956 non-null category region 956 non-null category heatdd 953 non-null float64 cooldd 953 non-null float64 tempjan 954 non-null float32 tempjuly 954 non-null float32 agecat 956 non-null category dtypes: category(3), float32(2), float64(2) memory usage: 33.3 KB

The research question is the following, is there a relationship between the region and age. Before testing this relationship, let's see some basic univariate statistics.

rp.summary_cat(df[["agecat", "region"]])
Variable Outcome Count Percent
0 agecat 19-29 507 53.03
1 30-34 316 33.05
2 35+ 133 13.91
3 region N Cntrl 284 29.71
4 West 256 26.78
5 South 250 26.15
6 NE 166 17.36

The data is majority in the 19-29 age group while the regions are fairly similar except for the Northeast region having the fewest population.

Chi-square test of independence with Scipy.Stats

The method that needs to be used is scipy.stats.chi2_contingency and it's official documentation can be found here. This method requires one to pass a crosstabulation table, this can be accomplished using pandas.crosstab.

crosstab = pd.crosstab(df["region"], df["agecat"])

crosstab
agecat 19-29 30-34 35+
region
NE 46 83 37
N Cntrl 162 92 30
South 139 68 43
West 160 73 23

Now to pass this contingency table to the scipy.stats method. The output isn't the best formatted, but all the information is there. The information is returned within a tuple where the first value is the $\chi^2$ test static, the second value is the p-value, and the third number is the degrees of freedom. An array is also returned which contains the expected cell counts.

stats.chi2_contingency(crosstab)
(61.28767688406036, 2.463382670201326e-11, 6, array([[ 88.03556485, 54.87029289, 23.09414226], [150.61506276, 93.87447699, 39.51046025], [132.58368201, 82.63598326, 34.78033473], [135.76569038, 84.61924686, 35.61506276]]))

There is a relationship between region and the age distribution, $\chi^2$(6) = 61.29, p< 0.0001.

Chi-square test of independence with Researchpy

Now to conduct the $\chi^2$ test of independence using Researchpy. The method that needs to be used is researchpy.crosstab and the official documentation can be found here.

By default, the method returns the requested objects in a tuple that is just as ugly as scipy.stats. For cleaner output, one can assign each requested object from the tuple to another object and then those separately. The expected cell counts will be requested and used later while checking the assumptions for this statistical test. Additionally, will request the crosstabulation be returned with the cell percentage instead of the cell count.

crosstab, test_results, expected = rp.crosstab(df["region"], df["agecat"],
                                               test= "chi-square",
                                               expected_freqs= True,
                                               prop= "cell")

crosstab
agecat
19-29 30-34 35+ All
region
NE 4.81 8.68 3.87 17.36
N Cntrl 16.95 9.62 3.14 29.71
South 14.54 7.11 4.50 26.15
West 16.74 7.64 2.41 26.78
All 53.03 33.05 13.91 100.00
test_results
Chi-square test results
0 Pearson Chi-square ( 6.0) = 61.2877
1 p-value = 0.0000
2 Cramer's V = 0.1790

The one piece of information that researchpy calculates that scipy.stats does not is a measure of the strength of the relationship - this is akin to a correlation statistic such as Pearson's correlation coefficient. A good peer-reviewed article that is not behind a paywall is written by Akoglu (2018). The following table is reproduced from the mentioned article.

Phi and Cramer's V Interpretation
>0.25 Very strong
>0.15 Strong
>0.10 Moderate
>0.05 Weak
>0 No or very weak

Assumption Check

Checking the assumptions for the $\chi^2$ test of independence is easy. Let's recall what they are:

  • The two samples are independent

    • The variables were collected independently of each other, i.e. the answer from one variable was not dependent on the answer of the other

  • No expected cell count is = 0
  • No more than 20% of the cells have and expected cell count < 5

The last two assumptions can be checked by looking at the expected frequency table.

expected
agecat
agecat 19-29 30-34 35+
region
NE 88.035565 54.870293 23.094142
N Cntrl 150.615063 93.874477 39.510460
South 132.583682 82.635983 34.780335
West 135.765690 84.619247 35.615063

It can be seen that all the assumptions are met which indicates the statistical test results are reliable.

References

Rosner, B. (2015). Fundamentals of Biostatistics (8th ed.). Boston, MA: Cengage Learning.
Ott, R. L., and Longnecker, M. (2010). An introduction to statistical methods and data analysis. Belmon, CA: Brooks/Cole.