Introduction

Assumptions & Hypotheses

Chi-square test of independence with Python

... using Scipy.stats
... using Researchpy

Assumption Check
References

Chi-square Test of Independence

The $\chi^2$ test of independence tests for dependence between categorical variables and is an omnibus test. Meaning, that if a significant relationship is found and one wants to test for differences between groups then post-hoc testing will need to be conducted. Typically, a proportions test is used as a follow-up post-hoc test.

The $\chi^2$ test of independence analysis utilizes a cross tabulation table between the variables of interest $r$ rows and $c$ columns. Based on the cell counts, it is possible to test if there is a relationship, dependence, between the variables and to estimate the strength of the relationship. This is done by testing the difference between the expected count, $E$, and the observed count, $O$. The subscript i will be used to denote the row group, i.e. $\text{row group}_{i}$, and j will be used to denote the column group, i.e. $\text{column group}_{j}$, meaning the cell will be denoted with the appropriate row and column group subscripts, i.e. $\text{row group}_{i}$ and $\text{column group}_{j}$ will be $\text{cell}_{i,j}$. Let's take a look at an example cross tabulation.

General Observed Contingency Table
		Column Variable
		C1	C2
Row Variable	R1	$O_{1,1}$	$O_{1,2}$	$n_{1, .}$
Row Variable	R2	$O_{2,1}$	$O_{2,2}$	$n_{2, .}$
		$n_{., 1}$	$n_{., 2}$	$\text{Grand Total}$

The expected counts, which are needed to calculate the $\chi^2$ test statistics, are estimated using the following formula: $$\hat{E}_{i,j} = \frac{(n_{i,.})(n_{., j})}{\text{Grand Total}}$$ For example, to estimate the expected counts for cell $O_{1,1}$ one would use the follwing formula: $$\hat{E}_{1,1} = \frac{(n_{1,.})(n_{.,1})}{\text{Grand Total}}$$ The expected cell frequency is generalized below to create a general expected frequency table.

General Expected Contingency Table
		Column Variable
		C1	C2
Row Variable	R1	$\hat{E}_{1,1} = \frac{(n_{1,.})(n_{.,1})}{\text{Grand Total}}$	$\hat{E}_{1,2} = \frac{(n_{1,.})(n_{.,2})}{\text{Grand Total}}$	$n_{1, .}$
Row Variable	R2	$\hat{E}_{2,1} = \frac{(n_{2,.})(n_{.,1})}{\text{Grand Total}}$	$\hat{E}_{2,2} = \frac{(n_{2,.})(n_{.,2})}{\text{Grand Total}}$	$n_{2, .}$
		$n_{., 1}$	$n_{., 2}$	$\text{Grand Total}$

Before the example is conducted, let's touch on the assumptions, the hypothesis, and the test statistic.

$\chi^2$ test of independence assumptions

The two samples are independent
No expected cell count is = 0
No more than 20% of the cells have and expected cell count < 5

Hypothesis

$H_0: \text{Variables are independent}$
$H_A: \text{Variables are dependent}$

Test statistic

$\chi^2 = \sum_{i,j}\frac{(O_{i,j} - \hat{E}_{i,j})^2}{\hat{E}_{i,j}}$

One would reject the null hypothesis, $H_0$, if the calculated $\chi^2$ test statistic is > the critical $\chi^2$ value based on the degrees of freedom and $\alpha$ level. Degrees of freedom are calculated using $(r-1)(c-1)$ where $r$ is the number of rows and $c$ is the number of columns.

One needs to look-up the critical $\chi^2$ test statistic using the calculated degrees of freedom and set $\alpha$ value; this is typically calculated for the user when using statistical software.

Before the decision to accept or reject $H_0$, check the assumptions.

Chi-Square ($\chi^2$) test of independence with Python

Don't forget to check the assumptions before interpreting the results! This demonstration will cover how to conduct a $\chi^2$ test of independence using scipy.stats and researchpy. First, let's import pandas, statsmodels.api, scipy.stats, researchpy, and the data for this demonstration.

The data used in this example comes from Stata and is 1980 U.S. census data from 956 cities.

import pandas as pd
import researchpy as rp
import scipy.stats as stats

# To load a sample dataset for this demonstration
import statsmodels.api as sm

df = sm.datasets.webuse("citytemp2")

Let's take a high level look at the data.

df.info()

Int64Index: 956 entries, 0 to 955 Data columns (total 7 columns): division 956 non-null category region 956 non-null category heatdd 953 non-null float64 cooldd 953 non-null float64 tempjan 954 non-null float32 tempjuly 954 non-null float32 agecat 956 non-null category dtypes: category(3), float32(2), float64(2) memory usage: 33.3 KB

The research question is the following, is there a relationship between the region and age. Before testing this relationship, let's see some basic univariate statistics.

rp.summary_cat(df[["agecat", "region"]])

	Variable	Outcome	Count	Percent
0	agecat	19-29	507	53.03
1		30-34	316	33.05
2		35+	133	13.91
3	region	N Cntrl	284	29.71
4		West	256	26.78
5		South	250	26.15
6		NE	166	17.36

The data is majority in the 19-29 age group while the regions are fairly similar except for the Northeast region having the fewest population.

Chi-square test of independence with Scipy.Stats

The method that needs to be used is scipy.stats.chi2_contingency and it's official documentation can be found here. This method requires one to pass a crosstabulation table, this can be accomplished using pandas.crosstab.

crosstab = pd.crosstab(df["region"], df["agecat"])

crosstab

agecat	19-29	30-34	35+
region
NE	46	83	37
N Cntrl	162	92	30
South	139	68	43
West	160	73	23

Now to pass this contingency table to the scipy.stats method. The output isn't the best formatted, but all the information is there. The information is returned within a tuple where the first value is the $\chi^2$ test static, the second value is the p-value, and the third number is the degrees of freedom. An array is also returned which contains the expected cell counts.

stats.chi2_contingency(crosstab)

(61.28767688406036, 2.463382670201326e-11, 6, array([[ 88.03556485, 54.87029289, 23.09414226], [150.61506276, 93.87447699, 39.51046025], [132.58368201, 82.63598326, 34.78033473], [135.76569038, 84.61924686, 35.61506276]]))

There is a relationship between region and the age distribution, $\chi^2$(6) = 61.29, p< 0.0001.

Chi-square test of independence with Researchpy

Now to conduct the $\chi^2$ test of independence using Researchpy. The method that needs to be used is researchpy.crosstab and the official documentation can be found here.

By default, the method returns the requested objects in a tuple that is just as ugly as scipy.stats. For cleaner output, one can assign each requested object from the tuple to another object and then those separately. The expected cell counts will be requested and used later while checking the assumptions for this statistical test. Additionally, will request the crosstabulation be returned with the cell percentage instead of the cell count.

crosstab, test_results, expected = rp.crosstab(df["region"], df["agecat"],
                                               test= "chi-square",
                                               expected_freqs= True,
                                               prop= "cell")

crosstab

	agecat
	19-29	30-34	35+	All
region
NE	4.81	8.68	3.87	17.36
N Cntrl	16.95	9.62	3.14	29.71
South	14.54	7.11	4.50	26.15
West	16.74	7.64	2.41	26.78
All	53.03	33.05	13.91	100.00

test_results

	Chi-square test	results
0	Pearson Chi-square ( 6.0) =	61.2877
1	p-value =	0.0000
2	Cramer's V =	0.1790

The one piece of information that researchpy calculates that scipy.stats does not is a measure of the strength of the relationship - this is akin to a correlation statistic such as Pearson's correlation coefficient. A good peer-reviewed article that is not behind a paywall is written by Akoglu (2018). The following table is reproduced from the mentioned article.

Phi and Cramer's V	Interpretation
>0.25	Very strong
>0.15	Strong
>0.10	Moderate
>0.05	Weak
>0	No or very weak

Assumption Check

Checking the assumptions for the $\chi^2$ test of independence is easy. Let's recall what they are:

The two samples are independent

The variables were collected independently of each other, i.e. the answer from one variable was not dependent on the answer of the other

No expected cell count is = 0
No more than 20% of the cells have and expected cell count < 5

The last two assumptions can be checked by looking at the expected frequency table.

expected

	agecat
agecat	19-29	30-34	35+
region
NE	88.035565	54.870293	23.094142
N Cntrl	150.615063	93.874477	39.510460
South	132.583682	82.635983	34.780335
West	135.765690	84.619247	35.615063

It can be seen that all the assumptions are met which indicates the statistical test results are reliable.

References

Rosner, B. (2015). Fundamentals of Biostatistics (8^th ed.). Boston, MA: Cengage Learning.
Ott, R. L., and Longnecker, M. (2010). An introduction to statistical methods and data analysis. Belmon, CA: Brooks/Cole.

Symbol	Meaning
$n$	Sample size
$N$	Population size
$s^2$	Sample variance
$\sigma^2$	Population variance
$s$	Sample standard deviation
$\sigma$	Population standard deviation
$\mu$	Mean
$\bar{x}$	Sample or group mean
symbol$_1$	Subscript represents a group, i.e. symbol$_1$ group 1 while symbol$_2$ is group 2
$\alpha$	Alpha value, statistical significance threshold

Python for Data Science