# Chi-square Test of Independence

The $\chi^2$ test of independence tests for dependence between categorical variables and is an omnibus test. Meaning, that if a significant relationship is found and one wants to test for differences between groups then post-hoc testing will need to be conducted. Typically, a proportions test is used as a follow-up post-hoc test.

The $\chi^2$ test of independence analysis utilizes a cross tabulation
table between the variables of interest $r$ rows and $c$ columns.
Based on the cell counts, it is possible to test if there is a relationship,
dependence, between the variables and to estimate the strength of the
relationship. This is done by testing the difference between the expected
count, $E$, and the observed count, $O$. The subscript *i* will be used
to denote the row group, i.e. $\text{row group}_{i}$, and *j* will be
used to denote the column group, i.e. $\text{column group}_{j}$, meaning
the cell will be denoted with the appropriate row and column group
subscripts, i.e. $\text{row group}_{i}$ and $\text{column group}_{j}$
will be $\text{cell}_{i,j}$.
Let's take a look at an example cross
tabulation.

Column Variable | ||||
---|---|---|---|---|

C1 | C2 | |||

Row Variable | R1 | $O_{1,1}$ | $O_{1,2}$ | $n_{1, .}$ |

R2 | $O_{2,1}$ | $O_{2,2}$ | $n_{2, .}$ | |

$n_{., 1}$ | $n_{., 2}$ | $\text{Grand Total}$ |

The expected counts, which are needed to calculate the $\chi^2$ test statistics, are estimated using the following formula: $$\hat{E}_{i,j} = \frac{(n_{i,.})(n_{., j})}{\text{Grand Total}}$$ For example, to estimate the expected counts for cell $O_{1,1}$ one would use the follwing formula: $$\hat{E}_{1,1} = \frac{(n_{1,.})(n_{.,1})}{\text{Grand Total}}$$ The expected cell frequency is generalized below to create a general expected frequency table.

Column Variable | ||||
---|---|---|---|---|

C1 | C2 | |||

Row Variable | R1 | $\hat{E}_{1,1} = \frac{(n_{1,.})(n_{.,1})}{\text{Grand Total}}$ | $\hat{E}_{1,2} = \frac{(n_{1,.})(n_{.,2})}{\text{Grand Total}}$ | $n_{1, .}$ |

R2 | $\hat{E}_{2,1} = \frac{(n_{2,.})(n_{.,1})}{\text{Grand Total}}$ | $\hat{E}_{2,2} = \frac{(n_{2,.})(n_{.,2})}{\text{Grand Total}}$ | $n_{2, .}$ | |

$n_{., 1}$ | $n_{., 2}$ | $\text{Grand Total}$ |

Before the example is conducted, let's touch on the assumptions, the hypothesis, and the test statistic.

$\chi^2$ test of independence assumptions

- The two samples are independent
- No expected cell count is = 0
- No more than 20% of the cells have and expected cell count < 5

Hypothesis

- $H_0: \text{Variables are independent}$
- $H_A: \text{Variables are dependent}$

Test statistic

- $\chi^2 = \sum_{i,j}\frac{(O_{i,j} - \hat{E}_{i,j})^2}{\hat{E}_{i,j}}$

One would reject the null hypothesis, $H_0$, if the calculated $\chi^2$ test statistic is > the critical $\chi^2$ value based on the degrees of freedom and $\alpha$ level. Degrees of freedom are calculated using $(r-1)(c-1)$ where $r$ is the number of rows and $c$ is the number of columns.

- One needs to look-up the critical $\chi^2$ test statistic using the calculated degrees of freedom and set $\alpha$ value; this is typically calculated for the user when using statistical software.

Before the decision to accept or reject $H_0$, check the assumptions.

## Chi-Square ($\chi^2$) test of independence with Python

Don't forget to check the assumptions before interpreting the results! This demonstration will cover how to conduct a $\chi^2$ test of independence using scipy.stats and researchpy. First, let's import pandas, statsmodels.api, scipy.stats, researchpy, and the data for this demonstration.

The data used in this example comes from Stata and is 1980 U.S. census data from 956 cities.

```
import pandas as pd
import researchpy as rp
import scipy.stats as stats
# To load a sample dataset for this demonstration
import statsmodels.api as sm
df = sm.datasets.webuse("citytemp2")
```

Let's take a high level look at the data.

`df.info()`

The research question is the following, is there a relationship between the region and age. Before testing this relationship, let's see some basic univariate statistics.

`rp.summary_cat(df[["agecat", "region"]])`

Variable | Outcome | Count | Percent | |
---|---|---|---|---|

0 | agecat | 19-29 | 507 | 53.03 |

1 | 30-34 | 316 | 33.05 | |

2 | 35+ | 133 | 13.91 | |

3 | region | N Cntrl | 284 | 29.71 |

4 | West | 256 | 26.78 | |

5 | South | 250 | 26.15 | |

6 | NE | 166 | 17.36 |

The data is majority in the 19-29 age group while the regions are fairly similar except for the Northeast region having the fewest population.

### Chi-square test of independence with Scipy.Stats

The method that needs to be used is *scipy.stats.chi2_contingency*
and it's official documentation can be found
here.
This method requires one to pass a crosstabulation table, this can be accomplished using
pandas.crosstab.

```
crosstab = pd.crosstab(df["region"], df["agecat"])
crosstab
```

agecat | 19-29 | 30-34 | 35+ |
---|---|---|---|

region | |||

NE | 46 | 83 | 37 |

N Cntrl | 162 | 92 | 30 |

South | 139 | 68 | 43 |

West | 160 | 73 | 23 |

Now to pass this contingency table to the scipy.stats method. The output isn't the best formatted, but all the information is there. The information is returned within a tuple where the first value is the $\chi^2$ test static, the second value is the p-value, and the third number is the degrees of freedom. An array is also returned which contains the expected cell counts.

`stats.chi2_contingency(crosstab)`

There is a relationship between region and the age distribution, $\chi^2$(6) = 61.29, *p*< 0.0001.

### Chi-square test of independence with Researchpy

Now to conduct the $\chi^2$ test of independence using Researchpy. The method that needs to be used is researchpy.crosstab and the official documentation can be found here.

By default, the method returns the requested objects in a tuple that is just as ugly as scipy.stats. For cleaner output, one can assign each requested object from the tuple to another object and then those separately. The expected cell counts will be requested and used later while checking the assumptions for this statistical test. Additionally, will request the crosstabulation be returned with the cell percentage instead of the cell count.

```
crosstab, test_results, expected = rp.crosstab(df["region"], df["agecat"],
test= "chi-square",
expected_freqs= True,
prop= "cell")
crosstab
```

agecat | ||||
---|---|---|---|---|

19-29 | 30-34 | 35+ | All | |

region | ||||

NE | 4.81 | 8.68 | 3.87 | 17.36 |

N Cntrl | 16.95 | 9.62 | 3.14 | 29.71 |

South | 14.54 | 7.11 | 4.50 | 26.15 |

West | 16.74 | 7.64 | 2.41 | 26.78 |

All | 53.03 | 33.05 | 13.91 | 100.00 |

`test_results`

Chi-square test | results | |
---|---|---|

0 | Pearson Chi-square ( 6.0) = | 61.2877 |

1 | p-value = | 0.0000 |

2 | Cramer's V = | 0.1790 |

The one piece of information that researchpy calculates that scipy.stats does not is a measure of the strength of the relationship - this is akin to a correlation statistic such as Pearson's correlation coefficient. A good peer-reviewed article that is not behind a paywall is written by Akoglu (2018). The following table is reproduced from the mentioned article.

Phi and Cramer's V | Interpretation |
---|---|

>0.25 | Very strong |

>0.15 | Strong |

>0.10 | Moderate |

>0.05 | Weak |

>0 | No or very weak |

## Assumption Check

Checking the assumptions for the $\chi^2$ test of independence is easy. Let's recall what they are:

- The two samples are independent
- The variables were collected independently of each other, i.e. the answer from one variable was not dependent on the answer of the other
- No expected cell count is = 0
- No more than 20% of the cells have and expected cell count < 5

The last two assumptions can be checked by looking at the expected frequency table.

`expected`

agecat | |||
---|---|---|---|

agecat | 19-29 | 30-34 | 35+ |

region | |||

NE | 88.035565 | 54.870293 | 23.094142 |

N Cntrl | 150.615063 | 93.874477 | 39.510460 |

South | 132.583682 | 82.635983 | 34.780335 |

West | 135.765690 | 84.619247 | 35.615063 |

It can be seen that all the assumptions are met which indicates the statistical test results are reliable.

## References

Rosner, B. (2015).*Fundamentals of Biostatistics*(8

^{th}ed.). Boston, MA: Cengage Learning.

Ott, R. L., and Longnecker, M. (2010).

*An introduction to statistical methods and data analysis.*Belmon, CA: Brooks/Cole.