Python for Data Science

Type 3 Sum of Squares with StatsModels

For an easy primer on the differences between the types of sum of squares, see here. The code that is used in the examples is for R, however the explanation is clear.

Unlike Researchpy, in order to get the correct Type 3 sum of square calculations, one needs to enter the formula a bit differently. It's not anything major, but something that has to be known otherwise the results (without this step) are incorrect. Let's get to it.

StatsModels ANOVA with Type 3 Sum of Squares

Will use a data set from Stata called systolic that is accessible a few ways. One way is to load it via Stata's website it's self, however since this demonstration is for StatsModels, this demonstration will use StatsModels' method. Now to load the required libraries and the data.

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

stata = sm.datasets.webuse('systolic')

stata.head()
drug disease systolic
0 1 1 42
1 1 1 44
2 1 1 36
3 1 1 13
4 1 1 19

Now to run the ANOVA with Type 3 sum of squares using StatsModels.

model = ols('systolic ~ C(drug, Sum) + C(disease, Sum) + C(drug, Sum):C(disease, Sum)', data=stata).fit()
aov_table = sm.stats.anova_lm(model, typ=3)
aov_table
sum_sq df F PR(>F)
Intercept 20037.613011 1.0 181.413788 1.417921e-17
C(drug, Sum) 2997.471860 3.0 9.046033 8.086388e-05
C(disease, Sum) 415.873046 2.0 1.882587 1.637355e-01
C(drug, Sum):C(disease, Sum) 707.266259 6.0 1.067225 3.958458e-01
Residual 5080.816667 46.0 NaN NaN

That's all it takes! Now the sum of squares are being calculated as they should.