Python for Data Science

N-way ANOVA

If you are looking for how to run code jump to the next section or if you would like some theory/refresher then start with this section.

ANOVA stands for analysis of variance and is an omnibus parametric test. Meaning it tests for an overall difference between the variables in the model, i.e. at least one of the groups is statistically significantly different than the others. However, if the ANOVA is significant one cannot tell which groups differ. In order to tell which groups are different planned or post-hoc comparisons need to be made. As with all parametric tests, there are certain assumptions that need to be met in order for the test results to be considerede reliable.

Recall that when working from the ANOVA framework, independent variables are sometimes referred to as factors and the number of groups within each variable are called levels, i.e. one variable with 3 categories could be reffered to as a factor with 3 levels. Since this is an N-way ANOVA there will be at least 2 variables - they don't both have to be categorical. Some variables can be categorical and others continuous, if this is the case, the analysis is called an analysis of covariance (ANCOVA). The approach with an ANCOVA is no different than an N-factor ANOVA, but nonetheless, ANCOVA has it's own demonstration.

Ideally one should be comfortable with conducting and interpreting an one-way, a.k.a one-factor, ANOVA before conducting the N-way, a.k.a N-factor, ANOVA. When analyzing a model where there are more than 2 factors the analysis can get complex quickly - a 3-factor ANOVA is not that much more complex, but anything over 3 is definitely complex. This will be seen shortly.

Before one can start analyzing the factors themselves, the overall model needs to be significant at the pre-determined significance level, a.k.a alpha level ($\alpha$), which is commonly set to 0.05. The statistic being evaluated is the F-statistic.

Parametric test assumptions

  • Population distributions are normal
  • Samples have equal variances
  • Independence
Hypothesis

  1. $H_0: \bar{x}_1 = \bar{x}_2 = \bar{x}_3 = \; ... \; = \bar{x}_k$
  2. $H_A: \text{At least one of the groups means differ}$

The test statistic is the F-statistic and compares the mean square between samples ($MS_B$) to the mean square within sample ($MS_W$). This F-statistic can be calculated using the following formula:

  • $F = \frac{MS_B}{MS_W}$

  • Where,
    • $MS_B= \frac{\text{Sum of square between sample } (SS_B)}{(k - 1)}$

    • $MS_W= \frac{\text{Sum of square within sample } (SS_W)}{(n_T - k)}$

    • $k$ is the number of groups
    • $n_T$ is the total number of observations

  • and where,
    • Sum of square between sample ($SS_B$) = $\sum_k n_k(\bar{x}_k - \bar{x})^2$
    • Sum of square within sample ($SS_W$) = $\sum_{i, k}(x_{i, k} - \bar{x}_k)^2$ or can be calculated as $\sum_k (n_k - 1)s_k^2$

One rejects the the null hypothesis, $H_0$, if the computed F-static is greater than the critical F-statistic. The critical F-statistic is determined by the degrees of freedom and alpha, $\alpha$, value.

  1. Reject $H_0$ if $\text{calucated F-statistic} > \text{critical F-statistic}$

Before the decision is made to accept or reject the null hypothesis the assumptions need to be checked. See this page on how to check the parametric assumptions in detail - how to check the assumptions for this example will be demonstrated near the end.

Let's make sense of all these mathmatical terms. In order to do that, let's start with a generic ANOVA table excluding the factor information. The reason why the factors are being excluded for now is because there are various ways to calculate the sum of squares (SS) for the factors each with their own theoretical reasoning. The most common type is type III which is returned by default in Stata, SAS, and SPSS.

Basic ANOVA Table
Source Sum of Squares Degrees of Freedom Mean Square F-statistic
Between samples $SS_B$ $k - 1$ $MS_B = \frac{SS_B}{(k - 1)}$ $\frac{MS_B}{MS_W}$
Within samples $SS_W$ $n_T - k$ $MS_W = \frac{SS_W}{(n_T - k)}$
Total $TSS = SS_B + SS_W$ $n_T - 1$
Note: TSS means total sum of squares
Data Table: DV = Systolic
Variable Groups Sample Size Sample Means Sample Variance
Drug ($IV_1$) 1 ($k_1$) 15 ($n_1$) 26.07 ($\bar{x}_1$) 136.35 ($s^2_1$)
2 ($k_2$) 15 ($n_2$) 25.53 ($\bar{x}_2$) 134.98 ($s^2_2$)
3 ($k_3$) 12 ($n_3$) 8.75 ($\bar{x}_3$) 100.39 ($s^2_3$)
4 ($k_4$) 16 ($n_4$) 13.50 ($\bar{x}_4$) 86.93 ($s^2_4$)
Disease ($IV_2$) 1 ($k_1$) 19 ($n_1$) 22.79 ($\bar{x}_1$) 173.18 ($s^2_1$)
2 ($k_2$) 19 ($n_2$) 18.21 ($\bar{x}_2$) 183.73 ($s^2_2$)
3 ($k_3$) 20 ($n_3$) 15.80 ($\bar{x}_3$) 127.75 ($s^2_3$)
Systolic ($DV$) 58 ($n_T$) 18.88 ($\bar{x}$) 1163.86($s^2$)
Cross Tabulation Data Table: DV = Systolic
Drug ($IV_1$) Disease ($IV_2$) Sample Size Sample Means Sample Variance
1 ($k_{1,1}$) 1 ($k_{2,1}$) 6 29.33 169.46
2 ($k_{2,2}$) 4 28.25 34.25
3 ($k_{2,3}$) 5 20.40 178.80
2 ($k_{1,2}$) 1 ($k_{2,1}$) 5 28.00 120.50
2 ($k_{2,2}$) 4 33.50 4.33
3 ($k_{2,3}$) 6 18.17 156.97
3 ($k_{1,3}$) 1 ($k_{2,1}$) 3 16.33 201.33
2 ($k_{2,2}$) 5 4.40 47.80
3 ($k_{2,3}$) 4 8.50 81.00
4 ($k_{1,4}$) 1 ($k_{2,1}$) 5 13.60 111.30
2 ($k_{2,2}$) 6 12.83 106.96
3 ($k_{2,3}$) 5 14.20 79.70

Now using the formulas and data from above, the ANOVA table can be filled in - the table below reports type III sum of squares. The sum of square calculations for the individual factors have been omitted since it is a bit out of scope for the topic at hand, an easy to read primer on the types of sum of squares can be found here. Ignore the code part since it's written for R, but the theory behind the different types are explained well.

Completed ANOVA Table
Source Sum of Squares Degrees of Freedom Mean Square F-statistic
Between samples 4259.34 11 387.21 3.51
Drug 2997.47 3 999.16 9.05
Disease 415.87 2 207.94 1.88
Drug:Disease 707.27 6 117.88 1.07
Within samples 5080.82 46 110.45
Total 9340.16 57 163.86

In order to tell if the calculated F-statistic is statistically significant, one would look up the F-statistic based on the degress of freedom and alpha level - using statistical software this doesn't need to be done since it'll be provided.

Fear not if math is not your strong suit. All this is being calucated when using the methods of a statistical software or programming language. It's good to know what is going on behind the scences. References for this section are provided at the end of the page.

N-way ANOVA Approach

When conducting an ANOVA with multiple factors, like in the current demonstration, all factors should be tested for an interaction before looking at their individual main effects. If the interaction between the variables are non-significant, then remove a variable from the interaction and conduct the analysis again. First a 2-factor ANOVA example will be discussed then the discussion will be expanded to discuss a 3-factor ANOVA which will exemplify how complex ANOVAs can get when using multiple factors.

2-Factor ANOVA

To discuss a 2-factor ANOVA a hypothetcal example will be provided. A researcher is testing if there is a difference in effect between drug types and disease types on systolic blood pressure. Here the systolic blood pressure is the dependent variable and drug type and disease type are the independent variables. The model will look like

systolic ~ drug + disease + drug:disease

where
drug:disease
notes the test for an interaction between the two. If the interaction term is significant it indicates that the effect of the type of drug on systolic blood pressure is dependent on the type of disease. To determine which combination of drug and disease type are significantly different than the others, one has to conduct planned comparisons or post-hoc tests.

If the interaction is non-significant, then it is removed from the model and the model is re-run. The reduced model will now be

systolic ~ drug + disease

where one is now testing for the main effects of the independent variables themselves.

3-Factor ANOVA

To discuss a 3-factor ANOVA a hypothetical example will be provided. In this example a farmer wants to test the effects of different combinations of fertilizer, the amount of water, and amount of sunlight on crop yield. The farmer decided to bin the amount of water and sunlight into categories and simply label them as low, medium, and high failing to tell everyone else how much each category represents. Thus, this study looks like the following:

Variable Categories
Fertilizer A
B
Water Low
Medium
High
Sunlight Low
Medium
High
Crop Yield Continuous measure

The first analysis will look at the interaction of the 3 factors

crop_yield ~ fertilizer + water + sunlight + fertilizer:water:sunlight

Crop Yield ANOVA Table
Source Sum of Squares Degrees of Freedom Mean Square F-statistic
Between samples ## ## ## ##***
Fertilizer ## ## ## ##
Water ## ## ## ##
Sunlight ## ## ## ##
Fertilizer:Water:Sunlight ## ## ## ##**
Within samples ## ## ##
Total ## ## ##
The important rows are "between samples" and the interaction term row marked with *** and ** respectively. If the between samples F-statistic is significant it means the overal model explains a statistically significant amount of variance which then permits one to look at the interaction term. Coming from the ANOVA framework, not all packages report the model's F-statistic and only report the interaction of main effect F-statistic and associated p-value.

If the interaction term is significant the next step would be to check the model's assumptions and proceed from there. However, if the interaction term is non-signifcant then the next step would be to break the interaction term down into multiple 2-way interactions. The next model ran would look like:

crop_yield ~ fertilizer + water + sunlight + fertilizer:water + fertilizer:sunlight + water:sunlight

The focus is now on the multiple interaction terms.

Crop Yield ANOVA Table
Source Sum of Squares Degrees of Freedom Mean Square F-statistic
Between samples ## ## ## ##***
Fertilizer ## ## ## ##
Water ## ## ## ##
Sunlight ## ## ## ##
Fertilizer:Water ## ## ## ##**
Fertilizer:Sunlight ## ## ## ##**
Water:Sunlight ## ## ## ##**
Within samples ## ## ##
Total ## ## ##
The important rows are "between samples" and the interaction terms rows marked with *** and ** respectively. If the between samples F-statistic is significant it means the overal model explains a statistically significant amount of variance which then permits one to look at the interaction term. Coming from the ANOVA framework, not all packages report the model's F-statistic and only report the interaction of main effect F-statistic and associated p-value.

From here, if all the iteraction terms are statistically significant then one checks the assumptions and proceeds from there. However, if any of the interaction terms are non-significant then they need to be removed and the model needs to be re-run.

This process is completed until all non-significant interaction terms are removed. In the final model, if a factor is not in an interaction term then one can interprete the main effects of that factor, whereas if a factor is in an interaction term then that factor needs to be interpreted as such. A final crop yield table will be shown next and the interpretation will follow.

Final Crop Yield ANOVA Table
Source Sum of Squares Degrees of Freedom Mean Square F-statistic
Between samples ## ## ## ##*
Fertilizer ## ## ## ##
Water ## ## ## ##
Sunlight ## ## ## ##*
Fertilizer:Water ## ## ## ##*
Within samples ## ## ##
Total ## ## ##
* Indicates statistical significance at $\alpha$ = 0.05

In the Final Crop Yield ANOVA Table the only remaining significant interaction is between fertilizer and water. This would indicate the effect of fertilizer and water on the crop yield is synergistic and that different combinations of the levels produce different crop yields. Whereas the effect of sunlight on crop yield is statistically significant but is not synergistic with the other factors.

3-way ANOVA with Python

Don't forget to check the assumptions before interpreting the results! First to load the libraries and data needed. Below, Pandas, Researchpy and the data set will be loaded. Specific libraries for each demonstrated method below will contain any further libraries that are need is using that demonstration.

import pandas as pd
import researchpy as rp

Now to load the data set and take a high level look at the variables. This data is from Stata and is used in their ANOVA documentation. The data is similar to the above discussion on the 3-factor ANOVA, time to dive in. A study was designed to determine the optimal operating conditions to maximize yield. The study assessed temperature settings, chemical supply companies, and two mixing methods.

manufac = sm.datasets.webuse('manuf')

manufac.info()
RangeIndex: 36 entries, 0 to 35 Data columns (total 4 columns): temperature 36 non-null int16 chemical 36 non-null int16 method 36 non-null int16 yield 36 non-null int16 dtypes: int16(4) memory usage: 368.0 bytes
rp.summary_cat(manufac[["temperature", "chemical", "method"]])
Variable Outcome Count Percent
temperature 3 12 33.33
2 12 33.33
1 12 33.33
chemical 2 18 50.00
1 18 50.00
method 2 18 50.00
1 18 50.00
rp.summary_cont(manufac["yield"])
Variable N Mean SD SE 95% Conf. Interval
yield 36.0 7.583333 3.237062 0.53951 6.488069 8.678598

3-Factor ANOVA using StatsModels

The official documentation for conducting an ANOVA with StatsModels can be found here. Type 3 sum of squares is what is typically desired and is the default output for SPSS, SAS, and Stata. To get type 3 sum of squares, a minor extra step needs to be taken when using StatsModels - this step is specifying ", Sum" in the formula for the factors. Without specifying this, only Type I and Type II sum of squares will be calculated correctly.

from statsmodels.formula.api import ols

manufac["Yield"] = manufac["yield"]

model = ols("Yield ~ C(temperature, Sum) + C(chemical, Sum) + C(method, Sum) + C(temperature, Sum)*C(chemical, Sum)*C(method, Sum)", data=manufac).fit()

aov_table = sm.stats.anova_lm(model, typ=3)
aov_table
sum_sq df F PR(>F)
Intercept 2070.25 1.0 299.313253 4.675019e-15
C(temperature, Sum) 30.50 2.0 2.204819 1.321133e-01
C(chemical, Sum) 12.25 1.0 1.771084 1.957540e-01
C(method, Sum) 42.25 1.0 6.108434 2.093730e-02
C(temperature, Sum):C(chemical, Sum) 24.50 2.0 1.771084 1.916714e-01
C(temperature, Sum):C(method, Sum) 87.50 2.0 6.325301 6.216723e-03
C(chemical, Sum):C(method, Sum) 0.25 1.0 0.036145 8.508161e-01
C(temperature, Sum):C(chemical, Sum):C(method, Sum) 3.50 2.0 0.253012 7.785036e-01
Residual 166.00 24.0

The interaction between temperature, chemical type, and method is statistically non-significant. This indicates that different level combinations of the factors do not produce a significant difference in the yield. Thus this term should be removed from the ANOVA model and re-ran looking at the 2-factor interactions.

model = ols("Yield ~ C(temperature, Sum) + C(chemical, Sum) + C(method, Sum) + C(temperature, Sum):C(chemical, Sum) + C(temperature, Sum):C(method, Sum) + C(chemical, Sum):C(method, Sum)", data=manufac).fit()

aov_table = sm.stats.anova_lm(model, typ=3)
aov_table
sum_sq df F PR(>F)
Intercept 2070.25 1.0 317.560472 4.292571e-16
C(temperature, Sum) 30.50 2.0 2.339233 1.163633e-01
C(chemical, Sum) 12.25 1.0 1.879056 1.821599e-01
C(method, Sum) 42.25 1.0 6.480826 1.717637e-02
C(temperature, Sum):C(chemical, Sum) 24.50 2.0 1.879056 1.728955e-01
C(temperature, Sum):C(method, Sum) 87.50 2.0 6.710914 4.467613e-03
C(chemical, Sum):C(method, Sum) 0.25 1.0 0.038348 8.462683e-01
Residual 169.50 26.0

The only 2-factor interaction that is statistically significant is between temperature and method, the other 2-factor interactions should be removed and the model needs to be re-ran.

model = ols("Yield ~ C(temperature, Sum) + C(chemical, Sum) + C(method, Sum) + C(temperature, Sum):C(method, Sum)", data=manufac).fit()

aov_table = sm.stats.anova_lm(model, typ=3)
aov_table
sum_sq df F PR(>F)
Intercept 2070.25 1.0 309.072072 5.239664e-17
C(temperature, Sum) 30.50 2.0 2.276705 1.206672e-01
C(chemical, Sum) 12.25 1.0 1.828829 1.867181e-01
C(method, Sum) 42.25 1.0 6.307593 1.784464e-02
C(temperature, Sum):C(method, Sum) 87.50 2.0 6.531532 4.552060e-03
Residual 194.25 29.0

The sum of squares that each factor accounted for did not change while removing non-significant factors from the model due to the nature of Type III sum of squares. The information that did change are the F-statistics and their respective p-values, as well as the residual sum of squares and the residual degrees of freedom.

Interpretation

A study was designed to determine the optimal operating conditions to maximize yield. The study assessed temperature settings, chemical supply companies, and two mixing methods. The overall interaction between the three factors was statistically non-significant, F(2, 24.0)= 0.2530, p-value= 0.7786. After looking at the reduced model which included all possible 2-factor interactions, the only significant interaction was between temperature and method. The non-signifance interactions were removed and the final model showed that temperature and method have a statistically significant interaction effect on yield, F(2, 29)=6.5315, p-value= 0.0045.

Note: It's important to include the other model's in documentation to show full model results of all model's tested.

Assumption Check and Post-hoc Testing

To see how to check the assumptions using an ANOVA model please refer to the assumption check section of the one-way ANOVA. For a more in-depth look at the assumptions and some potential remedies, please check out this page.

To see how to conduct post-hoc testing, please refer to the post-hoc testing section of the one-way ANOVA page. Before you leave and conduct post-hoc testing it should be noted that an easy way to test for group difference of the interaction term using StatsModels is to create a column in the data frame which represents the possible group combinations of the significant interaction term and use this instead of trying to use a groupby object. A quick demonstration is below, but still refer to the post-hoc testing page as mentioned for a more in-depth demonstration.

import statsmodels.stats.multicomp as mc

interaction_groups = "Temp_" + manufac.temperature.astype(str) + " & " + "Method_" + manufac.method.astype(str)

comp = mc.MultiComparison(manufac["Yield"], interaction_groups)
post_hoc_res = comp.tukeyhsd()
post_hoc_res.summary()
Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff lower upper reject
Temp_1 & Method_1 Temp_1 & Method_2 -2.0 -6.6071 2.6071 False
Temp_1 & Method_1 Temp_2 & Method_1 -1.5 -6.1071 3.1071 False
Temp_1 & Method_1 Temp_2 & Method_2 1.5 -3.1071 6.1071 False
Temp_1 & Method_1 Temp_3 & Method_1 -1.5 -6.1071 3.1071 False
Temp_1 & Method_1 Temp_3 & Method_2 4.0 -0.6071 8.6071 False
Temp_1 & Method_2 Temp_2 & Method_1 0.5 -4.1071 5.1071 False
Temp_1 & Method_2 Temp_2 & Method_2 3.5 -1.1071 8.1071 False
Temp_1 & Method_2 Temp_3 & Method_1 0.5 -4.1071 5.1071 False
Temp_1 & Method_2 Temp_3 & Method_2 6.0 1.3929 10.6071 True
Temp_2 & Method_1 Temp_2 & Method_2 3.0 -1.6071 7.6071 False
Temp_2 & Method_1 Temp_3 & Method_1 0.0 -4.6071 4.6071 False
Temp_2 & Method_1 Temp_3 & Method_2 5.5 0.8929 10.1071 True
Temp_2 & Method_2 Temp_3 & Method_1 -3.0 -7.6071 1.6071 False
Temp_2 & Method_2 Temp_3 & Method_2 2.5 -2.1071 7.1071 False
Temp_3 & Method_1 Temp_3 & Method_2 5.5 0.8929 10.1071 True

References

Kutner, M. H., Nachtsheim, C. J., Neter, J., and Li, W. (2004). Applied linear statistical models (5th). New York, NY: McGraw-Hill Irwin.
Rosner, B. (2015). Fundamentals of Biostatistics (8th). Boston, MA: Cengage Learning.
Ott, R. L., and Longnecker, M. (2010). An introduction to statistical methods and data analysis. Belmon, CA: Brooks/Cole.