# One-way ANOVA

If you are looking for how to run the code jump to the next section or
if you would like some theory/refresher then start with this section or
see a publicly available peer reviewed article such as this one.

ANOVA stands for "Analysis of Variance" and is an omnibus test, meaning
it tests for a difference overall between all groups.
The one-way ANOVA, also referred to as one factor ANOVA,
is a parametric test used to test for a statistically significant difference
of an outcome between 3 or more groups. Since it is an omnibus test, it
tests for a difference overall, i.e. at least one of the groups is
statistically significantly different than the others. However, if the
ANOVA is significant one cannot tell which group is different. In order
to tell which group is different, one has to conduct planned or post-hoc
comparisons. As with all parametric tests, there are certain conditions
that need to be met in order for the test results to be considered reliable.

The reason why it's called an one-way or one factor ANOVA even though
there are 3 or more groups being tested is because those groups are
under one categorical variable, such as race or education level, and the name is referring to the number
of variables in the analysis and not the number of groups. If there are
two variables being compared it would technically be called a two-way,
or two factor, ANOVA if both variables are categorical, or it could be
called an ANCOVA if the 2^{nd} variable is continuous. The "C"
doesn't stand for continuous, it stands for covariate.

When working from the ANOVA framework, independent variables are sometimes
referred to as *factors* and the number of groups within each variable
are called *levels*, i.e. one variable with 3 categories could
be referred to as a factor with 3 levels.

Parametric test assumptions

- Population distributions are normal
- Samples have equal variances
- Independence

- $H_0: \bar{x}_1 = \bar{x}_2 = \bar{x}_3 = \; ... \; = \bar{x}_k$
- $H_A: \text{At least one of the groups means differ}$

The test statistic is the F-statistic and compares the mean square
between samples ($MS_B$) to the mean square within sample ($MS_W$).
This *F-statistic* can be calculated using the following formula:

- $F = \frac{MS_B}{MS_W}$
- Where,
- $MS_B= \frac{\text{Sum of square between sample } (SS_B)}{(k - 1)}$
- $MS_W= \frac{\text{Sum of square within sample } (SS_W)}{(n_T - k)}$
- $k$ is the number of groups
- $n_T$ is the total number of observations
- and where,
- Sum of square between sample ($SS_B$) = $\sum_k n_k(\bar{x}_k - \bar{x})^2$
- Sum of square within sample ($SS_W$) = $\sum_{i, k}(x_{i, k} - \bar{x}_k)^2$ or can be calculated as $\sum_k (n_k - 1)s_k^2$

One rejects the the null hypothesis, $H_0$, if the computed F-static is greater than the critical F-statistic. The critical F-statistic is determined by the degrees of freedom and alpha, $\alpha$, value.

- Reject $H_0$ if $\text{calucated F-statistic} > \text{critical F-statistic}$

Before the decision is made to accept or reject the null hypothesis the assumptions need to be checked. See this page on how to check the parametric assumptions in detail - how to check the assumptions for this example will be demonstrated near the end.

Let's make sense of all these mathmatical terms. In order to do that, let's start with a generic ANOVA table filled in with symbols and the data set used in this example for now.

Source | Sum of Squares | Degrees of Freedom | Mean Square | F-statistic |
---|---|---|---|---|

Between samples | $SS_B$ | $k - 1$ | $MS_B = \frac{SS_B}{(k - 1)}$ | $\frac{MS_B}{MS_W}$ |

Within samples | $SS_W$ | $n_T - k$ | $MS_W = \frac{SS_W}{(n_T - k)}$ | |

Total | $TSS = SS_B + SS_W$ | $n_T - 1$ | ||

Note: TSS means total sum of squares |

Drug Dose | Libido | Sample Size | Sample Means | Sample Variance | ||||
---|---|---|---|---|---|---|---|---|

Placebo ($k_1$) | 3 | 2 | 1 | 1 | 4 | 5 ($n_1$) | 2.2 ($\bar{x}_1$) | 1.7 ($s^2_1$) |

Low ($k_2$) | 5 | 2 | 4 | 2 | 3 | 5 ($n_2$) | 3.2 ($\bar{x}_2$) | 1.7 ($s^2_2$) |

High ($k_3$) | 7 | 4 | 5 | 3 | 6 | 5 ($n_3$) | 5.0 ($\bar{x}_3$) | 2.5 ($s^2_3$) |

Total ($k= 3$) | 15 ($n_T$) | 3.5 ($\bar{x}$) | 3.1 ($s^2$) |

Now using the formulas from above, the ANOVA table can be filled in.

Source | Sum of Squares | Degrees of Freedom | Mean Square | F-statistic |
---|---|---|---|---|

Between samples | 20.15 | 2 | 10.07 | 5.11 |

Within samples | 23.6 | 12 | 1.97 | |

Total | 43.75 | 14 |

In order to tell if the calcualted F-statistic is statistically significant, one would look up the F-statistic based on the degress of freedom and alpha level - using statistical software this doesn't need to be done since it'll be provided.

Fear not if math is not your strong suit. All this is being calucated when using the methods of a statistical software or programming language. It's good to know what is going on behind the scences. References for this section are provided at the end of the page.

## One-way ANOVA with Python

Don't forget to check the assumptions before interpreting the results!
First to load the libraries and data needed. Below, *Pandas*, *Researchpy* and the
data set will be loaded. Specific libraries for each demonstrated method
below will contain any further libraries that are need is using that
demonstration.

```
import pandas as pd
import researchpy as rp
```

Now to load the data set and take a high level look at the variables.

```
df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/difficile.csv")
df.drop('person', axis= 1, inplace= True)
# Recoding value from numeric to string
df['dose'].replace({1: 'placebo', 2: 'low', 3: 'high'}, inplace= True)
df.info()
```

`rp.summary_cont(df['libido'])`

Variable | N | Mean | SD | SE | 95% Conf. | Interval | |
---|---|---|---|---|---|---|---|

0 | libido | 15.0 | 3.466667 | 1.76743 | 0.456349 | 2.487896 | 4.445437 |

`rp.summary_cont(df['libido'].groupby(df['dose']))`

N | Mean | SD | SE | 95% Conf. | Interval | |
---|---|---|---|---|---|---|

dose | ||||||

high | 5 | 5.0 | 1.581139 | 0.707107 | 3.450484 | 6.549516 |

low | 5 | 3.2 | 1.303840 | 0.583095 | 1.922236 | 4.477764 |

placebo | 5 | 2.2 | 1.303840 | 0.583095 | 0.922236 | 3.477764 |

### One-way ANOVA using scipy.stats

Conducting an one-way ANOVA using scipy.stats is quick and only returns the restuling F-statistic and p-value of the test.

```
import scipy.stats as stats
stats.f_oneway(df['libido'][df['dose'] == 'high'],
df['libido'][df['dose'] == 'low'],
df['libido'][df['dose'] == 'placebo'])
```

Before the results should be interpreted, the assumptions of the test should be checked. For example purposes, the results will be interpreted before checking the assumptions.

#### Interpretation

A new medication was developed to increase the libido of those who take the medication. The purpose of this study was to test for a difference between the dosage levels. The overall average libido was 3.5 95% CI(2.5, 4.4) with group averages of 2.2 95% CI(0.9, 3.5) for the placebo group; 3.2 95% CI(1.9, 4.5) for the low dose group; and 5.0 95% CI(3.5, 6.5) for the high dose group. There is a statistically significant difference between the groups and their effects the libido, F= 5.12, p-value= 0.0247.

### One-way ANOVA using StatsModels

This method conducts a one-way ANOVA in two steps:

- Fit the model using an estimation method,
- The default estimation method in most statistical software packages is ordinary least squares
- Not going to dive into estimation methods as it's out of scope of this section's topic
- If you are not familiar with it and don't care to really dive into it, then just know it's one of many types of estimation methods that aim to provide estimates of the parameter (mean, propertion, etc.) being tested
- Pass fitted model into ANOVA method to produce ANOVA table

Here is the official StatsModels documentation on an ANOVA. The general structure for entering the equation is:

ols("outcome_variable ~ independent_variable", data= data_frame).fit()

In the case of an ANOVA, the independent variable will be categorical. The pseudo code above would work if you were conducting a simple linear regression, but that's not what we are here for! Have to modify the pseudo code which would make it look like:

ols("outcome_variable ~ C(independent_variable)", data= data_frame).fit()

Now to use real code. In the code below there is an argument "typ" in the anova_lm method, this determines how the sum of squares is calculated. The calculation differences is a bit out of scope, but it's encouraged to learn more about them. An easy to read primer can be found here. Additionally, to see how to conduct an ANOVA with type 3 sum of squares see this page - this requires one additional step.

```
import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('libido ~ C(dose)', data=df).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
aov_table
```

sum_sq | df | F | PR(>F) | |
---|---|---|---|---|

C(dose) | 20.133333 | 2.0 | 5.118644 | 0.024694 |

Residual | 23.600000 | 12.0 | NaN | NaN |

Note: C(dose)= between samples and Residual= within samples |

This table provides all the information one needs in order to interprete if the results are significant; however, it does not provide any effect size measures to tell if the statistical significance is meaningful. The function below calculates eta-squared ($\eta^2$) and omega-squared ($\omega^2$). A quick note, $\eta^2$ is the exact same thing as $R^2$ except when coming from the ANOVA framework people call it $\eta^2$; $\omega^2$ is considered a better measure of effect size since it is unbiased in it's calculation by accounting for the degrees of freedom in the model.

sum_sq | df | mean_sq | F | PR(>F) | eta_sq | omega_sq | |
---|---|---|---|---|---|---|---|

C(dose) | 20.133333 | 2.0 | 10.066667 | 5.118644 | 0.024694 | 0.460366 | 0.354486 |

Residual | 23.600000 | 12.0 | 1.966667 | NaN | NaN | NaN | NaN |

#### Interpretation

A new medication was developed to increase libido. The purpose of this study was to test for a difference between
the dosage levels. The overall average libido was 3.5 95% CI(2.5, 4.4) with group
averages of 2.2 95% CI(0.9, 3.5) for the placebo group; 3.2 95% CI(1.9, 4.5)
for the low dose group; and 5.0 95% CI(3.5, 6.5) for the high dose group.
There is a statistically significant difference between the groups and their
effects the libido, F= 5.12, p-value= 0.0247, with an overall large effect,
$\omega^2$= 0.35.

In order to tell which groups differed significantly, post-hoc
tests need to be conducted. Before one goes through that work, the assumptions
should be checked first in case any modifications need to be made to the
model.

## Assumption check

The assumptions in this section need to be met in order for the test results to be considered valid. A more in-depth look at parametric assumptions is provided here, which includes some potential remedies.

### Independence

This assumption is tested when the study is designed. What this means is that all groups are mutually exclusive, i.e. an individual can only belong in one group. Also, this means that the data is not repeated measures (not collected through time). In this example, this condition is met.

### Normality

The assumption of normality is tested on the residuals of the model when coming from an ANOVA or regression framework. One method for testing the assumption of normality is the Shapiro-Wilk test. This can be completed using the shapiro() method from scipy.stats. Ensure that scipy.stats is imported for the following method to work. Unfortunately the output is not labelled, but it's (W-test statistic, p-value).

```
import scipy.stats as stats
stats.shapiro(model.resid)
```

The test is non-significant, W= 0.9167, p= 0.1715, which indicates that the residuals are normally
distributed.

Another way to test the assumption is through a visual check- this is helpful
when the sample is large. The reason this is true is that as the sample size
increases, the statistical test's ability to reject the null hypothesis increases,
i.e. it gains power to detect smaller differences as the sample size n increases.

One method of visually checking the distribution is to use a probability plot
with or without the correlation value, $R^2$, to assess the observed values
correlation with the theoretical distribution in question - in the current case it would
be the Gaussian (a.k.a the normal) distribution. This can be completed
by using the probplot()
method from Scipy.stats. If using the $R^2$ measure, one can refer to the
NIST/SEMATECH e-handbook of statistical methods
to see if the value is significant.

```
import matplotlib.pyplot as plt
fig = plt.figure(figsize= (10, 10))
ax = fig.add_subplot(111)
normality_plot, stat = stats.probplot(model.resid, plot= plt, rvalue= True)
ax.set_title("Probability plot of model residual's", fontsize= 20)
ax.set
plt.show()
```

This is a case where the statistical testing method indicated the residuals were normally distributed, but the probability plot correlation coefficient (PPCC) indicated non-normality. Given the current example's sample size is small, N= 15, the Shapiro-Wilk test indicated normality, and that the calculated PPCC, $R^2$= 0.9349, is ever so slightly smaller than the table PPC, $R^2$= 0.9376, it is reasonable to state this assumption is met. However, looking at the plotted probability plot and the residual structure it would also be reasonable to transform the data for the analysis, or to use a non-parametric statistical test such as Welch's ANOVA or the Kruskal-Wallis ANOVA.

### Homogeneity of variance

The final assumption is that all groups have equal variances. One method for testing this assumption is the Levene's test of homogeneity of variances. This can be completed using the levene() method from Scipy.stats.

```
stats.levene(df['libido'][df['dose'] == 'high'],
df['libido'][df['dose'] == 'low'],
df['libido'][df['dose'] == 'placebo'])
```

The Levene's test of homogeneity of variances is not significant which indicates that the groups have non-statistically significant difference in their varability. Again, it may be worthwhile to check this assumption visually as well.

```
fig = plt.figure(figsize= (10, 10))
ax = fig.add_subplot(111)
ax.set_title("Box Plot of Libido by Dosage", fontsize= 20)
ax.set
data = [df['libido'][df['dose'] == 'placebo'],
df['libido'][df['dose'] == 'low'],
df['libido'][df['dose'] == 'high']]
ax.boxplot(data,
labels= ['Placebo', 'Low', 'High'],
showmeans= True)
plt.xlabel("Drug Dosage")
plt.ylabel("Libido Score")
plt.show()
```

The graphical testing of homogeneity of variances supports the statistical
testing findings which is the groups have equal variance.

By default box plots show the median (orange line in graph above). The green
triangle is the mean for each group which was an additional argument that
was passed into the method.

There are different ways to handle heteroskedasticity (unequal variance) and
a decision needs to be made. Some options include, but is not limited to,
transformming the dependent variable (outcome), could use trimmed means,
robust standard errors, or use a parametric test suchs as the Welch's t-test.
For a more in-depth look at the assumptions and some potential remedies, please
check out this page.

## Post-hoc Testing

By conducting post-hoc tests or planned comparisons it allows one to see which group(s) significantly differ from each other; remember that the ANOVA is an omnibus test! There are a few different approaches that can be taken while conducting these tests, ones that are implemented in StatsModels currently are:

- Tukey Honestly Significant Difference (HSD)
- Tests all pairwise group comparisons while controlling for the multiple comparisons which protects the familywise error rate and from making a Type I error
- Not technically a "post-hoc" test since this test can be used as a test independently of the ANOVA and can be planned before hand
- More in-depth information about this statistical method can be found here
- Bonferroni
- Tests groups for a diffence while controlling for the multiple comparisons which protects the familywise error rate and from making a Type I error. It should be noted that some statistical software reports the Bonferroni adjusted confidence interval, however this is not the case in Python at this time (unless one were to program a function to do so)
- This method is common because it is fast to calculate - take the number of groups to be compared and divide that by the initial alpha value $\alpha_{Bonf}= \frac{\text{Number of groups}}{\alpha}$. In the current example there are 3 groups being compared (placebo vs. low, placebo vs. high, and low vs. high) which had $\alpha$= 0.05 making the equation become $\alpha_{Bonf}= \frac{\text{3}}{0.05}= 0.0167$. Thus, in order for a comparison to be considered significant a p-value would need to be < 0.0167 in order to be considered statistically significant.
- More in-depth information about this statistical method can be found here
- Šidák (a.k.a. Dunn-Šidák)
- Tests groups for a diffence while controlling for the multiple comparisons which protects the familywise error rate and from making a Type I error. It should be noted that some statistical software reports the Šidák adjusted confidence interval, however this is not the case in StatsModels at this time (unless one were to program a function to do so)
- This method is common because it is pretty fast to calculate, the formula is $\alpha_{Sid}= 1 - (1 - \alpha)^{\frac{1}{\text{Number of groups}}}$. In the current example there are 3 groups being compared (placebo vs. low, placebo vs. high, and low vs. high) which had $\alpha$= 0.05 making the equation become $\alpha_{Sid}= 1 - (1 - 0.05)^{\frac{1}{3}}= 0.0170$. Thus, in order for a comparison to be considered significant a p-value would need to be < 0.0170 to be considered statistically significant.
- More in-depth information about this statistical method can be found here.

### Tukey Honestly Significant Difference (HSD)

Have to use a library that has not been imported yet; please see the official documentation about this method for more information if interested.

```
import statsmodels.stats.multicomp as mc
comp = mc.MultiComparison(df['libido'], df['dose'])
post_hoc_res = comp.tukeyhsd()
post_hoc_res.summary()
```

group1 | group2 | meandiff | p-adj | lower | upper | reject |
---|---|---|---|---|---|---|

high | low | -1.8 | 0.1472 | -4.1651 | 0.5651 | False |

high | placebo | -2.8 | 0.0209 | -5.1651 | -0.4349 | True |

low | placebo | -1.0 | 0.5171 | -3.3651 | 1.3651 | False |

Now to make sense of the table.

- At the top the table testing information is provided
- FWER is the familywise error rate, i.e. what $\alpha$ is being set to and controlled at
- group1 and group2 columns are the groups being compared
- meandiff is the difference between the group means
- p-adj is the corrected p-value which takes into account the multiple comparisons being conducted
- lower is the lower band of the confidence interval. In the current example the confidence interval at the 95% level since $\alpha$= 0.05.
- upper is the upper band of the confidence interval. In the current example the confidence interval at the 95% level since $\alpha$= 0.05.
- reject is the decision rule based on the corrected p-value

It is possible to plot the difference using this method as well!

`post_hoc_res.plot_simultaneous(ylabel= "Drug Dose", xlabel= "Score Difference")`

Using Tukey HSD to test for differences between groups indicates that there is a statistically significant difference in libido score between those who took the placebo and those who took the high dosage of the medication, no other groups differed significantly. What this indicates is that the high dosage of the medication is effective at increasing libido, but the low dosage is not.

### Bonferroni Correction

Have to use a library that has not been imported yet (if you didn't do the
Tukey HSD example above); please see the
official documentation
about this method for more information if interested.

The documentation for allpairtest
is not in the best shape at the time of writing this. The method returns 3
objects, one is a completed table object, the second is the data of the table, and the third
is the data of the table with the table headings - it is not understood why
the developers of StatsModels did this. All that is needed is the first object.

Before jumping into the code, let's take a look at pseudo code to make sense
of this method.

The documentation shows one needs to supply this method with a statistical test method, which can either be a user defined function or a function from another Python library - in this case independent sample t-tests will be conducted. One also has to state the correction method to be applied to the p-value to adjust for the multiple comparisons taking place. Now to see the method in action.

```
import statsmodels.stats.multicomp as mc
comp = mc.MultiComparison(df['libido'], df['dose'])
tbl, a1, a2 = comp.allpairtest(stats.ttest_ind, method= "bonf")
tbl
```

group1 | group2 | stat | pval | pval_corr | reject |
---|---|---|---|---|---|

high | low | 1.964 | 0.0851 | 0.2554 | False |

high | placebo | 3.0551 | 0.0157 | 0.0471 | True |

low | placebo | 1.2127 | 0.2598 | 0.7795 | False |

Now to make sense of the table.

- At the top the table testing information is provided
- FWER is the familywise error rate, i.e. what $\alpha$ is being set to and controlled at
- method is the correction method that is being applied to the p-values
- Then there is the adjusted p-value (adjusted $\alpha$) for both the Sidak and Bonferroni correction methods
- group1 and group2 columns are the groups being compared
- stat is the test statistic value; in this case it would be the t statistic
- pval is the uncorrected p-value returned from the supplied "statistical_test_method"
- pval_corr is the corrected p-value which has been corrected using whichever "correction_method" was supplied
- reject is the decision rule based on the corrected p-value

Conducting comparisons using the Bonferroni correction indicates that the only groups that differed significantly are those who took the high dose and the placebo dose.

### ŠidÁk Correction (a.k.a. Dunn-ŠidÁk Correction)

Have to use a library that has not been imported yet (if you didn't do the
Tukey HSD or Bonferroni examples above); please see the
official documentation
about this method for more information if interested.

The documentation for allpairtest
is not in the best shape at the time of writing this. The method returns 3
objects, one is a completed table object, the second is the data of the table, and the third
is the data of the table with the table headings - it is not understood why
the developers of StatsModels did this. All that is needed is the first object.

Before jumping into the code, let's take a look at pseudo code to make sense
of this method.

The documentation shows one needs to supply this method with a statistical test method, which can either be a user defined function or a function from another Python library - in this case independent sample t-tests will be conducted. One also has to state the correction method to be applied to the p-value to adjust for the multiple comparisons taking place. Now to see the method in action.

```
import statsmodels.stats.multicomp as mc
comp = mc.MultiComparison(df['libido'], df['dose'])
tbl, a1, a2 = comp.allpairtest(stats.ttest_ind, method= "sidak")
tbl
```

group1 | group2 | stat | pval | pval_corr | reject |
---|---|---|---|---|---|

high | low | 1.964 | 0.0851 | 0.2343 | False |

high | placebo | 3.0551 | 0.0157 | 0.0464 | True |

low | placebo | 1.2127 | 0.2598 | 0.5945 | False |

Now to make sense of the table.

- At the top the table testing information is provided
- FWER is the familywise error rate, i.e. what $\alpha$ is being set to and controlled at
- method is the correction method that is being applied to the p-values
- Then there is the adjusted p-value (adjusted $\alpha$) for both the Sidak and Bonferroni correction methods
- group1 and group2 columns are the groups being compared
- stat is the test statistic value; in this case it would be the t statistic
- pval is the uncorrected p-value returned from the supplied "statistical_test_method"
- pval_corr is the corrected p-value which has been corrected using whichever "correction_method" was supplied
- reject is the decision rule based on the corrected p-value

Conducting comparisons using the Šidák correction indicates that the only groups that differed significantly are those who took the high dose and the placebo dose.

## References

Kutner, M. H., Nachtsheim, C. J., Neter, J., and Li, W. (2004).*Applied linear statistical models*(5

^{th}). New York, NY: McGraw-Hill Irwin.

Rosner, B. (2015).

*Fundamentals of Biostatistics*(8

^{th}). Boston, MA: Cengage Learning.

Ott, R. L., and Longnecker, M. (2010).

*An introduction to statistical methods and data analysis.*Belmon, CA: Brooks/Cole.