Python for Data Science

One-way ANOVA

If you are looking for how to run the code jump to the next section or if you would like some theory/refresher then start with this section or see a publicly available peer reviewed article such as this one.

ANOVA stands for "Analysis of Variance" and is an omnibus test, meaning it tests for a difference overall between all groups. The one-way ANOVA, also referred to as one factor ANOVA, is a parametric test used to test for a statistically significant difference of an outcome between 3 or more groups. Since it is an omnibus test, it tests for a difference overall, i.e. at least one of the groups is statistically significantly different than the others. However, if the ANOVA is significant one cannot tell which group is different. In order to tell which group is different, one has to conduct planned or post-hoc comparisons. As with all parametric tests, there are certain conditions that need to be met in order for the test results to be considered reliable.

The reason why it's called an one-way or one factor ANOVA even though there are 3 or more groups being tested is because those groups are under one categorical variable, such as race or education level, and the name is referring to the number of variables in the analysis and not the number of groups. If there are two variables being compared it would technically be called a two-way, or two factor, ANOVA if both variables are categorical, or it could be called an ANCOVA if the 2nd variable is continuous. The "C" doesn't stand for continuous, it stands for covariate.

When working from the ANOVA framework, independent variables are sometimes referred to as factors and the number of groups within each variable are called levels, i.e. one variable with 3 categories could be referred to as a factor with 3 levels.

Parametric test assumptions

  • Population distributions are normal
  • Samples have equal variances
  • Independence
Hypothesis

  1. $H_0: \bar{x}_1 = \bar{x}_2 = \bar{x}_3 = \; ... \; = \bar{x}_k$
  2. $H_A: \text{At least one of the groups means differ}$

The test statistic is the F-statistic and compares the mean square between samples ($MS_B$) to the mean square within sample ($MS_W$). This F-statistic can be calculated using the following formula:

  • $F = \frac{MS_B}{MS_W}$

  • Where,
    • $MS_B= \frac{\text{Sum of square between sample } (SS_B)}{(k - 1)}$

    • $MS_W= \frac{\text{Sum of square within sample } (SS_W)}{(n_T - k)}$

    • $k$ is the number of groups
    • $n_T$ is the total number of observations

  • and where,
    • Sum of square between sample ($SS_B$) = $\sum_k n_k(\bar{x}_k - \bar{x})^2$
    • Sum of square within sample ($SS_W$) = $\sum_{i, k}(x_{i, k} - \bar{x}_k)^2$ or can be calculated as $\sum_k (n_k - 1)s_k^2$

One rejects the the null hypothesis, $H_0$, if the computed F-static is greater than the critical F-statistic. The critical F-statistic is determined by the degrees of freedom and alpha, $\alpha$, value.

  1. Reject $H_0$ if $\text{calucated F-statistic} > \text{critical F-statistic}$

Before the decision is made to accept or reject the null hypothesis the assumptions need to be checked. See this page on how to check the parametric assumptions in detail - how to check the assumptions for this example will be demonstrated near the end.

Let's make sense of all these mathmatical terms. In order to do that, let's start with a generic ANOVA table filled in with symbols and the data set used in this example for now.

ANOVA Table
Source Sum of Squares Degrees of Freedom Mean Square F-statistic
Between samples $SS_B$ $k - 1$ $MS_B = \frac{SS_B}{(k - 1)}$ $\frac{MS_B}{MS_W}$
Within samples $SS_W$ $n_T - k$ $MS_W = \frac{SS_W}{(n_T - k)}$
Total $TSS = SS_B + SS_W$ $n_T - 1$
Note: TSS means total sum of squares
Data Table
Drug Dose Libido Sample Size Sample Means Sample Variance
Placebo ($k_1$) 3 2 1 1 4 5 ($n_1$) 2.2 ($\bar{x}_1$) 1.7 ($s^2_1$)
Low ($k_2$) 5 2 4 2 3 5 ($n_2$) 3.2 ($\bar{x}_2$) 1.7 ($s^2_2$)
High ($k_3$) 7 4 5 3 6 5 ($n_3$) 5.0 ($\bar{x}_3$) 2.5 ($s^2_3$)
Total ($k= 3$) 15 ($n_T$) 3.5 ($\bar{x}$) 3.1 ($s^2$)

Now using the formulas from above, the ANOVA table can be filled in.

$$ \begin{aligned} \text{Between samples row} & \\ & SS_B = \sum_k n_k(\bar{x}_k - \bar{x})^2 = 5(2.2 - 3.5)^2 + 5(3.2 - 3.5)^2 + 5(5.0 - 3.5)^2 = 20.15 \\ & \text{Degrees of Freedom} = k - 1 = 3 - 1 = 2 \\ & \text{Mean square ($MS_B$)} = \frac{SS_B}{k - 1} = \frac{20.15}{2} = 10.07 \\ \\ \text{Within samples row} & \\ & SS_W = \sum_{i, k}(x_{i, k} - \bar{x}_i)^2 \text{ or } \sum_k (n_k - 1)s_k^2 = (5 - 1)1.7 + (5 - 1)1.7 + (5 - 1)2.5 = 23.6 \\ & \text{Degrees of Freedom} = n_T - k = 15 - 3 = 12 \\ & \text{Mean square ($MS_W$)} = \frac{SS_W}{n_T - k} = \frac{23.6}{12} = 1.97 \\ \\ \text{Total row} & \\ & TSS = SS_B + SS_W = 20.15 + 23.6 = 43.75 \\ & \text{Degrees of Freedom} = n_T - 1 = 15 - 1 = 14 \\ \\ \text{F-statistic} & \\ & \text{F-statistic}= \frac{MS_B}{MS_W} = \frac{10.07}{1.97} = 5.11 \end{aligned} $$
ANOVA Table
Source Sum of Squares Degrees of Freedom Mean Square F-statistic
Between samples 20.15 2 10.07 5.11
Within samples 23.6 12 1.97
Total 43.75 14

In order to tell if the calcualted F-statistic is statistically significant, one would look up the F-statistic based on the degress of freedom and alpha level - using statistical software this doesn't need to be done since it'll be provided.

Fear not if math is not your strong suit. All this is being calucated when using the methods of a statistical software or programming language. It's good to know what is going on behind the scences. References for this section are provided at the end of the page.

One-way ANOVA with Python

Don't forget to check the assumptions before interpreting the results! First to load the libraries and data needed. Below, Pandas, Researchpy and the data set will be loaded. Specific libraries for each demonstrated method below will contain any further libraries that are need is using that demonstration.

import pandas as pd
import researchpy as rp

Now to load the data set and take a high level look at the variables.

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/difficile.csv")
df.drop('person', axis= 1, inplace= True)

# Recoding value from numeric to string
df['dose'].replace({1: 'placebo', 2: 'low', 3: 'high'}, inplace= True)

df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 15 entries, 0 to 14 Data columns (total 2 columns): dose 15 non-null object libido 15 non-null int64 dtypes: int64(1), object(1) memory usage: 320.0+ bytes
rp.summary_cont(df['libido'])
Variable N Mean SD SE 95% Conf. Interval
0 libido 15.0 3.466667 1.76743 0.456349 2.487896 4.445437
rp.summary_cont(df['libido'].groupby(df['dose']))
N Mean SD SE 95% Conf. Interval
dose
high 5 5.0 1.581139 0.707107 3.450484 6.549516
low 5 3.2 1.303840 0.583095 1.922236 4.477764
placebo 5 2.2 1.303840 0.583095 0.922236 3.477764

One-way ANOVA using scipy.stats

Conducting an one-way ANOVA using scipy.stats is quick and only returns the restuling F-statistic and p-value of the test.

import scipy.stats as stats

stats.f_oneway(df['libido'][df['dose'] == 'high'],
               df['libido'][df['dose'] == 'low'],
               df['libido'][df['dose'] == 'placebo'])
F_onewayResult(statistic=5.11864406779661, pvalue=0.024694289538222603)

Before the results should be interpreted, the assumptions of the test should be checked. For example purposes, the results will be interpreted before checking the assumptions.

Interpretation

A new medication was developed to increase the libido of those who take the medication. The purpose of this study was to test for a difference between the dosage levels. The overall average libido was 3.5 95% CI(2.5, 4.4) with group averages of 2.2 95% CI(0.9, 3.5) for the placebo group; 3.2 95% CI(1.9, 4.5) for the low dose group; and 5.0 95% CI(3.5, 6.5) for the high dose group. There is a statistically significant difference between the groups and their effects the libido, F= 5.12, p-value= 0.0247.

One-way ANOVA using StatsModels

This method conducts a one-way ANOVA in two steps:

  1. Fit the model using an estimation method,
    • The default estimation method in most statistical software packages is ordinary least squares
      • Not going to dive into estimation methods as it's out of scope of this section's topic
      • If you are not familiar with it and don't care to really dive into it, then just know it's one of many types of estimation methods that aim to provide estimates of the parameter (mean, propertion, etc.) being tested
  2. Pass fitted model into ANOVA method to produce ANOVA table

Here is the official StatsModels documentation on an ANOVA. The general structure for entering the equation is:

ols("outcome_variable ~ independent_variable", data= data_frame).fit()

In the case of an ANOVA, the independent variable will be categorical. The pseudo code above would work if you were conducting a simple linear regression, but that's not what we are here for! Have to modify the pseudo code which would make it look like:

ols("outcome_variable ~ C(independent_variable)", data= data_frame).fit()

Now to use real code. In the code below there is an argument "typ" in the anova_lm method, this determines how the sum of squares is calculated. The calculation differences is a bit out of scope, but it's encouraged to learn more about them. An easy to read primer can be found here. Additionally, to see how to conduct an ANOVA with type 3 sum of squares see this page - this requires one additional step.

import statsmodels.api as sm from statsmodels.formula.api import ols model = ols('libido ~ C(dose)', data=df).fit() aov_table = sm.stats.anova_lm(model, typ=2) aov_table
sum_sq df F PR(>F)
C(dose) 20.133333 2.0 5.118644 0.024694
Residual 23.600000 12.0 NaN NaN
Note: C(dose)= between samples and Residual= within samples

This table provides all the information one needs in order to interprete if the results are significant; however, it does not provide any effect size measures to tell if the statistical significance is meaningful. The function below calculates eta-squared ($\eta^2$) and omega-squared ($\omega^2$). A quick note, $\eta^2$ is the exact same thing as $R^2$ except when coming from the ANOVA framework people call it $\eta^2$; $\omega^2$ is considered a better measure of effect size since it is unbiased in it's calculation by accounting for the degrees of freedom in the model.

""" The function below was created specifically for the one-way ANOVA table results returned for Type II sum of squares """ def anova_table(aov): aov['mean_sq'] = aov[:]['sum_sq']/aov[:]['df'] aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq']) aov['omega_sq'] = (aov[:-1]['sum_sq']-(aov[:-1]['df']*aov['mean_sq'][-1]))/(sum(aov['sum_sq'])+aov['mean_sq'][-1]) cols = ['sum_sq', 'df', 'mean_sq', 'F', 'PR(>F)', 'eta_sq', 'omega_sq'] aov = aov[cols] return aov anova_table(aov_table)
sum_sq df mean_sq F PR(>F) eta_sq omega_sq
C(dose) 20.133333 2.0 10.066667 5.118644 0.024694 0.460366 0.354486
Residual 23.600000 12.0 1.966667 NaN NaN NaN NaN

Interpretation

A new medication was developed to increase libido. The purpose of this study was to test for a difference between the dosage levels. The overall average libido was 3.5 95% CI(2.5, 4.4) with group averages of 2.2 95% CI(0.9, 3.5) for the placebo group; 3.2 95% CI(1.9, 4.5) for the low dose group; and 5.0 95% CI(3.5, 6.5) for the high dose group. There is a statistically significant difference between the groups and their effects the libido, F= 5.12, p-value= 0.0247, with an overall large effect, $\omega^2$= 0.35.

In order to tell which groups differed significantly, post-hoc tests need to be conducted. Before one goes through that work, the assumptions should be checked first in case any modifications need to be made to the model.

Assumption check

The assumptions in this section need to be met in order for the test results to be considered valid. A more in-depth look at parametric assumptions is provided here, which includes some potential remedies.

Independence

This assumption is tested when the study is designed. What this means is that all groups are mutually exclusive, i.e. an individual can only belong in one group. Also, this means that the data is not repeated measures (not collected through time). In this example, this condition is met.

Normality

The assumption of normality is tested on the residuals of the model when coming from an ANOVA or regression framework. One method for testing the assumption of normality is the Shapiro-Wilk test. This can be completed using the shapiro() method from scipy.stats. Ensure that scipy.stats is imported for the following method to work. Unfortunately the output is not labelled, but it's (W-test statistic, p-value).

import scipy.stats as stats stats.shapiro(model.resid)
(0.9166916012763977, 0.17146942019462585)

The test is non-significant, W= 0.9167, p= 0.1715, which indicates that the residuals are normally distributed.

Another way to test the assumption is through a visual check- this is helpful when the sample is large. The reason this is true is that as the sample size increases, the statistical test's ability to reject the null hypothesis increases, i.e. it gains power to detect smaller differences as the sample size n increases.

One method of visually checking the distribution is to use a probability plot with or without the correlation value, $R^2$, to assess the observed values correlation with the theoretical distribution in question - in the current case it would be the Gaussian (a.k.a the normal) distribution. This can be completed by using the probplot() method from Scipy.stats. If using the $R^2$ measure, one can refer to the NIST/SEMATECH e-handbook of statistical methods to see if the value is significant.

import matplotlib.pyplot as plt fig = plt.figure(figsize= (10, 10)) ax = fig.add_subplot(111) normality_plot, stat = stats.probplot(model.resid, plot= plt, rvalue= True) ax.set_title("Probability plot of model residual's", fontsize= 20) ax.set plt.show()
python homogeneity of variance levene statistics normality prob plot probability regression one way one-way anova researchpy

This is a case where the statistical testing method indicated the residuals were normally distributed, but the probability plot correlation coefficient (PPCC) indicated non-normality. Given the current example's sample size is small, N= 15, the Shapiro-Wilk test indicated normality, and that the calculated PPCC, $R^2$= 0.9349, is ever so slightly smaller than the table PPC, $R^2$= 0.9376, it is reasonable to state this assumption is met. However, looking at the plotted probability plot and the residual structure it would also be reasonable to transform the data for the analysis, or to use a non-parametric statistical test such as Welch's ANOVA or the Kruskal-Wallis ANOVA.

Homogeneity of variance

The final assumption is that all groups have equal variances. One method for testing this assumption is the Levene's test of homogeneity of variances. This can be completed using the levene() method from Scipy.stats.

stats.levene(df['libido'][df['dose'] == 'high'], df['libido'][df['dose'] == 'low'], df['libido'][df['dose'] == 'placebo'])
LeveneResult(statistic=0.11764705882352934, pvalue=0.8900225182757423)

The Levene's test of homogeneity of variances is not significant which indicates that the groups have non-statistically significant difference in their varability. Again, it may be worthwhile to check this assumption visually as well.

fig = plt.figure(figsize= (10, 10)) ax = fig.add_subplot(111) ax.set_title("Box Plot of Libido by Dosage", fontsize= 20) ax.set data = [df['libido'][df['dose'] == 'placebo'], df['libido'][df['dose'] == 'low'], df['libido'][df['dose'] == 'high']] ax.boxplot(data, labels= ['Placebo', 'Low', 'High'], showmeans= True) plt.xlabel("Drug Dosage") plt.ylabel("Libido Score") plt.show()
python homogeneity of variance levene statistics anova analysis of variance researchpy

The graphical testing of homogeneity of variances supports the statistical testing findings which is the groups have equal variance.

By default box plots show the median (orange line in graph above). The green triangle is the mean for each group which was an additional argument that was passed into the method.

There are different ways to handle heteroskedasticity (unequal variance) and a decision needs to be made. Some options include, but is not limited to, transformming the dependent variable (outcome), could use trimmed means, robust standard errors, or use a parametric test suchs as the Welch's t-test. For a more in-depth look at the assumptions and some potential remedies, please check out this page.

Post-hoc Testing

By conducting post-hoc tests or planned comparisons it allows one to see which group(s) significantly differ from each other; remember that the ANOVA is an omnibus test! There are a few different approaches that can be taken while conducting these tests, ones that are implemented in StatsModels currently are:

  • Tukey Honestly Significant Difference (HSD)
    • Tests all pairwise group comparisons while controlling for the multiple comparisons which protects the familywise error rate and from making a Type I error
    • Not technically a "post-hoc" test since this test can be used as a test independently of the ANOVA and can be planned before hand
    • More in-depth information about this statistical method can be found here

  • Bonferroni
    • Tests groups for a difference while controlling for the multiple comparisons which protects the familywise error rate and from making a Type I error. It should be noted that some statistical software reports the Bonferroni adjusted confidence interval, however this is not the case in Python at this time (unless one were to program a function to do so)
    • This method is common because it is fast to calculate - take the number of groups to be compared and divide that by the initial alpha value $\alpha_{Bonf}= \frac{\text{Number of groups}}{\alpha}$. In the current example there are 3 groups being compared (placebo vs. low, placebo vs. high, and low vs. high) which had $\alpha$= 0.05 making the equation become $\alpha_{Bonf}= \frac{\text{0.05}}{3}= 0.0167$. Thus, in order for a comparison to be considered significant a p-value would need to be < 0.0167 in order to be considered statistically significant.
    • More in-depth information about this statistical method can be found here

  • Šidák (a.k.a. Dunn-Šidák)
    • Tests groups for a difference while controlling for the multiple comparisons which protects the familywise error rate and from making a Type I error. It should be noted that some statistical software reports the Šidák adjusted confidence interval, however this is not the case in StatsModels at this time (unless one were to program a function to do so)
    • This method is common because it is pretty fast to calculate, the formula is $\alpha_{Sid}= 1 - (1 - \alpha)^{\frac{1}{\text{Number of groups}}}$. In the current example there are 3 groups being compared (placebo vs. low, placebo vs. high, and low vs. high) which had $\alpha$= 0.05 making the equation become $\alpha_{Sid}= 1 - (1 - 0.05)^{\frac{1}{3}}= 0.0170$. Thus, in order for a comparison to be considered significant a p-value would need to be < 0.0170 to be considered statistically significant.
    • More in-depth information about this statistical method can be found here.

Tukey Honestly Significant Difference (HSD)

Have to use a library that has not been imported yet; please see the official documentation about this method for more information if interested.

import statsmodels.stats.multicomp as mc comp = mc.MultiComparison(df['libido'], df['dose']) post_hoc_res = comp.tukeyhsd() post_hoc_res.summary()
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj lower upper reject
high low -1.8 0.1472 -4.1651 0.5651 False
high placebo -2.8 0.0209 -5.1651 -0.4349 True
low placebo -1.0 0.5171 -3.3651 1.3651 False

Now to make sense of the table.

  • At the top the table testing information is provided
    • FWER is the familywise error rate, i.e. what $\alpha$ is being set to and controlled at
  • group1 and group2 columns are the groups being compared
  • meandiff is the difference between the group means
  • p-adj is the corrected p-value which takes into account the multiple comparisons being conducted
  • lower is the lower band of the confidence interval. In the current example the confidence interval at the 95% level since $\alpha$= 0.05.
  • upper is the upper band of the confidence interval. In the current example the confidence interval at the 95% level since $\alpha$= 0.05.
  • reject is the decision rule based on the corrected p-value

It is possible to plot the difference using this method as well!

post_hoc_res.plot_simultaneous(ylabel= "Drug Dose", xlabel= "Score Difference")
python tukey hsd honest significant difference statistics anova analysis of variance researchpy post-hoc post hoc test

Using Tukey HSD to test for differences between groups indicates that there is a statistically significant difference in libido score between those who took the placebo and those who took the high dosage of the medication, no other groups differed significantly. What this indicates is that the high dosage of the medication is effective at increasing libido, but the low dosage is not.

Bonferroni Correction

Have to use a library that has not been imported yet (if you didn't do the Tukey HSD example above); please see the official documentation about this method for more information if interested.

The documentation for allpairtest is not in the best shape at the time of writing this. The method returns 3 objects, one is a completed table object, the second is the data of the table, and the third is the data of the table with the table headings - it is not understood why the developers of StatsModels did this. All that is needed is the first object.

Before jumping into the code, let's take a look at pseudo code to make sense of this method.

allpairtest(statistical_test_method, method= "correction_method")

The documentation shows one needs to supply this method with a statistical test method, which can either be a user defined function or a function from another Python library - in this case independent sample t-tests will be conducted. One also has to state the correction method to be applied to the p-value to adjust for the multiple comparisons taking place. Now to see the method in action.

import statsmodels.stats.multicomp as mc comp = mc.MultiComparison(df['libido'], df['dose']) tbl, a1, a2 = comp.allpairtest(stats.ttest_ind, method= "bonf") tbl
Test Multiple Comparison ttest_ind FWER=0.05
method=bonf alphacSidak=0.02, alphacBonf=0.017
group1 group2 stat pval pval_corr reject
high low 1.964 0.0851 0.2554 False
high placebo 3.0551 0.0157 0.0471 True
low placebo 1.2127 0.2598 0.7795 False

Now to make sense of the table.

  • At the top the table testing information is provided
    • FWER is the familywise error rate, i.e. what $\alpha$ is being set to and controlled at
    • method is the correction method that is being applied to the p-values
    • Then there is the adjusted p-value (adjusted $\alpha$) for both the Sidak and Bonferroni correction methods
  • group1 and group2 columns are the groups being compared
  • stat is the test statistic value; in this case it would be the t statistic
  • pval is the uncorrected p-value returned from the supplied "statistical_test_method"
  • pval_corr is the corrected p-value which has been corrected using whichever "correction_method" was supplied
  • reject is the decision rule based on the corrected p-value

Conducting comparisons using the Bonferroni correction indicates that the only groups that differed significantly are those who took the high dose and the placebo dose.

ŠidÁk Correction (a.k.a. Dunn-ŠidÁk Correction)

Have to use a library that has not been imported yet (if you didn't do the Tukey HSD or Bonferroni examples above); please see the official documentation about this method for more information if interested.

The documentation for allpairtest is not in the best shape at the time of writing this. The method returns 3 objects, one is a completed table object, the second is the data of the table, and the third is the data of the table with the table headings - it is not understood why the developers of StatsModels did this. All that is needed is the first object.

Before jumping into the code, let's take a look at pseudo code to make sense of this method.

allpairtest(statistical_test_method, method= "correction_method")

The documentation shows one needs to supply this method with a statistical test method, which can either be a user defined function or a function from another Python library - in this case independent sample t-tests will be conducted. One also has to state the correction method to be applied to the p-value to adjust for the multiple comparisons taking place. Now to see the method in action.

import statsmodels.stats.multicomp as mc comp = mc.MultiComparison(df['libido'], df['dose']) tbl, a1, a2 = comp.allpairtest(stats.ttest_ind, method= "sidak") tbl
Test Multiple Comparison ttest_ind FWER=0.05
method=sidak alphacSidak=0.02, alphacBonf=0.017
group1 group2 stat pval pval_corr reject
high low 1.964 0.0851 0.2343 False
high placebo 3.0551 0.0157 0.0464 True
low placebo 1.2127 0.2598 0.5945 False

Now to make sense of the table.

  • At the top the table testing information is provided
    • FWER is the familywise error rate, i.e. what $\alpha$ is being set to and controlled at
    • method is the correction method that is being applied to the p-values
    • Then there is the adjusted p-value (adjusted $\alpha$) for both the Sidak and Bonferroni correction methods
  • group1 and group2 columns are the groups being compared
  • stat is the test statistic value; in this case it would be the t statistic
  • pval is the uncorrected p-value returned from the supplied "statistical_test_method"
  • pval_corr is the corrected p-value which has been corrected using whichever "correction_method" was supplied
  • reject is the decision rule based on the corrected p-value

Conducting comparisons using the Šidák correction indicates that the only groups that differed significantly are those who took the high dose and the placebo dose.

References

Kutner, M. H., Nachtsheim, C. J., Neter, J., and Li, W. (2004). Applied linear statistical models (5th). New York, NY: McGraw-Hill Irwin.
Rosner, B. (2015). Fundamentals of Biostatistics (8th). Boston, MA: Cengage Learning.
Ott, R. L., and Longnecker, M. (2010). An introduction to statistical methods and data analysis. Belmon, CA: Brooks/Cole.