Python for Data Science

Parametric assumptions

Parametric tests have the same assumptions, or conditions, that need to be met in order for the analysis to be considered reliable.

Parametric test assumptions

  1. Independence
  2. Population distributions are normal
  3. Samples have equal variances

It is best to check the assumptions in the order above since some equal variance tests are sensitive to the distribution being normal.

Independence

This assumption is checked during the setup of the study. It means that each observation is independent of another; if there are 2 or more groups being compared, then it refers to that fact that groups are mutually exclusive, i.e. each individual belongs to only 1 group; and that the data is not repeated over time.

If data is repeated measures, longitudinal/panel or time series, there are appropriate methods to account for the repeated data points from individuals. This is typically handled in the math of the statistical test that is designed to analyze this type of data.

Normality

The normality assumption is applied differently depending on the statistical method being used. For example, it applies to the shape of the sampling distribution for the dependent variable (outcome variable) if it's a univariate test, to the difference scores if it's mean comparison (ex. independent sample t-test or paired t-test), or to the residuals if it's a regression framework. This assumption can be checked with a formal test or graphically. There are a few formal tests that can be used to check this assumption. Before that gets touched, the Central Limit Theorem needs to be discussed. Ott and Longnecker (2010, p.188) provide this overview about the Central Limit Theorem:

"Central Limit Theorem for $\bar{y}$
Let $\bar{y}$ denote the sample mean computed from a random sample of n measurements from a population having mean $\mu$, and finite standard deviation $\sigma$. Let $\mu_\bar{y}$ and $\sigma_\bar{y}$ denote the mean and standard deviation of the sampling distribution or $\bar{y}$, respectively. Based on repeated samples of size n from the population, we can conclude the following:

  1. $\mu_\bar{y} = \mu$
  2. $\sigma_\bar{y} = \sigma$
  3. When n is large, the sampling distribution of $\bar{y}$ will be approximately normal (with the approximation becoming more precise as n increases)
  4. When the population distribution is normal, the sampling distribution of $\bar{y}$ is exactly normal for any sample size n

No specific shape was required for these measurements for the Central Limit Theorem to be validated."

Great, what does it mean though? It means that if the sample size is large enough then normality may not be a concern even if the test for normality indicates that normality is not present. Simulation studies have been conducted and the general rule of thumb is n ≥ 30. However, if it's clear that there is heavy skewness or kurtosis, one might consider transforming the data and/or using a non-parametric statistical test. Now let's test the normality assumption.

Normality tests

There are many out there to choose from. These include Shapiro-Wilk test, Kolmogorov-Smirnov (K-S) test, Lilliefors corrected K-S test, Anderson-Darling test, D'Agostino skewness test, D'Agostino-Pearson omnibus test, and the Jarque-Bera test. This section will not cover all of these, but will demonstrate common normality tests. Additionally, normality tests should be accompanied by a visual test as well; when the sample size increases all statistical tests increase in power to detect smaller differences.

Yap and Sim (2011) is a nice peer-reviewed article (and it's free) for review about various normality tests.

Shapiro-Wilk (SW) test

The null hypothesis is that the data is normal. The Shapiro-Wilk test is an omnibus test (D'Agostino, 1971). It evaluates normality by comparing the data's distribution (values ordered) to the hypothesized normal distribution (Shapiro & Wilk, 1965). This can be completed in Python by using the shapiro() method from Scipy.stats.

# Coming from independent t-test framework

import pandas as pd
import scipy.stats as stats

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")

sampling_difference = df['bp_after'][df['sex'] == 'Male'].values - \
                      df['bp_after'][df['sex'] == 'Female'].values

stats.shapiro(sampling_difference)
(0.98586106300354, 0.7147841453552246)
# From regression or ANOVA framework

import pandas as pd
import scipy.stats as stats
import statsmodels.formula.api as smf

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")

model = smf.ols("bp_after ~ C(sex) + C(agegrp)", data= df).fit()

stats.shapiro(model.resid)
(0.9816310405731201, 0.10094476491212845)

Unfortunately it's not labelled, but the output is (W test statistic, p-value).

D'Agostino-Pearson K2 Test

The null hypothesis is that the data is normal. The D'Agostino-Pearson test is an omnibus test and evaluates normality by assessing the skewness and kurtosis (D'Agostino, 1971).

# Coming from independent t-test framework

import pandas as pd
import scipy.stats as stats

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")

sampling_difference = df['bp_after'][df['sex'] == 'Male'].values - \
                    df['bp_after'][df['sex'] == 'Female'].values

stats.normaltest(sampling_difference)
NormaltestResult(statistic=0.003175119018886905, pvalue=0.9984136999965528)
# From regression or ANOVA framework

import pandas as pd
import scipy.stats as stats
import statsmodels.formula.api as smf

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")

model = smf.ols("bp_after ~ C(sex) + C(agegrp)", data= df).fit()

stats.normaltest(model.resid)
NormaltestResult(statistic=2.9626037518315704, pvalue=0.2273415251854784)

Kolmogorov-Smirnov (KS) test

The null hypothesis is that the data is normal (matches compared distribution). The Kolmogorov-Smirnov test is a distance test (D'Agostino, 1971). It evaluates normality by comparing the data's empirical distribution function to the expected cumulative distribution function of the comparison distribution (Öztuna D., Elhan A., & Tüccar, 2006).

If testing the difference between two groups, then this can be completed using the ks_2samp() method from Scipy.stats.

# Coming from independent t-test framework

import pandas as pd
import scipy.stats as stats

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")

stats.ks_2samp(df['bp_after'][df['sex'] == 'Male'],
               df['bp_after'][df['sex'] == 'Female'])
Ks_2sampResult(statistic=0.3666666666666667, pvalue=0.00041326763403322234)

If coming from a regression or ANOVA approach then the 1 sample version of the test can be ran on the residuals using the kstest() method from Scipy.stats.

# From regression or ANOVA framework

import pandas as pd
import scipy.stats as stats
import statsmodels.formula.api as smf

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")

model = smf.ols("bp_after ~ C(sex) + C(agegrp)", data= df).fit()

stats.kstest(model.resid, 'norm')
KstestResult(statistic=0.4502179708439199, pvalue=1.0174625716715393e-22)

Anderson-Darling

The null hypothesis is that the data is normal (matches compared distribution). The Anderson-Darling test assess normality by comparing the data's empirical distribution function to a specified distribution's cumulative distribution function (Stephen, 1974). This can be completed using the anderson() method from Scipy.stats.

A note about this method, it returns 3 parts: the test statistic, an array of significance levels, and an array of corresponding critical values. The test statistic needs to be compared to the critical value for the significance level of interest- this is most commonly set to 0.05.

# Coming from mean comparison framework (independent sample t-test)

import pandas as pd
import scipy.stats as stats

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")

sampling_difference = df['bp_after'][df['sex'] == 'Male'].values - \
                      df['bp_after'][df['sex'] == 'Female'].values

result = stats.anderson(sampling_difference)

print(f'Test statistic: {result.statistic: .4f}')

for i in range(len(result.critical_values)):
    sig, crit = result.significance_level[i], result.critical_values[i]

    if result.statistic < result.critical_values[i]:
        print(f"At {sig}% significance,{result.statistic: .4f} <{result.critical_values[i]: .4f} data looks normal (fail to reject H0)")
    else:
        print(f"At {sig}% significance,{result.statistic: .4f} >{result.critical_values[i]: .4f} data does not look normal (reject H0)")
Test statistic: 0.2987 At 15.0% significance, 0.2987 < 0.5440 data looks normal (fail to reject H0) At 10.0% significance, 0.2987 < 0.6190 data looks normal (fail to reject H0) At 5.0% significance, 0.2987 < 0.7430 data looks normal (fail to reject H0) At 2.5% significance, 0.2987 < 0.8660 data looks normal (fail to reject H0) At 1.0% significance, 0.2987 < 1.0300 data looks normal (fail to reject H0
# From regression or ANOVA framework

import pandas as pd
import scipy.stats as stats
import statsmodels.formula.api as smf

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")

model = smf.ols("bp_after ~ C(sex) + C(agegrp)", data= df).fit()

result = stats.anderson(model.resid)

print(f'Test statistic: {result.statistic: .4f}')
for i in range(len(result.critical_values)):
    sig, crit = result.significance_level[i], result.critical_values[i]

    if result.statistic < result.critical_values[i]:
        print(f"At {sig}% significance,{result.statistic: .4f} <{result.critical_values[i]: .4f} data looks normal (fail to reject H0)")
    else:
        print(f"At {sig}% significance,{result.statistic: .4f} >{result.critical_values[i]: .4f} data does not look normal (reject H0)")
Test statistic: 0.4467 At 15.0% significance, 0.4467 < 0.5580 data looks normal (fail to reject H0) At 10.0% significance, 0.4467 < 0.6360 data looks normal (fail to reject H0) At 5.0% significance, 0.4467 < 0.7630 data looks normal (fail to reject H0) At 2.5% significance, 0.4467 < 0.8900 data looks normal (fail to reject H0) At 1.0% significance, 0.4467 < 1.0590 data looks normal (fail to reject H0)

Graphical Methods

There are a few different plots that can be used to check the assumption of normality. There are probability plot, the p-p plot, q-q plot, and histogram.

Probability plot

The probability plot plots the ordered data against the theoretical distribution. The better the ordered data fits the theoretical distribution, the less deviation there will be from the fit line that is in the middle of the graph.

The theoretical distribution can be any of them, but there is a special case if the theoretical distribution in question is the Gaussian (normal) distribution. It's special in the sense that it's been studied so much that their is a correlation coefficient that can be used to test normality along side the graph. Filliben (1975) conducted this initial test and provided critical values based on the sample size and $\alpha$ level. The table can be found online within the NIST/SEMATECH e-handbook of statistical methods.

This can be conducted using probplot() from Scipy.stats.

# Coming from mean comparison framework (independent sample t-test)

import pandas as pd
import scipy.stats as stats

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")

sampling_difference = df['bp_after'][df['sex'] == 'Male'] - \
                      df['bp_after'][df['sex'] == 'Female']

fig = plt.figure(figsize= (10, 10))
ax = fig.add_subplot(111)

normality_plot, stat = stats.probplot(sampling_difference, plot= plt, rvalue= True)
ax.set_title("Probability plot of sampling difference \n with R value")
ax.set

plt.show()
python homogeneity of variance levene statistics normality prob plot probability researchpy
# From regression or ANOVA framework

import pandas as pd
import scipy.stats as stats
import statsmodels.formula.api as smf

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")

model = smf.ols("bp_after ~ C(sex) + C(agegrp)", data= df).fit()

fig = plt.figure(figsize= (10, 10))
ax = fig.add_subplot(111)

normality_plot, stat = stats.probplot(model.resid, plot= plt, rvalue= True)
ax.set_title("Probability plot of regression residuals \n with R value")
ax.set

plt.show()
python homogeneity of variance levene statistics normality prob plot probability regression anova researchpy

The next step would be to reference the table and compare the calculated r-value to the r-value in the table. If the calculated r-value is less than the table r-value you reject the null hypothesis of normality.

What if Normality is Violated?

If the assumption of normality is violated fear not! There are a few routes to consider, these are in no particular order:

  • Rely on the Central Limit Theorem if the sample size is large enough ($\ge 30$)
  • Use a non-parametric statistical test
  • Transform the data
Since transforming the data can also help if the assumption of equal variances is violated, common data transformation methods will be provided at the end of this page.

Homogeneity of Variances

As with testing the assumption of normality, there are a few statistical tests available to test the assumption of equal variances. Some common methods are the Barlett test and Levene's test for equality of variances. Choosing the correct test is also dependent on the assumption of normality. For example, Barlett's test has been found to be sensitive to departures from normality whereas Levene's test is less sensitive to this (Conover, 1981).

It doesn't matter if one is coming from a 2 sample approach or a regression/ANOVA approach, the test steps are the same. One needs to test for equality of variance for each categorical variable on the outcome.

Equality of Variance Tests

Barlett's test

Barlett's test is used to test if the groups, which can be referred to as k, have equal variances. Barlett's test can test for equality between 2 or more groups.

This can be done with the bartlet method from scipy.stats.

import pandas as pd
import scipy.stats as stats

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")

stats.bartlett(df['bp_after'][df['sex'] == 'Male'],
               df['bp_after'][df['sex'] == 'Female'])
BartlettResult(statistic=3.9379638422812793, pvalue=0.047207884641474476)

Levene's test

Levene's test tests if the different goups have equal variances (Levene, 1960). Levene's test is less sensitive than Barlett's test to departures from normality and power (Conover, 1981). Levene's test of homogeneity of variances can test for equality between 2 or more groups. The original suggestion was to use the mean, Brown and Forsythe (1974) provided an expansion and testing can also be calculated using the median or the trimmed mean which have been found to be robust under nonnormality.

This can be done with the levene method from scipy.stats.

import pandas as pd
import scipy.stats as stats

df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")

# Using the mean

stats.levene(df['bp_after'][df['sex'] == 'Male'],
             df['bp_after'][df['sex'] == 'Female'])
LeveneResult(statistic=5.0464151793144625, pvalue=0.026537264851214513)
# Using the median

stats.levene(df['bp_after'][df['sex'] == 'Male'],
             df['bp_after'][df['sex'] == 'Female'], center= "median")
LeveneResult(statistic=5.0464151793144625, pvalue=0.026537264851214513)
# Using the trimmed mean (Default is to cut 10% total - 5% from each tail)

stats.levene(df['bp_after'][df['sex'] == 'Male'],
             df['bp_after'][df['sex'] == 'Female'], center= "trimmed")
LeveneResult(statistic=7.769755793226307, pvalue=0.006297605035462623)

What if homogeneity of variances is violated?

If there is not equal variances between groups there are a few routes to consider, these are in no particular order:

  • Use a non-parametric statistical test
  • Transform the data
Common data transformation methods will be provided in the next section.

Data Transformation

If the assumption(s) of normality or homogeneity of variances are not met then a route to consider is transforming the data. If this is the route that is decided on to try and correct not meeting one or both of the assumptions don't forget to back-transform the data for reporting and interpreting. Some data transformations effect normality and variances differently so different transformations should be tested. There are of course, common transformations that are used.
Transformation Effect Back Transformation
Log (numpy.log()) Taking the logarithm ($\log(x)$) of a value effects larger values more than smaller values. This is useful if the distribution has a positive skew because it brings larger values closer to the center. However, one cannot take the log of a negative number or of 0. If the data has negative numbers or 0s then a constant needs to be added to all values - don't forget to subtract the constant when back-transforming. numpy.exp()
Square-root (numpy.sqrt()) Taking the square-root ($\sqrt{x}$) of a value effects larger values more than smaller numbers. This is useful if the distribution has a positive skew because it brings larger values closer to the center. However, one cannot take the square-root of negative numbers. If the data has negative numbers then a constant needs to be added to all values - don't forget to subtract the constant when back-transforming. numpy.square()
Reciprocal (numpy.reciprocal()) Taking the reciprocal ($\frac{1}{x}$) of a value effects larger values more than smaller values and flips the scale in the process by making larger values closer to 0, ex. 1/1= 1 and 1/10= 0.1. This transformation is useful if the distribution has a positive skew. There is not a built-in method to get the inverse of the reciprocal to convert the value back to the original scale - a lambda is a good way to do this if back-transforming a series or array.
variable = [x**-1 for x in variable_reciprocal_transformed]
If back-transforming a single value use
x**-1

References

  1. Brown, M. B., and Forsythe, A. B. (1974). Robust tests for the equality of variances. Journal of the American Statistical Association, 69(346), 364-367
  2. Conover, W. J., Johnson, M. E., and Johnson, M. M. (1981). A comparative study of tests for homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics, 23(4), 351-361.
  3. D'Agostino, R. (1971). An omnibus test of normality for moderate and large size samples. Biometrika, 58(2), 341
  4. Filliben, J. J. (1975). The probability plot correlation coefficient test for normality. Technometrics, 17(1), 111-117
  5. Levene, H. (1960). Contributions to probability and statistics: essays in honor of harold hotelling. I. Olkin (Ed.). Stanford, CA: Standford University Press.
  6. Rosner, B. (2015). Fundamentals of Biostatistics (8th). Boston, MA: Cengage Learning.
  7. Ott, R. L., and Longnecker, M. (2010). An introduction to statistical methods and data analysis. Belmon, CA: Brooks/Cole.
  8. Öztuna D., Elhan A., and Tüccar (2006). Investigation of four different normality tests in terms of type 1 error rate and power under different distributions. Turkish Journal of Medical Sciences, 36(3), 171–6.
  9. Shapiro, and Wilk (1965). An analysis of variance test for normality (complete samples). Biometrika, 52, 591-611
  10. Stephens, M. A. (1974). EDF Statistics for goodness of fit and some comparisons. Journal of the American Statistical Association, 69(347), 730-737
  11. Yap. B. W., and Sim (2011). Comparisons of various types of normality tests. Journal of Statistical Computation and Simulation, 81(12), 2141-2155 doi:10.1080/00949655.2010.520163