Python for Data Science

Linear Regression

If you are looking for how to run code jump to the next section or if you would like some theory/refresher then start with this section.

Linear regression is used to test the relationship between independent variable(s) and a continous dependent variable. The overall regression model needs to be significant before one looks at the individual coeffiecients themselves. The model's signifance is measured by the F-statistic and a corresponding p-value. If the overall F-statistic is not significant, it indicates that the current model is no better than using the mean value of the dependent variable at predicting the outcome.

Regression models are useful because it allows one to see which variable(s) are important while taking into account other variables that could influence the outcome as well. Furthermore, once a regression model is decided on, there is a good amount of additional post-estimation work that can be done to further explore the relationship(s) that may be present.

Since linear regression is a parametric test it has the typical parametric testing assumptions. In addition to this, there is an additional concern of multicollinearity. While multicollinearity is not an assumption of the regression model, it's an aspect that needs to be checked. Multicollinearity occurs when an independent variable is able to be predicted, with good accuracy, by another independent variable in the same model. Multicollinearity is a concern because it weakens significance of independent variables. How to test for this will be demonstrated later on.

For this demonstration, the conventional p-value of 0.05 will be used.

Parametric test assumptions

  • Population distributions are normal
  • Samples have equal variances
  • Independence
Hypothesis

  1. $H_0: \beta_1 = \beta_2 = \; ... \; = \beta_k = 0$
  2. $H_A: \text{At least one $\beta_k$}$ ≠ $0$

The test statistic is the F-statistic and compares the regression mean square ($MS_R$) to the error mean square ($MS_E$). $MS_R$ is also known as the model's mean square. This F-statistic can be calculated using the following formula:

  • $F = \frac{MS_R}{MS_E}$

  • Where,
    • $MS_R= \frac{SS_R}{(k - 1)}$

    • $MS_E= \frac{SS_E}{(n_T - k)}$

    • $k$ is the number of independent variables
    • $n_T$ is the total number of observations

  • and where,
    • Regression model sum of square ($SS_R$) = $\sum (\hat{y}_i - \bar{y})^2$
      • which is the sum of the squared differences between the predicted value of y ($\hat{y}$) and the mean of y ($\bar{y}$).

    • Error sum of square ($SS_E$) = $\sum(y_{i} - \hat{y}_i)^2$
      • which is the sum of the squared differences between the predicted value of y ($\hat{y}$) and the actual value of y.

One rejects the the null hypothesis, $H_0$, if the computed F-static is greater than the critical F-statistic. The critical F-statistic is determined by the degrees of freedom and alpha, $\alpha$, value.

  1. Reject $H_0$ if $\text{calucated F-statistic} > \text{critical F-statistic}$

Before the decision is made to accept or reject the null hypothesis the assumptions need to be checked. See this page on how to check the parametric assumptions in detail - how to check the assumptions for this example will be demonstrated near the end.

Linear Regression with Python

Don't forget to check the assumptions before interpreting the results! First to load the libraries and data needed. Below, Pandas, Researchpy, StatsModels and the data set will be loaded.

import pandas as pd
import researchpy as rp
import statsmodels.api as sm


df = sm.datasets.webuse('auto')
df.info()

Let's look at the variables in the data set.

<class 'pandas.core.frame.DataFrame'> RangeIndex: 74 entries, 0 to 73 Data columns (total 12 columns): make 74 non-null object price 74 non-null int32 mpg 74 non-null int32 rep78 74 non-null int32 headroom 74 non-null float32 trunk 74 non-null int32 weight 74 non-null int32 length 74 non-null int32 turn 74 non-null int32 displacement 74 non-null int32 gear_ratio 74 non-null float32 foreign 74 non-null int16 dtypes: float32(2), int16(1), int32(8), object(1) memory usage: 3.7+ KB

For this example, the research question is does weight and brand nationality (domestic or foreign) significantly effect mile per galloon. Let's get some descriptives on these variables.

rp.summary_cont(df[['weight', 'mpg']])
Variable N Mean SD SE 95% Conf. Interval
0 weight 74.0 3019.459459 777.193567 90.346917 2839.398314 3199.520605
1 mpg 74.0 21.297297 5.785503 0.672551 19.956905 22.637690
rp.summary_cat(df['foreign'])
Variable Outcome Count Percent
0 foreign 0 52 70.27
1 1 22 29.73
1 indicates the observation is from a foreign manufacturer whereas 0 indicates non-foreign manufacturer

From the descriptive statistics it can be seen that the average weight is 3,019.46, we'll say the unit of measure is pounds (lbs.), and the average miles per galloon (mpg) is 21.30. The sample consists of approx. 40% more cases from foreign makers than domestic. Since foreign is a categorical variable, let's look at mpg by manufacturer.

rp.summary_cont(df.groupby('foreign')['mpg'])
N Mean SD SE 95% Conf. Interval
foreign
0 52 19.826923 4.743297 0.657777 18.525102 21.128744
1 22 24.772727 6.611187 1.409510 21.945076 27.600379

The amount of varability in mpg between the manufacturer types appears to be pretty equal. There is a way to formally test this, but right now this eye ball test will suffice. Now for the linear regression model.

Linear Regression using StatsModels

NOTE

StatsModels formula api uses Patsy to handle passing the formulas. The pseudo code looks like the following:

smf.ols("dependent_variable ~ independent_variable1 + independent_variable2 + independent_variablen",
        data = df).fit()

To tell the model that a variable is categorical, it needs to be wrapped in C(independent_variable). The pseudo code with a categorical independent variable looks like:

smf.ols("dependent_variable ~ independent_variable1 + C(independent_variable2)",
        data = df).fit()

By default, Patsy chooses the first categorical variable as the reference category; it's possible to change the reference category if desired. In order to do this, one needs to specify the reference category while one is specifying the variable is a categorical variable. Pseduo code is as follows:

smf.ols("dependent_variable ~ independent_variable1 + C(independent_variable2, Treatment(categorical_group))",
        data = df).fit()

Where categorical_group is the desired reference group.


First, one needs to import the package; the official documentation for this method of the package can be found here. The StatsModels package also supports other distributions for regression models besides Gaussian (a.k.a. normal)

import statsmodels.formula.api as smf

Now that the package is imported, the model can be fit and the results reviewed.

model = smf.ols("mpg ~ weight + C(foreign)", df).fit()
model.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.663
Model: OLS Adj. R-squared: 0.653
Method: Least Squares F-statistic: 69.75
Date: Thu, 26 Dec 2019 Prob (F-statistic): 1.76e-17
Time: 13:31:18 Log-Likelihood: -194.18
No. Observations: 74 AIC: 394.4
Df Residuals: 71 BIC: 401.3
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 41.6797 2.166 19.247 0.000 37.362 45.998
C(foreign)[T.1] -1.6500 1.076 -1.533 0.130 -3.796 0.495
weight -0.0066 0.001 -10.340 0.000 -0.008 -0.005
Omnibus: 37.830 Durbin-Watson: 2.421
Prob(Omnibus): 0.000 Jarque-Bera (JB): 92.107
Skew: 1.727 Prob(JB): 9.98e-21
Kurtosis: 7.236 Cond. No. 1.81e+04
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.81e+04. This might indicate that there are strong multicollinearity or other numerical problems.

StatsModels provides some useful model diagnostics with the summary output, let's go over them since they will be used to evaluate the model.

  • Omnibus: Tests for normality of residuals
    • Prob(Omnibus): P-value associated with omnibus test
  • Skew: Value indicates the slant, for ease of explanation, of the distribution
  • Kurtosis: Value indicates the sharpness of the peak of the distribution
  • Durbin-Watson: Tests for autocorrelation in the residuals. Unfortunately a table for critical values is not provided, but there are some available through a Google search
  • Jarque-bera (JB): Another test for normality of residuals
    • Prob(JB): P-value associated with JB test statistic
  • Cond. No.: A test for multicollinearity and estimation instability

Using this information, one can evaluate the regression model. The current overal model is significant which indicates it's better than using the mean to predict the mpg, no observations were dropped, but StatsModels diagnostic information is indicating there may be a concern of multicollinearity, or other numerical problems such as estimation concerns, based on the condition number.

Assumption Check

For how to check for parametric assumptions, please refer to the parametric assumptions page for a detailed look at how to check the parametric assumptions. Additionally, please refer to ANOVA post-hoc testing for how to test differences between multiple groups from a regression/ANOVA model. How to test for multicollinearity will be discussed below.

Multicollinearity

Multicollinearity is a concern because it weakens the signifance of independent variables. There are a few approaches to test for multicollinearity each having their own weakness. Common approaches are to use either

  • Correlation coefficient ($R$), or
  • Variance inflation factor ($VIF$)

The issue with using $R$ or $VIF$ is that they both are not able to detect when there are 3 or more collinear variables. This is where using the condition number and the condition index is better suited.

Variance Inflation Factor (VIF)

The use of the variance inflation factor is prevalent when diagnosing regression models for concerns of collinearity. Kutner, Nachtsheim, Neter, and Li (2004) suggest to use a VIF ≥ 10 as indication of multicollinearity. See O'brien (2007) for an assessment of the use of VIF and the 'rule of thumbs'.

To calculate VIF using StatsModels, one needs to import a package that hasn't been imported yet and then create the design matrix used in the regression model. Since StatsModels uses Patsy, it's recommended to use Patsy as well, although this is by no means required - it's simply easier.

from statsmodels.stats.outliers_influence import variance_inflation_factor
import patsy


# This creates the design matrix used by StatsModels
x = patsy.dmatrix("weight + C(foreign)", data = df)


vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(x, i) for i in range(x.shape[1])]
vif["Features"] = x.design_info.term_names

vif
VIF Factor Features
0 29.895625 Intercept
1 1.541895 C(foreign)
2 1.541895 weight
Ignore the intercept row since it is not useful when interpreting VIFs.

The VIFs do not indicate that there is a concern of multicollinearity given that all VIFs are less than 10. This could then suggest that the high condition number has something to do with the conditioning of the design matrix.

Condition Number

While there is no official threshold established to determine what a high condition number is, Belsley, D. A., Kuh, E., Welsch, R. E. (1980) conducted some experiments and found condition indexes 5-10 to be associated with weak dependencies, while condition indexes 30+ to be associated with moderate to strong dependecies; it's noted that since there is no official condition number threshold the suggested values are good starting values to use while one gains experience. For a full overview of the content, please see chapter 3 in Regression Diagnostics: Identifying Influential Data and Sources of Collinearity written by Belsley, D. A., Kuh, E., and Welsch, R. E. (1980).

In addition to evaluating collinearity using the condition index, the Eigenvalues will also be calculated and a collinearity diagnostic table will be produced.

First have to re-create the design matrix used in the regression since StatsModels does not provide this. Recommended to use the package Patsy to do this because it will be easier and it's the package that StatsModels uses itself so the same formula can be entered to get the same design matrix.

import numpy as np
import patsy


# Note that the order of the columns are the same as the regression formula
x = np.asarray(patsy.dmatrix("weight + C(foreign)", data = df))

Next is to get the singual values for the design matrix - don't worry if you're not familiar with these. In this step, the condition index will also be calculated. The condition value for the matrix is the largest value of the condition index.

# Getting the singular values from SVD

_, sing_as, _ = np.linalg.svd(x)


# Calculating the condiction index

condition_index = []

for n in sing_as:
    ci = sing_as.max() / n
    condition_index.append(ci)

condition_index
[1.0, 5887.959998687699, 18094.091357242447]

Now to put the pieces together to create the collinearity diagnostics table.

eigen_vals = (sing_as * sing_as).round(3)

pd.DataFrame(np.c_[eigen_vals, condition_index],
             columns = ["Eigenvalues", "Condition Index"]).round(1)
Collinearity diagnostics table
Eigenvalues Condition Index
0 718762273.1 1.0
1 20.7 5888.0
2 2.2 18094.1

Here it can be seen that the largest value of the condition index is indeed the condition number returned by StatsModels. In addition to this, it can be seen that there are 2 cases where the conidition value is large, > 30. Given that the condition number is a method to identify collinearity, by definition of multicollinearity, there needs to be at least 2 independent variables that have some form of linear dependency if it exists; which is the current case.

Now to assess the eigenvalues, small eigenvalues indicate instability in the estimation of the regression coeffiecients and multiple small eigenvalues indicates intercorrelation. The largest issue with this is that, currently, there is no agreed on criteria for what constitutes as small. Kendall (1957) and Silvey (1969) define small as close to 0 while Chatterjee and Price (1977) define small based on it's comparison to others (as cited in Belsley, Kuh, & Welsch, 1980, p.96).

Either definition one prefers to go by it should be clear that there is concern in the estimations provided. A common method to help stabilizing the estimates is to transform the continous independent variables that have been indicated - commonly using the z-score transformation although others transformations work as well. Below is the transformed model using the Z-score transformation on weight and the condition index.

df["weightZ"] = (df["weight"] - df["weight"].mean()) / df["weight"].std()

modelz = smf.ols("mpg ~ weightZ + C(foreign)", df).fit()
modelz.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.663
Model: OLS Adj. R-squared: 0.653
Method: Least Squares F-statistic: 69.75
Date: Tue, 31 Dec 2019 Prob (F-statistic): 1.76e-17
Time: 09:42:36 Log-Likelihood: -194.18
No. Observations: 74 AIC: 394.4
Df Residuals: 71 BIC: 401.3
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 21.7878 0.509 42.796 0.000 20.773 22.803
C(foreign)[T.1] -1.6500 1.076 -1.533 0.130 -3.796 0.495
weightZ -5.1201 0.495 -10.340 0.000 -6.107 -4.133
Omnibus: 37.830 Durbin-Watson: 2.421
Prob(Omnibus): 0.000 Jarque-Bera (JB): 92.107
Skew: 1.727 Prob(JB): 9.98e-21
Kurtosis: 7.236 Cond. No. 3.21
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Getting the design matrix
xz = np.asarray(patsy.dmatrix("weightZ + C(foreign)", data = df))


# Getting the singular values from SVD
_, sing_as, _ = np.linalg.svd(xz)


# Calculating the condiction index
condition_index = []

for n in sing_as:
    ci = sing_as.max() / n
    condition_index.append(ci)


# The singular values squared are the eigenvalues
eigen_vals = (sing_as * sing_as).round(3)

pd.DataFrame(np.c_[eigen_vals, condition_index],
            columns = ["Eigenvalues", "Condition Index"]).round(1)
Eigenvalues Condition Index
0 87.1 1.0
1 73.4 1.1
2 8.5 3.2

In the Z-score model the eigenvalues and their associated condition value indicates that the design matrix used is well conditioned, i.e. no condition value is ≥ 30, and that the concern of collinearity/estimation stability has been addressed.

References

Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression diagnostics: identifying influential data and sources of collinearity. Hoboken, NJ: John Wiley & Sons, Inc.
Kutner, M. H., Nachtsheim, C. J., Neter, J., and Li, W. (2004). Applied linear statistical models (5th ed.). New York, NY: McGraw-Hill Irwin.
Allen, M. P. (1997). Understanding Regression Analysis. New York: Springer.
O'brien, R. M. (2007). A caution regarding rules of thumb for variance inflation factors. Qual Quant, 41, 673-690.