Linear Regression
If you are looking for how to run code jump to the
next section or if you would like some
theory/refresher then start with this section.
Linear regression is used to test the relationship between
independent variable(s) and a continous dependent variable. The overall
regression model needs to be significant before one looks at the
individual coeffiecients themselves. The model's signifance is measured
by the F-statistic and a corresponding p-value. If the overall F-statistic
is not significant, it indicates that the current model is no better
than using the mean value of the dependent variable at predicting the
outcome.
Regression models are useful because it allows one to see which variable(s)
are important while taking into account other variables that could influence
the outcome as well. Furthermore, once a regression model is decided on,
there is a good amount of additional post-estimation work that can be
done to further explore the relationship(s) that may be present.
Since linear regression is a parametric test it has the typical parametric
testing assumptions. In addition to this, there is an additional concern of
multicollinearity. While multicollinearity is not an assumption of
the regression model, it's an aspect that needs to be checked.
Multicollinearity occurs when an independent variable
is able to be predicted, with good accuracy, by another independent variable
in the same model. Multicollinearity is a concern because it weakens
significance of independent variables.
How to test for this will be demonstrated later on.
For this demonstration, the conventional p-value of 0.05 will be used.
Parametric test assumptions
- Population distributions are normal
- Samples have equal variances
- Independence
- $H_0: \beta_1 = \beta_2 = \; ... \; = \beta_k = 0$
- $H_A: \text{At least one $\beta_k$}$ ≠ $0$
The test statistic is the F-statistic and compares the regression mean square ($MS_R$) to the error mean square ($MS_E$). $MS_R$ is also known as the model's mean square. This F-statistic can be calculated using the following formula:
- $F = \frac{MS_R}{MS_E}$
- Where,
- $MS_R= \frac{SS_R}{(k - 1)}$
- $MS_E= \frac{SS_E}{(n_T - k)}$
- $k$ is the number of independent variables
- $n_T$ is the total number of observations
- and where,
- Regression model sum of square ($SS_R$) = $\sum (\hat{y}_i - \bar{y})^2$
- which is the sum of the squared differences between the predicted value of y ($\hat{y}$) and the mean of y ($\bar{y}$).
- Error sum of square ($SS_E$) = $\sum(y_{i} - \hat{y}_i)^2$
- which is the sum of the squared differences between the predicted value of y ($\hat{y}$) and the actual value of y.
One rejects the the null hypothesis, $H_0$, if the computed F-static is greater than the critical F-statistic. The critical F-statistic is determined by the degrees of freedom and alpha, $\alpha$, value.
- Reject $H_0$ if $\text{calucated F-statistic} > \text{critical F-statistic}$
Before the decision is made to accept or reject the null hypothesis the assumptions need to be checked. See this page on how to check the parametric assumptions in detail - how to check the assumptions for this example will be demonstrated near the end.
Linear Regression with Python
Don't forget to check the assumptions before interpreting the results! First to load the libraries and data needed. Below, Pandas, Researchpy, StatsModels and the data set will be loaded.
import pandas as pd
import researchpy as rp
import statsmodels.api as sm
df = sm.datasets.webuse('auto')
df.info()
Let's look at the variables in the data set.
For this example, the research question is does weight and brand nationality (domestic or foreign) significantly effect mile per galloon. Let's get some descriptives on these variables.
rp.summary_cont(df[['weight', 'mpg']])
Variable | N | Mean | SD | SE | 95% Conf. | Interval | |
---|---|---|---|---|---|---|---|
0 | weight | 74.0 | 3019.459459 | 777.193567 | 90.346917 | 2839.398314 | 3199.520605 |
1 | mpg | 74.0 | 21.297297 | 5.785503 | 0.672551 | 19.956905 | 22.637690 |
rp.summary_cat(df['foreign'])
Variable | Outcome | Count | Percent | |
---|---|---|---|---|
0 | foreign | 0 | 52 | 70.27 |
1 | 1 | 22 | 29.73 | |
1 indicates the observation is from a foreign manufacturer whereas 0 indicates non-foreign manufacturer |
From the descriptive statistics it can be seen that the average weight is 3,019.46, we'll say the unit of measure is pounds (lbs.), and the average miles per galloon (mpg) is 21.30. The sample consists of approx. 40% more cases from foreign makers than domestic. Since foreign is a categorical variable, let's look at mpg by manufacturer.
rp.summary_cont(df.groupby('foreign')['mpg'])
N | Mean | SD | SE | 95% Conf. | Interval | |
---|---|---|---|---|---|---|
foreign | ||||||
0 | 52 | 19.826923 | 4.743297 | 0.657777 | 18.525102 | 21.128744 |
1 | 22 | 24.772727 | 6.611187 | 1.409510 | 21.945076 | 27.600379 |
The amount of varability in mpg between the manufacturer types appears to be pretty equal. There is a way to formally test this, but right now this eye ball test will suffice. Now for the linear regression model.
Linear Regression using StatsModels
NOTE
StatsModels formula api uses Patsy to handle passing the formulas. The pseudo code looks like the following:
smf.ols("dependent_variable ~ independent_variable1 + independent_variable2 + independent_variablen",
data = df).fit()
To tell the model that a variable is categorical, it needs to be wrapped in C(independent_variable). The pseudo code with a categorical independent variable looks like:
smf.ols("dependent_variable ~ independent_variable1 + C(independent_variable2)",
data = df).fit()
By default, Patsy chooses the first categorical variable as the reference category; it's possible to change the reference category if desired. In order to do this, one needs to specify the reference category while one is specifying the variable is a categorical variable. Pseduo code is as follows:
smf.ols("dependent_variable ~ independent_variable1 + C(independent_variable2, Treatment(categorical_group))",
data = df).fit()
Where categorical_group is the desired reference group.
First, one needs to import the package; the official documentation for this method of the package can be found here. The StatsModels package also supports other distributions for regression models besides Gaussian (a.k.a. normal)
import statsmodels.formula.api as smf
Now that the package is imported, the model can be fit and the results reviewed.
model = smf.ols("mpg ~ weight + C(foreign)", df).fit()
model.summary()
Dep. Variable: | mpg | R-squared: | 0.663 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.653 |
Method: | Least Squares | F-statistic: | 69.75 |
Date: | Thu, 26 Dec 2019 | Prob (F-statistic): | 1.76e-17 |
Time: | 13:31:18 | Log-Likelihood: | -194.18 |
No. Observations: | 74 | AIC: | 394.4 |
Df Residuals: | 71 | BIC: | 401.3 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 41.6797 | 2.166 | 19.247 | 0.000 | 37.362 | 45.998 |
C(foreign)[T.1] | -1.6500 | 1.076 | -1.533 | 0.130 | -3.796 | 0.495 |
weight | -0.0066 | 0.001 | -10.340 | 0.000 | -0.008 | -0.005 |
Omnibus: | 37.830 | Durbin-Watson: | 2.421 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 92.107 |
Skew: | 1.727 | Prob(JB): | 9.98e-21 |
Kurtosis: | 7.236 | Cond. No. | 1.81e+04 |
StatsModels provides some useful model diagnostics with the summary output, let's go over them since they will be used to evaluate the model.
- Omnibus: Tests for normality of residuals
- Prob(Omnibus): P-value associated with omnibus test
- Skew: Value indicates the slant, for ease of explanation, of the distribution
- Kurtosis: Value indicates the sharpness of the peak of the distribution
- Durbin-Watson: Tests for autocorrelation in the residuals. Unfortunately a table for critical values is not provided, but there are some available through a Google search
- Jarque-bera (JB): Another test for normality of residuals
- Prob(JB): P-value associated with JB test statistic
- Cond. No.: A test for multicollinearity and estimation instability
Using this information, one can evaluate the regression model. The current overal model is significant which indicates it's better than using the mean to predict the mpg, no observations were dropped, but StatsModels diagnostic information is indicating there may be a concern of multicollinearity, or other numerical problems such as estimation concerns, based on the condition number.
Assumption Check
For how to check for parametric assumptions, please refer to the parametric assumptions page for a detailed look at how to check the parametric assumptions. Additionally, please refer to ANOVA post-hoc testing for how to test differences between multiple groups from a regression/ANOVA model. How to test for multicollinearity will be discussed below.
Multicollinearity
Multicollinearity is a concern because it weakens the signifance of independent variables. There are a few approaches to test for multicollinearity each having their own weakness. Common approaches are to use either
- Correlation coefficient ($R$), or
- Variance inflation factor ($VIF$)
The issue with using $R$ or $VIF$ is that they both are not able to detect when there are 3 or more collinear variables. This is where using the condition number and the condition index is better suited.
Variance Inflation Factor (VIF)
The use of the variance inflation factor is prevalent when diagnosing regression models for concerns of collinearity. Kutner, Nachtsheim, Neter, and Li (2004) suggest to use a VIF ≥ 10 as indication of multicollinearity. See O'brien (2007) for an assessment of the use of VIF and the 'rule of thumbs'.
To calculate VIF using StatsModels, one needs to import a package that hasn't been imported yet and then create the design matrix used in the regression model. Since StatsModels uses Patsy, it's recommended to use Patsy as well, although this is by no means required - it's simply easier.
from statsmodels.stats.outliers_influence import variance_inflation_factor
import patsy
# This creates the design matrix used by StatsModels
x = patsy.dmatrix("weight + C(foreign)", data = df)
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(x, i) for i in range(x.shape[1])]
vif["Features"] = x.design_info.term_names
vif
VIF Factor | Features | |
---|---|---|
0 | 29.895625 | Intercept |
1 | 1.541895 | C(foreign) |
2 | 1.541895 | weight |
Ignore the intercept row since it is not useful when interpreting VIFs. |
The VIFs do not indicate that there is a concern of multicollinearity given that all VIFs are less than 10. This could then suggest that the high condition number has something to do with the conditioning of the design matrix.
Condition Number
While there is no official threshold established to determine what a high condition number is, Belsley, D. A., Kuh, E., Welsch, R. E. (1980) conducted some experiments and found condition indexes 5-10 to be associated with weak dependencies, while condition indexes 30+ to be associated with moderate to strong dependecies; it's noted that since there is no official condition number threshold the suggested values are good starting values to use while one gains experience. For a full overview of the content, please see chapter 3 in Regression Diagnostics: Identifying Influential Data and Sources of Collinearity written by Belsley, D. A., Kuh, E., and Welsch, R. E. (1980).
In addition to evaluating collinearity using the condition index, the Eigenvalues will also be calculated and a collinearity diagnostic table will be produced.
First have to re-create the design matrix used in the regression since StatsModels does not provide this. Recommended to use the package Patsy to do this because it will be easier and it's the package that StatsModels uses itself so the same formula can be entered to get the same design matrix.
import numpy as np
import patsy
# Note that the order of the columns are the same as the regression formula
x = np.asarray(patsy.dmatrix("weight + C(foreign)", data = df))
Next is to get the singual values for the design matrix - don't worry if you're not familiar with these. In this step, the condition index will also be calculated. The condition value for the matrix is the largest value of the condition index.
# Getting the singular values from SVD
_, sing_as, _ = np.linalg.svd(x)
# Calculating the condiction index
condition_index = []
for n in sing_as:
ci = sing_as.max() / n
condition_index.append(ci)
condition_index
Now to put the pieces together to create the collinearity diagnostics table.
eigen_vals = (sing_as * sing_as).round(3)
pd.DataFrame(np.c_[eigen_vals, condition_index],
columns = ["Eigenvalues", "Condition Index"]).round(1)
Eigenvalues | Condition Index | |
---|---|---|
0 | 718762273.1 | 1.0 |
1 | 20.7 | 5888.0 |
2 | 2.2 | 18094.1 |
Here it can be seen that the largest value of the condition index is indeed the condition number returned by StatsModels. In addition to this, it can be seen that there are 2 cases where the conidition value is large, > 30. Given that the condition number is a method to identify collinearity, by definition of multicollinearity, there needs to be at least 2 independent variables that have some form of linear dependency if it exists; which is the current case.
Now to assess the eigenvalues, small eigenvalues indicate instability in the estimation of the regression coeffiecients and multiple small eigenvalues indicates intercorrelation. The largest issue with this is that, currently, there is no agreed on criteria for what constitutes as small. Kendall (1957) and Silvey (1969) define small as close to 0 while Chatterjee and Price (1977) define small based on it's comparison to others (as cited in Belsley, Kuh, & Welsch, 1980, p.96).
Either definition one prefers to go by it should be clear that there is concern in the estimations provided. A common method to help stabilizing the estimates is to transform the continous independent variables that have been indicated - commonly using the z-score transformation although others transformations work as well. Below is the transformed model using the Z-score transformation on weight and the condition index.
df["weightZ"] = (df["weight"] - df["weight"].mean()) / df["weight"].std()
modelz = smf.ols("mpg ~ weightZ + C(foreign)", df).fit()
modelz.summary()
Dep. Variable: | mpg | R-squared: | 0.663 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.653 |
Method: | Least Squares | F-statistic: | 69.75 |
Date: | Tue, 31 Dec 2019 | Prob (F-statistic): | 1.76e-17 |
Time: | 09:42:36 | Log-Likelihood: | -194.18 |
No. Observations: | 74 | AIC: | 394.4 |
Df Residuals: | 71 | BIC: | 401.3 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 21.7878 | 0.509 | 42.796 | 0.000 | 20.773 | 22.803 |
C(foreign)[T.1] | -1.6500 | 1.076 | -1.533 | 0.130 | -3.796 | 0.495 |
weightZ | -5.1201 | 0.495 | -10.340 | 0.000 | -6.107 | -4.133 |
Omnibus: | 37.830 | Durbin-Watson: | 2.421 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 92.107 |
Skew: | 1.727 | Prob(JB): | 9.98e-21 |
Kurtosis: | 7.236 | Cond. No. | 3.21 |
# Getting the design matrix
xz = np.asarray(patsy.dmatrix("weightZ + C(foreign)", data = df))
# Getting the singular values from SVD
_, sing_as, _ = np.linalg.svd(xz)
# Calculating the condiction index
condition_index = []
for n in sing_as:
ci = sing_as.max() / n
condition_index.append(ci)
# The singular values squared are the eigenvalues
eigen_vals = (sing_as * sing_as).round(3)
pd.DataFrame(np.c_[eigen_vals, condition_index],
columns = ["Eigenvalues", "Condition Index"]).round(1)
Eigenvalues | Condition Index | |
---|---|---|
0 | 87.1 | 1.0 |
1 | 73.4 | 1.1 |
2 | 8.5 | 3.2 |
In the Z-score model the eigenvalues and their associated condition value indicates that the design matrix used is well conditioned, i.e. no condition value is ≥ 30, and that the concern of collinearity/estimation stability has been addressed.
References
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression diagnostics: identifying influential data and sources of collinearity. Hoboken, NJ: John Wiley & Sons, Inc.Kutner, M. H., Nachtsheim, C. J., Neter, J., and Li, W. (2004). Applied linear statistical models (5th ed.). New York, NY: McGraw-Hill Irwin.
Allen, M. P. (1997). Understanding Regression Analysis. New York: Springer.
O'brien, R. M. (2007). A caution regarding rules of thumb for variance inflation factors. Qual Quant, 41, 673-690.