Logistic Regression
If you are looking for how to run code jump to the
next section or if you would like some
theory/refresher then start with this section.
Logitic regression is a nonlinear regression model used when the dependent
variable (outcome) is binary (0 or 1). The binary value 1 is typically used to
indicate that the event (or outcome desired) occured, whereas 0 is typically
used to indicate the event did not occur. The interpretation of the
coeffiecients are not straightforward as they are when they come
from a linear regression model - this is due to the transformation
of the data that is made in the logistic regression algorithm.
In logistic regression, the coeffiecients
are a measure of the log of the odds. Given this, the interpretation of a
categorical independent variable with two groups would be
"those who are in group-A have an increase/decrease ##.## in the log odds
of the outcome compared to group-B" - that's not intuitive at all.
Commonly, researchers like to take the exponential of the coeffiecients
because it allows for a much easier interpretation since now the coeffiecients
represent the odd ratio (OR). This would change the interpretation to, "the odd
of the outcome for group-A is ##.## times that of group-B", where
- OR = 1, same odds between groups
- OR < 1, fewer odds compared to reference group
- OR > 1, greater odds compared to reference group
For continuous independent variables, the interpretation of the odds ratios is worded slightly different because there is no comparison group. In this case, the interpretation would be "the odds of the outcome increases/decreases by a factor of ##.## for every one unit increase in the independent variable."
If one were to use the logistic regression model to make predictions, the predicted Y, ($\hat{Y}$), would represent the probability of the outcome occuring given the specific values of the independent variables, i.e. $\hat{Y} = 0.56$ would mean there is a 56% chance the outcome will occur.
For this demonstration, the conventional p-value of 0.05 will be used.
Logistic regression test assumptions
- Linearity of the logit for continous variable
- Independence of errors
Maximum likelihood estimation is used to obtain the coeffiecients and the model is typically assessed using a goodness-of-fit (GoF) test - currently, the Hosmer-Lemeshow GoF test is commonly used. Hosmer and Lemeshow (1980) method is as follows:
- Order the observations based on their estimated probabilities
- Partition ordered observations into 10 groups ($g$ = 10) by either of the following grouping strategies:
- sample size, defined as $n_g^{'} = \frac{n}{10}$, or
- by using cutpoints ($k$), defined as $\frac{k_g}{10}$
- These groupings are known as 'deciles of risk'. Either grouping strategy can be used to calculate the Hosmer-Lemeshow goodness-of-fit statistic ($\hat{C}$)
- Run the Pearson chi-squared (χ2) test on the $g$ x 2 deciles of risk table. Hosmer and Lemeshow (1980) expresses this using the following formula:
- Where,
- $n_k^{'}$ is the total number of participants in the $k^{th}$ group
- $c_k$ is the number of covariate patterns in the $k^{th}$ decile
- $m_j\hat{\pi}_j$ is the expected probability
$$ \begin{align*} \hat{C} = \sum_{k=1}^{g}\frac{(o_k - n_k^{'} \bar{\pi}_k)^{2}} {n_k^{'} \bar{\pi}_k - (1 - \bar{\pi}_k)} & & \\ \\ \text{with, } & \\ \\ o_k = \sum_{j=1}^{c_k}y_j & & \text{being the observed number of responses} \\ \\ \\ \bar{\pi} = \sum_{j=1}^{c_k}\frac{m_j\hat{\pi_j}}{n_k^{'}} & & \text{being the average estimated probability} \\ \end{align*} $$
One rejects the null hypothesis, $H_o$, if the computed $\hat{C}$ statistic is greater than the critical $\chi^2$ statistic for the given degrees of freedom.
Logistic Regression with Python
Don't forget to check the assumptions before interpreting the results! First to load the libraries and data needed. Below, Pandas, Researchpy, and the data set will be loaded. This data set is hosted by UCLA Institute for Digital Research & Education for their demonstration on logistic regression within Stata.
import pandas as pd
import researchpy as rp
import statsmodels.api as sm
df = pd.read_stata("https://stats.idre.ucla.edu//stat//stata//dae//binary.dta")
df.info()
Let's look at the variables in the data set.
For this example, the hypothetical research question is "What factors affect the chances of being admitted?" One of the departments has some data from the previous semester and would like to use it to test this research questions. Now to take a look at the descriptives of the factors that will be included in the model: gre, gpa, and rank. Rank is a factor variable that measures the institutions prestigiousness from which the applicant is applying from with 1 indicating the highest prestige to 4 indicating the lowest prestige.
rp.summary_cont(df[['gre', 'gpa']])
Variable | N | Mean | SD | SE | 95% Conf. | Interval | |
---|---|---|---|---|---|---|---|
0 | gre | 400.0 | 587.700012 | 115.516663 | 5.775827 | 576.345156 | 599.054868 |
1 | gpa | 400.0 | 3.389901 | 0.380567 | 0.019028 | 3.352493 | 3.427309 |
rp.summary_cat(df[['admit', 'rank']])
Variable | Outcome | Count | Percent | |
---|---|---|---|---|
0 | admit | 0.0 | 273 | 68.25 |
1 | 1.0 | 127 | 31.75 | |
2 | rank | 2.0 | 151 | 37.75 |
3 | 3.0 | 121 | 30.25 | |
4 | 4.0 | 67 | 16.75 | |
5 | 1.0 | 61 | 15.25 |
From the descriptive statistics it can be seen that the average GRE score is 587.7, the average GPA is 3.389, applicants appying from institutions with a prestige rank of 2 is most common, and the majority of the applicants were not admitted to the program.
Logistic Regression using StatsModels
NOTE
StatsModels formula api uses Patsy to handle passing the formulas. The pseudo code looks like the following:
smf.logit("dependent_variable ~ independent_variable1 + independent_variable2 + independent_variablen",
data = df).fit()
To tell the model that a variable is categorical, it needs to be wrapped in C(independent_variable). The pseudo code with a categorical independent variable looks like:
smf.logit("dependent_variable ~ independent_variable1 + C(independent_variable2)",
data = df).fit()
By default, Patsy chooses the first categorical variable as the reference category; it's possible to change the reference category if desired. In order to do this, one needs to specify the reference category while one is specifying the variable is a categorical variable. Pseduo code is as follows:
smf.logit("dependent_variable ~ independent_variable1 + C(independent_variable2, Treatment(categorical_group))",
data = df).fit()
Where categorical_group is the desired reference group.
First, one needs to import the package; the official documentation for this method of the package can be found here.
import statsmodels.formula.api as smf
Now that the package is imported, the model can be fit and the results reviewed.
model = smf.logit("admit ~ gre + gpa + C(rank)", data = df).fit()
model.summary()
Dep. Variable: | admit | No. Observations: | 400 |
---|---|---|---|
Model: | Logit | Df Residuals: | 394 |
Method: | MLE | Df Model: | 5 |
Date: | Mon, 20 Jan 2020 | Pseudo R-squ.: | 0.08292 |
Time: | 10:24:43 | Log-Likelihood: | -229.26 |
converged: | True | LL-Null: | -249.99 |
LLR p-value: | 7.578e-08 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -3.9900 | 1.140 | -3.500 | 0.000 | -6.224 | -1.756 |
C(rank)[T.2.0] | -0.6754 | 0.316 | -2.134 | 0.033 | -1.296 | -0.055 |
C(rank)[T.3.0] | -1.3402 | 0.345 | -3.881 | 0.000 | -2.017 | -0.663 |
C(rank)[T.4.0] | -1.5515 | 0.418 | -3.713 | 0.000 | -2.370 | -0.733 |
gre | 0.0023 | 0.001 | 2.070 | 0.038 | 0.000 | 0.004 |
gpa | 0.8040 | 0.332 | 2.423 | 0.015 | 0.154 | 1.454 |
Interpretation
Using this information, one can evaluate the regression model. The current
overal model is significant which indicates it's better than using the
mean to predict being admitted.Interpreting the coefficients right now would be premature since the
model's diagnostics have not been evaluated. However, for demonstration purposes
they will be interpreted.
The overall model indicates the model is better than using the mean of
admission to predict an applicants admission decision, F(5, 394) < 0.0000.
For every unit increase in GRE there is a 0.0023 increase in the log odds
of being admitted; additionally, for every unit increase in
GPA there is a 0.8040 increase in the log odds of being admitted. Applicants
applying from institutions with a rank of 2, 3, or 4 have a decrease in the
log odds of being admitted of -0.6754, -1.3402, and -1.5515, respectively,
compared to applicants applying from a rank 1 institution.
That the interpretation is valid, but log odds is not intuitive in it's interpretation. Let's convert this to odds ratio and interpret the model again. To convert the log odds coefficients and confidence intervals, one needs to take the exponential of the values.
model_odds = pd.DataFrame(np.exp(model.params), columns= ['OR'])
model_odds['z-value']= model.pvalues
model_odds[['2.5%', '97.5%']] = np.exp(model.conf_int())
model_odds
OR | z-value | 2.5% | 97.5% | |
---|---|---|---|---|
Intercept | 0.018500 | 0.000465 | 0.001981 | 0.172783 |
C(rank)[T.2.0] | 0.508931 | 0.032829 | 0.273692 | 0.946358 |
C(rank)[T.3.0] | 0.261792 | 0.000104 | 0.133055 | 0.515089 |
C(rank)[T.4.0] | 0.211938 | 0.000205 | 0.093443 | 0.480692 |
gre | 1.002267 | 0.038465 | 1.000120 | 1.004418 |
gpa | 2.234545 | 0.015388 | 1.166122 | 4.281877 |
Interpretation
The overall model indicates the model is better than using the mean of admission to predict an applicants admission decision, F(5, 394) < 0.0000. The odds of being admitted increases by a factor of 1.002 for every unit increase in GRE; likewise, the odds of being admitted increases by a factor of 2.235 for every unit increase in GPA. The odds of being addmitted for those applying from an institution with a rank of 2, 3, or 4 are 0.5089, 0.2618, and 0.2119, respectively, times that of those applying from an institution with a rank of 1.
Converting to odd ratios (OR) is much more intuitive in the interpretation. Where,
- OR = 1, same odds
- OR < 1, fewer/decrease in odds
- OR > 1, greater/increase in odds
Also note that ORs are multiplicative in their interpretation that is why the phrasing includes "... times more likely\less likely ..." or "... a factor of ...".
Assumption Check
Since logistic regression is a nonparametric model the assumptions are different than linear regression and the diagnostics of the model are different as well. A lot of the methods used to diagnose linear regression models cannot be used to diagnose logistic regression models; with logistic regression, the focus is on assessing the model's adequacy.
Logistic Regression Residuals
In linear regression, one assess the residuals as is; however the residuals from the logistic regression model need to be transformed to be useful. This is because the dependent variable is binary (0 or 1). Due to the binary nature of the outcome, the residuals will not be normally distributed and their distribution is unknown (Nachtsheim, Neter, & Li, 2004). The residuals assessed then are either the Pearson residuals, studentized Pearson residuals, and/or the deviance residuals.
A plot that is helpful for diagnosing logistic regression model is to plot
the studentized Pearson residuals, or the deviance residuals,
against the estimated probability or linear predictor values with a Lowess smooth.
Nachtsheim, Neter, and Li (2004) show that under the assumption that the logistic regression model
is correct then the error (difference) between the observed value ($Y_i$)
and predicted value ($\hat{\pi}_i$) is equal to 0, i.e.
$$Y_i - \pi_i = 0$$
They conclude that this then suggests that a lowess smooth of one of the plots
mentioned above would approximately be a horizontal line with zero intercept -
unfortunately they do not provide a suggestion of what "approximately"
looks like.
While looking at visualizations, it's important to keep in mind the image
size and scale will affect how the visualization looks and thus will affect
ones interpretation.
Now,to demonstrate this. StatsModels calculates the studentized Pearson
residuals (model.resid_pearson) as well as the
deviance residuals (model.resid_dev) by default - saves us some time.
## Plotting multiple plots same figure
fig, (axL, axR) = plt.subplots(2, figsize=(15, 15))
plt.suptitle("Logistic Regression Residual Plots \n using Seaborn Lowess line (N = 400)")
# Deviance Residuals
sns.regplot(model.fittedvalues, model.resid_dev, ax= axL,
color="black", scatter_kws={"s": 5},
line_kws={"color":"b", "alpha":1, "lw":2}, lowess=True)
axL.set_title("Deviance Residuals \n against Fitted Values")
axL.set_xlabel("Linear Predictor Values")
axL.set_ylabel("Deviance Residuals")
# Studentized Pearson Residuals
sns.regplot(model.fittedvalues, model.resid_pearson, ax= axR,
color="black", scatter_kws={"s": 5},
line_kws={"color":"g", "alpha":1, "lw":2}, lowess=True)
axR.set_title("Studentized Pearson Residuals \n against Fitted Values")
axR.set_xlabel("Linear Predictor Values")
axR.set_ylabel("Studentized Pearson Residuals")
plt.show()
For the current example, it appears the plots do approximate horizontal line with 0 intercept. This suggests that there is no significant model inadequacy.
References
Kutner, M. H., Nachtsheim, C. J., Neter, J., and Li, W. (2004). Applied linear statistical models (5th ed.). New York, NY: McGraw-Hill Irwin.Hosmer, D. W., Jr., and Lemeshow, S. (1980). Goodness-of-fit tests for the multiple logistic regression model. Communications in Statistics: Theory and Methods, Part A, 9, 1043-1069
Hosmer, D. W., Jr., and Lemeshow, S. (2000). Applied Logistic Regression (2nd ed.). New York: Wiley.