Python for Data Science

Logistic Regression

If you are looking for how to run code jump to the next section or if you would like some theory/refresher then start with this section.

Logitic regression is a nonlinear regression model used when the dependent variable (outcome) is binary (0 or 1). The binary value 1 is typically used to indicate that the event (or outcome desired) occured, whereas 0 is typically used to indicate the event did not occur. The interpretation of the coeffiecients are not straightforward as they are when they come from a linear regression model - this is due to the transformation of the data that is made in the logistic regression algorithm. In logistic regression, the coeffiecients are a measure of the log of the odds. Given this, the interpretation of a categorical independent variable with two groups would be "those who are in group-A have an increase/decrease ##.## in the log odds of the outcome compared to group-B" - that's not intuitive at all. Commonly, researchers like to take the exponential of the coeffiecients because it allows for a much easier interpretation since now the coeffiecients represent the odd ratio (OR). This would change the interpretation to, "the odd of the outcome for group-A is ##.## times that of group-B", where

  • OR = 1, same odds between groups
  • OR < 1, fewer odds compared to reference group
  • OR > 1, greater odds compared to reference group

For continuous independent variables, the interpretation of the odds ratios is worded slightly different because there is no comparison group. In this case, the interpretation would be "the odds of the outcome increases/decreases by a factor of ##.## for every one unit increase in the independent variable."

If one were to use the logistic regression model to make predictions, the predicted Y, ($\hat{Y}$), would represent the probability of the outcome occuring given the specific values of the independent variables, i.e. $\hat{Y} = 0.56$ would mean there is a 56% chance the outcome will occur.

For this demonstration, the conventional p-value of 0.05 will be used.

Logistic regression test assumptions

  • Linearity of the logit for continous variable
  • Independence of errors

Maximum likelihood estimation is used to obtain the coeffiecients and the model is typically assessed using a goodness-of-fit (GoF) test - currently, the Hosmer-Lemeshow GoF test is commonly used. Hosmer and Lemeshow (1980) method is as follows:

  1. Order the observations based on their estimated probabilities
  2. Partition ordered observations into 10 groups ($g$ = 10) by either of the following grouping strategies:
    • sample size, defined as $n_g^{'} = \frac{n}{10}$, or
    • by using cutpoints ($k$), defined as $\frac{k_g}{10}$
    • These groupings are known as 'deciles of risk'. Either grouping strategy can be used to calculate the Hosmer-Lemeshow goodness-of-fit statistic ($\hat{C}$)
  3. Run the Pearson chi-squared (χ2) test on the $g$ x 2 deciles of risk table. Hosmer and Lemeshow (1980) expresses this using the following formula:
  4. $$ \begin{align*} \hat{C} = \sum_{k=1}^{g}\frac{(o_k - n_k^{'} \bar{\pi}_k)^{2}} {n_k^{'} \bar{\pi}_k - (1 - \bar{\pi}_k)} & & \\ \\ \text{with, } & \\ \\ o_k = \sum_{j=1}^{c_k}y_j & & \text{being the observed number of responses} \\ \\ \\ \bar{\pi} = \sum_{j=1}^{c_k}\frac{m_j\hat{\pi_j}}{n_k^{'}} & & \text{being the average estimated probability} \\ \end{align*} $$

  5. Where,
    • $n_k^{'}$ is the total number of participants in the $k^{th}$ group
    • $c_k$ is the number of covariate patterns in the $k^{th}$ decile
    • $m_j\hat{\pi}_j$ is the expected probability

One rejects the null hypothesis, $H_o$, if the computed $\hat{C}$ statistic is greater than the critical $\chi^2$ statistic for the given degrees of freedom.

Logistic Regression with Python

Don't forget to check the assumptions before interpreting the results! First to load the libraries and data needed. Below, Pandas, Researchpy, and the data set will be loaded. This data set is hosted by UCLA Institute for Digital Research & Education for their demonstration on logistic regression within Stata.

import pandas as pd
import researchpy as rp
import statsmodels.api as sm


df = pd.read_stata("https://stats.idre.ucla.edu//stat//stata//dae//binary.dta")
df.info()

Let's look at the variables in the data set.

<class 'pandas.core.frame.DataFrame'> Int64Index: 400 entries, 0 to 399 Data columns (total 4 columns): admit 400 non-null float32 gre 400 non-null float32 gpa 400 non-null float32 rank 400 non-null float32 dtypes: float32(4) memory usage: 9.4 KB

For this example, the hypothetical research question is "What factors affect the chances of being admitted?" One of the departments has some data from the previous semester and would like to use it to test this research questions. Now to take a look at the descriptives of the factors that will be included in the model: gre, gpa, and rank. Rank is a factor variable that measures the institutions prestigiousness from which the applicant is applying from with 1 indicating the highest prestige to 4 indicating the lowest prestige.

rp.summary_cont(df[['gre', 'gpa']])
Variable N Mean SD SE 95% Conf. Interval
0 gre 400.0 587.700012 115.516663 5.775827 576.345156 599.054868
1 gpa 400.0 3.389901 0.380567 0.019028 3.352493 3.427309
rp.summary_cat(df[['admit', 'rank']])
Variable Outcome Count Percent
0 admit 0.0 273 68.25
1 1.0 127 31.75
2 rank 2.0 151 37.75
3 3.0 121 30.25
4 4.0 67 16.75
5 1.0 61 15.25

From the descriptive statistics it can be seen that the average GRE score is 587.7, the average GPA is 3.389, applicants appying from institutions with a prestige rank of 2 is most common, and the majority of the applicants were not admitted to the program.

Logistic Regression using StatsModels

NOTE

StatsModels formula api uses Patsy to handle passing the formulas. The pseudo code looks like the following:

smf.logit("dependent_variable ~ independent_variable1 + independent_variable2 + independent_variablen",
        data = df).fit()

To tell the model that a variable is categorical, it needs to be wrapped in C(independent_variable). The pseudo code with a categorical independent variable looks like:

smf.logit("dependent_variable ~ independent_variable1 + C(independent_variable2)",
        data = df).fit()

By default, Patsy chooses the first categorical variable as the reference category; it's possible to change the reference category if desired. In order to do this, one needs to specify the reference category while one is specifying the variable is a categorical variable. Pseduo code is as follows:

smf.logit("dependent_variable ~ independent_variable1 + C(independent_variable2, Treatment(categorical_group))",
        data = df).fit()

Where categorical_group is the desired reference group.


First, one needs to import the package; the official documentation for this method of the package can be found here.

import statsmodels.formula.api as smf

Now that the package is imported, the model can be fit and the results reviewed.

model = smf.logit("admit ~ gre + gpa + C(rank)", data = df).fit()

model.summary()
Logit Regression Results
Dep. Variable: admit No. Observations: 400
Model: Logit Df Residuals: 394
Method: MLE Df Model: 5
Date: Mon, 20 Jan 2020 Pseudo R-squ.: 0.08292
Time: 10:24:43 Log-Likelihood: -229.26
converged: True LL-Null: -249.99
LLR p-value: 7.578e-08
coef std err z P>|z| [0.025 0.975]
Intercept -3.9900 1.140 -3.500 0.000 -6.224 -1.756
C(rank)[T.2.0] -0.6754 0.316 -2.134 0.033 -1.296 -0.055
C(rank)[T.3.0] -1.3402 0.345 -3.881 0.000 -2.017 -0.663
C(rank)[T.4.0] -1.5515 0.418 -3.713 0.000 -2.370 -0.733
gre 0.0023 0.001 2.070 0.038 0.000 0.004
gpa 0.8040 0.332 2.423 0.015 0.154 1.454

Interpretation

Using this information, one can evaluate the regression model. The current overal model is significant which indicates it's better than using the mean to predict being admitted.Interpreting the coefficients right now would be premature since the model's diagnostics have not been evaluated. However, for demonstration purposes they will be interpreted.

The overall model indicates the model is better than using the mean of admission to predict an applicants admission decision, F(5, 394) < 0.0000. For every unit increase in GRE there is a 0.0023 increase in the log odds of being admitted; additionally, for every unit increase in GPA there is a 0.8040 increase in the log odds of being admitted. Applicants applying from institutions with a rank of 2, 3, or 4 have a decrease in the log odds of being admitted of -0.6754, -1.3402, and -1.5515, respectively, compared to applicants applying from a rank 1 institution.

That the interpretation is valid, but log odds is not intuitive in it's interpretation. Let's convert this to odds ratio and interpret the model again. To convert the log odds coefficients and confidence intervals, one needs to take the exponential of the values.

model_odds = pd.DataFrame(np.exp(model.params), columns= ['OR'])
model_odds['z-value']= model.pvalues
model_odds[['2.5%', '97.5%']] = np.exp(model.conf_int())

model_odds
OR z-value 2.5% 97.5%
Intercept 0.018500 0.000465 0.001981 0.172783
C(rank)[T.2.0] 0.508931 0.032829 0.273692 0.946358
C(rank)[T.3.0] 0.261792 0.000104 0.133055 0.515089
C(rank)[T.4.0] 0.211938 0.000205 0.093443 0.480692
gre 1.002267 0.038465 1.000120 1.004418
gpa 2.234545 0.015388 1.166122 4.281877

Interpretation

The overall model indicates the model is better than using the mean of admission to predict an applicants admission decision, F(5, 394) < 0.0000. The odds of being admitted increases by a factor of 1.002 for every unit increase in GRE; likewise, the odds of being admitted increases by a factor of 2.235 for every unit increase in GPA. The odds of being addmitted for those applying from an institution with a rank of 2, 3, or 4 are 0.5089, 0.2618, and 0.2119, respectively, times that of those applying from an institution with a rank of 1.

Converting to odd ratios (OR) is much more intuitive in the interpretation. Where,

  • OR = 1, same odds
  • OR < 1, fewer/decrease in odds
  • OR > 1, greater/increase in odds

Also note that ORs are multiplicative in their interpretation that is why the phrasing includes "... times more likely\less likely ..." or "... a factor of ...".

Assumption Check

Since logistic regression is a nonparametric model the assumptions are different than linear regression and the diagnostics of the model are different as well. A lot of the methods used to diagnose linear regression models cannot be used to diagnose logistic regression models; with logistic regression, the focus is on assessing the model's adequacy.

Logistic Regression Residuals

In linear regression, one assess the residuals as is; however the residuals from the logistic regression model need to be transformed to be useful. This is because the dependent variable is binary (0 or 1). Due to the binary nature of the outcome, the residuals will not be normally distributed and their distribution is unknown (Nachtsheim, Neter, & Li, 2004). The residuals assessed then are either the Pearson residuals, studentized Pearson residuals, and/or the deviance residuals.

A plot that is helpful for diagnosing logistic regression model is to plot the studentized Pearson residuals, or the deviance residuals, against the estimated probability or linear predictor values with a Lowess smooth. Nachtsheim, Neter, and Li (2004) show that under the assumption that the logistic regression model is correct then the error (difference) between the observed value ($Y_i$) and predicted value ($\hat{\pi}_i$) is equal to 0, i.e. $$Y_i - \pi_i = 0$$ They conclude that this then suggests that a lowess smooth of one of the plots mentioned above would approximately be a horizontal line with zero intercept - unfortunately they do not provide a suggestion of what "approximately" looks like. While looking at visualizations, it's important to keep in mind the image size and scale will affect how the visualization looks and thus will affect ones interpretation.

Now,to demonstrate this. StatsModels calculates the studentized Pearson residuals (model.resid_pearson) as well as the deviance residuals (model.resid_dev) by default - saves us some time.

## Plotting multiple plots same figure
fig, (axL, axR) = plt.subplots(2, figsize=(15, 15))
plt.suptitle("Logistic Regression Residual Plots \n using Seaborn Lowess line (N = 400)")


# Deviance Residuals
sns.regplot(model.fittedvalues, model.resid_dev, ax= axL,
            color="black", scatter_kws={"s": 5},
            line_kws={"color":"b", "alpha":1, "lw":2}, lowess=True)

axL.set_title("Deviance Residuals \n against Fitted Values")
axL.set_xlabel("Linear Predictor Values")
axL.set_ylabel("Deviance Residuals")

# Studentized Pearson Residuals
sns.regplot(model.fittedvalues, model.resid_pearson, ax= axR,
            color="black", scatter_kws={"s": 5},
            line_kws={"color":"g", "alpha":1, "lw":2}, lowess=True)

axR.set_title("Studentized Pearson Residuals \n against Fitted Values")
axR.set_xlabel("Linear Predictor Values")
axR.set_ylabel("Studentized Pearson Residuals")

plt.show()
python researchpy logistic regression diagnostic diagnostics

For the current example, it appears the plots do approximate horizontal line with 0 intercept. This suggests that there is no significant model inadequacy.

References

Kutner, M. H., Nachtsheim, C. J., Neter, J., and Li, W. (2004). Applied linear statistical models (5th ed.). New York, NY: McGraw-Hill Irwin.
Hosmer, D. W., Jr., and Lemeshow, S. (1980). Goodness-of-fit tests for the multiple logistic regression model. Communications in Statistics: Theory and Methods, Part A, 9, 1043-1069
Hosmer, D. W., Jr., and Lemeshow, S. (2000). Applied Logistic Regression (2nd ed.). New York: Wiley.