The world’s leading publication for data science, AI, and ML professionals.

Assumptions of Logistic Regression, Clearly Explained

Understand and implement assumption checks (in Python) for one of the most important data science modeling techniques

Photo by Sebastian Staines on Unsplash
Photo by Sebastian Staines on Unsplash

Logistic regression is a highly effective modeling technique that has remained a mainstay in statistics since its development in the 1940s.

Given its popularity and utility, data practitioners should understand the fundamentals of logistic regression before using it to tackle data and business problems.

In this article, we explore the key assumptions of logistic regression with theoretical explanations and practical Python implementation of the assumption checks.


Contents

(1) Theoretical Concepts & Practical Checks(2) Comparison with Linear Regression (3) Summary and GitHub repo link

Photo by Glenn Carstens-Peters on Unsplash
Photo by Glenn Carstens-Peters on Unsplash

Theoretical Concepts & Practical Checks

For the implementation of the assumption checks in Python, we will be using the classic Titanic dataset. For the complete codes, please have a look at the GitHub repo of this project.


Assumption 1— Appropriate Outcome Type

Logistic regression generally works as a classifier, so the type of logistic regression utilized (binary, multinomial, or ordinal) must match the outcome (dependent) variable in the dataset.

By default, logistic regression assumes that the outcome variable is binary, where the number of outcomes is two (e.g., Yes/No).

If the dependent variable has three or more outcomes, then multinomial or ordinal logistic regression should be used.

How to Check?

We can check this assumption by getting the number of different outcomes in the dependent variable. If we want to use binary logistic regression, then there should only be two unique outcomes in the outcome variable.


Assumption 2 – Linearity of independent variables and log-odds

One of the critical assumptions of logistic regression is that the relationship between the logit (aka log-odds) of the outcome and each continuous independent variable is linear.

The logit is the logarithm of the odds ratio, where p = probability of a positive outcome (e.g., survived Titanic sinking)

How to Check?

(i) Box-Tidwell Test

The Box-Tidwell test is ** used to check for linearity between the predictors and the logit. This is done by adding log-transformed interaction terms between the continuous independent variables and their corresponding natural lo**g into the model.

For example, if one of your continuous independent variables is Age, then the interaction term to add as a new variable will be *`Age ln(Age)`**.

As part of the Box-Tidwell test, we filter our dataset to keep just the continuous independent variables.

Note: While R has the carlibrary to perform Box-Tidwell with a single line of code, I could not find any Python package that can do something similar.

If you have more than one continuous variable, you should include the same number of interaction terms in the model. With the interaction terms included, we can re-run the logistic regression and review the results.

Sample Logit Regression Results involving Box-Tidwell transformations | Image by author
Sample Logit Regression Results involving Box-Tidwell transformations | Image by author

What we need to do is check the statistical significance of the interaction terms (_Age: LogAge and _Fare: LogFare in this case) based on their p-values.

The _Age:LogAge interaction term has a p-value of 0.101 (not statistically significant since p>0.05), implying that the independent variable Age is linearly related to the logit of the outcome variable and that the assumption is satisfied.

On the contrary, _Fare:LogFare is statistically significant (i.e., p≤0.05), indicating the presence of non-linearity between Fare and the logit.

One solution is to perform transformations by incorporating higher-order polynomial terms to capture the non-linearity (e.g., _Fare_²).

(ii) Visual check

Another way that we can check logit linearity is by visually inspecting the scatter plot between each predictor and the logit values.

Scatter plot of Fare variable vs. log-odds of outcome | Image by author
Scatter plot of Fare variable vs. log-odds of outcome | Image by author

The above scatter plot shows a clear non-linear pattern of Fare vs. the log-odds, thereby implying that the assumption of logit linearity is violated.


Assumption 3— No strongly influential outliers

Logistic regression assumes that there are no highly influential outlier data points, as they distort the outcome and accuracy of the model.

Note that not all outliers are influential observations. Rather, outliers have the potential to be influential. To assess this assumption, we need to check whether both criteria are satisfied, i.e., influential and outlier.

How to Check?

(i) Influence

We can use Cook’s Distance to determine the influence of a data point, and it is calculated based on its residual and leverage. It summarizes the changes in the regression model when that particular (_i_th) observation is removed.

There are different opinions regarding what cut-off values to use. One standard threshold is 4/N (where N = number of observations), meaning that observations with Cook’s Distance > 4/N are deemed as influential.

The statsmodel package also allows us to visualize influence plots for GLMs, such as the index plot ([influence.plot_index](https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLMResults.get_influence.html)) for influence attributes:

Sample of an Index Plot of Cook's Distance. The dotted red line indicates the Cook's Distance cut-off, above which are points considered influential | Image by author
Sample of an Index Plot of Cook’s Distance. The dotted red line indicates the Cook’s Distance cut-off, above which are points considered influential | Image by author

(ii) Outliers

We use standardized residuals to determine whether a data point is an outlier or not. Data points with absolute standardized residual values greater than 3 represent possible extreme outliers.

(iii) Putting Both Together

We can identify the strongly influential outlier data points by finding the top observations based on thresholds defined earlier for Cook’s Distance and standardized residuals.

When outliers are detected, they should be treated accordingly, such as removing or transforming them.

Top 5 most influential outliers (with the corresponding indices) | Image by author
Top 5 most influential outliers (with the corresponding indices) | Image by author

Assumption 4 – Absence of Multicollinearity

Multicollinearity corresponds to a situation where the data contain highly correlated independent variables. This is a problem because it reduces the precision of the estimated coefficients, which weakens the statistical power of the logistic regression model.

How to Check?

Variance Inflation Factor (VIF) measures the degree of multicollinearity in a set of independent variables.

Mathematically, it is equal to the ratio of the overall model variance to the variance of a model that includes only that single independent variable.

The smallest possible value for VIF is 1 (i.e., a complete absence of collinearity). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of multicollinearity.

Below is a sample of the calculated VIF values. Since no VIF values exceed 5, the assumption is satisfied.

VIF values | Image by author
VIF values | Image by author

Another way to check is to generate a correlation matrix heatmap:

Correlation Matrix | Image by author
Correlation Matrix | Image by author

The problem with this method is that the heatmap can be challenging to interpret when many independent variables are present.

More importantly, collinearity can exist between three or more variables even if no pair of variables is seen to have an exceptionally high correlation. Hence, VIF is a better way to assess multicollinearity.

Join Medium with my referral link – Kenneth Leung


Assumption 5— Independence of observations

The observations must be independent of each other, i.e., they should not come from repeated or paired data. This means that each observation is not influenced by or related to the rest of the observations.

How to Check?

This independence assumption is automatically met for our Titanic example dataset since the data consists of individual passenger records.

This assumption would be more of a concern when dealing with time-series data, where the correlation between sequential observations (auto-correlation) can be an issue.

Nonetheless, there are still ways to check for the independence of observations for non-time series data. In such cases, the ‘time variable’ is the order of observations (i.e., index numbers).

In particular, we can create the Residual Series plot where we plot the deviance residuals of the logit model against the index numbers of the observations.

Residual Series Plot | Image by author
Residual Series Plot | Image by author

Since the residuals in the plot above appear to be randomly scattered around the centerline of zero, we can infer (visually) that the assumption is satisfied.

Note: If you wish to find out more about interpreting the traditional residual vs. fit plot in logistic regression, check out the articles [here](https://bookdown.org/jefftemplewebb/IS-6489/logistic-regression.html#fn40) and here.


Assumption 6 – Sufficiently large sample size

There should be an adequate number of observations for each independent variable in the dataset to avoid creating an overfit model.

How to Check?

Like Cook’s distance, there are numerous opinions on the rule of thumb to determine a ‘sufficiently large’ quantity.

One rule of thumb is that there should be at least 10 observations with the least frequent outcome for each independent variable. We can check this by retrieving the value counts for each variable.

Another way to determine a large sample size is that the total number of observations should be greater than 500. We can check this by getting the length of the entire data frame.


Comparison with Linear Regression

Although the assumptions for logistic regression differ from linear regression, several assumptions still hold for both techniques.

Differences

  • Logistic regression does not require a linear relationship between the dependent and independent variables. However, it still needs independent variables to be linearly related to the log-odds of the outcome.
  • Homoscedasticity (constant variance) is required in linear regression but not for logistic regression.
  • The error terms (residuals) must be normally distributed for linear regression but not required in logistic regression.

Similarities

  • Absence of multicollinearity
  • Observations are independent of each other
Photo by Robert Anasch on Unsplash
Photo by Robert Anasch on Unsplash

Summary

Here’s a recap of the assumptions we have covered:

  1. Appropriate outcome type
  2. Linearity of independent variables and log-odds
  3. No strongly influential outliers
  4. Absence of multicollinearity
  5. Independence of observations
  6. Sufficiently large sample size

I also recommend exploring the accompanying GitHub repo to view the complete Python implementation of these six assumption checks.

If you have any feedback/suggestions on this topic, I look forward to hearing from you in the Comments section.

Before you go

I welcome you to join me on a Data Science learning journey! Follow this Medium page and check out my GitHub to stay in the loop of more exciting education data science content. Meanwhile, have fun running logistic regression!

Why Bootstrapping Actually Works

Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER)

The Dying ReLU Problem, Clearly Explained

References

Complete list of references collated in GitHub repo README


Related Articles