
Logistic regression is a highly effective modeling technique that has remained a mainstay in statistics since its development in the 1940s.
Given its popularity and utility, data practitioners should understand the fundamentals of logistic regression before using it to tackle data and business problems.
In this article, we explore the key assumptions of logistic regression with theoretical explanations and practical Python implementation of the assumption checks.
Contents
(1) Theoretical Concepts & Practical Checks(2) Comparison with Linear Regression (3) Summary and GitHub repo link

Theoretical Concepts & Practical Checks
For the implementation of the assumption checks in Python, we will be using the classic Titanic dataset. For the complete codes, please have a look at the GitHub repo of this project.
Assumption 1— Appropriate Outcome Type
Logistic regression generally works as a classifier, so the type of logistic regression utilized (binary, multinomial, or ordinal) must match the outcome (dependent) variable in the dataset.
By default, logistic regression assumes that the outcome variable is binary, where the number of outcomes is two (e.g., Yes/No).
If the dependent variable has three or more outcomes, then multinomial or ordinal logistic regression should be used.
How to Check?
We can check this assumption by getting the number of different outcomes in the dependent variable. If we want to use binary logistic regression, then there should only be two unique outcomes in the outcome variable.
Assumption 2 – Linearity of independent variables and log-odds
One of the critical assumptions of logistic regression is that the relationship between the logit (aka log-odds) of the outcome and each continuous independent variable is linear.
The logit is the logarithm of the odds ratio, where p = probability of a positive outcome (e.g., survived Titanic sinking)

How to Check?
(i) Box-Tidwell Test
The Box-Tidwell test is ** used to check for linearity between the predictors and the logit. This is done by adding log-transformed interaction terms between the continuous independent variables and their corresponding natural lo**g into the model.
For example, if one of your continuous independent variables is Age
, then the interaction term to add as a new variable will be *`Age ln(Age)`**.
As part of the Box-Tidwell test, we filter our dataset to keep just the continuous independent variables.
Note: While R has the car
library to perform Box-Tidwell with a single line of code, I could not find any Python package that can do something similar.
If you have more than one continuous variable, you should include the same number of interaction terms in the model. With the interaction terms included, we can re-run the logistic regression and review the results.

What we need to do is check the statistical significance of the interaction terms (_Age: LogAge and _Fare: LogFare in this case) based on their p-values.
The _Age:LogAge interaction term has a p-value of 0.101 (not statistically significant since p>0.05), implying that the independent variable Age is linearly related to the logit of the outcome variable and that the assumption is satisfied.
On the contrary, _Fare:LogFare is statistically significant (i.e., p≤0.05), indicating the presence of non-linearity between Fare and the logit.
One solution is to perform transformations by incorporating higher-order polynomial terms to capture the non-linearity (e.g., _Fare_²).
(ii) Visual check
Another way that we can check logit linearity is by visually inspecting the scatter plot between each predictor and the logit values.

The above scatter plot shows a clear non-linear pattern of Fare vs. the log-odds, thereby implying that the assumption of logit linearity is violated.
Assumption 3— No strongly influential outliers
Logistic regression assumes that there are no highly influential outlier data points, as they distort the outcome and accuracy of the model.
Note that not all outliers are influential observations. Rather, outliers have the potential to be influential. To assess this assumption, we need to check whether both criteria are satisfied, i.e., influential and outlier.
How to Check?
(i) Influence
We can use Cook’s Distance to determine the influence of a data point, and it is calculated based on its residual and leverage. It summarizes the changes in the regression model when that particular (_i_th) observation is removed.
There are different opinions regarding what cut-off values to use. One standard threshold is 4/N (where N = number of observations), meaning that observations with Cook’s Distance > 4/N are deemed as influential.
The statsmodel
package also allows us to visualize influence plots for GLMs, such as the index plot ([influence.plot_index](https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLMResults.get_influence.html)
) for influence attributes:

(ii) Outliers
We use standardized residuals to determine whether a data point is an outlier or not. Data points with absolute standardized residual values greater than 3 represent possible extreme outliers.
(iii) Putting Both Together
We can identify the strongly influential outlier data points by finding the top observations based on thresholds defined earlier for Cook’s Distance and standardized residuals.
When outliers are detected, they should be treated accordingly, such as removing or transforming them.

Assumption 4 – Absence of Multicollinearity
Multicollinearity corresponds to a situation where the data contain highly correlated independent variables. This is a problem because it reduces the precision of the estimated coefficients, which weakens the statistical power of the logistic regression model.
How to Check?
Variance Inflation Factor (VIF) measures the degree of multicollinearity in a set of independent variables.
Mathematically, it is equal to the ratio of the overall model variance to the variance of a model that includes only that single independent variable.
The smallest possible value for VIF is 1 (i.e., a complete absence of collinearity). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of multicollinearity.
Below is a sample of the calculated VIF values. Since no VIF values exceed 5, the assumption is satisfied.

Another way to check is to generate a correlation matrix heatmap:

The problem with this method is that the heatmap can be challenging to interpret when many independent variables are present.
More importantly, collinearity can exist between three or more variables even if no pair of variables is seen to have an exceptionally high correlation. Hence, VIF is a better way to assess multicollinearity.
Assumption 5— Independence of observations
The observations must be independent of each other, i.e., they should not come from repeated or paired data. This means that each observation is not influenced by or related to the rest of the observations.
How to Check?
This independence assumption is automatically met for our Titanic example dataset since the data consists of individual passenger records.
This assumption would be more of a concern when dealing with time-series data, where the correlation between sequential observations (auto-correlation) can be an issue.
Nonetheless, there are still ways to check for the independence of observations for non-time series data. In such cases, the ‘time variable’ is the order of observations (i.e., index numbers).
In particular, we can create the Residual Series plot where we plot the deviance residuals of the logit model against the index numbers of the observations.

Since the residuals in the plot above appear to be randomly scattered around the centerline of zero, we can infer (visually) that the assumption is satisfied.
Note: If you wish to find out more about interpreting the traditional residual vs. fit plot in logistic regression, check out the articles [here](https://bookdown.org/jefftemplewebb/IS-6489/logistic-regression.html#fn40) and here.
Assumption 6 – Sufficiently large sample size
There should be an adequate number of observations for each independent variable in the dataset to avoid creating an overfit model.
How to Check?
Like Cook’s distance, there are numerous opinions on the rule of thumb to determine a ‘sufficiently large’ quantity.
One rule of thumb is that there should be at least 10 observations with the least frequent outcome for each independent variable. We can check this by retrieving the value counts for each variable.
Another way to determine a large sample size is that the total number of observations should be greater than 500. We can check this by getting the length of the entire data frame.
Comparison with Linear Regression
Although the assumptions for logistic regression differ from linear regression, several assumptions still hold for both techniques.
Differences
- Logistic regression does not require a linear relationship between the dependent and independent variables. However, it still needs independent variables to be linearly related to the log-odds of the outcome.
- Homoscedasticity (constant variance) is required in linear regression but not for logistic regression.
- The error terms (residuals) must be normally distributed for linear regression but not required in logistic regression.
Similarities
- Absence of multicollinearity
- Observations are independent of each other

Summary
Here’s a recap of the assumptions we have covered:
- Appropriate outcome type
- Linearity of independent variables and log-odds
- No strongly influential outliers
- Absence of multicollinearity
- Independence of observations
- Sufficiently large sample size
I also recommend exploring the accompanying GitHub repo to view the complete Python implementation of these six assumption checks.
If you have any feedback/suggestions on this topic, I look forward to hearing from you in the Comments section.
Before you go
I welcome you to join me on a Data Science learning journey! Follow this Medium page and check out my GitHub to stay in the loop of more exciting education data science content. Meanwhile, have fun running logistic regression!
Why Bootstrapping Actually Works
Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER)
References
Complete list of references collated in GitHub repo README