Are you interpreting your logistic regression correctly?

Regression coefficients alone do not tell you what you need to know

Published in

Towards Data Science

7 min readJul 5, 2022

Logistic regressions are always praised for their interpretability, but in practice they are often misunderstood. In these cases it is usually assumed that coefficients in the logistic model work the same way as those in a multiple linear regression model. But this is not the case. Below, we will have a detailed look at the correct interpretation of logistic regressions and the mechanics that influence the effect of the features.

Logistic regression was introduced in this 1958 paper by David Roxbee Cox to model binary outcomes. Today it is widely used in medical research, social sciences and of course in data science. When data scientists start a new classification project, logistic regression is often the first model we try. We use it to get a feeling for the most important features and the direction of the dependence. Afterwards, we may switch to a less interpretable classifier such as gradient boosted trees or random forests, if we want to gain performance — or we might stick with the logistic regression if it is important for our stakeholder to be able to reason about the model.

Picture of the statistician Sir David Roxbee Cox — the inventor of logistic regression and proportional hazard models. — **Sir David Roxbee Cox invented logistic regression and proportional hazard models for survival analysis (named Cox regression after him). Source:** **Wikimedia Commons**.

In a linear regression model the coefficient β tells you by how many units the dependent variable changes when the feature changes by one unit. This change does not depend on the value of other features or other coefficients and the level of the coefficient itself directly describes the magnitude of the change. By contrast, in a logistic model, the change in the dependent variable in response to a change in the feature is a function that depends on the value of the feature itself, the values of other features and also all other coefficients in the model!

Let’s consider the example of a simple model with two features. In the linear model, the prediction ŷᵢ is just a weighted sum of the feature values and coefficients. To simplify the notation, we will abbreviate this weighted sum by μᵢ = β₀ + β₁ x₁ᵢ + β₂ x₂ᵢ. The logistic regression uses the same weighted sum μᵢ, but wraps the logistic function Λ(x) = exp(x)/[1+exp(x)] around it, so that all predictions are between 0 and 1 and can be interpreted as probabilities.

Linear regression: ŷᵢ= μᵢ
Logistic regression: ŷᵢ = Λ(μᵢ)

Generally, coefficients are interpreted as the change in the dependent variable that happens when there is a small change in the value of the feature and all other features stay the same. Mathematically that means we are considering the partial derivative. So here are the partial derivatives of the two models with respect to the first coefficient β₁:

Linear regression: ∂ ŷᵢ / ∂ x₁ᵢ = β₁
Logistic regression: ∂ ŷᵢ / ∂ x₁ᵢ = Λ(μᵢ)[1-Λ(μᵢ)] β₁

As one can see, β₁ itself is the partial derivative in the linear regression model. No complications here. In contrast to that, the marginal effect in the logistic regression associated with the coefficient value β₁ depends on μᵢ and therefore on x₁ᵢ, β₂ and x₂ᵢ.

Since Λ(μᵢ) and [1-Λ(μᵢ)] are both non-negative, the sign of β₁ determines the sign of the marginal effect. Therefore, we can interpret the direction of the effect of the features in the logistic model as we are used from multiple linear regression. Only the magnitude of β₁ can not be interpreted directly.

To see how much the effect of changes in x₁ᵢ on the predicted outcome depends on the level of x₁ᵢ itself, have a look at Figure 1, below that shows the marginal effect of x₁ᵢ for different values of β₁ holding β₀, β₂ and x₂ᵢ constant at 1. In a multiple linear regression model, the marginal effect is independent of the level, so all three lines would just be a horizontal line at the level of the respective coefficient. Instead, we see that the magnitude of the effect is much smaller. For example, the maximum of the blue curve is at 0.25, even though the coefficient β₁=1. Furthermore, the larger the coefficient, the more pronounced is the non-linearity.

Graph with lines representing the marginal effect of feature one depending on the value of the feature itself. — **Figure 1: Marginal effect of x₁ depending on the feature value. Image by the author.**

Figure 2, below, illustrates the dependence of the marginal effect on the coefficients and values of other features. This time, we keep x₁ᵢ=1, but vary the value of the sum β₀ + β₂ x₂ᵢ. As before, we can observe a pronounced non-linearity.

Graph with lines representing the marginal effect of feature one depending on the sum of the other features and coefficients. — **Figure 2: Marginal effect of x₁ depending on the sum of coefficients and other features. Image by the author.**

To gain some more insights into the interpretation of logistic regression, remember that ŷᵢ is the prediction for yᵢ so that means ŷᵢ gives the probability that yᵢ=1. The equation ŷᵢ = Λ(μᵢ) can be turned around, to see that

μᵢ = Λ⁻¹(ŷᵢ) = ln (ŷᵢ/(1-ŷᵢ)).

That means the logistic regression implies a linear relationship between the features in μᵢ and the logarithm of the odds ratio ŷᵢ/(1-ŷᵢ).

Therefore, the effect of x₁ᵢ and x₂ᵢ in μᵢ on the log-odds (also called logits) is directly given by the coefficients β₁ and β₂. Unfortunately, the log-odds are a little unintuitive to humans — so this does not provide a good basis for interpretation.

How to do it right: average marginal effects and marginal effects at the average

Based on ∂ ŷᵢ / ∂ x₁ᵢ = Λ(μᵢ)[1-Λ(μᵢ)] β₁ with μᵢ = β₀ + β₁ x₁ᵢ + β₂ x₂ᵢ there are two ways to quantify the marginal effect of x₁ᵢ. The first is to evaluate the derivative at the average value of x₂ᵢ. This is the marginal effect at the sample mean. The idea can be extended at will to marginal effects at any set of values that is deemed representative — for example medians, or cluster means. Alternatively, one can calculate the partial derivative for each of the observations in the sample, and report the average of the values obtained this way. This is the average marginal effect. Both approaches are valid — and they should give very similar results in large enough samples.

Methods to determine marginal effects are readily available — for example in the margins package in R or in statsmodels and sklearn in Python. Unfortunately, I was not able to find an implementation in pyspark.ml.

A special case that is worth to mention is when one or more features are dummy variables (or one-hot encoded). In this case the derivative ∂ ŷᵢ / ∂ x₁ᵢ with respect to the feature is not well defined. What we want to use in this case is the difference in the predicted outcome when the feature takes the value 1 compared to that when it takes the value 0. As before, this can be calculated while keeping the other features at their average, or it can be calculated for every observation before one takes the average.

In practice, we often use tools such as partial dependence plots, accumulated local effects, LIME or Shapley values to better understand and interpret our models. These are model-agnostic and also work well for logistic regression- although one might argue some of them are overkill for something that can be interpreted directly via marginal effects. Nevertheless, each of these tools gives a different insight into the model, therefore they all can enhance our understanding. Often it is especially illuminating if different approaches to model interpretation give you different indications about the role of certain features.

I personally really like marginal effects, because they give you a nice summary of the effect of each of your features, even if you use more complicated transformations such as squared terms and interactions with other features. In those cases, the influence of the original variable can only be interpreted if we consider multiple features at once. Marginal effects extend nicely in such a situation. For example, if we include a squared term so that μᵢ = β₀ + β₁ x₁ᵢ + β₂ x₁ᵢ² we have ∂ ŷᵢ / ∂ x₁ᵢ = Λ(μᵢ)[1-Λ(μᵢ)] (β₁+2β₂ x₁ᵢ). The equation may be a bit bigger, but in practice we can still look at a one-number summary of the influence of x₁ᵢ on ŷᵢ.

This is where we come full circle with regard to the relationship between coefficient interpretation in multiple linear regression and logistic regression. It is not correct to simply interpret the coefficients of a logistic regression model as marginal effects — the same as we would do in a multiple linear regression model. But in practice things quickly become more complicated since your model most likely contains polynomial terms and interaction terms. In that case multiple linear regression and logistic regression are treated the same way — we have to calculate marginal effects to interpret them.

References:

W. H. Greene (2012): Econometric Analysis, 7th edition, Pearson Education.

T. Hastie, R. Tibshirani, and J. Friedman (2009): The Elements of Statistical Learning, 2nd Edition, Springer Science+Business Media.

How to choose your loss function — where I disagree with Cassie Kozyrkov

Selecting the right loss function and evaluation metrics is important for the success of your data science project. But…

pub.towardsai.net

Why you should not try to predict stock markets

I have met a surprising number of people who have tried to build a trading algorithm at some point. Here is why that is…

medium.