The world’s leading publication for data science, AI, and ML professionals.

Logistic Regression: Deceptively Flawed

When can large odds ratios and perfectly separated data bite you?

Photo by Alvan Nee on Unsplash
Photo by Alvan Nee on Unsplash

This is a second part to a previous post on conceptual understanding of logistic regression:

Logistic regression: Faceoff and Conceptual Understanding

Last time we visualized and explained fitting log-losses in logistic regression. We also showed that this process cannot fit perfectly separated data. In other words, unlike linear regression with ordinary least square fit, logistic regression actually works better if the data is a little bit noisy!

In practice, does this actually matter? It depends:

  • It matters if our goal is to use the output for statistical inference. For example, accurately estimating model coefficients, calculating confidence intervals and to test hypotheses using p-values.
  • It does not matter much or at all, if our goal is to use the output of a logistic model to create a predictive classification model.

Statistical inference: statsmodels

Sample data

For this part, we will use Python’s statsmodels library. Keep in mind that statsmodels and scikit-learn (used later) parametrize the probability using _β_s instead of k and x₀:

where the relationship between k, x₀ and β₁, β₀ is:

We will continue with datasets we generated in the first part on logistic regression, first with the "imperfect" data sample_df , using statsmodels’ formula API:

import statsmodels.formula.api as smf
model = smf.logit('y ~ x', df).fit()
model.summary()

#> Optimization terminated successfully.
#>         Current function value: 0.230366
#>         Iterations 8

Our model parameters are k = 3 and x₀ = 2.5, so those translate to β₁ = 3 and β₀ = -7.5. We can compare those with fitted parameters by reading them out from the coef column of the bottom table:

We have very few data points and the seed was intentionally chosen to showcase outliers, so the fit is a little bit off, but it is still in the right ballpark. The total log-loss is here reported as "Log-Likelihood", which is just the negative of total log-loss and equals -6.911.

Perfectly separated data

What happens when we run our perfectly separated dataset in statsmodels? Again, it depends!

perfect_sep_model = smf.logit('y ~ x', perfect_sep_data).fit()

#> ...
#> PerfectSeparationError: Perfect separation detected, results not available

In this case we got an error – no results are output. In other cases of perfect separation, we may get a warning. For example, if we use the same parameters but different random seed:

perfect2_df = create_sample_data(k=3, x0=2.5, n_points=30, seed=154)
perfect_sep2_df, perfect_sep_logloss2_df = fit_data_to_logloss(
    perfect2_df,
    k=3,
    x0=2.5
)

perfect_sep_model_2 = smf.logit('y ~ x', perfect_sep2_df).fit()
perfect_sep_model_2.summary()

#> Warning: Maximum number of iterations has been exceeded.
#>         Current function value: 0.000000
#>         Iterations: 35

#> /.../statsmodels/base/model.py:604: ConvergenceWarning: 
#>   Maximum Likelihood optimization failed to converge. Check mle_retvals

Since the second model did not converge either, we can argue that it probably should have also returned an error, not an innocent warning. Logistic regression function in R, glm(..., family=binomial) does the same. To quote R Inferno, Circle 5, Consistency:

There is a problem with warnings. No one reads them. People have to read error messages because no food pellet falls into the tray after they push the button. With a warning the machine merely beeps at them but they still get their food pellet. Never mind that it might be poison.

Therefore, be careful when doing inference using effect sizes and p-values! In this case the data is perfectly separable – we have a perfect predictor – while the reported p-value (P > |z|) is 0.998. Ignoring or misunderstanding these warnings may get you miss some obvious features in the data.

Alternatively, consider using a different model. There is a brave new world outside of logistic regression! 🌍

Prediction: scikit-learn

Sample data

scikit-learn automates and abstracts much of the calculations while providing a simple and consistent way to run many different models. However, in doing so, it hides a number of things happening in the background. When it is not obvious what is happening under the hood, learning Data Science concepts from scikit-learn documentation can lead to misunderstanding.

One of these is misunderstandings is that when we run LogisticRegression() from sklearn.linear_model and use a .predict() method, it makes it seem as if we are running a classification model:

from sklearn.linear_model import LogisticRegression

def sklearn_logistic_fit(df):
    sklearn_logistic = LogisticRegression(penalty=None).fit(
        df['x'].values.reshape(-1, 1),
        df['y'].values
    )
    predicted_outcomes = sklearn_logistic.predict(
        df['x'].values.reshape(-1, 1)
    )

    print("beta_1 or log-odds-ratio: ", sklearn_logistic.coef_[0][0])
    print("beta_0 or y intercept: ", sklearn_logistic.intercept_[0])
    print("Original outcomes: ")
    print(df['y'].values)
    print("Predicted outcomes: ")
    print(predicted_outcomes)

    return predicted_outcomes

sklearn_logistic_fit(sample_df)

#> beta_1 or log-odds-ratio:  2.6187462367743803
#> beta_0 or y intercept:  -5.682896087072998
#> Original outcomes: 
#> [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
#> Predicted outcomes: 
#> [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1]

However, we can verify that this is still the same old Logistic Regression by either calling its .predict_proba() method or by digging out the model parameters from the model object, which we printed out here.

Under the hood, .predict() predicts outcomes based on the larger of two probabilities: p or 1-p. If p > 1-p then it predicts 1, otherwise it predicts 0. It does not calculate confidence intervals, does not have p-values and does not do any kind of statistical inference or hypothesis testing. It is still a regression model, just with another thresholding layer on top of it.

Note: LogisticRegression() by default uses L2 regularization, which adds an extra term to the log-loss function. To make it comparable with statsmodels, use LogisticRegression(penalty=None).

Perfectly separable data

The advantage of ignoring the accuracy of fitted coefficients is that even if the coefficients are not accurate, or just plain wrong, the model can still be good enough! As the saying goes, "let not the perfect be the enemy of the good".

scikit-learn by default uses a different fitting method (called "lbfgs") than statsmodels. In my very limited set of tests, lbfgs did not error out on perfectly separated data. To see what happens here, let us run the same two datasets we used previously for statsmodels:

sklearn_logistic_fit(perfect_sep_df)

#> beta_1 or log-odds-ratio:  22.65352734156491
#> beta_0 or y intercept:  -53.25978472434583
#> Original outcomes: 
#> [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
#> Predicted outcomes: 
#> [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

sklearn_logistic_fit(perfect_sep2_df)

#> beta_1 or log-odds-ratio:  180.11386979983104
#> beta_0 or y intercept:  -469.3714808877968
#> Original outcomes: 
#> [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
#> Predicted outcomes: 
#> [0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

This algorithm also stops at some point, giving finite values of β₁ (or k), without an error, but it still predicts the correct outcomes. These are perfectly usable model fits for classification!

Log-odds and loose ends

There is one last topic we did not touch too much. Coefficient k (or β₁) which multiplies x, has another interpretation – a log Odds Ratio. Odds ratio describes the multiplicative change in odds, when x increases by 1:

This definition states that odds ratio is a ratio of ratios of probabilities. If increasing x by one unit, increases the probability of y = 1 from 0.1 (odds 0.1 / 0.9 = 0.11) to 0.2 (odds = 0.2 / 0.8 = 0.25), that is represented by odds ratio of 0.25 / 0.1 = 2.27.

When probabilities are small, large odds ratio does not reflect large absolute probabilities. If increasing x by one unit, increases the probability of y = 1:

  • from p(x) = 0.0001 (odds = 0.0001 / (1–0.0001) ≈ 0.0001)
  • to p(x + 1) = 0.001 (odds = 0.001 / (1–0.001) ≈ 0.001)

then this gives us the odds ratio of 0.001/0.0001 = 10. Even though the odds ratio is 10, the final absolute probability is still very small: 0.001. The good news is that when probability is small, odds ratio becomes easier to interpret – it is approximately equal to a ratio of two probabilities p(x + 1) / p(x).

Compare this to another example where probability increases by a substantial amount, say by 25% from p(x) = 0.5 to p(x + 1) = 0.75: in this case odds ratio is only 3!

Therefore, high odds ratio needs to be used with a grain of salt when the probability of y = 1 is small. There are alternatives to odds ratio, such as relative risk which are more interpretable, but may not be simple to calculate.

Conclusion

In the last two posts, we gave an intuitive explanation to logistic regression and show how to run regression models in Python’s statsmodels and scikit-learn libraries.

Take home points:

  • Logistic regression outputs a probability – therefore it is a regression algorithm.
  • Logistic regression is often used for classification. For example, by predicting the outcome with a larger probability – or by setting a custom probability threshold.
  • Parameters are estimated numerically using the difference between data and two crossed hockey sticks log-loss curves that serve as a cost function.
  • Use statsmodels to run a logistic regression when you are interested in statistical inference (care about accuracy of coefficients, hypothesis testing, p-values, …).
  • Use scikit-learn to run a logistic regression when you just want to predict the outcome or when you need to run a larger model.
  • Perfectly separable data breaks statistical inference: best-fit coefficients do not exist or "are infinite" and p-values will be large (because of large confidence intervals). If we only care about using the model for classification/prediction, then scikit-learn’s implementation works better for perfectly separated data.
  • Do not blindly trust large (log-)odds ratios. Consider calculating additional metrics, such as absolute probabilities.

I hope this helps you in your data science journey!


Related Articles