The world’s leading publication for data science, AI, and ML professionals.

What to Do If the Logit Decision Boundary Fails?

Feature engineering for classification models using Bayesian Machine Learning

Logistic regression is by far the most widely used machine learning model for binary classification datasets. The model is relatively simple and is based on a key assumption: the existence of a linear decision boundary (a line or a surface in a higher-dimensional feature space) that can separate the classes of the target variable y based on the features in the model.

In a nutshell, the decision boundary can be interpreted as a threshold at which the model assigns a data point to one class or the other, conditional on the predicted likelihood of belonging to a class.

The figure below presents a schematic representation of the decision boundary that separates the target variable into two classes. In this case the model is based on a set of two features (x1 and x2). The target variable can be clearly separated into two classes based on the values of the features.

However, in your daily modeling activities, the situation might look rather similar to the figure below.

Again, there are two features in this model. However the estimated decision boundary does not separate the data based on the features. This is one of the biggest issues related to the logistic model, whose impact is often highly underestimated among data scientists. The problem arises from significant overlap between the classes y=0 and y=1 in the feature space, making the decision boundary unable to cleanly separate them. This situation is commonly observed in imbalanced datasets, which can result in skewed decision boundaries.


In such cases, logistic regression struggles because it assumes linear separability. The decision boundary will not cleanly divide the classes. A major consequence of the problem concerns feature selection. During Feature Engineering, the features without strong separability may be deemed unimportant or insignificant by the model. The typical recommendation for logistic regression feature selection, which can be found in many books and articles, points at recursive feature elimination. This is however pointless due to a simple fact: when some variables get removed from the model, the estimated parameters for the remaining variables will change accordingly, see:

Integrating Feature Selection into the Model Estimation

The relationships between predictors and the target variable are often interconnected. The coefficients in the reduced model will no longer reflect their values from the full model, what in the end effect might be leading to biased interpretations of the model parameters or predictions. Moreover, under the scenario of weak separability by the decision boundary, the recursive feature elimination or even regularization will not help.

What is the remedy for this problem?


With nearly 20 years of experience in modeling categorical data, I would suggest that the most effective solution to this issue is to perform feature engineering using probit regression rather than Logistic Regression (logit model). This approach has proven to be highly effective, and I have tested it many times in practice !

Similarly to the logistic regression, the probit model can be used to categorical data modeling. Instead of logistic function

it uses normal cumulative distribution function

The models are quite similar, although training a probit model is often preferred over a logit model in terms of the decision boundary because the probit model assumes a normal cumulative distribution, which can lead to a more natural and smoother boundary. In contrast, the logit model’s logistic function may produce decision boundaries that are less flexible and more prone to overshooting or underfitting, especially when the data is not well-suited for a logistic transformation.

As a result, the probit model can better capture the nuances in the data, leading to more accurate and meaningful decision boundaries even for nonlinear and non normal data.


Let us consider the following latent variable representation of the probit model

where the value of the binary variable y_i is observed, as are the values of the explanatory variables x_i. The latent data z_i, however, are unobserved.

The model training can be performed by drawing sequentially from two distributions

where

and

with the notation TN for a truncated normal.

The model is implemented in the code attached to this post. The implementation shows how well the feature selection mechanism in-built in the model is able to identify the features irrespective of a decision boundary which is not separating the classes in the feature domain.

In the code, the set of features is created by sampling from a normal distribution. Then, a mask vector is created to simulate the situation where only a subset of the features are relevant for the data-generating process. In this case, features 1, 3, and 7 are selected. The latent variable z is simulated using the selected features based on the mask and their corresponding coefficients from the β vector. The observed variable y is derived from the latent variable z. A threshold of 0 is used, such that if z is greater than 0, y is set to 1 (class 1), and if z is less than or equal to 0, yyy is set to 0 (class 0). The result is a binary outcome (0 or 1), which is typical for a categorical model, where the outcome is determined by whether the latent variable crosses a threshold.

We make it difficult for the model to identify the relevant features within the larger set of variables by using features with unclear separability, as shown in the figure below. Boxplots show highly overlapping classes for x1 and x3.

Similar conclusions can be made based on density and scatter plots.

As explained above, the code simulates data for a binary classification problem using a subset of explanatory variables and tests whether the model selection procedure can identify the relevant features. The feature engineering procedure based on stochastic search has been previously described in case of linear regression model and then successfully applied in context of an advanced model of mixture of regressions.

In this case, the results are just as good as before!

The model is able to identify the proper features among the broader dataset, as shown in the figure below. The inclusion probability MCMC sampling trajectory for x1, x3, and x7 is either exactly 1, or mostly 1, whereas for all the others features is rather 0. (The individual probabilities computed over the full trajectory comprising of 10000 MCMC draws are given by 1.0000 0.1624 0.9752 0.1269 0.2757 0.1839 1.0000 0.1220).

The results have been achieved despite of weak separability for x1 and x3. As shown in the figure below, all of the parameters beyond the truly selected features, oscillate around 0 indicating no impact of those variables on our target!

The full code is presented below.

Unless otherwise noted, all images are by the author.


Related Articles