Logistic Regression and the Feature Scaling Ensemble

A new groove toward optimality for linear classification

Dr. Dave Guggenheim
Towards Data Science

--

Photo by Josh Sorenson on Unsplash

Principle Researcher: Dave Guggenheim / Contributor: Utsav Vachhani

The Down Beat

This work is a continuation of our earlier research into feature scaling (see here: The Mystery of Feature Scaling is Finally Solved | by Dave Guggenheim | Towards Data Science). In this project, we’ll examine the effect of 15 different scaling methods across 60 datasets using ridge-regularized logistic regression.

We will show how powerful regularization can be, with the accuracy of many datasets unaffected by the choice of feature scaling. But, as with the original work, feature scaling ensembles offer dramatic improvements, in this case especially with multiclass targets.

Model Definition

We chose the L2 (ridge or Tikhonov-Miller) regularization for logistic regression to satisfy the scaled data requirement. It adds a penalty that is the sum of the squared value of the coefficients. This is particularly useful in dealing with multicollinearity and considers variable importance when penalizing less significant variables in the model.

All models in this research were constructed using the LogisticRegressionCV algorithm from the sci-kit learn library. All models were also 10-fold cross-validated with stratified sampling. Each binary classification model was run with the following hyperparameters:

1) penalty = ‘l2’

2) cs = 100

3) solver = ‘liblinear’

4) class_weight = ‘balanced’

5) cv = 10

6) max_iter = 5000

7) scoring = ‘accuracy’

8) random_state = 1

Multiclass classification models (indicated with an asterisk in the results tables) were tuned in this fashion:

1) penalty = ‘l2’

2) cs = 100

3) solver = ‘lbfgs’

4) class_weight = ‘balanced’

5) cv = 10

6) max_iter = 20000

7) scoring = ‘accuracy’

8) random_state = 1

9) multiclass = ‘multinomial’

The L2 penalizing factor here addresses the inefficiency in a predictive model when using training data and testing data. As models with higher number of predictors face an overfitting issue, ridge regression, which uses the L2 regularizer, can utilize the squared coefficient penalty to prevent it.

All other hyperparameters were left to their respective default values. All models were constructed with feature-scaled data using these scaling algorithms (sci-kit learn packages are named in parentheses):

a. Standardization (StandardScaler)

b. L2 Normalization (Normalizer; norm=’l2')

c. Robust (RobustScaler; quantile_range=(25.0, 75.0), with_centering=True, with_scaling=True)

d. Normalization (MinMaxScaler; feature_range = multiple values (see below))

d1. feature_range = (-1, 1)

d2. feature_range = (0, 1)

d3. feature_range = (0, 2)

d4. feature_range = (0, 3)

d5. feature_range = (0, 4)

d6. feature_range = (0, 5)

d7. feature_range = (0, 6)

d8. feature_range = (0, 7)

d9. feature_range = (0, 8)

d10. feature_range = (0, 9)

e. Ensemble w/StackingClassifier: StandardScaler + Norm(0,9) [see Feature Scaling Ensembles for more info]

f. Ensemble w/StackingClassifier: StandardScaler + RobustScaler [see Feature Scaling Ensembles for more info]

With the exception of the ensembles, scaling methods were applied to the training predictors using ‘fit_transform’ and then to the test predictors using ‘transform’ as specified by numerous sources (e.g., Géron, 2019, pg. 66; Müller & Guido, 2016, pg. 139; Shmueli, Bruce, et al., 2019, pg. 33; Should scaling be done on both training data and test data for machine learning? — Quora) and provided for by scikit learn for all feature scaling algorithms. However, because of their design, the ensembles were forced to predict on raw test data as sequentially chaining scaling algorithms results in only the final stage appearing as the outcome, and that doesn’t even resolve the replica condition of two parallel scaling paths combining into one via modeling.

In cases where there were fewer than 12 samples per predictor, we limited the test partition to no less than 10% of the population (Shmueli, Bruce, et al., 2019, pg. 29). In cases where there were enough samples for reasonable predictive accuracy as determined by the sample complexity generalization error, we used a uniform 50% test partition size. Between these two boundaries, we adjusted the test size to limit the generalization test error in a tradeoff with training sample size (Abu-Mostafa, Magdon-Ismail, & Lin, 2012, pg. 57). For multiclass data, if there were fewer than 12 samples per categorical level in the target variable, those levels were dropped prior to modeling.

Any missing values were imputed using the MissForest algorithm due to its robustness in the face of multicollinearity, outliers, and noise. Categorical predictors were one-hot encoded using pandas get_dummies function with dropped subtypes (drop_first=True). Low-information variables (e.g., ID numbers, etc.) were dropped prior to the train/test partition.

These models were constructed for the purpose of comparing feature-scaling algorithms rather than tuning a model to achieve the best results. For this reason, we incorporated as many default values in the models as possible to create a level ground for said comparisons. All performance metrics are computed as the overall accuracy of predictions on the test data, and this metric was examined through two thresholds: 1) within 3% of best performance as a measure of generalizability, and 2) within 0.5% of best performance as a measure of predictive accuracy. This is why we use many datasets — because variance and its inherent randomness is a part of everything we research.

Datasets

The sixty datasets used in this analysis are presented in Table 1, with a broad range of predictor types and classes (binary and multiclass). Most datasets may be found at the UCI index (UCI Machine Learning Repository: Data Sets). Datasets not in the UCI index are all open source and found at Kaggle:

Boston Housing: Boston Housing | Kaggle; HR Employee Attrition: Employee Attrition | Kaggle; Lending Club: Lending Club | Kaggle; Telco Churn: Telco Customer Churn | Kaggle; Toyota Corolla: Toyota Corolla | Kaggle

38 of the datasets are binomial and 22 are multinomial classification models. All models were created and checked against all datasets. The number of predictors listed in the table are unencoded (categorical) and all original variables, including non-informational before exclusion.

Table 1 Datasets (image by author)

Feature Scaling Misfit Exploration

Our prior research indicated that, for predictive models, the proper choice of feature scaling algorithm involves finding a misfit with the learning model to prevent overfitting. Here is the equation that defines the log loss cost function with an L2 penalty factor added:

Figure 1 The log loss cost function (image by author)

Unlike distance-based measures for which normalization is a fit (by maintaining relative spacing) and standardization is a misfit, the regularized log loss cost function is not as easily determined. Regardless of the embedded logit function and what that might indicate in terms of misfit, the added penalty factor ought to minimize any differences regarding model performance.

To test for this condition of bias control, we built identical normalization models that sequentially cycled from feature_range = (0, 1) to feature_range = (0, 9). Training and test set accuracies at each stage were captured and plotted with training in blue and test in orange.

The left panel of each comparison graph shows the effect of raising the feature range by one unit from zero to nine with the non-regularized support vector classifier. The right panel shows the same data and model selection parameters but with an L2-regularized logistic regression model.

As we increase the feature range without changing any other aspect of the data or model, lower bias is the result for the non-regularized learning model whereas there is little effect on the regularized version. Note that the y-axes are not identical and should be consulted individually. Please refer to Figures 2–7 for examples of this phenomenon.

Figure 2 Australian Credit Binary Model Comparison (image by author)
Figure 3 Car Multiclass Model Comparison (image by author)
Figure 4 Credit Approval Binary Model Comparison (image by author)
Figure 5 German Credit Binary Model Comparison (image by author)
Figure 6 Letter Recognition Multiclass Model Comparison (image by author)
Figure 7 HR Churn (Employee Attrition) Binary Model Comparison (image by author)

True, the two distinct learning models perhaps do not respond in the same way to an extension of normalization range, but the regularized models do demonstrate a bias control mechanism regardless.

Feature Scaling Ensembles

Based on the results generated with the 13 solo feature scaling models, these are the two ensembles constructed to satisfy both generalization and predictive performance outcomes (see Figure 8). Scaling paths were constructed using the make_pipeline function in scikit learn for the creation of the three estimators: 1) standardization+L2 logistic regression, 2) Norm(0,9)+L2 logistic regression, and 3) robust scaling+L2 logistic regression. Voting classifiers as the final stage were tested, but rejected due to poor performance, hence the use of stacking classifiers for both ensembles as the final estimator. The StackingClassifiers were 10-fold cross validated in addition to 10-fold cross validation on each pipeline. All other hyperparameters were set to their previously specified or default values.

Figure 8 Feature Scaling Ensembles Construction

Preliminary Results

Refer to Figure 9 for details about generalized performance for the 15 feature scaling algorithms (13 solo and 2 ensembles).

Figure 9 Generalization Performance (image by author)

As expected, there was scant difference between solo feature scaling algorithms regarding generalized performance. In Figure 9, one can see an equality enforced through regularization such that, excluding L2 normalization, there is only a four-dataset difference between the lowest performing solo algorithm (Norm(0,9) = 41) and the best (Norm(0,4) = 45). The STACK_ROB feature scaling ensemble improved the best count by another eight datasets to 53, representing 88% of the 60 datasets for which the ensemble generalized.

In the case of predictive performance, there is a larger difference between solo feature scaling algorithms. In Figure 10, one can see a wider range of counts across the datasets.

Figure 10 Predictive Performance (image by author)

Excluding L2 normalization, the maximum difference between the lowest performing solo algorithm and the best solo is 11 datasets ((StandardScaler = 21) and (Norm(0,5))= 32) instead of the four presented by generalization metrics. The STACK_ROB feature scaling ensemble improved the best count by another 12 datasets to 44, or a 20% improvement across all 60 from the best solo algorithm.

This unusual phenomenon, the boosting of predictive performance, is not explained by examining the overall performance graph for the feature scaling ensembles (see Figure 11). Rather, to understand better, we’ll need to dive into the raw comparison data.

Figure 11 Feature Scaling Ensembles Overall Performance (image by author)

What you are seeing is correct — the feature scaling ensembles delivered new best accuracy metrics for more than half of all datasets in this study!

Full Results and New Findings

Table 2 is color-coded in many ways. First, the left-hand column denotes the 60 datasets, and multiclass targets are identified with yellow highlighting and an asterisk after the dataset name. Next, the color-coded cells represent percentage differences from the best solo method, with that method being the 100% point. To be clear, the color-coded cells do not show absolute differences but rather percentage differences. The color green in a cell signifies achieving best case performance against the best solo method, or within 0.5% of the best solo accuracy. The color yellow in a cell indicates generalization performance, or within 3% of the best solo accuracy. If a dataset shows green or yellow all the way across, it demonstrates the effectiveness of regularization in that there were minimal differences in performance. The color red in a cell shows performance that is outside of the 3% threshold, with the number in the cell showing how far below it is from the target performance in percentage from best solo method. Lastly, the color blue, the Superperformers, shows performance in percentage above and beyond the best solo algorithm.

Table 2 Full Comparative Results (image by author)

In reviewing the comparative data, we noticed something interesting — positive differential predictive performance on multiclass target variables.

Binary versus Multiclass Performance

Out of 38 binary classification datasets, the STACK_ROB feature scaling ensemble scored 33 datasets for generalization performance and 26 datasets for predictive performance (see Table 3). These results represent 87% generalization and 68% predictive performance for binary targets, or a 19-point differential between those two metrics.

Table 3 Binary Classification Comparative Results (image by author)

Out of 22 multiclass datasets, the feature scaling ensembles scored 20 datasets for generalization performance, only one more than most of the solo algorithms (see Figure 12). Yet, those same feature scaling ensembles scored 18 datasets for predictive performance (see Figure 13), closing the gap between generalization and prediction. In other words, the feature scaling ensembles achieved 91% generalization and 82% predictive accuracy across the 22 multiclass datasets, a nine-point differential instead of the 19-point difference with binary target variables.

And of those 18 datasets at peak performance, 15 delivered new best accuracy metrics (the Superperformers). See Table 4 for the multiclass comparative analysis.

Figure 13 Generalization Performance for Multiclass (image by author)
Figure 14 Predictive Performance for Multiclass (image by author)
Table 4 Multiclass Comparative Results (image by author)

The STACK_ROB Feature Scaling Ensemble

A comparative inspection of the performance offered by combining standardization and robust scaling across all 60 datasets is shown in Figure 15. Numbers below zero show those datasets for which STACK_ROB was not able to meet the scaling accuracy as expressed in a percentage of the best solo algorithm. Numbers at zero indicate achieving 100% of the best solo accuracy whereas numbers above zero indicate Superperformers, and the y-axis denotes the percentage improvement over the best solo method.

Figure 15 STACK_ROB Performance (image by author)

Conclusions

Our work has shown that regularization is effective at minimizing accuracy differences between feature scaling schema such that the choice of scaling isn’t as critical as a non-regularized model. Despite the bias control effect of regularization, the predictive performance results indicate that standardization is a fit and normalization is a misfit for logistic regression.

But, as we confirmed with our earlier research, feature scaling ensembles, especially STACK_ROB, deliver substantial performance improvements. And in this case, there is a definitive improvement in multiclass predictive accuracy, with predictive performance closing the gap with generalized metrics. If you had not considered logistic regression for solving multinomial problems, the STACK_ROB feature scaling ensemble may change your mind.

If your L2-regularized logistic regression model doesn’t support the time needed to process feature scaling ensembles, then normalization with a feature range of zero to four or five (Norm(0,4) or Norm(0,5)) has decent performance for both generalization and prediction. At least, it’s a good place to start in your search for optimality.

Acknowledgements

I would like to express my deepest thanks for the tireless effort expended for over a year by Utsav Vachhani toward solving the mystery of feature scaling, which led to the creation of feature scaling ensembles. Quite simply, without his contribution, this paper and all future work into feature scaling ensembles would not exist.

References

Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H.-T. (2012). Learning from data (Vol. 4). AMLBook New York, NY, USA

Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O’Reilly Media.

Müller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: a guide for data scientists. O’Reilly Media, Inc.

Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2019). Data mining for business analytics: concepts, techniques and applications in Python. John Wiley & Sons.

Scikit-learn_developers. (n.d.). sklearn.linear_model.LogisticRegressionCV Documentation. Retrieved from sklearn.linear_model.LogisticRegressionCV — scikit-learn 1.0.2 documentation

Dataset Sources

Contact Info:

Dave Guggenheim: See author info and bio, dguggen@gmail.com

Utsav Vachhani: LinkedIn bio, uk.vachhani@gmail.com

--

--