Benchmarking simple models with feature extraction against modern black-box methods

Martin Dittgen

Follow

Published in

Towards Data Science

13 min readOct 13, 2019

--

People in the financial industry love logistic regression. And it appears they are not alone: Based on the 2017 Kaggle survey The State of ML and Data Science the most popular answer to the question ”What data science methods are used at work?” was logistic regression with 63.5%. On the second place we find the decision tree with 49.9%. Those two methods constitute a class of very interpretable methods.

However, to win a Kaggle competition one usually needs to employ a more modern, black-box like algorithm like boosted trees. In the Kaggle survey boosted trees ranked only 8th place with 23.9%.
(Note that we will focus here on structured — spreadsheet like — data. For unstructured inputs like audio and images the picture will be entirely different and probably be dominated by deep learning.)

I myself implemented many logistic regressions for banks and corporations and know that many like the interpretability of simple models. You know its inner cogs and bolts and thus can explain how the model is working. This becomes especially important when you have users of the model, who want to understand how it works, or when — e.g. in the financial or pharmaceutical industry — you are dealing with regulators which scrutinize your models and your whole development process.

The question this blogpost tries to answer

Is the difference in performance between a simple interpretable model (like logistic regression) and a modern method (like boosted trees) really that large in practice. Previous benchmark studies suggest that this is indeed the case. See for example (Lim et al. 2000), (Caruana et al. 2006), (Caruana et al. 2008), (Bischl et al. 2014).

However these benchmarks were usually run over the raw datasets. In practice what you often do with simpler models is a manual feature extraction: You combine certain columns, remove noisy or highly correlated features, and transform certain features in attempts to make them more useful or linear with regards to your target.

Might it be possible that the difference between a manually crafted simple model, in comparison to a modern machine learning method applied to raw and unprocessed data, is not as large as one might naively assume?

In order to answer this question we will run a benchmark which will include a feature extraction for interpretable models. We will compare their performance with less interpretable modern machine learning models without feature extraction.

Note that this work was originally done as a Master’s Thesis at the University of Oxford. In this blogpost we will focus on the main ideas and findings. If you are interested in a more thorough discussion including all details you can find the thesis here.

Setup of the Benchmark

For our following discussion we will classify all algorithms into two different classes:

1. White Box Model: A statistical or machine learning model which can be put into a form understandable by a human being such that an economic interpretation can be attached to the inner workings of the prediction process.

2. Black Box Model: Any machine learning model which defies a simple explanation of its inner workings, therefore making it essentially a black box.

The class of white box models will be allowed to use an additional feature extraction. In order to keep this comparison objective we cannot include any manual steps in it. We instead will rely on certain (unsupervised) algorithms like principal component analysis, hoping that they at least approximate the feature extraction a human would do. We will call these algorithms extractors.

List of all models and algorithms used in this benchmark. The technical name given in this table will be used in the result plots further below. Footnote 1: The MARS model will be considered grey box. We treat it as if it has an inherent extractor and don’t combine it with the other extractors.

The feature extractors used in this benchmark are applied to a 3-dimensional S-curve. For this example all feature extractors are reconfigured to extract exactly 2 components. The colour coding visualizes the effect of the transformation. This figure was inspired by a similar plot from Jake Vanderplas.

Datasets

The comparison will be performed over 32 different publically available datasets:

16 datasets concern binary classification,
5 are multiclass classification tasks,
11 involve regression tasks with predicting continuous values.

To make this comparison computationally feasible we will downsample each large dataset to 10,000 observations without replacement.

The datasets used in this benchmark. Most of them are from the UCI Machine Learning Repository, some are (training) datasets from Kaggle. The AmesHousing dataset was taken directly from (De Cock 2011). The BostonHousing2, a corrected and amended version of BostonHousing, was taken from StatLib.

The model pipeline

The benchmark will be done in scikit-learn using its convenient pipelines. For all models we will use a basic pipeline: Each metric feature is first imputed with the median and then standardized. Each categorical feature is one-hot encoded and then also standardized to put it on an equal footing with the metric features. The metric and categorical features are then merged again. Afterwards a feature extractor may be used, if applicable for the current setup, and then the model is fitted to the training data.

The metrics

The metric used to compare the models are:

Explained variance for regression problems
Area under the ROC curve (AUROC) for binary classification problems, which we convert to Somers’ D
F1-score evaluated and weighted by each class and its count for multiclass classification problems.

The shift from AUROC to Somers’ D is simply done via:

Somers’ D = 2 · AUROC -1

The main advantage is, that the dummy performance of randomly guessing the target is now 0% (instead of the 50% level with AUROC). This will make our normalization and aggregation of results a lot easier.

Hyperparameter Optimization

In order to assess the optimal set of hyperparameters for each algorithm we use 60 repeats of random search, where we sample each hyperparameter from a reasonable distribution. We evaluate the random search on 5-fold cross validation for datasets with less than 2,000 observations and on a 80/20 train-test-split for larger datasets. The metrics used in the hyperparameter optimization are the same as the ones used in the comparison.

Results

We evaluate each algorithm on 10 repeats of 10-fold cross-validation. We chose repeated cross validation based on the studies by (Molinaro, Simon, and Pfeiffer 2005) and (Kim 2009), which both compared different model evaluation strategies. We thus obtain 100 values of each metric per datasets. Each algorithm is configured with the set of optimal hyperparameters as described before.

In order to assess the performance of the algorithms we use two comparisons inspired by the previous literature: First we perform a comparison of all algorithms based on their relative ranking per dataset, presented in the next section. The subsequent section presents a comparison based on a normalized score for each metric. This normalization harmonizes the scores over all datasets and makes an aggregation and comparison possible.

Comparison of Relative Ranking

In order to first assess the algorithms we award medals per dataset to the models which performed best. The model with the best mean performance over all folds on a dataset is awarded a gold medal. The second and third are rewarded a silver and bronze medal. The medal table over all models can be seen below. The algorithms are sorted by medal counts sorting for gold medals first, followed by silver and bronze. This is similar to the sorting used in sport events like the Olympic Games. All algorithms which won at least one medal are shown.

For regression problems the xgboost and extra trees dominate the results with 4 gold medals each. The three white box models ridge regression, lasso and linear SVM were able to score medals with ridge regression even winning one gold medal. The random forest showed a very consistent performance, scoring 7 silver and 1 bronze medal. Out of the 33 medals awarded for regression problems 5 were won by white box algorithms, yielding a rate of 15%, and 28 by black box methods. In terms of the medal count regression problems were clearly dominated by the black box methods, which won 85% of all medals.

For multiclass classification 15 medals were awarded with 6 going to white box methods, yielding a rate of 40%, and 9 going to black box methods. The two white box methods QDA and linear SVC each won 1 gold medal.

Out of the 48 models awarded for binary classification 21 were won by white box, yielding a rate of 44%, and 27 by black box models. The linear SVC and logistic ridge methods performed well and scored 19 of the 21 medals awarded for interpretable models. Thus for binary and multiclass classification the gap between white box and black box methods was close in terms of medal count.

The dominance of xgboost for regression and binary classification appears to be weaker for multiclass problems. Interestingly, if we award medals for binary classification based on the F1 score, xgboost only wins 2 gold and 1 bronze medal (check the appendix of the linked thesis for these supplementary results). It might therefore be (partly) an effect of the F1 metric. However, if we distribute medals based on the metric accuracy xgboost is the highest ranking model again for binary classification with 3 gold, 1 silver and 2 bronze medals. For multiclass classification it ranks again in the middle of the medal table. Note that these comparisons should be taken with a grain of salt as the hyperparameter optimization was performed for the metrics shown here.

Best extractors

We also award medals to the best extractors per dataset (assessed on the white box models only), as displayed in the table below. We see that, besides using the raw features (None), only the PCA and Kernel-PCA methods won any medals.

White box ranking

As the overall medal tables are all dominated by black box algorithms we also perform a ranking on the white box models only: The first table shows the ranking of white box algorithms if no additional extractor is used. The second table shows the same comparison if all combinations with feature extractors are included.

Medal ranking for white box models **without any feature extractor** (Note that we also list MARS as having no feature extractor, though we consider it to have the equivalent of a feature extractor build into the algorithm itself. This probably explains why MARS is dominating the table when no feature extraction methods are included but ranks lower when the other white box models are allowed to utilize a feature extraction algorithm.)

Medal ranking for white box models **including feature extractors**

We see that some algorithms such as elastic net perform strongly on the white box models alone but then disappear in the overall table. This is due to the fact that they perform well on the datasets where the black box algorithms performed even better.

For multiclass classification the linear SVC does not appear in the table at all if no feature extraction is applied but wins most medals with extractor methods.

Comparison of Normalized Scores

The results over the different datasets and algorithms have to be normalized in order to make them comparable to each other. For this purpose we process the data with the following three steps:

1. Remove unstable algorithms: Algorithms which did not converge and yielded results way below dummy performance were excluded. Some algorithms extractor combinations were only unstable on one dataset. In order not to bias the result, we remove any algorithm for each dataset, where it has a median equal or below -0.5.

2. Winsorize worst possible performance for regression problems: The metric for the binary classification problems are bound from -1.0 (absolute worst performance by always giving the best observations the worst scores) to +1.0 (perfect prediction). Similarly the weighted F1 score of the multiclass classification ranges from 0.0 to 1.0. Contrary the regression metrics can go from minus infinity to +1.0. Some regression algorithms were only unstable on a small number of folds and therefore not removed in the previous step. In order not to allow any single outlier to influence the total result by a significant amount and to put the comparison between regression and classification on a more equal footing we winsorize the metric for each fold of the regression problems by setting every value which is lower to be equal to -1.0.

3. Normalize for idiosyncratic effects: For each dataset, the difficulty in predicting new values is different. The reason lies in the different magnitude of the idiosyncratic effects on each observation. In order to make the scores for each datasets more comparable the following normalization is performed:

For each combination of dataset and metric we determine the 95% percentile of the metric, taken over all algorithms and folds. Each CV-fold of the dataset and metric combination is then normalized by dividing its result through this percentile.

where ‘d’ denotes each dataset, ‘a’ each algorithm extractor pair and ‘c’ stands for a specific CV-fold. The percentile is taken over all algorithms and all folds, per metric and dataset.

After this normalization each algorithm should show a similar result structure over all datasets. We can confirm that this is indeed the case with a simple boxplot, shown below. Note that this normalization can lead to values which are above +1.0.

Results for the regression random forest before and after normalization for the metric explained variance by dataset. We see that after the normalization the results are of similar range. An aggregation of the results per algorithm over all datasets is now possible.

Next, we aggregate all the normalized scores over all datasets, calculate the mean scores per algorithm and rank the algorithms accordingly. We also determine 95% confidence intervals by bootstrapping the normalized scores.

In the plots below we will only show the best white box extractor duo for a model. The method without any extractor will always be included as well (a comprehensive plot with all combinations can be found in the linked thesis). Black box algorithms are displayed in dark blue. White box algorithms are shown in cyan if applied to the raw data, and in red if any extractor is used. Confidence intervals are shown with grey bars.

Selection of the best binary classification algorithms in terms of their normalized Somers’ D values.

Selection of the best multiclass classification algorithms in terms of their normalized weighted F1 scores.

Selection of the best regression algorithms in terms of their normalized explained variance.

For regression problems we see that all black box algorithms, with the exception of the neural network, performed significantly better than the white box methods. The xgboost algorithm was the best performing algorithm. It was able to explain almost 10% more variance than the best white box model ridge regression (with RBF based kernel-PCA as its extractor).

For multiclass classification we see that all white box algorithms without any extractor performed significantly worse and cannot compete with the other methods. While xgboost and the neural network have the highest mean scores they only perform roughly 1% better than the three best white box algorithms logistic regression, logistic lasso, and liner SVC. These three even outperformed the three black box methods random forest, kernel-SVC, and bagged trees.

For binary classification the xgboost algorithm achieved a 1% higher Somers’ D than the second best model, which was a linear SVC with kernel-PCA of type RBF as its extractor. The following methods appear to cluster in a similar range starting from a normalized Somers’ D of 0.93 for the linear SVC up to the extra trees classifier with 0.89. They are comprised of the other black box methods (besides xgboost) and many other white box algorithms, all with extractors. The next best algorithm (a decision tree without any extractor) already shows a significant drop in performance to 0.82.

Conclusion

From the benchmark we can draw the following three conclusions:

Extractors increase the performance of white box models. Principal component analysis, including its kernel variants, were the best feature extractors:

For most algorithms their use results in a substantial gain in performance. For many datasets, especially for binary classification, using the raw features without any extractor produced competitive results for white box models as well. Other methods based on manifold methods, k-means and random projections performed worse. We speculate that the manifold methods might achieve a better extraction on unstructured datasets (texts, images) for which they were originally developed; it appears that on structured data the more straightforward transformations of PCA are more appropriate.

The best algorithms: All black box algorithms performed well, especially xgboost. Lasso, ridge and elastic-net were generally the best white box models

The boosting algorithm xgboost performed best in our analysis. This might explain its ubiquity in machine learning competitions. Generally all black box algorithms were well performing and achieved similar scores. The neural network performed well on classification tasks, but showed lower performance on regression datasets (Potentially this might be due to the chosen network structure and parameters, so no premature general conclusions should be drawn). The support vector machine showed a lower performance on regression and binary classification problems than most other black box algorithms. The extra trees were in a similarly lower range for the binary classification problems.

The models of linear structure, like linear and logistic regression, performed best of all white box models on both regression and classification tasks. Often the regularized methods (lasso, ridge, elastic-net) performed better than their un-regularized versions. Linear support vector machines performed well on classification problems, but their regression version performed worse. LDA, though structurally similar to logistic regression, was not able to show a similarly high performance on classification tasks. The methods of non-linear structure naive Bayes, QDA, and a single tree were not able to compete, apparently unable to fully utilize the extracted features.

White box models with feature extraction were able to compete with black box models on classification problems; on regression problems the black box algorithms appeared superior

While typically black box models showed the best performance the top white box models (including feature extraction) had just marginally lower scores. This was apparent for both types of binary and multiclass classification.

We can conclude that for classification problems a simple model with proper (ideally manual) feature extraction can compete with a black box model. As a practical suggestion we would recommend to benchmark a manually constructed simple model against black box models. This can provide an indication whether the manual feature extraction is adequate or could be improved.

An interesting follow up study could validate if feature extractors improve the performance of black box methods as well. One could also test if applying different feature extractors to metric and categorical variables would lead to different results, as it might be better at handling the different nature of these two variable types.

I hope that you found this benchmark interesting and can take away some lessons or inspirations for your next machine learning project. Let me know in the comments below if this coincides with your hands-on experience as well. Maybe some of you have experimented with your own benchmarks: What conclusions did you draw from them?

Disclaimer: All things stated in this article are of my own opinion and not of any employer.