Random Forest or XGBoost? It is Time to Explore LCE

LCE: Local Cascade Ensemble

Kevin Fauvel, PhD, CFA, CAIA
Towards Data Science

--

Photo by David Bruyndonckx on Unsplash.

Over the past years, Random Forest [Breiman, 2001] and XGBoost [Chen and Guestrin, 2016] have emerged to be the best performing machine learning methods in addressing many challenges of classification and regression. Practitioners are faced with the recurrent question: which one should we choose for our given dataset? Local Cascade Ensemble (LCE) [Fauvel et al., 2022] is a new machine learning method which proposes to answer this question. It combines their strengths and adopts a complementary diversification approach to obtain a better generalizing predictor. Thus, LCE further enhances the prediction performance of both Random Forest and XGBoost.

This article presents LCE and the corresponding Python package with some code examples. LCE package is compatible with scikit-learn; it passes the check_estimator. Therefore, it can interact with scikit-learn pipelines and model selection tools.

LCE Presentation

The construction of an ensemble method involves combining accurate and diverse individual predictors. There are two complementary ways to generate diverse predictors: (i) by changing the training data distribution and (ii) by learning different parts of the training data.

We adopted these two diversification approaches in developing LCE. First, (i) LCE combines the two well-known methods that modify the distribution of the original training data with complementary effects on the bias-variance trade-off: bagging [Breiman, 1996] (variance reduction) and boosting [Schapire, 1990] (bias reduction). Then, (ii) LCE learns different parts of the training data to capture new relationships that cannot be discovered globally based on a divide-and-conquer strategy (a decision tree). Before detailing how LCE combines these methods, we introduce the key concepts behind them that will be used in the explanation of LCE.

The bias-variance trade-off defines the capacity of the learning algorithm to generalize beyond the training set. The bias is the component of the prediction error that results from systematic errors of the learning algorithm. A high bias means that the learning algorithm is not able to capture the underlying structure of the training set (underfitting). The variance measures the sensitivity of the learning algorithm to changes in the training set. A high variance means that the algorithm is learning too closely the training set (overfitting). The objective is to minimize both the bias and variance. Bagging has a main effect on variance reduction; it is a method for generating multiple versions of a predictor (bootstrap replicates) and using these to get an aggregated predictor. The current state-of-the-art method that employs bagging is Random Forest. Whereas, boosting has a main effect on bias reduction; it is a method for iteratively learning weak predictors and adding them to create a final strong one. After a weak learner is added, the data weights are readjusted, allowing future weak learners to focus more on the examples that previous weak learners mispredicted. The current state-of-the-art method that uses boosting is XGBoost. Figure 1 illustrates the difference between bagging and boosting methods.

Figure 1. Bagging versus Boosting on a dataset of plant diseases. n — number of estimators. Image by the author.

Thus, our new ensemble method LCE combines a boosting-bagging approach to handle the bias-variance trade-off faced by machine learning models; in addition, it adopts a divide-and-conquer approach to individualize predictor errors on different parts of the training data. LCE is represented in Figure 2.

Figure 2. Local Cascade Ensemble on a dataset of plant diseases, with a reference to Figure 1 with Bagging in cadet blue and Boosting in red. n — number of trees, XGB — XGBoost. Image by the author.

Specifically, LCE is based on cascade generalization: it uses a set of predictors sequentially, and adds new attributes to the input dataset at each stage. The new attributes are derived from the output given by a predictor (e.g., class probabilities for a classifier), called a base learner. LCE applies cascade generalization locally following a divide-and-conquer strategy — a decision tree, and reduces bias across a decision tree through the use of boosting-based predictors as base learners. The current best performing state-of-the-art boosting algorithm is adopted as base learner (XGBoost, e.g., XGB¹⁰, XGB¹¹ in Figure 2). When growing the tree, boosting is propagated down the tree by adding the output of the base learner at each decision node as new attributes to the dataset (e.g., XGB¹⁰() in Figure 2). Prediction outputs indicate the ability of the base learner to correctly predict a sample. At the next tree level, the outputs added to the dataset are exploited by the base learner as a weighting scheme to focus more on previously mispredicted samples. Then, the overfitting generated by the boosted decision tree is mitigated by the use of bagging. Bagging provides variance reduction by creating multiple predictors from random sampling with replacement of the original dataset (e.g., , D² in Figure 2). Finally, trees are aggregated with a simple majority vote. In order to be applied as a predictor, LCE stores, in each node, the model generated by the base learner.

Missing Data

We opted to natively handle missing data. Similar to XGBoost, LCE excludes missing values for the split and uses block propagation. During a node split, block propagation sends all samples with missing data to the side of the decision node with less errors.

Hyperparameters

The hyperparameters of LCE are the classical ones in tree-based learning (e.g., max_depth, max_features, n_estimators). Moreover, LCE learns a specific XGBoost model at each node of a tree, and it only requires the ranges of XGBoost hyperparameters to be specified. Then, the hyperparameters of each XGBoost model are automatically set by Hyperopt [Bergstra et al., 2011], a sequential model-based optimization using a tree of Parzen estimators algorithm. Hyperopt chooses the next hyperparameters from both the previous choices and a tree-based optimization algorithm. Tree of Parzen estimators meet or exceed grid search and random search performance for hyperparameters setting. The full list of LCE hyperparameters is available in its documentation.

Published Results

We have initially designed LCE for a specific application in [Fauvel et al., 2019], and then evaluated it on the public UCI datasets [Dua and Graff, 2017] in [Fauvel et al., 2022]. Results show that LCE obtains on average a better prediction performance than the state-of-the-art classifiers, including Random Forest and XGBoost.

Python Package and Code Examples

Image by the author.

Installation

LCE is available in a Python package (Python ≥ 3.7). It can be installed using pip:

pip install lcensemble

or conda:

conda install -c conda-forge lcensemble

Code Examples

LCE package is compatible with scikit-learn; it passes the check_estimator. Therefore, it can interact with scikit-learn pipelines and model selection tools. The following examples illustrate the use of LCE on public datasets for a classification and a regression task. An example of LCE on a dataset including missing values is also shown.

This example on the Iris dataset illustrates how to train a LCE model and apply it as predictor. It also demonstrates the compatibility of LCE with scikit-learn model selection tools through the use of cross_val_score.

  • Classification with LCE on the Iris Dataset with missing values

This example illustrates the robustness of LCE to missing values. The Iris train set is modified with 20% of missing values per variable.

Finally, this example shows how LCE can be used in the case of a regression task.

Conclusion

This article introduces LCE, a new ensemble method for the general classification and regression tasks, and the corresponding Python package. For more information about LCE, please refer to the associated paper published in the journal Data Mining and Knowledge Discovery.

References

J. Bergstra, R. Bardenet, Y. Bengio and B. Kégl. Algorithms for Hyper-Parameter Optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems, 2011.

L. Breiman. Bagging Predictors. Machine Learning, 24(2):123–140, 1996.

L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.

T. Chen and C. Guestrin. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.

D. Dua and C. Graff. UCI Machine Learning Repository, 2017.

K. Fauvel, V. Masson, E. Fromont, P. Faverdin and A. Termier. Towards Sustainable Dairy Management — A Machine Learning Enhanced Method for Estrus Detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019.

K. Fauvel, E. Fromont, V. Masson, P. Faverdin and A. Termier. XEM: An Explainable-by-Design Ensemble Method for Multivariate Time Series Classification. Data Mining and Knowledge Discovery, 36(3):917–957, 2022.

R. Schapire. The Strength of Weak Learnability. Machine Learning, 5(2):197–227, 1990.

--

--