The world’s leading publication for data science, AI, and ML professionals.

Feature Preprocessor in Automated Machine Learning

The performance of an automated machine learning(AutoML) workflow depends on how we process and feed different types of variables to the…

Getting Started

PCA and Feature Selection Strategies in AutoML

Photo by Markus Spiske on Unsplash
Photo by Markus Spiske on Unsplash

The performance of an automated machine learning(AutoML) workflow depends on how we process and feed different types of variables to the model, due to most machine learning models only accept numerical variables. Thus, categorical features encoding becomes a necessary step for any automated machine learning approaches. It not only elevates the model quality but also helps in better feature engineering.

There are two major feature reduction strategies in AutoML: principal component analysis(PCA) and Feature Selection.

PCA strategy:

PCA is widely used in current AutoML frameworks, due to it often used for reducing the dimensionality of a large dataset so that it becomes more practical to apply Machine Learning where the original data are inherently high dimensional. It relies on linear relationships between feature elements and it’s often unclear what the relationships, as it also "hides" feature elements that contribute little to the variance in the data, it can sometimes eradicate a small but significant differentiator that would affect the performance of a machine learning model.

The withdraw of PCA is more apparent when AutoML system coping with categorical features. Most AutoML frameworks are using Onehot algorithm, which will easily generate high dimension dummies features when a categorical feature has large categories. That will cause information loss and hard to tune without manual diagnosis and interruption.

Typical PCA-based feature preprocessor uses only one encoder to cope with categorical features and has at least one PCA algorithm to implement feature reduction. This preprocessor system is widely applied in AutoML frameworks, i.e. Auto-ML and H2O autoML. And Auto-Sklearn has a PCA ensemble component in it, which allows multiple PCA algorithms to generate input datasets for different pipelines.

PCA-based Feature Preprocessor(Image by Author)
PCA-based Feature Preprocessor(Image by Author)

Feature selection strategy:

To design a robust automated feature preprocessor, Optimalflow chooses feature selection as the alternative strategy, and an ensemble encoding mechanism is introduced to cope with categorical features in the input data. This selection-based feature preprocessor with an ensemble encoding(SPEE) system improves AutoML’s adaptivity for multiple categorical features by ensemble encoding method. Instead of PCA algorithms, it keeps features relationship information and variance by ensemble feature selection algorithms.

SPEE Preprocessor(Image by Author)
SPEE Preprocessor(Image by Author)

For the feature selection ensemble component, SPEE is using 3 major algorithms:

  • Select K Best Features(SelectKBest) with ANOVA F-value between label/feature or Chi-squared stats of non-negative features approach, etc.
  • Recursive Feature Elimination(RFE) with Logistic Regression, SVM, or Decision Tree estimator
  • Recursive Feature Elimination and Cross-validated Selection(RFECV) with logistic regression, SVM, or decision tree estimator
Algorithms summary in SPEE(Image by Author)
Algorithms summary in SPEE(Image by Author)

The figures below show that SPEE performed better to cope with multiple categorical features in perceiving the trend and pattern, which will also be beneficial for further AutoML workflow. Here’s the performance compare plot based on the categorical features added breast cancer dataset. The OptimalFlow’s SPEE approach performance favorably against typical PCA-based preprocessor in AutoML, preserving features trend information and variance among transformations.

(Image by Author)
(Image by Author)

References:

[1]OptimalFlow GitHub. In https://github.com/tonyleidong/OptimalFlow.

[2]Jake Lever, Martin Krzywinski, and Naomi Altman. Principal component analysis. In Nat Methods 14, page 641–642. Nature Methods 2010.

[3]M. Pechenizkiy, A. Tsymbal, and S. Puuronen. PCA-based Feature Transformation for Classification: Issues in Medical Diagnostics. Proceedings. 17th IEEE Symposium on Computer-Based Medical Systems, page 535–540. IEEE 2004.

[4]H2o.ai GitHub. In https://github.com/h2oai/h2o-3.

[5]Auto-Sklearn GitHub. In https://github.com/automl/auto-sklearn.


Related Articles