The world’s leading publication for data science, AI, and ML professionals.

Auto-Sklearn: Scikit-Learn on Steroids

Automate the "boring" stuff. Accelerate your model development lifecycle.

Photo by Alexander Redl on Unsplash
Photo by Alexander Redl on Unsplash

Motivation

A typical machine learning workflow is an iterative cycle of data processing, feature processing, model training, and evaluation. Imagine having to experiment with different combinations of data processing methods, model algorithm, and hyperparameters until we get a satisfactory model performance. This laborious and time-consuming task is commonly performed during hyperparameter optimization.

Model development lifecycle. Image by author.
Model development lifecycle. Image by author.

Hyperparameter Optimization

The objective of hyperparameter optimization is to find the optimal model pipeline components and their associated hyperparameters. Let’s assume a simple model pipeline that has two model pipeline components: an imputer step followed by a random forest classifier.

Image by author
Image by author

The imputer step has a hyperparameter called "strategy" which determines how the imputation is performed e.g. using mean, median or mode. The random forest classifier has a hyperparameter called "depth" which determines the maximum depth of an individual decision tree in the forest. Our objective is to find which combination of hyperparameters across model pipeline components provides the best result. Two common ways to do hyperparameter tuning are by using Grid Search or Random Search.

Grid Search

For each hyperparameter, we make a list of possible values and try all possible combinations of values. In the case of our simple example, we have 3 imputer strategies and 3 different random forest classifier depth to try, hence there are 9 different combinations in total.

Grid Search. Image by author.
Grid Search. Image by author.

Random Search

In random search, we define the range and choices for each hyperparameter and the sets of hyperparameters are randomly chosen within these boundaries. In the case of our simple example, the range for depth was between 2 to 6 and choices for imputer strategy were mean, median or mode.

Random Search. Image by author.
Random Search. Image by author.

Notice that the sets of hyperparameters in Grid and Random Search are selected independently of one another. Neither of these methods uses results from prior training and evaluation trial to improve results in the next trial. A more efficient way to go about doing hyperparameter optimization is to utilize results from prior trials to improve the selection of hyperparameters for the next trial. Such approach was used in Bayesian optimization.

Bayesian Optimization

Bayesian optimization stores prior searched hyperparameters and results of a predefined objective function (e.g. binary cross entropy loss) and use it to create a surrogate model. The purpose of a surrogate model is to quickly estimate the performance of the actual model given a particular set of candidate hyperparameter. This allows us to decide if we should use the set of candidate hyperparameter to train the actual model with. As the number of trials increases, the surrogate model, updated with additional trial results, improves and starts to recommend better candidate hyperparameters.

Bayesian optimization suffers from cold start problem as it requires trial data to build the surrogate model before it is able to recommend good candidate hyperparameter for the next trial. There are no historical trials for the surrogate model to learn from at the beginning, therefore the candidate hyperparameters are selected at random which leads to slow start in finding good performing hyperparameters.

To overcome the cold start problem, Auto-Sklearn, an open source Automl library, incorporates warm start, through a process called meta-learning, into Bayesian optimization to get instantiation of hyperparameters that are better than random.

Auto-Sklearn

Automated Machine Learning (AutoML) is the process of automating tasks in the machine learning pipeline such as data preprocessing, feature preprocessing, hyperparameter optimization, model selection and evaluation. Auto-Sklearn automates the above mentioned tasks using for the popular Scikit-Learn machine learning framework. Below image shows is how Auto-Sklearn works in a nutshell.

Auto-Sklearn. Image from [1].
Auto-Sklearn. Image from [1].

Auto-Sklearn uses Bayesian optimization with warm start (meta-learning) to find the optimal model pipeline and build an ensemble from the individual model pipelines at the end. Let’s examine the different components in the Auto-Sklearn framework.

Meta-Learning

The purpose of meta-learning is to find good instantiation of hyperparameters for Bayesian optimization so that it performs better than random at the start. The intuition behind meta learning is simple: datasets with similar meta features performs similarly on the same set of hyperparameter. Meta features as defined by Auto-Sklearn authors are "characteristics of the dataset that can be computed efficiently and that help to determine which algorithm to use on a new dataset".

During offline training, a total of 38 meta features such as skewness, kurtosis, number of features, number of classes etc were tabulated for 140 reference datasets from OpenML. Each reference dataset were trained using Bayesian optimization process and the results were evaluated. Hyperparameters that gave the best results for each reference dataset are stored and these hyperparameters serve as instantiation for the Bayesian optimizer for new dataset with similar meta features.

During training of model for the new dataset, the meta features for the new dataset are tabulated and the reference datasets are ranked according to the L1 distance to the new dataset in the meta feature space. The stored hyperparameters from the top 25 nearest reference datasets are used to instantiate the Bayesian optimizer.

The authors experimented with different variants of Auto-Sklearn on the reference dataset and compared them using the average ranking across different training duration. Lower rank indicates better performance. Variants with meta learning (blue and green) show steep drop in rank at the start due to good initialization of the Bayesian optimizer.

Figure 1: Comparison of different Auto-Sklearn variants. Image from [1].
Figure 1: Comparison of different Auto-Sklearn variants. Image from [1].

Data Pre-processors

Auto-Sklearn preprocesses the data in the following order [2].

  1. One Hot Encoding of categorical features
  2. Imputation using mean, median or mode
  3. Rescaling features
  4. Balance the dataset using class weights

Feature Pre-processors

After data pre-processing, features may be optionally pre-processed with one or more of the following categories of feature pre-processors [2].

  1. Matrix decomposition using PCA, truncated SCV, kernel PCA or ICA
  2. Univariate feature selection
  3. Classification-based features selection
  4. Feature clustering
  5. Kernel approximations
  6. Polynomial feature expansion
  7. Feature embeddings
  8. Sparse representation and transformation

Ensemble

During the training process, Auto-Sklearn trains mutiple individual models which can used to construct an ensemble model. Ensemble models combines weighted output of multiple trained models to provide a final prediction. They are known to be less prone to overfitting and generally outperforms single models.

From figure 1, the authors showed that variant that uses ensemble performs better than variant without ensemble (black vs red and green vs blue). The variant with meta-learning and ensemble (green) performs the best.

Code

Let’s take a look at some practical examples of Auto-Sklearn in action.

Install the package

pip install auto-sklearn==0.13

Import packages

import pandas as pd
import sklearn.metrics
from sklearn.model_selection import train_test_split, StratifiedKFold
from autosklearn.classification import AutoSklearnClassifier
from autosklearn.metrics import (accuracy,
                                 f1,
                                 roc_auc,
                                 precision,
                                 average_precision,
                                 recall,
                                 log_loss)

Load the dataset

We will be using a dataset from UCI which describes a bank’s marketing campaign which offers clients to place a term deposit. The target variable is yes if the customer agrees and no if the customer decides not to place a term deposit. You can find the original dataset here.

We read the dataset as a Pandas dataframe.

df = pd.read_csv('bank-additional-full.csv', sep = ';')

Prepare the data

Auto-Sklearn requires us to identify is a column is numerical categorical either in the pandas dataframe or we can do it later in the fit function. Lets convert it now.

num_cols = ['ge', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
cat_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome']
df[num_cols] = df[num_cols].apply(pd.to_numeric)
df[cat_cols] = df[cat_cols].apply(pd.Categorical)
y = df.pop('y')
X = df.copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1, stratify=y)

Instantiate the classifier

skf = StratifiedKFold(n_splits=5)

clf = AutoSklearnClassifier(time_left_for_this_task=600,
                            max_models_on_disc=5,
                            memory_limit = 10240,
                            resampling_strategy=skf,
                            ensemble_size = 3,
                            metric = average_precision,
                            scoring_functions=[roc_auc, average_precision, accuracy, f1, precision, recall, log_loss])

Below are some of the parameters used in AutoSklearnClassifier.

time_left_for_this_task: Limit the total training time (in seconds)

max_models_on_disc: Limit the number of models to keep

memory_limit: The amount of memory (in MB) which we want to utilize

resampling_strategy: holdout or different kinds of cross validation. Refer to this documentation.

ensemble_size: Number of models to include in the ensemble. Auto-Sklearn provides an option to create ensemble after the individual models are created by taking the top ensemble_size number of models in a weighted fashion.

metric: A metric which we want to optimize

scoring_function: One or more metrics which we want to evaluate the model on

Fit the classifier

clf.fit(X = X_train, y = y_train)

Under the hood, Auto-Sklearn constructs a Scikit-Learn pipeline during each trial. A Scikit-Learn pipeline is used to assemble a series of steps that performs data processing, feature processing and an estimator (classifier or regressor). The fit function trigger the entire Auto-Sklearn constructing, fitting and evaluating multiple Scikit-Learn pipeline until the stopping criteria time_left_for_this_task is met.

Results

We can view the results and the chosen hyperparameters.

df_cv_results = pd.DataFrame(clf.cv_results_).sort_values(by = 'mean_test_score', ascending = False)
df_cv_results
Cross validation results and parameters. Image by author.
Cross validation results and parameters. Image by author.

We can also view the comparison among all trials on the leaderboard

clf.leaderboard(detailed = True, ensemble_only=False)
Leaderboard. Image by author.
Leaderboard. Image by author.

We can view which pipelines were selected for the ensemble using

clf.get_models_with_weights()

This method returns a list of tuples [(weight_1, model_1), ..., (weight_n, model_n)]. The weight indicates how much weight it gives to the output of each model. All weight values will sum up to 1.

We can also view additional trainings statistics.

clf.sprint_statistics()

Refit with all the training data

During k-fold cross validation, Auto-Sklearn fit each model pipeline k times on the dataset for evaluation only, it does not keep any of the trained model. Therefore we need to call the refit method to fit the models pipeline found during cross validation with all the training data.

clf.refit(X = X_train, y = y_train)

Save Model

dump(clf, 'model.joblib')

Load Model and Predict

Let’s load the saved model pipeline for inference.

clf = load('model.joblib')
y_probas = clf.predict_proba(X_test)
pos_label = 'yes'
y_proba = y_probas[:, clf.classes_.tolist().index(pos_label)]

Conclusion

Searching for the optimal model pipeline components and hyperparameters is a non-trivial task. Fortunately, there are AutoML solutions such as Auto-Sklearn which can help automate the process. In this article, we examined how Auto-Sklearn uses meta-learning and Bayesian optimization to find the optimal model pipeline and construct model ensemble. Auto-Sklearn is one of many AutoML packages out there. Check out other alternatives such as H2O AutoML.

Automated Machine Learning with H2O

You can find the demo code used in this article here.

Reference

[1] Efficient and Robust Automated Machine Learning

[2] Supplementary material for Efficient and Robust Automated Machine Learning

[3] Auto-Sklearn API documentation


Related Articles