The world’s leading publication for data science, AI, and ML professionals.

Auto-Sklearn: An AutoML tool based on Bayesian Optimization

Searching the optimal pipeline through Meta-Learning, Bayesian Optimization and Ensemble Learning

Figure 1. Auto-Sklearn | Image by Author | Icons taken from source
Figure 1. Auto-Sklearn | Image by Author | Icons taken from source

There is a plenty of alternatives when trying to find the right ML model as well as the right set of hyperparamters, which one is the best option? maybe there is not a unique answer. This time we are going to talk about Auto-Sklearn, the Automl tool which implements Bayesian Optimization for searching of the optimal pipeline configuration as well as Ensemble Selection for the choosing of the right model. So, this blog will be divided as follows:

  • What is Auto-Sklearn?
  • Auto-Sklearn in practice

What is Auto-Sklearn?

Auto-Sklearn is an open-source project developed by Matthias Feurer, et al. [1] and made public in 2015 in their paper: "Efficient and Robust Automated Machine Learning". As an AutoML tool, Auto-Sklearn tries to provide the optimal pipeline for a given dataset, specifically by covering: data transformation, model selection and hyperparameter Optimization tasks. Auto-Sklearn is a tool that is mainly made up of scikit-learn models, specifically it is composed of 15 classifiers, 14 preprocessing methods, and 4 data preprocessing methods.

Finding the optimal pipeline is a complex task due to the diversity of models and parameters that must be considered. The "optimal pipeline" can be obtained through exhaustive techniques such as Grid Search, however it is not a suitable solution because the space search is determined by fixed values. Likewise, it has been proposed another techniques based on sophisticated optimization algorithms such as TPOT which aims to find the optimal pipeline configuration through Genetic Algorithms [2] which can find the "optimal pipeline" in a considerable time however, for datasets with specific characteristics, optimization can take even days. In contrast, Auto-Sklearn implements Bayesian Optimization for the searching of the optimal pipeline which can be thought as a slow technique which Auto-Sklearn solves in proper manner.

Figure 2. Auto-Sklearn Architecture | Image by Author
Figure 2. Auto-Sklearn Architecture | Image by Author

The Auto-Sklearn architecture is composed of 3 phases: meta-learning, bayesian optimization, ensemble selection. The key idea of the meta-learning phase is to reduce the space search by learning from models that performed well on similar datasets. Right after, the bayesian optimization phase takes the space search created in the meta-learning step and creates bayesian models for finding the optimal Pipeline configuration. Finally, an ensemble selection model is created by reusing the most accurate models found in the bayesian optimization step. In Figure 2 it’s described the Auto-Sklearn architecture.

Auto-Sklearn is a robust tool that integrates 3 stages for the search for the optimal pipeline. However, it is important to mention that both phase 1 (meta-learninig) and phase 3 (ensemble selection) can be configured according to different needs, we will see this in detail in the next section.

Great, so far we already know what Auto-Sklearn is, what its components are and how it works, now let’s see how we do this in practice, let’s go for it!

Auto-Sklearn in practice

The idea of the following example is to show the usability of the autosklearn library as well as some configurations to manipulate phase 1 (meta-learning) and phase 3 (ensemble selection). For this example, we are going to make use of the "_breast cance_r" toy dataset.

So first we are going to import some libraries and split the dataset into train and test:

As you can notice, we are importing the extension AutoSklearnClassifier since this is a classification problem. Then, we only need to instantiate the classifier and we will be able to train the model, pretty easy right?

Since we are not passing any argument to the classifier, AutoSklearn uses the default parameters, which is not a good practice. As was mentioned in the previous section, AutoSklearn allows us to manipulate the meta-learning step as well as the ensemble selection.

In order to manipulate the number of instances obtained from the meta-learning step, we need to provide a value to the parameter:

initial_configurations_via_meta_learning : int (default=25)

As we can observe the default value is 25, it means that it will take 25 configurations to be implemented as the starting point in the bayesian optimization step. If you don’t want to take any configuration as starting point (i.e. if you want to start the optimization from scratch), you can set this value equals to zero.

On the other hand, if you want to manipulate the number of models to be considered in the Ensemble Selection, you only need to modify this parameter:

ensemble_size : int (default=50)

As we can observe, the default number of models to be added to the ensemble is 50 (which is a large number, at least for small datasets), you can try different values in order to find the optimal according to your needs (remember that ensemble learning is a good technique for improving the accuracy, however there is a risk of overfitting the model).

So, if we didn’t want to use any configuration as starting point (meta-learning) and only use one model in the ensemble, the initialization will looks like:

Now let’s talk about "time limits". AutoSklearn provides a set of parameters to control the time to be used for the entire optimization as well as to control the time used for each model evaluation. These flags are:

# Time limit for the entire optimization
time_left_for_this_task: int (default=3600)
# Time limit for each model evaluation
per_run_time_limit: int (1/10 of time_left_for_this_task)

If you dataset is small, maybe you should consider to decrease these flags, otherwise the optimization process may take a while. So, say we want to set 300 seconds as the time limit for the entire optimization and only 30 per each model evaluation, the class initialization will looks like:

Finally for testing it is quite simple, exactly as you do with sklearn models:

$ model.score(x_train, y_train)
0.960093896713615
$ model.score(x_test, y_test)
0.965034965034965

If you want to see a summary, you only need to type:

$ print(model.sprint_statistics())
auto-sklearn results:   
 Dataset name: ff54bc0cfe4c3dc32e4cbba909d41e5a   
 Metric: accuracy   Best validation score: 0.964539   
 Number of target algorithm runs: 62  
 Number of successful target algorithm runs: 55   
 Number of crashed target algorithm runs: 4   
 Number of target algorithms that exceeded the time limit: 3
 Number of target algorithms that exceeded the memory limit: 0

If you want to learn more about the AutoSklearn parameters, it will be worth to take a look a the documentation: https://automl.github.io/auto-sklearn/master/api.html

From my side, that is it!

Conclusion

In this blog we have seen what Auto-Sklearn is, what its components, how it works and a practical example.

In my opinion, Auto-Sklearn works as one more alternative to find the optimal configuration of your pipeline considering the risks that this entails, that is, it is necessary to provide a set of adequate parameters to avoid exploiting the computational time or overfitting the model. If you are working with serious datasets, you can consider these kind of tools as a baseline.

Thank you so much for reading, see you next time!

References

[1] https://proceedings.neurips.cc/paper/2015/file/11d0e6287202fced83f79975ec59a3a6-Paper.pdf

[2] https://epistasislab.github.io/tpot/


Related Articles