Hyper Parameter Tuning’s mission is to pick the best model configuration regarding an objective. We have seen in a previous post that even though it seems a challenging task, it’s not that hard to code when using a model-based approach.
But wait! If HP Tuning can pick the best parameters for a given ML task, can we reuse it to pick the best model? I.E. is it possible to create a method to use model type as a parameter like any other one, and let Hyper Parameters Optimisation choose the right one for us?
Wouldn’t that be a huge speedup in solving ML problems? Not only is the answer definitive Yes!, but moreover, it’s again not very complex to implement.
We’ll explore that in detail in this article.
Quick reminder on model-based HP Tuning
Hyper Parameters Tuning’s (HPT, sometimes also referred as Hyper Parameters Optimisation, HPO) goal is to find the configuration that maximises or minimizes a given objective. There are various ways to perform Hyper Parameter Tuning: Brute force, Random Search, Bayesian Search, or model-based methods.
I’ve previously advocated for the use of model-based methods, which turned out to be very efficient on some problems I encountered. You can find more detail on SMAC, an efficient library for HPO, here:
You might also be interested in having a better understanding of how these methods work and building your own Hyper Parameter Optimisation library.
Here stands a funny way to do it exploiting your knowledge of standard models like XGBoost, CatBoost or RandomForest:
Tuning XGBoost with XGBoost: Writing your own Hyper Parameters Optimization engine
In addition to being mind-blowing (using ML models to tune ML models), model-based Hyper Parameter Optimisation offers a very interesting advantage over other solutions: it supports categorical parameters. This means that we can encode model type as a categorical parameter.
That said, let’s see how we can hack standard Hyper Parameter Tuning to perform Model Selection and speed up model building.
Hyper Model
As mentioned in the introduction, the idea behind this hacking is the following: can we consider _modeltype (i.e. XGBoost, Prophet, S/ARIMA, …) as a parameter like any other one, and let the Hyper Parameter Tuning method do the job for us?
Happily, as stated above, Hyper Parameter Optimisation using model-based methods supports categorical parameters. After all, their underlying model is usually a Boosted (or not) Decision Tree.
A direct consequence of this property is that we can create a HyperModel that would be defined by, amongst other parameters, a model type. The _modeltype will encode the underlying model used for training. In python, this gives:
This implementation of HyperModel follows the scikit-learn model interface, i.e. it offers _fit, predict, setparams and _getparams methods. For the sake of simplicity, we only support two models: XGBoost and RandomForest, but adding another one would only take a few more lines. You should give SVR a try, for instance.
As you have probably noted, we use a white list to keep only those parameters that are applicable to a given model. This is not the best way to support the fact that parameters differ from one model to the other. The right way to do it would be to use conditional configuration. This is supported by the ConfigSpace python library, but that doesn’t work out of the box with scikit.
Using our new class for training and prediction is immediate. Let’s say that we want to use XGBoost as the underlying model. This gives:
All we have to do now is use this model in an HPT/HPO step and let it choose the best candidate model type.
Finding the best model
Armed with our HyperModel whose main parameter is a model type, we can use standard Hyper Parameter Tuning methods to identify the best model.
An important point to remember is that there is no "best model" in the absolute. When I write "the best model", I mean the best model for a given score. In this paper, we are going to use Mean Absolute Error as a score.
Though I’ve been using (and advising) SMAC or a custom Hyper Parameter Optimisation implementation to perform HP tuning in my two previous posts on the subject, we’ll try another method in this paper. Never miss an opportunity to try something new 🙂
This time, we are going to use BayesSearchCV for exploring the configuration space. The underlying principle of Bayesian Search is to build a surrogate model, using Gaussian Processes, that estimates model score.
Each new training updates the posterior knowledge of the surrogate model. This surrogate is then provided with configurations picked randomly, and the one that gives the best score is kept for training.
As It uses a Gaussian Processes model to learn the relation between Hyper Parameters and the score of candidates models, it can be considered as a model-based approach.
Putting it all together gives the following lines of code:
The configuration space is defined mostly using uniform distribution for parameters. This means that the probability of picking a value in a given range is the same everywhere. This is the case for _maxfeatures, _nestimators or _maxdepth for instance. On the opposite, gamma and _learningrate, that span over multiple order of magnitude, are picked using a log-uniform distribution.
Running this piece of code will show you that XGBoost seems to be the best option for this dataset.
Checking
As I, you probably don’t trust the result of an algorithm before having performed some checking.
Fortunately, the Boston dataset challenge has already been studied by many other data scientists. More specifically, it has been studied in Kaggle here by Shreayan Chaudhary and he comes to the same conclusion as our algorithm. That’s good news.
Nevertheless, we are never too cautious. Let’s perform another simple check to ensure that RandomForest cannot outperform XGBoost if we give more iterations to the HP Tuning exploration and optimize only for Random Forest:
We simply reuse our HyperModel, but this time we force the exploration to focus on a single model: RandomForestRegressor. We also allow more iterations: 50 for Random Forest only vs 50 iterations for both models previously.
The conclusion remains the same. XGBoost precision remains better with respect to mean absolute error: 2.68 for Random Forest vs 2.57 for XGboost.
We also might have been "lucky", and the fact that XGBoost outperforms RandomForestRegressor might be purely random and linked to the initial seed used to initialize the Bayesian Search: _randomstate=0.
Running the code multiple time with various _randomstates should convince you that we were not lucky and that XGBoost is in this case the best option concerning the score that we choose: Mean Absolute Error.
How fast does our automated model selection discard bad option?
Another aspect that is very interesting to consider is the speed at which our model selection discard lame duck. To illustrate this point, we are going to add another model to the list of the HyperModel supported model: LinearRegression.
Intuitively, this is the worst option to consider. Let’s see how much time does our automatic model selection spend exploring this possibility. First, we add it to HyperModel:
We are going to use a dirty hack to measure the time spent exploring a given configuration. We just print it in line 42 (believe it or not), and performing some greps on it gives us the following stats:
- LinearRegression was explored 3 times.
- Random Forest 17 times.
- XGBoost 30 times.
It’s remarkable to see that in only 50 iterations, our model selection has learned to focus on the most promising model: XGBoost and discarded Linear Regression in only 3 iterations.
We could have excluded Linear Regression more quickly if we had used conditional configuration with ConfigSpace, instead of using a white list of allowed parameters. This artificially increases unnecessary trials.
Conclusion
Extending Hyper Parameters tuning for model selection works. It’s been pretty easy to demonstrate it using a few lines of code.
We have been using Bayesian search to explore the configuration space, but we could have used any other effective Hyper Parameter Tuning method.
Another idea that could be worth trying is to use Hyper Parameter Optimization methods for features selection. Odds are that it would work efficiently too.