Automated Machine Learning for Gradient Boosting Machines

How you can level up your hyperparameter tuning for Gradient Boosting Machines with NNI

Nicolas Kuhaupt
Towards Data Science

--

Gradient Boosting Machines like XGBoost and LightGBM are among the most popular machine learning algorithms today and are often found among Kaggle competition winners. Although they work well out of the box, some performance gains can be archived by tuning the parameters of the algorithm. Additionally, the number for parameters is too high to set them intuitively or to rely on quick trial and error. AutoML comes to the rescue.

AutoML — Automated Machine Learning — aims to automize machine learning. Here, we focus on AutoML for hyperparameter optimization. We will look at three different algorithms which each offer different advantages: Metis, Bayesian Optimization Hyperband (BOHB), and Population Based Training (PBT). To the best of my knowledge, the last one has not been adapted for Gradient Boosting Machines yet. As we will see, the combination of PBT and GBM has some interesting implications.

You can find the full code on GitHub: https://github.com/NKDataConv/AutoML_Gradient_Boosting_Machines. We will use the AutoML algorithms from the tool NNI. You can find the instructions for installing NNI here. Furthermore, we will need LightGBMs, where you can find the installation guide here.

Let us start with Metis and understand how to adapt the algorithm training for NNI.

Metis

The idea of Metis is simple. Metis builds a model of the parameters vs performance. After some initial random trials, the model can extrapolate on promising parameter configurations. Those are tried and the model is updated on the archived performance. The model in the background of Metis are Gaussian Processes and the update happens with Bayesian optimization.

To adapt your algorithm for Metis, all you need to do is add two files that describe the experiment and integrate two methods into your existing code. Lets look at the methods first: nni.get_next_parameter() gives the next parameter configuration to try out. Secondly, nni.report_final_result() reports the archived performance back.

In the following code snippet, you can see how those two methods play together with the rest of the training process. The full code is available on GitHub.

In the second line, some default parameters are generated. There are also parameters included, which are not tuned by AutoML.

You will also notice, that the data is split into four: train, test, and two validation sets. Now, you may wonder, why to have two validation sets. Maybe this is not strictly necessary. Nevertheless, I think it is a good idea. Let me speak verbosely on this point. The training set is for the machine learning algorithm itself. After the training procedure, the performance is measured on the validation set. Normally, you carry out a few iterations training and validation. In this process, you manually check the performance on the validation set. After enough cycles between train and validation, over-fitting on the validation dataset might happen. That is the reason why one should hold out a test set. With AutoML, this cycle between train and validation becomes more sophisticated and is carried out many times. Therefore, it is more important than ever to have a test set. This test set is only used after the whole process of AutoML is finished.

Now to the two validation sets. The first one is used to determine early stopping. Early stopping is a very powerful idea. In combination with Gradient Boosting Machines, it makes sure that no more trees are added, when the performance on the validation dataset does not improve anymore. But this itself is an optimization on the validation data. So remember, we move from the validation dataset to the test dataset, because an optimization has happened on the validation data and we do not want to over-fit. The same applies here. We optimize on the first validation set with early stopping. Then, we move to hyperparameter optimization. And that is where we use the second validation dataset. To conclude: splitting validation in two has the advantage that the hyperparameter optimization does not happen on parameters that yield good results after early stopping. Instead, it is optimized for parameters, which yield good results after a training process which includes early stopping on another part of the data.

A very important thing to notice is the nature of the final score which is reported back. It does not have do be differentiable. Here, it is accuracy. But the score could be anything, even your self-defined loss. On this loss, the hyperparameters are optimized. Normally, for optimization in machine learning, the loss must be differentiable to find a minimum. This is not the case here. This gives a lot of freedom in terms of stating the problem and the performance measure. We can state here, what we actually care about (and not focus on mathematical properties).

Now, let us go back to the code. One additional file search_space.json is needed to describe the parameter search space, from which the hyperparameters are sampled.

This is a JSON file. The name of the parameter is the key and in brackets the strategy for obtaining new values (only random integer and uniform distribution are used here. More strategies are available, see here.)

The second file config.yml describes the AutoML configuration.

This is a YAML file. The most important thing to take care of is getting the optimization_mode correct. As we optimize for accuracy here, we want to maximize it. If you would use a loss like mean squared error, you would set this to minimize.

Now the experiment can be started from the command line in the corresponding folder:

nnictl create — config config.yml

Finally, you can observe the results on the NNI UI.

NNI UI for Hyperparameter Tuning

BOHB

This algorithm does also bayesian optimization but has another cool feature build into it. This feature is called successive halving. The general idea is, that training on the whole dataset is computationally expensive. So you start by training only a few boosting rounds. After those boosting rounds, you compare the performance and only continue with the most successful ones. Again, they are trained for a few more rounds and their performance is again compared. This continues until only one configuration is left. This is the most successful configuration. That procedure can save resources respectively time and is, therefore, a good approach for large and complex datasets that take up too much time to train multiple times.

We need to do some minor changes to adapt to BOHB.

The first thing to notice is the parameter TRIAL_BUDGET. This one is set by BOHB itself and does not need to be included in the search_space.json. The TRIAL_BUDGET takes care of keeping the first experiments small and makes them longer in later trials. The idea is to have multiples of 10 boosting rounds. With TRIAL_BUDGET = 1, there will be 10 boosting rounds. With TRIAL_BUDGET = 2, there will be 20 boosting rounds and so on. This has the downside, that the parameter boosting_rounds will not be included in the hyperparameter optimization anymore.

The second thing to notice is the method nni.report_intermediate_result. This is not strictly necessary but gives some nice visualizations in the UI.
Last, there is a difference in the for-loop between the first iteration and all other iterations. This is to initiate the booster (respectively model) in the first iteration. In further iterations, the model will be loaded with the parameter init_model and the already trained model as argument.

The two other files remain largely unchanged.

PBT

Now let us look at PBT. The idea is similar to evolutionary algorithms. PBT always trains a population of algorithms. The most promising ones copy their hyperparameter configuration to the least promising ones in the population. By coping, minor changes in the hyperparameters appear. With a certain chance, the least promising ones are replaced by newly sampled configurations. The coping of hyperparameters from one algorithm to another has a great implication. At the end of the whole AutoML training, we do not end up with a certain hyperparameter configuration. Instead, it is rather a history of different hyperparameters the algorithm was trained on. And we also have a model readily trained. With PBT the algorithm could be trained with a large learning rate at the start and with a lower one at the end. Or maybe it has some advantages to have a low tree depth at the beginning and a higher one at the end. All this becomes possible with PBT. And it is carried out automatically.

One downside in comparison to other algorithms is the following: Normally you find a good hyperparameter configuration on training and validation set. Then you test it on your test set and if you are satisfied, train it again on the whole dataset. It is not clear if this makes sense with PBT. As PBT returns the trained algorithms, it might not make sense to replicate the history of hyperparameter configuration on the whole dataset.

Finally, let us look again at the implementation.

As an additional parameter from PBT, we have the checkpoint directories. They are necessary to save the models from the population at the end of one iteration and loading them again at the start of the next. PBT and NNI automatically take care to save the models from the various parts of the population in dedicated folders.

Now you are ready to start your own AutoML experiments. I would suggest starting with Metis. If your data is large and you are short on time, switch to BOHB. If you can afford to leave some data out of the training process, try out PBT. I would be happy if you let me know how your AutoML experiments go.

--

--