The world’s leading publication for data science, AI, and ML professionals.

Tuning XGBoost Hyperparameters with Scikit Optimize

Using automated hyperparameter tuning to improve model performance

XGBoost is no longer an exotic model that a select few could understand and use. It has become a benchmark to compare against in many scenarios. The interest in XGBoost has also dramatically increased in the three and a half years since the paper first proposing the algorithm was published. Google trends suggest that the interest in XGBoost is currently at an all-time high.

Relative popularity of XGBoost in the since April 2016
Relative popularity of XGBoost in the since April 2016

What is the challenge?

Now, while XGBoost has gained popularity in the last few years, some challenges make it a little difficult to use. The most adamant one being what hyperparameter values to use- what should be the number of estimators, what is the maximum depth that should be allowed, and so on. What makes XGBoost more difficult to manage than say a linear/logistic regression model or a decision tree is that it has a lot more hyperparameters than many other models. Simply hand-tuning them is going to take a lot of time and patience. And you still can’t be sure if you are moving in the right direction or not.


Is there a solution?

You might think if only we could somehow automate this boring and tiring ritual of tuning the hyperparameters our lives would be a lot easier. And if you are wondering that, then be assured because, yes, there are ways we can improve the performance of XGBoost models without doing all of that manual labor and just let the computers figure it out for us.

This process of selecting the correct values of hyperparameters is called Hyperparameter Tuning and there are prebuilt libraries that can do that for us. So all we need to do is write some code, give some input values and let the computer figure out the best values.


How do we do it?

Before we get into code and muddy our hands, let us hold here for a minute and ask ourselves- if we were a computer and had been given the same problem, how would we do it?

Let’s think this through. I’ll assume we have only two hyperparameter values in this situation because it makes it easier to visualize. The first thing I am going to do is built a model with any random values for our two hyperparameters and see how my model performed.

The next thing I would do is increase one parameter and keep one stationary just to see how my model performance responds to an increase in one of these parameters. If my model performance increases, that means I am moving my hyperparameter in the right direction. If not, I now know I need to decrease the value of this hyperparameter. In the next iteration, I’ll change the value of my other hyperparameter and see how my model reacts to that change. I’ll do this a few times and once I have seen my model reacting to these changes enough times, I’ll start changing both simultaneously.

So what we are doing here is that we are leveraging the output of the previous model and changing the values of our hyperparameters accordingly. In the sections below, we are going to use a package developed specifically to help us do this.


The Idea

In this tutorial, we are going to use the scikit-optimize package to help us pick suitable values for our models. Now, to be able to use these algorithms, we need to define an area for them to search in, like a search space. Imagine you needed to search the highest point in a region and a friend willing to do that task for you. For your friend to able to do that efficiently, you would have to tell him in what area you want him to search. You would have to define an area/space within which you want him to spend time searching.

The same is with skopt and other such algorithms. The algorithm is your friend in this case and you need it to search for the best set of hyperparameter values. For the algorithm to be successful you need to define a search area.

These are the steps that we are going to follow

  • create a dummy dataset using sklearn’s make_classification functionality. We are going to be performing a binary classification using so our labels will contain zeros and ones
  • define a search space/area over which you want to model to search for the best hyperparameter values
  • define a function that fits the model with different hyperparameter values and measures the model performance
  • define how many times you want to train your model
  • using skopt’s gp_minimize algorithm search our space and give us the results

And that’s it. That’s all you need to do. So let’s take a look at the code now.


The Code

You’ll need the skopt package which you can install by entering pip install scikit-optimize at the command line

Note: links to all code snippets are provided below the code boxes. If any code looks incomplete, click on the GitHub link to find the full code

  • Imports

https://gist.github.com/vectosaurus/6ae1b455c7527bf25954e834b8b49a89

  • Creating the dataset

https://gist.github.com/vectosaurus/a53683c4d6fde396ce43b92c045a5a00

  • Defining the search space. Here we define a search space by providing the minimum and maximum value each hyperparameter can take.

https://gist.github.com/vectosaurus/dc922b0b8d9db080bd75b736c2e09b0b

  • Function to fit the model and return the performance of the model

https://gist.github.com/vectosaurus/73d257cb0be61d88ab788bb70f22b857

Here, in the return statement, I return 1 - test_score instead of the test_score. That is because skopt’s gp_minimize works to find the minimum. And we aim to find the maximum f1_score we can get. So by getting gp_minimize to minimize 1 - test_score we are maximizing the test_score

  • Running the algorithm

https://gist.github.com/vectosaurus/f686bfc43fdec8f378d0eeae074e5fe1


Results

https://gist.github.com/vectosaurus/2f6ff57ac4cb4e728d32c65a155ceb78
https://gist.github.com/vectosaurus/2f6ff57ac4cb4e728d32c65a155ceb78

The Xgboost model starts with a test_scoreon the first iteration being0.762 but ends up at an F1 score of0.837, an increase of over seven percent.


Conclusion

While automated hyper tuning helps in improving the model performance in many circumstances it is still necessary to pay close attention to the data. Probing the data and engineering informative variables can in many cases be much more effective.

Also, it is important to note that the search space that we select for the model be meaningful. Simply declaring a very large space will impact the quality of search these algorithms can perform.

Hyperparameter tuning is quite effective but we need to make sure we are providing it a fair enough search space and a reasonable enough number of iterations to perform. Automated hyperparameter tuning reduces the human effort but it doesn’t reduce the complexity of the program.


Related Articles