Why you should learn CatBoost now

CatBoost is changing the game of Machine Learning forever, for the better.

Published in

Towards Data Science

6 min readMay 18, 2020

Introduction

As I was designing the content for a training on Machine Learning, I ended up digging through the documentation of CatBoost. And there I was, baffled by this immensely capable framework. Not only does it build one of the most accurate model on whatever dataset you feed it with — requiring minimal data prep — CatBoost also gives by far the best open source interpretation tools available today AND a way to productionize your model fast.

That’s why CatBoost is revolutionising the game of Machine Learning, forever. And that’s why learning to use it is a fantastic opportunity to up-skill and remain relevant as a data scientist. But more interestingly, CatBoost poses a threat to the status quo of the data scientist (like myself) who enjoys a position where it’s supposedly tedious to build a highly accurate model given a dataset. CatBoost is changing that. It’s making highly accurate modeling accessible to everyone.

Image taken from CatBoost official documentation: https://catboost.ai/

Building highly accurate models at blazing speeds

Installation

Have you ever tried to install XGBoost on your laptop? Well, you know how painful it can be. Installing CatBoost on the other end is a piece of cake. Just run

pip install catboost

and that will do.

Data prep needed

Unlike most Machine Learning models available today, CatBoost requires minimal data preparation. It handles:

Missing values for Numeric variables
Non encoded Categorical variables.
Note missing values have to be filled beforehand for Categorical variables. Common approaches replace NAs with a new category ‘missing’ or with the most frequent category.
For GPU users only, it does handle Text variables as well.
Unfortunately I couldn’t test this feature as I am working on a laptop with no GPU available. [EDIT: a new upcoming version will handle Text variables on CPU. See comments for more info from the head of CatBoost team.]

Building models

As with XGBoost, you have the familiar sklearn syntax with some additional features specific to CatBoost.

from catboost import CatBoostClassifier # Or CatBoostRegressor
model_cb = CatBoostClassifier()
model_cb.fit(X_train, y_train)

Or if you want a cool sleek visual about how the model learns and whether it starts overfitting, use plot=True and insert your test set in the eval_set parameter:

from catboost import CatBoostClassifier # Or CatBoostRegressor
model_cb = CatBoostClassifier()
model_cb.fit(X_train, y_train, plot=True, eval_set=(X_test, y_test))

Note that you can display multiple metrics at the same time, even more human-friendly metrics like Accuracy or Precision. Supported metrics are listed here. See example below:

Monitoring both Logloss and AUC at training time on both training and test sets

You can even use cross-validation and observe the average & standard deviation of accuracies of your model on the different splits:

More information on the official documentation: https://catboost.ai/docs/features/visualization_jupyter-notebook.html#visualization_jupyter-notebook

Finetuning

CatBoost is quite similar to XGBoost on which I already wrote an article about. To fine-tune the model appropriately, first set the early_stopping_rounds to a finite number (like 10 or 50) and start tweaking the model’s parameters. For more information about the key parameters of Gradient Boosting like learning rate and number of estimators, you may read my previous article (XGBoost’s parameters are very similar to CatBoost’s):

Fine-tuning XGBoost in Python like a boss

XGBoost (or eXtreme Gradient Boosting) is not to be introduced anymore, proved relevant in only too many data science…

towardsdatascience.com

Training made fast. Real fast.

Without GPU

From their benchmark, you can see that CatBoost trains faster than XGBoost and relatively similarly to LightGBM. LightGBM is known to train very fast.

With GPU

When it comes to GPU though, the real magic happens.

Even with relatively old GPU like K40 (released 2013), training time will be divided by at least 4 times. With more recent GPU, training time can be divided by up to 40 times. Source: https://catboost.ai/news/catboost-enables-fast-gradient-boosting-on-decision-trees-using-gpus

Interpreting your model

One thing that CatBoost’s authors have understood is it’s not just an accuracy game. Why use CatBoost when there’s XGBoost and LightGBM available. Well, when it comes to interpretation, there’s no match to what CatBoost provides out-of-the-box.

Feature Importances

If this plot can be understood by most people, the underlying method used to compute the feature importances can sometimes be misleading. CatBoost provides 3 different methods: PredictionValuesChange, LossFunctionChange and InternalFeatureImportance. Everything is detailed here.

Local explanation

For local explanations, CatBoost comes with SHAP, the one commonly viewed as the only reliable method out there.

shap_values = model.get_feature_importance(Pool(X, y), type='ShapValues')

Following this tutorial, you may come with the classic output for local explanation as well as feature importances.

Marginal impact

That’s by far the thing I like most. With high accuracy being commoditized (notably with the rise of AutoML), what becomes increasingly important nowadays is to understand on a deeper level these highly accurate models.

Based on experience, the following plot has become a standard in model analysis. CatBoost provides it directly in its package.

On this plot you observe

in green the distribution of the data
in blue the average target value on each bin
in orange the average predicted value on each bin
in red the partial dependence

Using your CatBoost model in production

Implementing your model in production has never been that easy. Here’s how to export your CatBoost model.

Using the .save_model() method gives you the following documentation:

Python & C++ exports

model_cb.save_model(‘model_CatBoost.py’, format=’python’, pool=X_train)

That’s it. You’ll have a nice .py file in your repo that looks like this:

Model’s ready for production! And you don’t need to set up a specific environment on the machine making new scores. Just Python 3 will do!

Note that the documentation mentions that the Python scoring “method is inferior in performance compared to the native CatBoost application methods, especially on large models and datasets.”

Binary export

Apparently the fastest option to score new data. Save as .cbm file.

Re-upload your model using the following code:

from catboost import CatBoostmodel = CatBoost()model.load_model('filename', format='cbm')

Other useful tips

Verbose = 50

There’s usually a verbose input in most models in order to see how the training process is going. CatBoost has it too but it is slightly better than others’. Using verbose=50 for instance will display the training error every 50 iterations, instead of a display at every iteration which can be annoying if you have many trees.

Training the same model with verbose=10. Much nicer to check.

Note the remaining time is also displayed.

Model comparison

Fine-tuning a model takes time. Often you may have several lists of good parameters that make your model accurate. To take it to the next level you can even compare how models learn with different sets of parameters in order to help you make the decision on the final list of parameters to choose.

Compare CatBoost models easily. Instructions from the official documentation here

Saving your model while training

You have a big dataset and you’re afraid of training for too long? Fear no more. You can save your model while training so any interruption in the training process doesn’t have to imply a full retraining of your model! More info on the snapshot option here.

Education materials

Who’s too cool for school? The training material available from the doc is really helpful, even if you think you know everything about Gradient Boosted trees. They have notebooks and videos on how to use and understand CatBoost. My favourite one is surely the one from the NeurIPS 2018 conference (2nd video on this link).

Conclusion

It looks like we all waited way too long before something like this made it to the open source world. If you found this article helpful, don’t hesitate to give some claps, it always feels good to receive them ;)