The world’s leading publication for data science, AI, and ML professionals.

A Journey through XGBoost: Milestone 2

Classification with XGBoost

Photo by Martin Adams on Unsplash
Photo by Martin Adams on Unsplash

Welcome to the second article of "A Journey through XGBoost" series. Today, we will build our first XGBoost model on the "heart disease" dataset and make a small (but useful) web app to communicate our results with end-users. Here are the topics we discuss today.

Topics we discuss

  • Form a Classification problem
  • Identify the feature matrix and the target vector
  • Build the XGBoost model (Scikit-learn compatible API)
  • Describe ‘accuracy’ and ‘area under the ROC curve’ metrics
  • Explain XGBoost Classifier hyperparameters
  • Build the XGBoost model (Non-Scikit-learn compatible API)
  • XGBoost’s DMatrix
  • Create a small web app for our XGBoost model with Shapash Python library
  • Make some fancy visualizations

Prerequisites

Before proceeding, make sure that you’ve read the first article of the XGBoost series (A Journey through XGBoost: Milestone 1 – Setting up the background). This will help you to set up your own computer to run and experiment with the codes discussed here. In addition to that, I also assume you have basic knowledge of the Python Scikit-learn ML library.

Let’s get started!

The heart disease dataset

Today, we build our first XGBost model on the "heart disease" dataset (download here). The following image shows the first 5 rows of the dataset.

The first 5 rows of the "heart disease" dataset (Image by author)
The first 5 rows of the "heart disease" dataset (Image by author)

The following image contains the information of the dataset returned by the Pandas info() method.

Useful information on the "heart disease" dataset (Image by author)
Useful information on the "heart disease" dataset (Image by author)

The dataset has no missing values. All the values are numerical. So, no preprocessing step requires and the dataset is ready to use. This dataset has 303 observations and 14 features (including the "target" column).

Now, we define our problem statement.

The problem statement

Based on age, sex, cp, …, thal, we want to predict a given person (a new instance) has heart disease (class 1) or not (class 0). This is a classification problem because its outcome is a discrete value (a known class). The algorithm that we use to solve this classification problem is XGBoost (XGBClassifier). So, we will build an XGBoost model for this classification problem and evaluate its performance on test data (unseen data/new instances) using the model’s evaluation metrics such as Accuracy and Area under the ROC curve.

We provide the feature matrix and target column as the input data for the XGBoost model.

  • Feature matrix: Called X. Includes all the columns except for the "target column". This matrix can be in the form of a Pandas DataFrame or a 2-dimensional numpy array.
  • Target vector: Called y. Includes the "target column" of the dataset. This vector can be in the form of a Pandas Series or a 1-dimensional numpy array.

After that, the XGBoost model (with user-defined parameters) will learn the rules based on X and y. Based on those rules, we make predictions on new or unseen data.

Let’s get hands-on experience by writing the Python code to build our first XGboost model.

Building the XGboost model

Here, we discuss two scenarios.

  • Building the model with Scikit-learn compatible API
  • Building the model with XGBoost’s own non-Scikit-learn compatible API

You will see the difference between the two APIs as we progress.

Building the model with scikit-learn compatible API

The easiest way to build an XGBoost model is to use its Scikit-learn compatible API. "Scikit-learn compatible" means that you can use the Scikit-learn .fit() / .predict() paradigm with XGBoost. If you have used Scikit-learn previously, there is no new thing here. Let’s write the complete Python code to build the XGBoost model.

The output of the above code segment is:

Image by author
Image by author

Accuracy: 85.2% – This is the sum of all correct predictions on the test set divided by the total number of observations in the test set. Our model correctly predicts about 85 out of 100 observations. This accuracy score is not bad for an initial model because we haven’t tuned the model yet with optimal hyperparameter combinations (The model tuning part will be discussed later in the 5th article in the XGBoost series).

Note: If you set a different integer value for the random_state parameter in the train_test_split() function, you will get slightly different accuracy scores. We will address this issue also in the 4th article in the XGBoost series.

Area under the ROC curve: 91% – ROC is a probability curve and the area under the curve (AUC) is a measure of class separability. AUC tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is predicting 0s as 0s and 1s as 1s. A very poor model has AUC at near 0. If a model has an AUC of 0.5, that model has no class separation at all. An AUC of 91% is a very good value for our model. It has much capability of distinguishing between the two classes.

Let’s explain the above Python code line by line.

First, we import all the necessary libraries with community standard conventions (xgboost →xgb, numpy →np, etc). Then we load the dataset with Pandas read_csv() function. The dataset is in the same working directory. Then we create X and y. With shuffling the dataset, we create train and test sets for both X and y. This is because we need to train our model with train sets and evaluate our model with test sets – new or unseen data. We never test our model with the same data used in the training phase. If you do so, you will get much better accuracy scores, but the model will fail to generalize on new or unseen data and the predictions on new or unseen data will not be accurate.

Then we create an XGBoost object (called xgb_clf) from the XGBClassifier() class. The XGBoost model for classification is called XGBClassifier. ** We have specified 6 hyperparameters inside the XGBClassifier()** class.

  • max_depth=3: Here, the XGBoost uses decision trees as base learners. By setting max_depth=3, each tree will make 3 times of splits and stop there.
  • n_estimators=100: There are 100 decision trees in the ensemble.
  • objective=’binary:logistic’: A name for the loss function used in our model. binary:logistic is the standard option for binary classification in XGBoost.
  • booster=’gbtree’: This is the type of base learner that the ML model uses every round of boosting. ‘gbtree’ is the XGBoost default base learner. With booster=’gbtree‘, the XGBoost model uses decision trees, which is the best option for non-linear data.
  • n_jobs=2: Use 2 cores of the processor for doing parallel computations to run XGBoost.
  • random_state=1: Controls the randomness involved in creating tress. You may use any integer. By specifying a value for random_state, you will get the same result at different executions of your code.

After creating the model with the above hyperparameters, we train it using train sets. Then, we make the predictions on test sets. Finally, we evaluate the model using two evaluation metrics – Accuracy and Area under the ROC curve.

Next, we obtain the same result by running XGBoost its own non-Scikit-learn compatible API.

Building the model with XGBoost’s own non-Scikit-learn compatible API

Another way to create an XGBoost model is to use XGBoost’s own non-Scikit-learn compatible API. "Non-Scikit-learn compatible" means that you cannot use the Scikit-learn .fit() / .predict() paradigm and some other Scikit-learn classes with XGBoost. Let’s write the complete Python code to build the XGBoost model.

The output of the above code segment is:

The accuracy score is exactly the same as the previous one! The major difference in this API is that we explicitly create a special data structure called DMatrix which is an internal data structure used by XGBoost. The XGBoost DMatrix() function converts array-like objects into DMatrices. In scikit-learn compatible API for XGBoost, this conversion happens behind the scenes and we do not need to explicitly create DMatrices. When using DMatrices, the algorithm is optimized for both memory efficiency and training speed.

Other differences in this API are:

  • To train the model, we use XGBoost own train() function. Previously, we used the Scikit-learn fit() method to train the model.
  • The predictions are returned as probabilities. We need to convert them into classes (integers: 0 and 1)
  • With this XGBoost model, we cannot use some Scikit-learn functions. For example, we cannot use the plot_roc_curve() function with the XGBoost model created with this API.

Note: Whenever possible, I recommend you to use the XGBoost scikit-learn compatible API. This is because its syntax is very consistent and easy to use. In addition to that, we can always take full advantage of all the Scikit-learn library functions.

Creating a small web app to discuss the XGBoost model with end-users

Here, we take the advantage of Shapash Python library which aims to make Machine Learning models interpretable and understandable by end-users who don’t have much technical knowledge but are interested to see the results in visualizations. With just a few lines of code (maybe 5 or 6), we can make some fancy visualizations with Shapash effortlessly. Let’s get started.

Note: To learn more about the Shapash Python library, read its official documentation.

Installation

Just run the following command in your Anaconda command prompt to install Shapash. After installation, you can use Shapash in your Jupyter Notebook with Python.

pip install shapash --user

Making the web app

After creating your Xgboost classification model with XGBoost scikit-learn compatible API (run the Code Snippet-1 above), execute the following code to create the web app. The compile() method of xpl object takes test data of X (X_test), XGboost model (xgb_clf) and predictions as a Pandas series with the same index as X_test. The datatype of predictions (y_pred_as_series) should be integer or float (we need to explicitly define it as dtype=np.int or dtype=np.float). Otherwise, you will get an error.

After running, the web app link should now appear in your Jupyter output (as the second link in the following image). Click on it to launch the web app.

Image by author
Image by author

You can see some fancy visualizations there (Features Importance Plot, Feature Contribution Plot, Local Explanation Plot). The great thing is that you can interact with those plots. Watch the following 50-seconds video to see how I interact with the plots in this web app.

In this web app, you can even download individual plots and save them on your local machine. You may interpret them considering the problem you try to solve. For example, you can use the Feature Contribution plot to answers questions like "How does a feature in my model influence the prediction?".

The following are some of the plots created from the Shapash library.

Features Importance Plot

Image by author
Image by author

Feature Contribution Plot (Violin plot for a categorical variable)

Image by author
Image by author

Feature Contribution Plot (Scatter plot for a continuous variable)

Image by author
Image by author

To finalize this article, I will visualize an individual decision tree – a base learner in our XGBoost model. The following code shows and saves the first decision tree (at index 0) in our model.

The output is:

Visualizing an XGBoost base learner (Image by author)
Visualizing an XGBoost base learner (Image by author)

Summary

In the first article of the XGBoost article series, we just began the XGBoost journey. There, we set up the environment to run XGBoost on our own computers. Taking another step further, in this article, we have done a classification task with XGBoost. We have discussed the difference between the two APIs that support XGBoost. Not only that, but we also made a small (but useful) web app to communicate our model with end-users. The great thing is that we wrote just 5 additional lines of code to create the web app. The web app is very interactive. Add the end of the article, I have added some fancy visualizations.

What is next? We haven’t discussed the mathematical background of XGBoost yet. In the next article, I will build a regression model with XGBoost. There, I will also discuss mathematical background such as "formulate XGBoost’s learning objective". In this article, more emphasis was given to the technical and coding parts which are very useful to implement the algorithm with real-world datasets. Now, you will have hands-on experience in implementing the algorithm and visualizing the results for end-users.

Stay tuned for the updates about the next article of the XGBoost series!

Thanks for reading!

This tutorial was designed and created by Rukshan Pramoditha, the Author of Data Science 365 Blog.

Read my other articles at https://rukshanpramoditha.medium.com

2021–03–07


Related Articles