On-Premise Machine Learning with XGBoost Explained

Step by step guide to run Machine Learning model in your own environment using Docker container

Andrej Baranovskij
Towards Data Science

--

Source: Pixabay

You can run Machine Learning (ML) models on Cloud (Amazon SageMaker, Google Cloud Machine Learning, etc.). I believe it is important to understand how to run Machine Learning in your own environment too. Without this knowledge ML skills set would not be complete. There are multiple reasons for this. Not everyone is using Cloud and you must provide on-premise solution. Without getting your hands dirty and configuring environment yourself, you would miss an exciting opportunity to learn more about ML.

On-premise ML model training is not only related to environment install and setup. When you are training ML model in Cloud — you would use vendor API (Amazon SageMaker, Google, etc.), this API usually helps to solve the problem quicker, but it hides some interesting bits from you — which would help to understand ML process better. In this post, I will go step by step through ML model which can be trained without using Cloud API, but using API which comes from open source libraries directly.

Let’s dive in. The first thing you would need to start on-premise ML— Docker image (while you could configure ML environment without Docker, I would recommend going with Docker for better maintenance and simpler setup).

Go with official Jupyter Notebook Data Science Stack image. Create a container with docker run command (check all available parameters in image docs). I would recommend paying attention where you map working directory with -v parameter. The first part of this parameter points towards folder on your OS and the second part after : points to a folder in Docker container (usually /home/jovyan/work).

XGBoost installation in Jupyter Notebook container.

You must enter into Docker container prompt with this command docker exec -it containername bash, to run below commands:

conda install -y gcc

pip install xgboost

With XGBoost installed, we can move on to ML model — the core part of any ML implementation. I’m using Jupyter Notebook to build and train ML model, that's why my choice was Docker image from Jupyter. Jupyter notebook provides a structured way into Python code implementation, a developer could re-run each notebook section separately, this gives great flexibility, especially when coding and debugging Python code — no need to re-run entire Python code all the time. First, we start with imports. I would recommend keeping all imports at the beginning of the notebook (yes, you can do import in any section of the notebook). This way it improves code readability — always clear what imports are being used:

The first step is to read training data with Pandas library. Download training data (invoice_data_prog_processed.csv) used in this example from my GitHub repo. Read more about data structure in my previous post — Machine Learning — Date Feature Transformation Explained. Data contains information about invoice payment, it indicates if the invoice was paid on time and if it was delayed — how long was the delay. Decision column is assigned with 0 if the invoice was paid on time or delay was small.

After data was loaded from the file into Pandas data frame, we should check data structure — how decision column values are distributed:

XGBoost works with numerical (continuous) data. Categorical features must be translated to numeric representation. Pandas library provide get_dummies function which helps to encode categorical data into an array of (0,1). Here we translate categorical feature customer_id:

After encoding — data structure contains 44 columns.

Before running model training, is useful to see how features are correlated with decision feature. In our case, as expected the most correlated/influential features are dates and total. This is a good sign, meaning that ML model should be trained properly:

Next, we need to identify X/Y pair. Y is a decision feature, which is a first column in the dataset. All other columns are used to identify the decision feature. This means we need to split data into X/Y as follows:

Here we split data into train/test datasets. Using train_test_split function sklearn library. Data set is small, so using a larger part of it for training — 90%. Dataset is constructed with stratify option, to make sure decision feature is well represented in both training and test collections. Function train_test_split conveniently returns X/Y data into separate variables:

And here is the moment of truth. Running ML model training step with XGBoost. %%time prints time spent on training. XGBoost supports both classification and regression, here we are using classification with XGBClassifier. Parameters depend on the dataset and with different dataset you will need to adjust them. The ones included are the ones to pay attention for, based on my findings (read more about each parameter in XGBoost documentation).

We are not simply running model training, but using XGBoost feature of training self-evaluation and early stopping to avoid overfitting. Along with training data, passing test data too into ML model build function — model.fit. The function is assigned with 10 early stopping rounds. If there is no improvement in 10 rounds, training will stop and choose the most optimal model. Using logloss metric to evaluate training quality. Training is running with verbose=True flag to print detail output for each training iteration:

Based on the output from model training you can see that the best iteration was Nr. 71.

To evaluate training accuracy, we execute model.predict function and passing X testing data frame. The function returns an array of predictions per each row for X set. Then we match each row from prediction array with actual decision feature value. This is how accuracy is calculated:

We executed model.predict with test data. But how to execute model.predict with new data? Here is the example below, which feeds model.predict with Pandas data frame constructed from static data. Payment is by one day late (payment after 4 days since invoice vs. 3 days of expected payment), but since the amount is less than 80 — such payment delay is not considered risky. XGBoost model.predict returns decision, but often it might be useful to call model.predict_proba instead, which returns probabilities for the decision:

Once the model is trained, it is good practice to save it. In my next post, I will explain how to access trained model through Flask REST interface from the outside and expose ML functionality to Web app with Node.js and JavaScript. Model can be saved using pickle library:

Finally, we are drawing training results based on the output for logloss and classification error. This helps to understand if training iteration which was selected as the best one was actually a good choice. Based on the plot we can see that iteration 71 was one of the most optimal in terms of training and testing errors. This means XGBoost decision to peek this iteration was good:

A solution for XGBoost early stopping and results plotting was inspired by this blog post — Avoid Overfitting By Early Stopping With XGBoost In Python.

Complete Jupyter notebook for this post can be downloaded from my GitHub repo. Training data can be downloaded from here.

--

--