XGBoost has established itself as one of the most important machine learning algorithms due to its versatility and impressive performance.
Having used XGBoost for previous projects, I am always open to making its implementation faster, better, and cheaper. My curiosity was piqued when I came across AutoXGB, which claims to be an automated tool for simplifying the training and deployment of XGBoost models.
Given my work in the financial services sector, where fraud is a significant concern, it would be an excellent opportunity to use credit card fraud data to assess how AutoXGB fares against the standard XGBoost setup usually used.
Contents
(1) Data Acquisition and Understanding (2) Handling Class Imbalance (3) Choice of Performance Metric (4) Baseline – XGBoost with RandomizedSearchCV (5) Putting AutoXGB to the Test (6) Final Verdict
Click here to view the GitHub repo for this project
(1) Data Acquisition and Understanding
Overview
This project uses the credit card transaction data from the research collaboration between Worldline and Machine Learning Group (University of Brussells) on fraud detection (used under GNU Public License).
The dataset is a realistic simulation of real-world credit card transactions and has been designed to include complicated fraud detection issues.
These issues include class imbalance (only <1% of transactions are fraudulent), a mix of numerical and (high cardinality) categorical variables, and time-dependent fraud occurrences.
Data Transformation
We will be using the transformed dataset instead of the raw one to align with the project objective. The details of the baseline feature engineering are out of scope, so here is a brief summary:
- Indicate whether the transaction occurred (i) during the day or night; (ii) on a weekday or weekend. This is done because fraudulent patterns differ based on time of day and day of week
- Characterize customer spending behavior (e.g., average spending, number of transactions) by using the RFM (Recency, Frequency, Monetary value) metric
- Classify risk associated with each payment terminal by calculating its average count of fraud cases over a time window

Predictor and Target Features
Upon completion of the feature engineering, we have a dataset with the following features:
- TX_AMOUNT: Transaction amount in dollars [float]
- TX_DURING_WEEKEND: Whether transaction took place on the weekend [boolean]
- TX_DURING_NIGHT: Whether transaction took place at night [boolean]
- *CUSTOMER_ID_NB_TX___DAY_WINDOW**: Number of transactions for each customer over the last 1, 7, and 30 days [integer]
- *CUSTOMER_ID_AVG_AMOUNT___DAY_WINDOW**: Average amount (dollars) spent by each customer in the last 1, 7, and 30 days [float]
- *TERMINAL_ID_NBTXDAY_WINDOW**: Number of transactions on the terminal over the last 1, 7, and 30 days [integer]
- *TERMINAL_IDRISKDAY_WINDOW**: Average number of fraudulent transactions on the terminal over the last 1, 7, and 30 days [integer]
-
TX_FRAUD: Indicator for whether the transaction is legitimate (0) or fraudulent (1) [boolean]
From the target variable TX_FRAUD, ** we can see that we are dealing with a binary classification tas**k.
Train-Test Split with Delay Period
An important aspect to consider in fraud detection is the delay period (aka feedback delay). In real-world situations, a fraudulent transaction is only known some time after a complaint or investigation has been made.
Therefore, we need to introduce a sequential delay period (e.g., one week) to separate the train and test sets, where the test set should occur at least one week after the last transaction of the train set.
The train-test split is as follows:
- Train Set: 8 weeks (2018–Jul–01 to 2018–Aug–27)
- Delay Period: 1 week (2018–Aug–28 to 2018–Sep–03)
- Test Set: 1 week (2018–Sep–04 to 2018–Sep–10)

(2) Handling Class Imbalance
Fraudulent transactions do not happen regularly, so it is no surprise that we have a heavily imbalanced dataset on our hands.
From the value count of the target variable, there were only 4,935 fraudulent cases (0.9%) out of the 550k+ transactions.

We can use the synthetic Minority Oversampling Technique (SMOTE) to handle this class imbalance. The authors of the original SMOTE paper combined SMOTE and random under-sampling in their implementation, so I used that combination in this project.
The specific sampling strategy is as follows:
- SMOTE over-sampling to increase the minority class to 5% of the total dataset (500% increase from original 0.9%), then
- Random under-sampling to make the majority class twice the size of the minority class (i.e., minority class 50% the size of majority class)
Although the SMOTE authors showed that varying combinations gave comparable results, I chose the 500%/50% combination because the paper showed that it gave the highest accuracy on minority examples in the Oil dataset.
After this sampling, the dataset is more balanced, with the minority class increasing from 0.9% to 33.3% of the entire dataset.

(3) Choice of Performance Metric
Before we begin modeling, we have to decide what is the ideal metric to assess model performance.
The typical ones are threshold-based metrics like accuracy and F1-score. While they can assess the degree of misclassification, they depend on the definition of a specific decision threshold, e.g., probability>0.5 =fraud.
These metrics’ dependence on a decision threshold makes it challenging to compare different models. Therefore, a better choice would be threshold-free metrics such as AUC ROC and Average Precision.
Average Precision summarizes the precision-recall curve as the weighted mean of precisions achieved at each threshold, with the weight being the increase in recall from the previous threshold.
While AUC-ROC is more common, Average Precision is chosen as the primary reporting metric in this project for the following reasons:
- Average Precision is more informative than AUC ROC for imbalanced datasets. While we have applied sampling to balance our data, I felt that there was still a degree of residual imbalance (67:33 instead of 50:50)
- Too many false positives can overburden the fraud investigation team, so we want to assess the recall (aka sensitivity) at low false-positive rate (FPR) values. The advantage of using PR curves (and AP) over ROC is that PR curves can effectively highlight model performance for low FPR values.

(4) Baseline – XGBoost with RandomizedSearchCV
The setup for the baseline model (XGBoost with RandomizedSearchCV) is the one that I tend to use as a first-line approach for classification tasks.
Here are the test set prediction results from the baseline model, where the key metric of Average Precision is 0.776. The time taken for training was 13 minutes.

(5) Putting AutoXGB to the Test
In line with the recent rise of AutoML solutions, AutoXGB is a library that automatically trains, evaluates, and deploys XGBoost models from tabular data in CSV format.
The hyperparameter tuning is done automatically using Optuna, and the deployment is carried out with FastAPI.
AutoXGB was developed by Abhishek Thakur, a researcher at HuggingFace who holds the title of the world’s first 4x Kaggle Grandmaster. In his own words, AutoXGB is a no-brainer for setting up XGBoost models.
Therefore, I was keen to tap into his expertise to explore and improve the way my XGBoost models are usually implemented.
To install AutoXGB, run the following command:
pip install autoxgb
The AutoXGB framework significantly simplifies the steps required to set up XGBoost training and prediction. Here is the code used to setup AutoXGB for the binary classification task:
Here are the test set results from AutoXGB, where the key metric of Average Precision is 0.782. The time taken for training was 9 minutes.

(6) Final Verdict
Results Comparison

AutoXGB delivered a slightly higher Average Precision score of 0.782 as compared to the baseline score of 0.776.
The time taken for AutoXGB training is approximately 30% shorter, taking just 9 minutes as compared to the baseline of 13 minutes.
Another key advantage of AutoXGB is that we are only one command line away from serving the model as a FastAPI endpoint. This setup reduces the time to model deployment, which is a critical factor beyond training time.
The following factors are likely the ones that drive AutoXGB’s better performance:
- Use of Bayesian optimization with Optuna for hyperparameter tuning, which is faster than randomized search as it uses information from previous iterations to find the best hyperparameters in fewer iterations
- Careful selection of XGBoost hyperparameters (type and range) for tuning based on the author’s extensive Data Science experience
- Optimization of memory usage with specific typecasting for memory usage reduction, e.g., convert values with data type
int64
toint8
(which consumes 8x less memory)
Important Caveats
While the performance metrics give AutoXGB an edge, one of its most significant issues is the loss of granular control in the parameter settings.
If you have been following closely, you may realize that we did not introduce the sequential delay period we applied for train/test split for cross-validation (which we should have been done).
In this case, AutoXGB does not allow us to specify the validation fold we want to use as part of cross-validation (CV) since the only CV-related parameter is n_folds
(number of CV folds).
Another issue with this lack of control is that we cannot specify the evaluation metric for the Xgboost classifier.
In the baseline model, I was able to set the eval_metric
to be aucpr
(AUC under PR curve), which aligns with our primary metric of Average Precision. However, the hard-coded evaluation metric within the XGBoost Classifier of AutoXGB is logloss
for binary classification.
At the end of the day, while solutions like AutoXGB do well in simplifying (and possibly improving) XGBoost implementation, data scientists need to understand its limitations by knowing what goes on under the hood.
Feel free to check out the codes in the GitHub repo here.
Before You Go
I welcome you to join me on a data science learning journey! Follow my Medium page and GitHub to stay in the loop of more exciting data science content. Meanwhile, have fun using AutoXGB in your ML tasks!
Using Ensemble Regressors to Predict Condo Rental Prices
Automatically Generate Machine Learning Code with Just a Few Clicks