
In my previous article on Tidymodels, I showcased how to build a customer churn prediction model using logistic regression and random forest. In this article, I want to show how cross-validation can help make the modeling process even more robust by ensuring its performance is not dependent on a single train/test split.
What is cross-validation?
Cross-validation is accomplished by separating the data into training and testing subsets, and the model is fit across the different subsets. The goal is to evaluate the performance and generalize it on unseen data. This can also help prevent overfitting in scenarios where the model significantly performs better on training than on testing dataset, improving the robustness of the model.
Since I covered logistic regression and random forest in the last article, I will showcase cross-validation with Xgboost in this one using the same dataset – Binary Classification of Bank Churn Synthetic Data by Simarpreet Singh, available under a Creative Commons Attribution 4.0 International License (CC BY 4.0) from Kaggle. The dataset includes the column "Exited"
that denotes whether the customer left or stayed, this is what I will predict.
bankchurn_df <- read.csv("./data/bank_churn.csv")
bankchurn_df |>
glimpse()

Code
Code for everything in this article can be found in my GitHub Repo.
Below I will cover all steps needed to build cross-validation with XGBoost. If you have read my previous article on Tidymodels classification modeling, then you can skip to step 4 as you may already be familiar with environment set up, data cleaning, data splitting and feature engineering.
Let’s get started!
Step 0: Set up the environment
Install required packages by running install.packages("package_name")
prior to loading them.
# Loading packages
library(tidyverse)
library(tidymodels)
Step 1: Clean the data
I will remove the columns that contain the term 'Surname'
because they won’t be relevant as features for the model. I will also convert the output variable, Exited
, to a factor variable for binary classification.
bankchurn_df_upd <- bankchurn_df |>
select(Exited, everything()) |>
mutate(Exited = as.factor(Exited)) |>
select(-contains("Surname"))
Step 2: Split the data
I will use the initial_split()
command to split the dataset into training and testing sets and access the splits using training()
and testing()
functions.
# Splitting the data into training and test sets
set.seed(123)
bc_split <- initial_split(bankchurn_df_upd, prop = 3/4, strata = Exited)
train_data <- training(bc_split)
test_data <- testing(bc_split)
This is what the split looks like:

Step 3: Feature engineering
I will use the recipe()
function to set up a series of steps for common feature engineering tasks. In this scenario, I will turn nominal variables to dummy variables, remove any columns with zero variance and normalize numeric predictors.
bc_recipe <- recipe(Exited ~ ., data = bankchurn_df_upd) |>
step_dummy(all_nominal(), -all_outcomes()) |>
step_zv(all_numeric()) |>
step_normalize(all_numeric()) |>
prep()

Step 4: Create XGBoost model specification
I will set up a model specification for XGBoost by picking some initial parameter values. I will not cover tuning in this article but it is a great concept to arrive at the best value for some of these parameters to improve upon performance.
# XGBoost model specification
xgb_mod <- boost_tree(
mode = "classification", trees = 1000, tree_depth = 6, learn_rate = 0.01, loss_reduction = 0, sample_size = 1, mtry = 3, min_n = 10) |>
set_engine("xgboost")

Step 5: Create workflow
I will now create a workflow that combines the recipe with the model for training. A workflow helps put together the pre-processing steps and the model specification, so that the same steps can be consistently applied during both the training and validation steps.
xgb_workflow <- workflow() |>
add_model(xgb_mod) |>
add_recipe(bc_recipe)

Step 6: Cross-validation set up
To create the cross-validation folds, I will use the vfold_cv()
command, where I will provide the dataset and the value of "v" which denotes the number of folds I want to create. For this scenario, I will set it to 5 for a quicker run but the default value is 10. Cross-validation will help train and evaluate the model on different subsets of the data to assess its performance more reliably.
# Cross-validation setup
set.seed(123)
bc_folds <- vfold_cv(bankchurn_df_upd, v = 5, strata = Exited)

Step 7: Fit and evaluate the cross-validation folds
Now that I have created the model workflow, I will fit the model on the training dataset, but instead of using a single data split I will use the cross-validation folds.
xgb_res <- xgb_workflow |>
fit_resamples(resamples = bc_folds,
metrics = metric_set(roc_auc,
accuracy,
sensitivity,
specificity),
control = control_resamples(save_pred = TRUE))
In this step:
- The
fit_resamples()
function will fit the workflow to each of the resampling folds inbc_folds
. - The
metric_set()
function specifies the metrics I want to access in the next command. - The
control_resamples(save_pred = TRUE)
option saves the predictions for each fold which I can use for building and comparing the ROC curves.

I will use collect_metrics()
to assess the average metrics across all the folds. Note that the metrics here are the same ones that I inputted in the fit_resamples()
function earlier.

I will also take a look at the confusion matrix for the resampled training sets by running the conf_mat_resampled()
function on the results.

I can use autoplot()
to easily plot the ROC curve for all the folds simultaneously by grouping it by fold id.
xgb_res %>%
collect_predictions() |>
group_by(id) |>
roc_curve(Exited, .pred_1) |>
autoplot()

Now that I have understood how XGBoost will perform with different sets of data through cross-validation and tested its reliability on different combinations of train/test datasets, I will fit the XGBoost model on my split training dataset.
Step 8: Fit the training dataset and evaluate on testing dataset
I will extract the best performing workflow based on ROC AUC metric and use last_fit()
to train the model on the entire training set split. This function fits a model to the training dataset and also evaluates it on the test dataset in a single step.
I will then use collect_predictions()
on the final fit to get the predictions on the test dataset.
best_params <- xgb_res |>
select_best(metric = "roc_auc")
final_xgb_workflow <- xgb_workflow |>
finalize_workflow(best_params)
final_fit <- final_xgb_workflow |>
last_fit(bc_split)
test_predictions <- final_fit |>
collect_predictions()

As with cross-validation, I will similarly extract the metrics from this final fitted model using collect_metrics()
and build an ROC curve using autoplot()
.
# Collecting the final fitted results
final_fit |>
collect_metrics()
# Visualizing the ROC curve
test_predictions |>
roc_curve(truth = Exited, .pred_1) |>
autoplot()


I hope this article helped you understand how cross-validation is implemented with Tidymodels. A great next step is to also tune the models to find the best-fit parameters, which I will cover in another article.
Code for everything in this article can be found in my GitHub Repo.
If you’d like, find me on Linkedin.
All images in this article are by the author, unless mentioned otherwise.