Hands-on Tutorials

Caret vs Tidymodels: How to Use Both Packages for Machine Learning?

An example of building models with two popular packages together in R to predict bike sharing demand

Yu-En Hsu
Towards Data Science
9 min readJan 8, 2021

--

Photo by Chris Barbalis on Unsplash

Max Kuhn builds both packages (with contributions from many other talented people). The caret package (short for Classification And REgression Training) streamlines the process for creating predictive models and has been the top choice among R users. It’s been around for a long time, and there are numerous resources, answers, and solutions to all the possible questions. On the other hand, tidymodels is newer and is built on the tidyverse principles. RStudio hired Max intending to design a tidy version of the caret.

I have been using caret for predictive modelling. While I am aware of tidymodels, I only began exploring last week. As everything in life is, adopting a new ecosystem takes time and patience. So this post is by no means an exhaustive analysis. The complete code is available on GitHub, and an HTML version of the Markdown is published.

Overview

caret is a single package with various functions for machine learning. For example, createDataPartition for splitting data and trainControl for setting up cross-validation.

tidymodels is a collection of packages for modelling. When I execute the library(tidymodels) command, the following packages are loaded:

  • rsample: for data splitting and resampling
  • parsnip: for trying out a range of models
  • recipes: for pre-processing
  • workflow: for putting everything together
  • yardstick: for evaluating models
  • broom: for converting the information in common statistical R objects into user-friendly, predictable formats
  • dials: for creating and managing tuning parameters

Some common libraries from tidyverse, such as dplyr, are also loaded. As shown, tidymodels breaks down the machine learning workflow into multiple stages and provides specialised packages for each stage. This is beneficial for users because of the increased flexibility and possibility. However, for a beginner, it might be intimidating (at least it was for me).

Import data

The data are from Bike Sharing Dataset of the UCI repository. The goal is to predict bike rental count total based on the environmental and seasonal settings.

library(tidymodels) 
library(caret)
library(lubridate)
library(tidyverse)
library(moments)
library(corrr)
library(randomForest)
bike <- read_csv("Bike-Sharing-Dataset/hour.csv")
bike %>% dim()
## [1] 17379 17

There are 17,379 cases and 17 features. I removed instant, changed the formatting for year, and renamed some variables.

bike %>%
mutate(instant = NULL, yr = yr + 2011) %>%
rename(
date = dteday,
year = yr,
month = mnth,
hour = hr,
weather = weathersit,
humidity = hum,
total = cnt
) ->
bike
head(bike)
Data frame header preview

Explore data

Target variable

bike %>%
pivot_longer(
cols = c(casual, registered, total),
names_to = "usertype",
values_to = "count"
) %>%
ggplot(aes(count, colour = usertype)) +
geom_density() +
labs(
title = "Distribution of the number of rental bikes",
x = "Number per hour", y = "Density"
) +
scale_colour_discrete(
name = "User type",
breaks = c("casual", "registered", "total"),
labels = c("Non-registered", "Registered", "Total")
)
Target variable distribution

The distributions of rental counts are positively skewed. It is desirable to have a normal distribution, as most machine learning techniques require the dependent variable to be normal. I addressed the skewness later.

Correlation

I used correlated() function from corrr package, which is part of tidymodels but not automatically loaded with library(tidymodels) command. I am constantly surprised by how many packages there are in the tidy ecosystem. Typically, I prioritise tidy packages over independent ones because of the integration of pipeline and the consistency in aesthetic, and corrr is no exception.

bike %>%
select(where(is.numeric)) %>%
correlate() %>%
rearrange(absolute = FALSE) %>%
shave() ->
bike_cor
rplot(bike_cor, print_cor = TRUE)
Correlation plot

Prepare data

Since I have not split the data yet, this step is not data scaling or centring, which should fit the training set and transform the testing set. Here, I focus on the process that applies to all data and does not have a parameter, such as factorising or simple mathematic calculation. For example, if I take the square root of a number, I can square it to know the original number. However, for normalisation, I need to know the minimum and the maximum value of a variable, both of which might be different for training and testing.

Target variable

I focused on the total count, so casual and registered variables are moved. As suggested earlier, the target variable is positively skewed and requires transformation. I tried several common techniques for positively skewed data and applied the one with the lowest skewness — cubic root.

bike_all <- bike %>%
select(-casual, -registered)

# Original
skewness(bike_all$total)
## [1] 1.277301

# Log
skewness(log10(bike_all$total))
## [1] -0.936101

# Log + constant
skewness(log1p(bike_all$total))
## [1] -0.8181098

# Square root
skewness(sqrt(bike_all$total))
## [1] 0.2864499

# Cubic root
skewness(bike_all$total^(1 / 3))
## [1] -0.0831688

# Transform with cubic root
bike_all$total <- bike_all$total^(1 / 3)

Predictors

Categorical variables are converted to factors according to the attribute information provided by UCI.

bike_all$season <- factor(
bike_all$season,
levels = c(1, 2, 3, 4),
labels = c("spring", "summer", "autumn", "winter")
)
bike_all$holiday <- factor(
bike_all$holiday,
levels = c(0, 1), labels = c(FALSE, TRUE)
)
bike_all$workingday <- factor(
bike_all$workingday,
levels = c(0, 1), labels = c(FALSE, TRUE)
)
bike_all$weather <- factor(
bike_all$weather,
levels = c(1, 2, 3, 4),
labels = c("clear", "cloudy", "rainy", "heavy rain"),
ordered = TRUE
)
head(bike_all)
Data frame header preview

Split data (Train/Test, Cross-Validation)

Both packages provide functions for common data splitting strategies, such as k-fold, grouped k-fold, leave-out-one, and bootstrapping. But tidyverse appears to be more comprehensive since it includes Monte Carlo cross-validation (I don’t know what this is, but it sounds cool) and nested cross-validation. I particularly emphasised the method because a research paper found that, “nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size.” (Vabalas et al., 2019)

tidymodels

The tidymodels rsample library handles data splitting. Training and testing split is done as shown, along with 10-fold cross-validation.

set.seed(25)
split <- initial_split(bike_all, prop = 0.8)
train_data <- training(split)
train_data %>% dim()
## [1] 13904 14

test_data <- testing(split)
test_data %>% dim()
## [1] 3475 14

train_cv <- vfold_cv(train_data, v = 10)

caret

There are two options available:

  • Use caret's native functions, such as createDataPartition.
set.seed(25)
train_index <- createDataPartition(
bike_all$total, p = 0.8, times = 1, list = FALSE
)
train_data <- mics[ train_index, ]
test_data <- mics[-train_index, ]

fold_index <- createFolds(
train_data$total,
k = 10, returnTrain = TRUE, list = TRUE
)
train_cv <- trainControl(method="cv", index = fold_index)
  • Use tidymodels’ rsample2caret function, which returns a list that mimics the index and indexOut elements of a trainControl object.
train_cv_caret <- rsample2caret(train_cv)
ctrl_caret <- trainControl(
method = "cv",
index = train_cv_caret$index,
indexOut = train_cv_caret$indexOut
)

Two packages are pretty similar. It’s interesting that trainControl only specifies the cross-validation strategy but not the data. Nonetheless, thanks to rsample2caret() and caret2rsample() commands from tidymodels, it’s easy to set up resampling in whichever package that you prefer. Here I used rsample2caret() to generate 10-fold indices for caret to ensure the cross-validation are identical for both.

Preprocess data

In caret, one function preProcess covers all pre-processing for numerical features, including imputation, centring, scaling, and power transformation. For categorical features, I can either use dummyVars to create dummy variables or leave it to train, which handles factors during model training. One thing I find frustrating is that I cannot specify pre-processing for different variables. For example, I want to use the Box-Cox transformation to normalise an extremely skewed variable. But, preProcess performs the transformation on all predictors. If you’re familiar with sklearn in Python, I am saying that I want ColumnTransformer.

tidymodels

recipes delivers the wish, allowing me to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. “feature engineering”). Moreover, recipes it seems to offer more preprocessing methods. However, unlike caret, recipes does not automatically handle the categorical variables, and I need to create dummy variables manually. Still, the ability to tailor preprocessing trumps the small inconvenience of having to generate dummy variables.

prep_recipe <-
recipe(total ~ ., data = train_data) %>%
step_rm(year, month, weekday) %>%
step_date(date) %>%
step_corr(all_numeric(), threshold = 0.8) %>%
step_dummy(all_nominal())

caret

Similarly, I can use caret’s preProcess() function. But I always find it frustrating because all numerical variables are processed, and there is not much flexibility.

prep <- preProcess(cutoff = 0.8) 

Again, tidymodels’ recipe can be used for caret. Here, I prep() and bake() to transform the data because caret does not have a workflow function.

train_data_caret <-
prep(prep_recipe) %>% bake(new_data = NULL)

test_data_caret <-
prep(prep_recipe) %>% bake(new_data = test_data)

Train models

Two custom functions I will use later:

# Generate prediction tables
predict_table <- function(model, data, tidy_flag) {
if (tidy_flag == TRUE) {
result <- model %>%
predict(data) %>%
rename(pred = .pred) %>%
mutate(
actual = data$total,
pred_real = pred^3,
actual_real = actual^3
)
} else {
result <- model %>%
predict(data) %>%
as_tibble_col(column_name = "pred") %>%
mutate(
actual = data$total,
pred_real = pred^3,
actual_real = actual^3
)
}
result
}

# Extract RMSE for models
pull_rmse <- function(result_table) {
rmse_result <- rmse(result_table, pred, actual) %>%
pull(.estimate)
rmse_result_real <- rmse(result_table, pred_real, actual_real) %>%
pull(.estimate)
result <- c(rmse = rmse_result, real_rmse = rmse_result_real)
}

Baseline

The baseline is the average of the total.

base_train_pred <-
tibble(
actual = train_data$total,
actual_real = train_data$total^3
) %>%
mutate(pred = mean(actual), pred_real = mean(actual_real))
base_test_pred <-
tibble(
actual = test_data$total,
actual_real = test_data$total^3
) %>%
mutate(pred = mean(actual), pred_real = mean(actual_real))
base_train_rmse <- pull_rmse(base_train_pred)
print(base_train_rmse)
## rmse real_rmse
## 2.032927 181.063306
base_test_rmse <- pull_rmse(base_test_pred)
print(base_test_rmse)
## rmse real_rmse
## 2.02608 182.61370

Decision trees with tidymodels

parsnip for modelling, workflow for well… workflow, tune for parameter tuning, and yardstick for performance metrics. I was also curious about the timing, so I recorded the time as well.

# Cost complexity for decision tree parameter
tree_cp <- seq(0.01, 0.1, 0.01)

set.seed(25)
tree_tidy_time1 <- Sys.time()

# Specify model
tree_engine <-
decision_tree(mode = "regression", cost_complexity = tune()) %>%
set_engine("rpart")

# Set workflow (Preprocess & model)
tree_workflow <-
workflow() %>%
add_recipe(prep_recipe) %>%
add_model(tree_engine)

# Tune parameters with cross-validation
tree_tune <- tune_grid(
tree_workflow,
resamples = train_cv,
grid = data.frame(cost_complexity = tree_cp),
metrics = metric_set(rmse)
)

# Fit again with the best parameter
tree_best <-
finalize_workflow(tree_workflow, select_best(tree_tune)) %>%
fit(train_data)

tree_tidy_time2 <- Sys.time()
print(tree_tidy_time2 - tree_tidy_time1)
## Time difference of 1.376683 mins

It takes around 1 minute and 20 seconds to cross-validate ten parameters. Once it’s done, I can predict the target variable and examine model performance with RMSE. Here I used custom functions predict_table and pull_rmse to complete the task.

tree_tidy_train_pred <- predict_table(tree_best, train_data, TRUE)
tree_tidy_train_rmse <- pull_rmse(tree_tidy_train_pred)
print(tree_tidy_train_rmse)
## rmse real_rmse
## 1.078724 116.106006
tree_tidy_test_pred <- predict_table(tree_best, test_data, TRUE)
tree_tidy_test_rmse <- pull_rmse(tree_tidy_test_pred)
print(tree_tidy_test_rmse)
## rmse real_rmse
## 1.074347 118.205989

Decision trees with caret

set.seed(25)
tree_caret_time1 <- Sys.time()
tree_caret <- train(
total~.,
data = train_data_caret,
method = "rpart",
trControl = ctrl_caret,
metric = "RMSE",
tuneGrid = data.frame(cp = tree_cp)
)
tree_caret_time2 <- Sys.time()
print(tree_caret_time2 - tree_caret_time1)
## Time difference of 4.469931 secs

Wooooow! It only takes 4.5 seconds. Moreover, the code is much shorter. The train function includes the model method = "rpart", cross-validation trControl = ctrl_caret, and parameter tuning tuneGrid = data.frame(cp = tree_cp).

tree_caret_train_pred <- predict_table(tree_caret, train_data_caret, FALSE)
tree_caret_train_rmse <- pull_rmse(tree_caret_train_pred)
print(tree_caret_train_rmse)
## rmse real_rmse
## 1.078724 116.106006
tree_caret_test_pred <- predict_table(tree_caret, test_data_caret, FALSE)
tree_caret_test_rmse <- pull_rmse(tree_caret_test_pred)
print(tree_caret_test_rmse)
## rmse real_rmse
## 1.074347 118.205989

Compare models

rbind(
base_train_rmse, base_test_rmse,
tree_tidy_train_rmse, tree_tidy_test_rmse,
tree_caret_train_rmse, tree_caret_test_rmse
)
## rmse real_rmse
## base_train_rmse 2.032927 181.0633
## base_test_rmse 2.026080 182.6137
## tree_tidy_train_rmse 1.078724 116.1060
## tree_tidy_test_rmse 1.074347 118.2060
## tree_caret_train_rmse 1.078724 116.1060
## tree_caret_test_rmse 1.074347 118.2060

As you can see, the decision tree model results are the same regardless of the library, since I split the data and set up cross-validation the same way. Moreover, both tidymodels and caret use rpart as the underlying engine. So it seems strange that tidymodels takes over 1 minute while caret only needs 4–5 seconds to run decision tree.

Conclusion

I have been using tidymodels for a few weeks now, and I really enjoy integrating to the tidyverse. But I find it confusing to have so many steps and objects. For example, I keep trying to get RMSE from a workflow object. I probably need a bit more time to get familiar with the new ecosystem.

The complete code is available on GitHub, and an HTML version of the Markdown is published. As always, I hope this post is helpful. Have a wonderful day!

Reference

Vabalas, A., Gowen, E., Poliakoff, E., & Casson, A. J. (2019). Machine learning algorithm validation with a limited sample size. PloS one, 14(11), e0224365.

--

--

I am passionate about using data to make the world a better place, and I write about data science, visualisation, and machine learning.