Finding optimal values can make significant improvements in machine learning models. I enjoy experimenting with various values in the tuning processes and seeing my models getting better at prediction. However, it can be troublesome having to figure out the names, ranges, values, and other aspects for the parameters. Gladly, the dials
library, which is part of the tidymodels
, is created to make the parameter tuning a lot easier.
In this post, I present three ways to tune parameters with tidymodels
and provide example codes. Most examples of tidymodels
I have seen so far have not included dials
in modelling. While dials
is not required, it is helpful in parameter tuning.
Overview
I use tidymodels
for the demonstration, including rsample
for splitting data, parsnip
for modelling, workflow
for bundling the process, tune
for tuning, and dials
for parameter management. If you’re reading this, I assume you have some knowledge of tidymodels
. Overall, there are three ways to work with parameters (that I know):
- Apply default values in
parsnip
- Use
tune
to create a tuning grid and cross-validateparsnip
model - Create a tuning grid with
dials
to be used bytune
to cross-validateparsnip
model
I always stress this in my post, but I will do it again here: this is by no means an exhaustive guide. Data Science, as well as tidymodels
, is continuously evolving and changing. I am sharing what I have learnt so far and hopefully, helping people to understand how to use these libraries. Please let me know if you have any questions or if you find this article helpful.
Import Data
I used the penguins¹ dataset included in tidymodels. After dropping rows with NA values and excluded the "island" feature, it has 344 rows and 7 columns. The dataset is relatively small but has a mix of nominal and numerical variables. The target variable is body_mass_g
, which is numerical, making this a regression task.
library(tidymodels)
data("penguins")
penguins <- penguins %>%
select(-island) %>%
drop_na()
penguins %>% glimpse()
## Rows: 344
## Columns: 7
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie...
## $ island <fct> Torgersen, Torgersen, Torgersen...
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, ...
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8,...
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195,...
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625...
## $ sex <fct> male, female, female, NA, female, male...
As the goal is to demonstrate the various approaches for parameter tuning, I am not doing exploratory data analysis, preprocess, feature engineering, and other works that typically are performed in a Machine Learning project. Here, I split the data into training and testing, and I set up 5-fold cross-validation with 2 repetitions.
set.seed(300)
split <- initial_split(penguins, prop = 0.75)
penguins_train <- training(split)
penguins_test <- testing(split)
folds_5 <- vfold_cv(penguins_train, v = 5, repeats = 2)
Create a random forest model
I usually need to consider the project goal, data type, and other factors before shortlisting model options. Here, I decide to use the random forest because it has several parameters and I like it.
In tidymodels, parsnip
provides a tidy, unified interface to models. One of the challenges in R is the differences in function arguments and parameters. For example, both randomForest
and ranger
are functions for building a random forest model. However, randomForest
has mtry
and ntree
, and ranger
has mtry
and num.trees
. Even though ntree
and num.trees
refer to the same idea, they are named differently. I did find myself using the ?randomForest
or ?ranger
to figure out argument names frequently while coding. The parsnip
is created to provide a solution to the problem.
Find available engines
Firstly, I find the name for the random forest model from the reference page. Surprisingly, it is not rf
but rand_forest
. parsnip
provides several engines for each model, and I can call show_engines("rand_forest")
to list all available engines.
## # A tibble: 6 x 2
## engine mode
## <chr> <chr>
## 1 ranger classification
## 2 ranger regression
## 3 randomForest classification
## 4 randomForest regression
## 5 spark classification
## 6 spark regression
Next, show_model_info("rand_forest")
shows modes, arguments, fit modules, and prediction models. The following code chunk is cropped to only display the arguments, and the output clearly translates the parameter names across rand_forest
and three engines.
## arguments:
## ranger:
## mtry --> mtry
## trees --> num.trees
## min_n --> min.node.size
## randomForest:
## mtry --> mtry
## trees --> ntree
## min_n --> nodesize
## spark:
## mtry --> feature_subset_strategy
## trees --> num_trees
## min_n --> min_instances_per_node
By now, I select rand_forest
as the model and randomForest
as the engine, and I know that the model has three parameters: mtry
, trees
, and min_n
. In the subsequent section, I discuss three different ways, as outlined in the overview, to work with parameters.
1. Use default parameters in parsnip
rf_spec
is a random forest model specification created with parsnip
. I do not specify values for any parameters, resulting in using the default values. As always, I then fit the model on the training data. The default parameters are printed.
# Create random forest specification
rf_spec <-
rand_forest(mode = "regression") %>%
set_engine("randomForest")
# Fit training data
model_default <-
rf_spec %>%
fit(body_mass_g~., data = penguins_train)
model_default
## parsnip model object
##
## Fit time: 133ms
##
## Call:
## randomForest(x = maybe_data_frame(x), y = y)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 89926.71
## % Var explained: 85.51
The model using default parameters is not cross-validated. Because there is only one set of parameters, cross-validation does not improve the model but makes the training data performance closer to the testing. Once the training is done, I make predictions on the testing data and calculate the performance metrics with yardstick
.
model_default %>%
predict(penguins_test) %>%
bind_cols(penguins_test) %>%
metrics(body_mass_g, .pred)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 317.
## 2 rsq standard 0.877
## 3 mae standard 242.
2. Use tune to tune parsnip model
Earlier, the rf_spec
uses default parameter values. To tune the parameters, I need to add the arguments. I provide two ways to do so below.
# Add parameters
rf_spec <-
rf_spec %>%
update(mtry = tune(), trees = tune())
# Option 2: Start again
rf_spec_new <-
rand_forest(
mode = "regression",
mtry = tune(),
trees = tune()
) %>%
set_engine("randomForest")
The tune()
is a place holder for values waiting to be tuned. So now, rf_spec
needs to try different parameters and find the best one. Here is where the cross-validation comes in. For such a small dataset, cross-validated model performance might be more representative of new data.
Sounds pretty neat, right? tidymodels
makes it pretty neat, too, with workflow
, which bundles pre-processing, modelling, and post-processing requests. It is also commonly used to include data preprocessing with recipes
, but I skipped. Instead, I specify the outcomes and predictors and add the model specification.
# Create a workflow
rf_workflow <-
workflow() %>%
add_variables(
outcomes = body_mass_g, predictors = everything()
) %>%
add_model(rf_spec)
Manually provide values
Now, we have finally reached the tuning stage! The [tune_grid()](https://tune.tidymodels.org/reference/tune_grid.html)
function includes the cross-validation and the parameter grid, which should a data frame of parameter combinations. expand.grid()
is an easy way to generate all combinations for inputs. I listed three values each for mtry
and trees
, generating nine combinations (three times three).
I don’t know whether this is a wake-up call that I should replace my five-year-old MacBook or whether modelling just requires time, tuning always takes a significant amount of time. After completion, I call collect_metrics()
to check the result.
set.seed(300)
manual_tune <-
rf_workflow %>%
tune_grid(
resamples = folds_5,
grid = expand.grid(
mtry = c(1, 3, 5),
trees = c(500, 1000, 2000)
)
)
collect_metrics(manual_tune)
## # A tibble: 18 x 8
## mtry trees .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 500 rmse standard 306. 10 9.84 Preprocessor1_Model1
## 2 1 500 rsq standard 0.858 10 0.0146 Preprocessor1_Model1
## 3 3 500 rmse standard 301. 10 14.3 Preprocessor1_Model2
## 4 3 500 rsq standard 0.854 10 0.0178 Preprocessor1_Model2
## 5 5 500 rmse standard 303. 10 14.5 Preprocessor1_Model3
## 6 5 500 rsq standard 0.852 10 0.0180 Preprocessor1_Model3
## 7 1 1000 rmse standard 305. 10 9.82 Preprocessor1_Model4
## 8 1 1000 rsq standard 0.859 10 0.0143 Preprocessor1_Model4
## 9 3 1000 rmse standard 300. 10 14.5 Preprocessor1_Model5
## 10 3 1000 rsq standard 0.854 10 0.0180 Preprocessor1_Model5
## 11 5 1000 rmse standard 304. 10 14.5 Preprocessor1_Model6
## 12 5 1000 rsq standard 0.851 10 0.0180 Preprocessor1_Model6
## 13 1 2000 rmse standard 306. 10 10.1 Preprocessor1_Model7
## 14 1 2000 rsq standard 0.858 10 0.0144 Preprocessor1_Model7
## 15 3 2000 rmse standard 300. 10 14.5 Preprocessor1_Model8
## 16 3 2000 rsq standard 0.854 10 0.0179 Preprocessor1_Model8
## 17 5 2000 rmse standard 304. 10 14.7 Preprocessor1_Model9
## 18 5 2000 rsq standard 0.851 10 0.0181 Preprocessor1_Model9
Too many things to read? I agree. Let’s focus on the one with the best performance using show_best()
. It suggests that mtry = 3
and trees = 2000
are the best parameters.
show_best(manual_tune, n = 1)
## # A tibble: 1 x 8
## mtry trees .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 3 2000 rmse standard 300. 10 14.5 Preprocessor1_Model8
manual_tune
is not a model but a tuning grid, so I need to finalise with the best arguments (mtry = 3
& trees = 2000
) and fit with the entire training data. The testing RMSE is 296, lower than the 317 from the model with default parameters, suggesting the tuning yields improvements.
manual_final <-
finalize_workflow(rf_workflow, select_best(manual_tune)) %>%
fit(penguins_train)
manual_final %>%
predict(penguins_test) %>%
bind_cols(penguins_test) %>%
metrics(body_mass_g, .pred)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 296.
## 2 rsq standard 0.881
## 3 mae standard 238.
Specify grid size for automatic generation
Or instead of using an expand.grid()
for the grid
argument to manually input values, I can specify the number of candidates to try with an integer. Here, I ask the model to try five sets of parameters. As the collect_metrics()
returns both RMSE and R-squared, each set has two rows of output, and five sets lead to ten rows of result.
set.seed(300)
random_tune <-
rf_workflow %>%
tune_grid(
resamples = folds_5, grid = 5
)
collect_metrics(random_tune)
## # A tibble: 10 x 8
## mtry trees .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 1879 rmse standard 304. 10 14.5 Preprocessor1_Model1
## 2 5 1879 rsq standard 0.851 10 0.0181 Preprocessor1_Model1
## 3 2 799 rmse standard 298. 10 13.6 Preprocessor1_Model2
## 4 2 799 rsq standard 0.857 10 0.0171 Preprocessor1_Model2
## 5 3 1263 rmse standard 300. 10 14.5 Preprocessor1_Model3
## 6 3 1263 rsq standard 0.854 10 0.0179 Preprocessor1_Model3
## 7 2 812 rmse standard 297. 10 13.7 Preprocessor1_Model4
## 8 2 812 rsq standard 0.858 10 0.0171 Preprocessor1_Model4
## 9 4 193 rmse standard 302. 10 14.9 Preprocessor1_Model5
## 10 4 193 rsq standard 0.852 10 0.0182 Preprocessor1_Model5
Again, I use show_best()
to focus on the best result. As a reminder, the best cross-validated RMSE from manual tuning is 300 with mtry = 3
and trees = 2000
.
show_best(random_tune)
## # A tibble: 1 x 8
## mtry trees .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 812 rmse standard 297. 10 13.7 Preprocessor1_Model4
Similarly, I finalise the workflow, fit the model on training data, and test it. Nice! RMSE decreases from 296 to 295.
random_final <-
finalize_workflow(rf_workflow, select_best(random_tune)) %>%
fit(penguins_train)
random_final %>%
predict(penguins_test) %>%
bind_cols(penguins_test) %>%
metrics(body_mass_g, .pred)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 295.
## 2 rsq standard 0.883
## 3 mae standard 233.
3. Create parameter values with dials
dials
works with parameter
objects. In my random forest model, there are mtry
and trees
. Each object contains information about the range, possible values, types, etc. I find dials
to be an extremely powerful and helpful tool because too often, I had to read through the documentation, go through my books, or google to find out what is the range and value for a given parameter.
Check parameter information
Let’s focus on mtry
first. I can see that it is a quantitative parameter that refers to the number of randomly selected predictors. The range is from 1 to … wait…? A question mark? Let’s try range_get()
to see the range again. It’s still unknown.
mtry()
## # Randomly Selected Predictors (quantitative)
## Range: [1, ?]
mtry() %>% range_get()
## $lower
## [1] 1
##
## $upper
## unknown()
Well, this is because the upper range limit depends on the number of predictors in the data, and I need to specify the value manually. I do so by getting the number of columns and subtracting one (outcome variable). There are two ways to do so as shown below.
# Option 1: Use range_set
mtry() %>% range_set(c(1, ncol(penguins_train) - 1))
## # Randomly Selected Predictors (quantitative)
## Range: [1, 5]
# Options 2: Include in the argument
mtry(c(1, ncol(penguins_train) - 1))
## # Randomly Selected Predictors (quantitative)
## Range: [1, 5]
Let’s try with trees
. The parameter does not depend on the data, therefore, the range is provided.
trees()
## # Trees (quantitative)
## Range: [1, 2000]
Create values for a parameter
So now, I know about the parameter, how do I create values? There are two ways. I can use value_seq()
to generate a sequence of n
numbers spanning across the range. Here, I try to get 4, 5, and 10 numbers. As you can see, the minimum and the maximum value for trees
are included.
trees() %>% value_seq(n = 4)
## [1] 1 667 1333 2000
trees() %>% value_seq(n = 5)
## [1] 1 500 1000 1500 2000
trees() %>% value_seq(n = 10)
## [1] 1 223 445 667 889 1111 1333 1555 1777 2000
Or I can use value_sample()
to generate random numbers.
set.seed(300)
trees() %>% value_sample(n = 4)
## [1] 590 874 1602 985
trees() %>% value_sample(n = 5)
## [1] 1692 789 553 1980 1875
trees() %>% value_sample(n = 10)
## [1] 1705 272 461 780 1383 1868 1107 812 460 901
Create a grid for parameters
Let’s recall tune_grid()
, the function to tune parameters. It requires the grid to be a data frame. Two approaches above return vectors. So how can I generate a grid? Of course, I can simply turn the vectors into a data frame. But, dials
has better ways to do so. Again, there are two methods: creating a sequence of numbers and creating a set of random numbers.
To create a grid with sequence, use grid_regular()
. Add the parameters in the argument, mtry()
and trees()
, and specify the levels for each. I want three levels for each parameter, resulting in nine combinations (three times three).
set.seed(300)
dials_regular <- grid_regular(
mtry(c(1, ncol(penguins_train) - 1)),
trees(),
levels = 3
)
dials_regular
## # A tibble: 9 x 2
## mtry trees
## <int> <int>
## 1 1 1
## 2 3 1
## 3 5 1
## 4 1 1000
## 5 3 1000
## 6 5 1000
## 7 1 2000
## 8 3 2000
## 9 5 2000
For random numbers, use grid_random()
and specify the size
.
set.seed(300)
dials_random <- grid_random(
mtry(c(1, ncol(penguins_train) - 1)),
trees(),
size = 6
)
dials_random
## # A tibble: 6 x 2
## mtry trees
## <int> <int>
## 1 2 1980
## 2 2 1875
## 3 1 1705
## 4 4 272
## 5 5 461
## 6 1 780
Use dials with tune_grid()
Either approach creates a data frame that is ready to be used with tune_grid()
. For grid_regular()
:
dials_regular_tune <-
rf_workflow %>%
tune_grid(
resamples = folds_5, grid = dials_regular
)
show_best(dials_regular_tune, n = 1)
## # A tibble: 1 x 8
## mtry trees .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 3 1000 rmse standard 300. 10 14.4 Preprocessor1_Model5
dials_regular_final <-
finalize_workflow(
rf_workflow, select_best(dials_regular_tune)
) %>%
fit(penguins_train)
dials_regular_final %>%
predict(penguins_test) %>%
bind_cols(penguins_test) %>%
metrics(body_mass_g, .pred)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 296.
## 2 rsq standard 0.881
## 3 mae standard 237.
For grid_random()
:
dials_random_tune <-
rf_workflow %>%
tune_grid(
resamples = folds_5, grid = dials_random
)
show_best(dials_random_tune, n = 1)
## # A tibble: 1 x 8
## mtry trees .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 1875 rmse standard 297. 10 13.6 Preprocessor1_Model2
dials_random_final <-
finalize_workflow(
rf_workflow, select_best(dials_random_tune)
) %>%
fit(penguins_train)
dials_random_final %>%
predict(penguins_test) %>%
bind_cols(penguins_test) %>%
metrics(body_mass_g, .pred)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 296.
## 2 rsq standard 0.882
## 3 mae standard 234.
Conclusion
I explain three ways to work with tuning parameters in the post.
- Apply default values in
parsnip
: Without specifying the values for parameters,parsnip
would use the default values that come with the selected engine. - Use
tune
withparsnip
: Thetune_grid()
function cross-validates a set of parameters. It can work with a pre-defined data frame or generate a set of random numbers. - Create values with
dials
to be used intune
to cross-validateparsnip
model:dials
provides information about parameters and generates values for them. The values can be a sequence of number spanning across the range or a list of random numbers.
I am a big fan of short and concise code. When I started playing with dials
, it felt redundant and unnecessary, because I could have just used tune
. However, as I became more familiar with dials
, I started to understand why it was created in the first place. I no longer have to go back to the documentation to see whether a parameter is an integer or a floating number, whether there is a range limit, and whether it requires transformation.
It took me a while to completely piece everything from rsample
and recipe
to tune
and dials
together. I hope you enjoy the article, and have a wonderful day! Here is a gist with all the codes throughout:
Reference
¹ Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081