The world’s leading publication for data science, AI, and ML professionals.

Dials, Tune, and Parsnip: Tidymodels’ Way to Create and Tune Model Parameters

Three tidy approaches to manage parameter values with example codes to predict penguin body mass

Photo by Ian Parker on Unsplash
Photo by Ian Parker on Unsplash

Finding optimal values can make significant improvements in machine learning models. I enjoy experimenting with various values in the tuning processes and seeing my models getting better at prediction. However, it can be troublesome having to figure out the names, ranges, values, and other aspects for the parameters. Gladly, the dials library, which is part of the tidymodels, is created to make the parameter tuning a lot easier.

In this post, I present three ways to tune parameters with tidymodels and provide example codes. Most examples of tidymodels I have seen so far have not included dials in modelling. While dials is not required, it is helpful in parameter tuning.


Overview

I use tidymodels for the demonstration, including rsample for splitting data, parsnip for modelling, workflow for bundling the process, tune for tuning, and dials for parameter management. If you’re reading this, I assume you have some knowledge of tidymodels. Overall, there are three ways to work with parameters (that I know):

  1. Apply default values in parsnip
  2. Use tune to create a tuning grid and cross-validate parsnip model
  3. Create a tuning grid with dials to be used by tune to cross-validate parsnip model

I always stress this in my post, but I will do it again here: this is by no means an exhaustive guide. Data Science, as well as tidymodels, is continuously evolving and changing. I am sharing what I have learnt so far and hopefully, helping people to understand how to use these libraries. Please let me know if you have any questions or if you find this article helpful.


Import Data

I used the penguins¹ dataset included in tidymodels. After dropping rows with NA values and excluded the "island" feature, it has 344 rows and 7 columns. The dataset is relatively small but has a mix of nominal and numerical variables. The target variable is body_mass_g, which is numerical, making this a regression task.

library(tidymodels)
data("penguins")
penguins <- penguins %>%
  select(-island) %>%
  drop_na() 
penguins %>% glimpse() 
## Rows: 344
## Columns: 7
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie...
## $ island            <fct> Torgersen, Torgersen, Torgersen...
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, ...
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8,...
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195,...
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625...
## $ sex               <fct> male, female, female, NA, female, male...

As the goal is to demonstrate the various approaches for parameter tuning, I am not doing exploratory data analysis, preprocess, feature engineering, and other works that typically are performed in a Machine Learning project. Here, I split the data into training and testing, and I set up 5-fold cross-validation with 2 repetitions.

set.seed(300)
split <- initial_split(penguins, prop = 0.75)
penguins_train <- training(split)
penguins_test <- testing(split)
folds_5 <- vfold_cv(penguins_train, v = 5, repeats = 2)

Create a random forest model

I usually need to consider the project goal, data type, and other factors before shortlisting model options. Here, I decide to use the random forest because it has several parameters and I like it.

In tidymodels, parsnip provides a tidy, unified interface to models. One of the challenges in R is the differences in function arguments and parameters. For example, both randomForest and ranger are functions for building a random forest model. However, randomForest has mtry and ntree, and ranger has mtry and num.trees. Even though ntree and num.trees refer to the same idea, they are named differently. I did find myself using the ?randomForest or ?ranger to figure out argument names frequently while coding. The parsnip is created to provide a solution to the problem.

Find available engines

Firstly, I find the name for the random forest model from the reference page. Surprisingly, it is not rf but rand_forest. parsnip provides several engines for each model, and I can call show_engines("rand_forest") to list all available engines.

## # A tibble: 6 x 2
##   engine       mode          
##   <chr>        <chr>         
## 1 ranger       classification
## 2 ranger       regression    
## 3 randomForest classification
## 4 randomForest regression    
## 5 spark        classification
## 6 spark        regression

Next, show_model_info("rand_forest") shows modes, arguments, fit modules, and prediction models. The following code chunk is cropped to only display the arguments, and the output clearly translates the parameter names across rand_forest and three engines.

##  arguments: 
##    ranger:       
##       mtry  --> mtry
##       trees --> num.trees
##       min_n --> min.node.size
##    randomForest: 
##       mtry  --> mtry
##       trees --> ntree
##       min_n --> nodesize
##    spark:        
##       mtry  --> feature_subset_strategy
##       trees --> num_trees
##       min_n --> min_instances_per_node

By now, I select rand_forest as the model and randomForest as the engine, and I know that the model has three parameters: mtry, trees, and min_n. In the subsequent section, I discuss three different ways, as outlined in the overview, to work with parameters.


1. Use default parameters in parsnip

rf_spec is a random forest model specification created with parsnip. I do not specify values for any parameters, resulting in using the default values. As always, I then fit the model on the training data. The default parameters are printed.

# Create random forest specification
rf_spec <- 
  rand_forest(mode = "regression") %>%
  set_engine("randomForest")
# Fit training data
model_default <-
  rf_spec %>%
  fit(body_mass_g~., data = penguins_train)

model_default 
## parsnip model object
## 
## Fit time:  133ms 
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 89926.71
##                     % Var explained: 85.51

The model using default parameters is not cross-validated. Because there is only one set of parameters, cross-validation does not improve the model but makes the training data performance closer to the testing. Once the training is done, I make predictions on the testing data and calculate the performance metrics with yardstick.

model_default %>% 
  predict(penguins_test) %>% 
  bind_cols(penguins_test) %>% 
  metrics(body_mass_g, .pred) 
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     317.   
## 2 rsq     standard       0.877
## 3 mae     standard     242.

2. Use tune to tune parsnip model

Earlier, the rf_spec uses default parameter values. To tune the parameters, I need to add the arguments. I provide two ways to do so below.

# Add parameters
rf_spec <-
  rf_spec %>%
  update(mtry = tune(), trees = tune())
# Option 2: Start again
rf_spec_new <-
  rand_forest(
    mode = "regression",
    mtry = tune(),
    trees = tune()
  ) %>%
  set_engine("randomForest")

The tune() is a place holder for values waiting to be tuned. So now, rf_spec needs to try different parameters and find the best one. Here is where the cross-validation comes in. For such a small dataset, cross-validated model performance might be more representative of new data.

Sounds pretty neat, right? tidymodels makes it pretty neat, too, with workflow, which bundles pre-processing, modelling, and post-processing requests. It is also commonly used to include data preprocessing with recipes, but I skipped. Instead, I specify the outcomes and predictors and add the model specification.

# Create a workflow
rf_workflow <-
  workflow() %>%
  add_variables(
    outcomes = body_mass_g, predictors = everything()
  ) %>%
  add_model(rf_spec)

Manually provide values

Now, we have finally reached the tuning stage! The [tune_grid()](https://tune.tidymodels.org/reference/tune_grid.html) function includes the cross-validation and the parameter grid, which should a data frame of parameter combinations. expand.grid() is an easy way to generate all combinations for inputs. I listed three values each for mtry and trees, generating nine combinations (three times three).

I don’t know whether this is a wake-up call that I should replace my five-year-old MacBook or whether modelling just requires time, tuning always takes a significant amount of time. After completion, I call collect_metrics() to check the result.

set.seed(300)
manual_tune <-
  rf_workflow %>%
  tune_grid(
    resamples = folds_5, 
    grid = expand.grid(
      mtry = c(1, 3, 5), 
      trees = c(500, 1000, 2000)
    )
  )
collect_metrics(manual_tune) 
## # A tibble: 18 x 8
##     mtry trees .metric .estimator    mean     n std_err .config             
##    <dbl> <dbl> <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
##  1     1   500 rmse    standard   306.       10  9.84   Preprocessor1_Model1
##  2     1   500 rsq     standard     0.858    10  0.0146 Preprocessor1_Model1
##  3     3   500 rmse    standard   301.       10 14.3    Preprocessor1_Model2
##  4     3   500 rsq     standard     0.854    10  0.0178 Preprocessor1_Model2
##  5     5   500 rmse    standard   303.       10 14.5    Preprocessor1_Model3
##  6     5   500 rsq     standard     0.852    10  0.0180 Preprocessor1_Model3
##  7     1  1000 rmse    standard   305.       10  9.82   Preprocessor1_Model4
##  8     1  1000 rsq     standard     0.859    10  0.0143 Preprocessor1_Model4
##  9     3  1000 rmse    standard   300.       10 14.5    Preprocessor1_Model5
## 10     3  1000 rsq     standard     0.854    10  0.0180 Preprocessor1_Model5
## 11     5  1000 rmse    standard   304.       10 14.5    Preprocessor1_Model6
## 12     5  1000 rsq     standard     0.851    10  0.0180 Preprocessor1_Model6
## 13     1  2000 rmse    standard   306.       10 10.1    Preprocessor1_Model7
## 14     1  2000 rsq     standard     0.858    10  0.0144 Preprocessor1_Model7
## 15     3  2000 rmse    standard   300.       10 14.5    Preprocessor1_Model8
## 16     3  2000 rsq     standard     0.854    10  0.0179 Preprocessor1_Model8
## 17     5  2000 rmse    standard   304.       10 14.7    Preprocessor1_Model9
## 18     5  2000 rsq     standard     0.851    10  0.0181 Preprocessor1_Model9 

Too many things to read? I agree. Let’s focus on the one with the best performance using show_best(). It suggests that mtry = 3 and trees = 2000 are the best parameters.

show_best(manual_tune, n = 1)
## # A tibble: 1 x 8
##    mtry trees .metric .estimator  mean     n std_err .config             
##   <dbl> <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1     3  2000 rmse    standard    300.    10    14.5 Preprocessor1_Model8 

manual_tune is not a model but a tuning grid, so I need to finalise with the best arguments (mtry = 3 & trees = 2000) and fit with the entire training data. The testing RMSE is 296, lower than the 317 from the model with default parameters, suggesting the tuning yields improvements.

manual_final <-
  finalize_workflow(rf_workflow, select_best(manual_tune)) %>%
  fit(penguins_train)

manual_final %>% 
  predict(penguins_test) %>% 
  bind_cols(penguins_test) %>% 
  metrics(body_mass_g, .pred) 
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     296.   
## 2 rsq     standard       0.881
## 3 mae     standard     238.

Specify grid size for automatic generation

Or instead of using an expand.grid() for the grid argument to manually input values, I can specify the number of candidates to try with an integer. Here, I ask the model to try five sets of parameters. As the collect_metrics() returns both RMSE and R-squared, each set has two rows of output, and five sets lead to ten rows of result.

set.seed(300)
random_tune <-
  rf_workflow %>%
  tune_grid(
    resamples = folds_5, grid = 5
  )
collect_metrics(random_tune)
## # A tibble: 10 x 8
##     mtry trees .metric .estimator    mean     n std_err .config             
##    <int> <int> <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
##  1     5  1879 rmse    standard   304.       10 14.5    Preprocessor1_Model1
##  2     5  1879 rsq     standard     0.851    10  0.0181 Preprocessor1_Model1
##  3     2   799 rmse    standard   298.       10 13.6    Preprocessor1_Model2
##  4     2   799 rsq     standard     0.857    10  0.0171 Preprocessor1_Model2
##  5     3  1263 rmse    standard   300.       10 14.5    Preprocessor1_Model3
##  6     3  1263 rsq     standard     0.854    10  0.0179 Preprocessor1_Model3
##  7     2   812 rmse    standard   297.       10 13.7    Preprocessor1_Model4
##  8     2   812 rsq     standard     0.858    10  0.0171 Preprocessor1_Model4
##  9     4   193 rmse    standard   302.       10 14.9    Preprocessor1_Model5
## 10     4   193 rsq     standard     0.852    10  0.0182 Preprocessor1_Model5

Again, I use show_best() to focus on the best result. As a reminder, the best cross-validated RMSE from manual tuning is 300 with mtry = 3 and trees = 2000.

show_best(random_tune) 
## # A tibble: 1 x 8
##    mtry trees .metric .estimator  mean     n std_err .config             
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1     2   812 rmse    standard    297.    10    13.7 Preprocessor1_Model4

Similarly, I finalise the workflow, fit the model on training data, and test it. Nice! RMSE decreases from 296 to 295.

random_final <-
  finalize_workflow(rf_workflow, select_best(random_tune)) %>%
  fit(penguins_train)

random_final %>% 
  predict(penguins_test) %>% 
  bind_cols(penguins_test) %>% 
  metrics(body_mass_g, .pred) 
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     295.   
## 2 rsq     standard       0.883
## 3 mae     standard     233.

3. Create parameter values with dials

dials works with parameter objects. In my random forest model, there are mtry and trees. Each object contains information about the range, possible values, types, etc. I find dials to be an extremely powerful and helpful tool because too often, I had to read through the documentation, go through my books, or google to find out what is the range and value for a given parameter.

Check parameter information

Let’s focus on mtry first. I can see that it is a quantitative parameter that refers to the number of randomly selected predictors. The range is from 1 to … wait…? A question mark? Let’s try range_get() to see the range again. It’s still unknown.

mtry()
## # Randomly Selected Predictors (quantitative)
## Range: [1, ?]

mtry() %>% range_get()
## $lower
## [1] 1
## 
## $upper
## unknown()

Well, this is because the upper range limit depends on the number of predictors in the data, and I need to specify the value manually. I do so by getting the number of columns and subtracting one (outcome variable). There are two ways to do so as shown below.

# Option 1: Use range_set
mtry() %>% range_set(c(1, ncol(penguins_train) - 1))
## # Randomly Selected Predictors (quantitative)
## Range: [1, 5]
# Options 2: Include in the argument
mtry(c(1, ncol(penguins_train) - 1))
## # Randomly Selected Predictors (quantitative)
## Range: [1, 5]

Let’s try with trees. The parameter does not depend on the data, therefore, the range is provided.

trees()
## # Trees (quantitative)
## Range: [1, 2000]

Create values for a parameter

So now, I know about the parameter, how do I create values? There are two ways. I can use value_seq() to generate a sequence of n numbers spanning across the range. Here, I try to get 4, 5, and 10 numbers. As you can see, the minimum and the maximum value for trees are included.

trees() %>% value_seq(n = 4)
## [1]    1  667 1333 2000

trees() %>% value_seq(n = 5)
## [1]    1  500 1000 1500 2000

trees() %>% value_seq(n = 10)
##  [1]    1  223  445  667  889 1111 1333 1555 1777 2000

Or I can use value_sample() to generate random numbers.

set.seed(300)
trees() %>% value_sample(n = 4)
## [1]  590  874 1602  985

trees() %>% value_sample(n = 5)
## [1] 1692  789  553 1980 1875

trees() %>% value_sample(n = 10)
##  [1] 1705  272  461  780 1383 1868 1107  812  460  901

Create a grid for parameters

Let’s recall tune_grid(), the function to tune parameters. It requires the grid to be a data frame. Two approaches above return vectors. So how can I generate a grid? Of course, I can simply turn the vectors into a data frame. But, dials has better ways to do so. Again, there are two methods: creating a sequence of numbers and creating a set of random numbers.

To create a grid with sequence, use grid_regular(). Add the parameters in the argument, mtry() and trees(), and specify the levels for each. I want three levels for each parameter, resulting in nine combinations (three times three).

set.seed(300)
dials_regular <- grid_regular(
  mtry(c(1, ncol(penguins_train) - 1)),
  trees(),
  levels = 3
)
dials_regular
## # A tibble: 9 x 2
##    mtry trees
##   <int> <int>
## 1     1     1
## 2     3     1
## 3     5     1
## 4     1  1000
## 5     3  1000
## 6     5  1000
## 7     1  2000
## 8     3  2000
## 9     5  2000

For random numbers, use grid_random() and specify the size.

set.seed(300)
dials_random <- grid_random(
  mtry(c(1, ncol(penguins_train) - 1)),
  trees(),
  size = 6
)
dials_random
## # A tibble: 6 x 2
##    mtry trees
##   <int> <int>
## 1     2  1980
## 2     2  1875
## 3     1  1705
## 4     4   272
## 5     5   461
## 6     1   780

Use dials with tune_grid()

Either approach creates a data frame that is ready to be used with tune_grid(). For grid_regular() :

dials_regular_tune <-
  rf_workflow %>%
  tune_grid(
    resamples = folds_5, grid = dials_regular
  )
show_best(dials_regular_tune, n = 1)
## # A tibble: 1 x 8
##    mtry trees .metric .estimator  mean     n std_err .config             
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1     3  1000 rmse    standard    300.    10    14.4 Preprocessor1_Model5
dials_regular_final <-
  finalize_workflow(
    rf_workflow, select_best(dials_regular_tune)
  ) %>%
  fit(penguins_train)

dials_regular_final %>% 
  predict(penguins_test) %>% 
  bind_cols(penguins_test) %>% 
  metrics(body_mass_g, .pred)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     296.   
## 2 rsq     standard       0.881
## 3 mae     standard     237.

For grid_random() :

dials_random_tune <-
  rf_workflow %>%
  tune_grid(
    resamples = folds_5, grid = dials_random
  )
show_best(dials_random_tune, n = 1)
## # A tibble: 1 x 8
##    mtry trees .metric .estimator  mean     n std_err .config             
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1     2  1875 rmse    standard    297.    10    13.6 Preprocessor1_Model2
dials_random_final <-
  finalize_workflow(
    rf_workflow, select_best(dials_random_tune)
  ) %>%
  fit(penguins_train)

dials_random_final %>% 
  predict(penguins_test) %>% 
  bind_cols(penguins_test) %>% 
  metrics(body_mass_g, .pred)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     296.   
## 2 rsq     standard       0.882
## 3 mae     standard     234.

Conclusion

I explain three ways to work with tuning parameters in the post.

  1. Apply default values in parsnip: Without specifying the values for parameters, parsnip would use the default values that come with the selected engine.
  2. Use tune with parsnip: The tune_grid() function cross-validates a set of parameters. It can work with a pre-defined data frame or generate a set of random numbers.
  3. Create values with dials to be used in tune to cross-validate parsnip model: dials provides information about parameters and generates values for them. The values can be a sequence of number spanning across the range or a list of random numbers.

I am a big fan of short and concise code. When I started playing with dials, it felt redundant and unnecessary, because I could have just used tune. However, as I became more familiar with dials, I started to understand why it was created in the first place. I no longer have to go back to the documentation to see whether a parameter is an integer or a floating number, whether there is a range limit, and whether it requires transformation.

It took me a while to completely piece everything from rsample and recipe to tune and dials together. I hope you enjoy the article, and have a wonderful day! Here is a gist with all the codes throughout:

Reference

¹ Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081

https://github.com/allisonhorst/palmerpenguins


Related Articles