The world’s leading publication for data science, AI, and ML professionals.

If Marie Kondo did modeling in R

She would use tidymodels

The output of many models in R can be inconsistent. Often we are given more information than we need and in some cases we have less than we need. Arguments of function and formats of outputs can vary a lot, and we sometimes need to look in different parts of the output to see specific statistics that we seek.

The tidymodels meta-package is a collection of packages which collectively apply the principles of tidy data to the construction of statistical models. More information and learning resources on tidymodels can be found here. Within tidymodels there are two packages which are particular useful in controlling the output of models in R: the broom and parsnip packages.

The broom package

Consistent with how it is named, broom aims to tidy up the output of the models into a predictable format. It works with over 100 different types of models in R. In order to illustrate its use, let’s run a binomial logistic regression model on a set of data about salespeople to see if we can understand their promotion likelihood.

# obtain salespeople data
url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople <- read.csv(url)
# convert promoted to factor
salespeople$promoted <- as.factor(salespeople$promoted)
# build model to predict promotion based on sales and customer_rate
promotion_model <- glm(formula = promoted ~ sales + customer_rate,
                       family = "binomial",
                       Data = salespeople)

We now have our model sitting in memory. We can use three key functions in the broom package to view a variety of model statistics. First, the tidy() function allows us to see the coefficient statistics of the model.

library(tidymodels)
broom::tidy(promotion_model)
## # A tibble: 3 x 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -19.5      3.35        -5.83 5.48e- 9
## 2 sales           0.0404   0.00653      6.19 6.03e-10
## 3 customer_rate  -1.12     0.467       -2.40 1.63e- 2

The glance() function allows us to see a row of overall model statistics:

broom::glance(promotion_model)
## # A tibble: 1 x 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          440.     349  -32.6  71.1  82.7     65.1         347   350

And the augment() function augments the observations in the dataset with a range of observation-level model statistics such as residuals:

head(broom::augment(promotion_model))
## # A tibble: 6 x 9
##   promoted sales customer_rate .fitted   .resid .std.resid      .hat .sigma  .cooksd
##   <fct>    <int>         <dbl>   <dbl>    <dbl>      <dbl>     <dbl>  <dbl>    <dbl>
## 1 0          594          3.94  0.0522 -1.20      -1.22    0.0289     0.429 1.08e- 2
## 2 0          446          4.06 -6.06   -0.0683    -0.0684  0.00212    0.434 1.66e- 6
## 3 1          674          3.83  3.41    0.255      0.257   0.0161     0.434 1.84e- 4
## 4 0          525          3.62 -2.38   -0.422     -0.425   0.0153     0.433 4.90e- 4
## 5 1          657          4.4   2.08    0.485      0.493   0.0315     0.433 1.40e- 3
## 6 1          918          4.54 12.5     0.00278    0.00278 0.0000174  0.434 2.24e-11

These functions are model-agnostic for a very wide range of common models in R. For example, here we use them on a proportional odds logistic regression model on a soccer data set to understand what factors relate to in-game referee disciplinary action.

# get soccer data
url <- "http://peopleanalytics-regression-book.org/data/soccer.csv"
soccer <- read.csv(url)
# convert discipline to ordered factor
soccer$discipline <- ordered(soccer$discipline, 
                             levels = c("None", "Yellow", "Red"))
# run proportional odds model
library(MASS)
soccer_model <- polr(
  formula = discipline ~ n_yellow_25 + n_red_25 + position + country + level + result, 
  data = soccer
)
# view model statistics
broom::glance(soccer_model)
## # A tibble: 1 x 7
##     edf logLik   AIC   BIC deviance df.residual  nobs
##   <int>  <dbl> <dbl> <dbl>    <dbl>       <dbl> <dbl>
## 1    10 -1722. 3465. 3522.    3445.        2281  2291

broom functions integrate well into other tidyverse methods, and allow easy running of models over nested subsets of data. For example, if we want to run our soccer model across the different countries in the dataset and see all the model statistics in a neat table, we can use typical tidyverse grammar to do so using dplyr.

library(tidyverse)
# define function to run soccer model and glance at results
soccer_model_glance <- function(df) {
  broom::glance(
    polr(
      formula = discipline ~ n_yellow_25 + n_red_25 + position + level + result,
      data = df
    )
  )
}
# run it nested by country
soccer %>% 
  dplyr::nest_by(country) %>% 
  dplyr::summarise(soccer_model_glance(data))
## # A tibble: 2 x 8
## # Groups:   country [2]
##   country   edf logLik   AIC   BIC deviance df.residual  nobs
##   <chr>   <int>  <dbl> <dbl> <dbl>    <dbl>       <dbl> <dbl>
## 1 England     9  -843. 1704. 1749.    1686.        1123  1132
## 2 Germany     9  -877. 1773. 1818.    1755.        1150  1159

The parsnip package

The parsnip package aims create a unified interface to running models, to avoid users needing to understand different model terminology and other minutiae. It also takes a more hierarchical approach to defining models that is similar in nature to the object-oriented approaches that Python users would be more familiar with.

Again let’s use our salesperson model example to illustrate. We start by defining a model family that we wish to use, in this case logistic regression, and define a specific engine and mode.

model <- parsnip::logistic_reg() %>% 
  parsnip::set_engine("glm") %>% 
  parsnip::set_mode("classification")

We can use the translate() function to see what kind of model we have created:

model %>% 
  parsnip::translate()
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm 
## 
## Model fit template:
## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     family = stats::binomial)

Now with our model defined, we can fit it using a formula and data and then use broom to view the coefficients:

model %>% 
  parsnip::fit(formula = promoted ~ sales + customer_rate,
               data = salespeople) %>% 
  broom::tidy()
## # A tibble: 3 x 5
##   term          estimate std.error statistic  p.value
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -19.5      3.35        -5.83 5.48e- 9
## 2 sales           0.0404   0.00653      6.19 6.03e-10
## 3 customer_rate  -1.12     0.467       -2.40 1.63e- 2

parsnip functions are particularly motivated around tooling for machine learning model workflows in a similar way to scikit-learn in Python, but they can offer an attractive approach to coding inferential models too, particularly where common families of models are used.

Data and examples in this article are taken from my open source book on regression analysis here.

_Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter. Also check out my blog on drkeithmcnulty.com._

Courtesy of unsplash.com
Courtesy of unsplash.com

Related Articles