Anyone who codes in R knows about dplyr
. It’s really the defining package of R, and is designed to make operations on dataframes more intuitive to those that buy into the principles of ‘tidy Data‘ (which would be most data scientists, I suspect). In fact, many people can code in dplyr
better than they can code in R base. That’s how central dplyr
has become in the R ecosystem, along with the other packages that currently make up the tidyverse
.
So the fact that a new version has been released is exciting for most R users. The fact that it’s version 1.0.0 means it’s a real event. Hadley Wickham and the team of open-source developers behind dplyr
would not give it this version number lightly. A huge amount of effort has gone into superpowering dplyr
‘s functionality by making it more powerful, by unifying a number of previously distinct functions under a more abstracted umbrella, and above all in trying to offer more day-to-day users solutions to their most common dataframe-wrangling problems.
dplyr 1.0.0
can now be installed using install.packages("dplyr")
. You may need to update your R version to ensure that this update installs. I recommend upgrading to R 4.0.0 in any case.
In this article I’m going to move through the major new features in increasing order of complexity by my reckoning. I’ll use built in datasets – mostly mtcars
– to demonstrate what I mean.
1. Built in tidyselect
You can now use tidyselect
helper functions inside certain dplyr
verbs. For example:
library(dplyr)
mtcars %>%
select(starts_with("c")) %>%
head(3)

mtcars %>%
select(any_of(c("mpg", "cyl", "trash"))) %>%
head(3)

tidyselect
helper functions like this work inside any selecting function, including some new ones that we will look at later. You can find the full range of tidyselect
functions here.
2. Simple but so useful – the relocate() function
Often people want a specific order to the columns in their dataframe, and previously the only way to do that was to order the columns within a select()
verb, and that was tedious if there was a lot of columns involved.
By default relocate will move your column or columns to the left of the dataframe. If you want to move them to a specific place, you can use the .before
or .after
arguments. For example:
mtcars %>%
dplyr::relocate(disp) %>%
head(3)

mtcars %>%
relocate(starts_with("c"), .after = disp) %>%
head(3)

3. Incredibly powerful expansion of the summarise() function
Summarise – the original workhorse of dplyr
– has been made even more flexible in this new release. First, it can now return vectors to form multiple rows in the output. Second, it can return dataframes to form multiple rows and columns in the output. This might be a little mind-bending for some, so I’ll spend a little time on it here to illustrate how this could work.
If you want to summarise a function that creates a vector output, this is now easy. For example you can easily summarise a range:
mtcars %>%
group_by(cyl) %>%
summarise(range = range(mpg))

You could then combine with tidyr::pivot_wider()
if you wish:
library(tidyr)
mtcars %>%
group_by(cyl) %>%
summarise(range = range(mpg)) %>%
mutate(name = rep(c("min", "max"), length(unique(cyl)))) %>%
pivot_wider(names_from = name, values_from = range)

This would provide the equivalent of:
mtcars %>%
group_by(cyl) %>%
summarise(min = min(mpg), max = max(mpg))

The second option in this case is much easier, but where this comes in useful is where you have longer outputs. Here’s one simple way you could compute deciles:
decile <- seq(0, 1, 0.1)
mtcars %>%
group_by(cyl) %>%
summarise(deciles = quantile(mpg, decile)) %>%
mutate(name = rep(paste0("dec_", decile), length(unique(cyl)))) %>%
pivot_wider(names_from = name, values_from = deciles)

Now your summarise output can be a dataframe. Let’s look at a simple example. Recently I wrote a function that identified all unique unordered pairs of elements in a vector. Now I want to apply that to map a network of connections between characters of Friends based on appearing in the same scene.
Here’s a simple version of a dataframe I might be working from:
friends_episode <- data.frame(
scene = c(1, 1, 1, 2, 2, 2),
character = c("Joey", "Phoebe", "Chandler", "Joey", "Chandler", "Janice")
)
friends_episode

Now I’m going to write my function which accepts a vector and which produces a two column dataframe, and apply it by scene:
unique_pairs <- function(char_vector = NULL) {
vector <- as.character(unique(char_vector))
df <- data.frame(from = character(), to = character(), stringsAsFactors = FALSE)
if (length(vector) > 1) {
for (i in 1:(length(vector) - 1)) {
from <- rep(vector[i], length(vector) - i)
to <- vector[(i + 1): length(vector)]
df <- df %>%
dplyr::bind_rows(
data.frame(from = from, to = to, stringsAsFactors = FALSE)
)
}
}
df
}
friends_episode %>%
group_by(scene) %>%
summarise(unique_pairs(character))

As you might see, the dataframe which is the output of my summarise()
function has been unpacked and forms two columns in the final output. What happens if we name the output of our summarise()
function?
friends_pairs <- friends_episode %>%
group_by(scene) %>%
summarise(pairs = unique_pairs(character))
friends_pairs

So this is an important watchout. If you want your summarise()
output unpacked, don’t name it.
4. More powerful colwise wrangling with across()
With these more powerful summarise capabilities, and with the in-built tidyselect
toolkit, this sets us up for much more powerful and abstracted capabilities to work with the columns of our data and form a wider range of tasks. The introduction of the new across()
adverb enables this.
In short, the new function across()
operates across multiple columns and multiple functions within existing dplyr
verbs such as summarise()
or mutate()
. This makes it extremely powerful and time-saving. There is now no longer any need for the scoped variants such as summarise_at()
, mutate_if()
, etc.
First, you can replicate summarise_at()
by manually defining a set of columns to summarise using a character vector of column names, or by using column numbers:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(across(c("mpg", "hp"), mean))

across()
is a selecting function, and so you can use the tidyselect
syntax inside it. You can replicate mutate_if()
by using a function to select your columns. Here we turn the name
and status
columns in the dplyr::storms
dataset from character to factor.
storms %>%
dplyr::mutate(across(is.character, as.factor)) %>%
dplyr::select(name, status)

You can also apply multiple named functions to your multiple columns by using a list. The across()
function will by default glue your function and column names together with an underscore:
mtcars %>%
group_by(cyl) %>%
summarise(across(c("mpg", "hp"), list(mean = mean, median = median, sd = sd)))

And if you want to use a different glueing formula, you can do so using glue syntax:
mtcars %>%
group_by(cyl) %>%
summarise(across(starts_with("d"),
list(mean = mean, sd = sd),
.names = "{col}_{fn}_summ"))

If you need to add optional arguments into your functions, you can use formulas:
mtcars %>%
group_by(cyl) %>%
summarise(across(c("mpg", "hp"),
list(mean = ~mean(.x, na.rm = T),
median = ~median(.x, na.rm = T),
sd = ~sd(.x, na.rm = T)),
.names = "{col}_{fn}_summ"))

And similarly you can use formulas to combine functions to avoid unnecessary extra mutating:
mtcars %>%
group_by(cyl) %>%
summarise(across(mpg,
list(minus_sd = ~(mean(.x) - sd(.x)),
mean = mean,
plus_sd = ~(mean(.x) + sd(.x)))
))

5. rowwise() comes to life in the new dplyr
dplyr
previously had limited friendliness to working across rows. It previously behaved somewhat counter-intuitively when you wanted to sum or average across values in the same row. Here’s an example, which some of you might recognize as being a source of a previous headache:
WorldPhones_df <- WorldPhones %>%
as.data.frame()
# mutate an average column
WorldPhones_df %>%
dplyr::mutate(avg = mean(N.Amer:Mid.Amer))

This has returned the average of everything in every column in your dataframe, which is of course not what was intended.
Previously the only solution to this was to use manual calculations and to avoid using functions in this way, so you would write (N.Amer + Europe + Asia + S.Amer + Oceania + Africa + Mid.Amer)/7
which was pretty darn tedious.
rowwise()
creates a different structure called a rowwise_df
which prepares your data to perform operations across the rows – it basically groups your data by row.
rowwise()
is super-powered by the new c_across()
adverb to allow you to work in a similar way to how you would work colwise with the across()
adverb. Now you can write:
WorldPhones_df %>%
rowwise() %>%
dplyr::mutate(avg = mean(c_across(N.Amer:Mid.Amer)))

6. Running different models inside your dataframe
The new rowwise_df
object is designed to work with list-columns, which allow the storage of any type of data you want inside a column in a dataframe. Where I find this particularly valuable is where you want to run different models on subsets of your data according to the value of certain variables. Here’s an example of how you can store different subsets of mtcars
in a rowwise dataframe and then run a model on them.
model_coefs <- function(formula, data) {
coefs <- lm(formula, data)$coefficients
data.frame(coef = names(coefs), value = coefs)
}
mtcars %>%
dplyr::group_by(cyl) %>%
tidyr::nest() %>%
dplyr::rowwise() %>%
dplyr::summarise(model_coefs(mpg ~ wt + disp + hp, data = data)) %>%
tidyr::pivot_wider(names_from = coef, values_from = value)

7. The nest_by()
function
Of course, the developers behind dplyr 1.0.0
noticed the power of this row-wise modelling capability and so created the nest_by()
function as a shortcut for the code above. nest_by(x)
is equivalent of:
dplyr::group_by(x) %>%
tidyr::nest() %>%
dplyr::rowwise()
So now you can do the modeling above using:
mtcars %>%
nest_by(cyl) %>%
dplyr::summarise(model_coefs(mpg ~ wt + disp + hp, data = data)) %>%
tidyr::pivot_wider(names_from = coef, values_from = value)

_Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter. Also check out my blog on drkeithmcnulty.com._
