Whether or not you are a fan of the tidyverse, there is no doubt that this collection of R packages offers some neat and attractive ways of wrangling data that is often very intuitive to users. In the earlier versions of tidyverse packages, some elements of user control of output were sacrificed in favor of simpler functions which could be picked up and easily used by newbies. In recent updates to dplyr
and tidyr
, there has been significant progress to restoring some of this control.
This means that there are new functions and methods available in the tidyverse that you may not be aware of. They allow you to better transform your data how you want, and to perform operations more flexibly. They also provide new and alternative ways to perform tasks like nesting, modeling or graphing in a way where your code is more readable and understandable to many. In fact, I am convinced that users are only just scratching the surface of what can be done with the latest updates to this important set of packages.
It’s incumbent on any programmer to stay up to date with methods. Here are ten examples of new approaches to common data tasks that are offered by the latest tidyverse updates. For these examples, I will use the new Palmer Penguins dataset, which is an alternative to the controversial Iris dataset which is known to have been used by Fischer in his work around eugenics. As we will see, it’s actually a better all round dataset for teaching and illustrating data wrangling, and I’d encourage you to use and explore it.
First let’s load our tidyverse packages and the Palmer Penguins dataset and take a quick look at it. I’d encourage you to install the latest versions of these packages before you try to replicate the work in this article.
We can see that the dataset presents several measurements of various anatomical features of penguins of different species, sexes and native locations, as well as the year in which the measures were taken.
1. Selecting columns in data
tidyselect
helper functions are now built in to allow you to save time by selecting columns using dplyr::select()
based on common conditions. In this case, if I want to reduce the dataset to just bill measurements I can use this (noting that all measurement columns contain an underscore):
A full set of tidyselect
helper functions can be found in the documentation here.
2. Reordering columns in data
dplyr::relocate()
allows a new way to reorder specific columns or sets of columns. For example, if I want to make sure that all of my measurement columns are at the end of the dataset, I can use this (noting that my last column is year
):
Similar to .after
you can also use .before
as an argument here.
3. Controlling mutated column location
You’ll note in the penguins
dataset that there are no unique identifiers for each penguin. This can be problematic when you have multiple penguins of the same species, island, sex and year in the dataset. To address this and prepare for later examples, let’s add a unique identifier using dplyr::mutate()
, and here we can illustrate how mutate()
now allows you to position your new column in a similar way to relocate()
:
4. Transforming from wide to long
The penguins
dataset is clearly in a wide form – it gives multiple observations across the columns. For many reasons we may want to transform data from wide to long. In long data, each observation has its own row. The older function gather()
in tidyr
was popular for this sort of task but its new version pivot_longer()
is even more powerful. In this case we have different body parts, measures and units inside these column names, but we can break them out very simply like this:
5. Transforming from long to wide
It’s just as easy to move back from long to wide. pivot_wider()
gives much more flexibility compared to the older spread()
:
6. Running group statistics across multiple columns
dplyr
can how apply multiple summary functions to grouped data using the across
adverb, helping you be more efficient. If we wanted to summarise all bill and flipper measurements in our penguins we would do this:
7. Control how output columns are named when summarising across multiple columns
You’ll see above how the multiple columns in penguin_stats
have been given default names which are not that intuitive. If you name your summary functions, you can then use the .names
argument to control precisely how you want these columns named. This uses glue
notation. For example, here I want to construct the new column names by taking the existing column names, removing any underscores or ‘mm’ metrics, and pasting to the summary function name using an underscore:
8. Running models across subsets of data
The output of summarise()
can now be literally anything, because dplyr
now allows different column types. You can generate summary vectors, dataframes or other objects like models or graphs.
If you wanted to run a model for each species you could do it like this:
It’s not usually that useful to keep model objects in a dataframe, but you could use other tidy-oriented packages to summarise the statistics of the models and return them all as nicely integrated dataframes:
9. Nesting data
Often we have to work with subsets of our data, and it can be useful to group data by subset so that we can apply a common function or operation across all subsets of the data. For example, maybe we want to take a look at our different species of penguins and make some different graphs of them. Grouping based on subsets could previously be achieved by the following somewhat awkward combination of tidyverse functions.
The new function nest_by()
provides a more intuitive and faster way to do the same thing:
Note that the nested data will be stored in a column called data
unless you specify otherwise using a .key
argument.
10. Graphing across subsets
Armed with nest_by()
and the fact that we can summarise or mutate virtually any type of object now, this allows us to generate graphs across subsets and store them in a dataframe for later use. Let’s scatter plot bill length and depth for our three penguin species:
Now we can easily display the different scatter plots to show, for example, that our penguins exemplify Simpson’s Paradox:

_Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter. Also check out my blog on drkeithmcnulty.com._
