The world’s leading publication for data science, AI, and ML professionals.

How to Group Data in R: Going Beyond “group_by”

Go from beginner to advanced with these grouping workflows

Photo by Camille San Vicente on Unsplash
Photo by Camille San Vicente on Unsplash

Grouping data allows you to perform operations on subsets of a dataset, rather than the entire dataset. Working with grouped data is a crucial aspect of Data Analysis, and has near-limitless uses in data science.

There are many ways to create and manipulate groups with R. In this article, I’ll explain grouping workflows from the dplyr package, from the fundamentals to more advanced functions.

By the end, you should have all the tools needed to extract valuable insights from grouped data. All of the code in this article is also available on GitHub.

Basic grouping in dplyr

To group data in dplyr, you’ll mainly use the group_by function. You can use this to specify one or more variables to group the data by. Here’s an example with the penguins dataset from the palmerpenguins package. You can install this package by running install.packages("palmerpenguins"). Once loaded with library(palmerpenguins), you’ll be able to access the penguins dataset by name, as seen below.

library(tidyverse)
library(palmerpenguins)

# The Palmer Penguins dataset
penguins

A quick look at the dataset allows us to identify categorical variables that are suitable for grouping. Here, we can group by species; a factor with three levels. Viewing the grouped data in the console, we can see the grouping structure printed clearly above the column names. I’ve highlighted this in red.

# Grouping by species
penguins_species <- penguins %>%
  group_by(species)

penguins_species

We can also access the names and levels of grouping variables in our data with the group_keys function. Using this function on our grouped data returns a tibble with each grouping variable as a column, and each group level as a row.

# Getting the grouping structure with group_keys
group_keys(penguins_species) 

Now the data is grouped, we can apply another function to it. A common use of grouped data is for calculating summary statistics with the summarise function. In the example below, summarise returns the mean body mass for each species of penguin. This gives us a neat summary table with little effort.

# Getting the mean body mass for each group
penguins_species %>%
  summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE))

You can also use other dplyr functions like mutate, filter, select, and more on grouped data. While diving deep into all of these functions could take up a whole article by itself, the dplyr grouped data vignette is a helpful guide to how these functions behave with grouped data.

Grouping by more than one variable

The process of grouping data by more than one variable is simple; just add another variable name inside group_by. For instance, one could group the penguins data by both species and sex.

# Grouping by more than one variable
penguins %>%
  filter(!is.na(sex)) %>%
  group_by(species, sex) %>%
  summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE))

First, we filter out NA values in the sex column, then group by species and sex. Using the same summary function on this newly grouped data gives a mean body mass measure for each combination of levels across our two grouping variables.

Creating variables within group_by

Moving on from the basic use of group_by, we can get into more advanced grouping workflows.

One useful yet underrated dplyr feature is that you can create new grouping variables within group_by.

Let’s say we want to get summary statistics for penguins at all levels of factor that isn’t already coded in the penguins data. For instance, the penguins dataset is made up of observations from three research studies, but the study identifier isn’t included in the cleaned version of the data. How can we calculate the mean body mass for penguins in each study?

An obvious solution would be to create a new variable with the study identifier for each row in the data, group by that variable, and then summarise.

penguins %>%
  mutate(StudyName = penguins_raw$studyName) %>%
  group_by(StudyName) %>%
  summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE))

However, you can combine the first two steps into one by creating the grouping variable inside group_by with the following syntax:

penguins %>%
  group_by(StudyName = penguins_raw$studyName) %>%
  summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE))

This gives exactly the same output as the long-form version while saving valuable space. This makes it a great trick to shorten long pipe sequences in your analysis.

Splitting up data by groups with group_split

You might run into situations where you want to split up groups into separate tibbles. For instance, you could write each species in the penguins data as a separate data file. For this, you’ll need group_split.

As its name suggests, group_split splits up data into separate tibbles; one for each level of the grouping variable species. It returns these tibbles in a list, which we can then feed into a separate function that writes each file one by one.

# Split up penguins data by species
species_list <- penguins %>%
  group_split(species)

# Get the names of each species
species_names <- group_keys(penguins_species) %>%
  pull(species)

# Write the separate datasets to csv, giving unique names
map2(species_list, species_names, ~ write_csv(.x, paste0(.y, ".csv")))

To write the filenames here, we make further use of the group_keys function to get the names of each level of our grouping variable. Using pull then turns these names into a vector that we can put into our write_csv function.

We then apply the write_csv function to each of the datasets inside species_list, giving each csv an appropriate filename from species_names. The map2 function enables us to repeat this operation for each dataset.

Using group_split like this saves a lot of manual filtering and writing. What could otherwise be a laborious task (especially in a dataset with even more groups) becomes achievable with minimal effort.

Grouping temporarily using with_groups

Sometimes when grouping data for one purpose, we then want to drop the grouping structure to continue with further analyses. The standard way to do this is by using the ungroup function. In the example below, we filter the largest three penguins by body mass in each species. Ungrouping after this operation gets rid of the grouping structure.

heavy_penguins <- penguins %>%
  group_by(species) %>%
  slice_max(body_mass_g, n = 3, with_ties = F) %>%
  ungroup()

group_keys(heavy_penguins)
# A tibble: 1 × 0

However, grouping by a variable, manipulating the data, and then ungrouping can add unnecessary steps to your code. Luckily, there’s a shorter way of temporarily performing a grouped operation using the with_groups function.

First, you specify a grouping variable in the .groups argument. Then, you specify a function you want to apply to each group, using tidyverse-specific syntax. For instance, in this example, I’ve denoted the function I want to apply using the tilde (~) symbol. I then go onto write my function, using the "." symbol as a placeholder for my data.

This syntax will be familiar to those who have learned the map functions from the tidyverse’s purrr package. These allow you to repeat other functions, giving each element of a list or vector as input. They’re well worth learning and help with understanding advanced features in other tidyverse packages.

How to Use Map Functions for Data Science in R

Running with_groups gives us the same output as the longer "group_by, slice, ungroup" workflow. Inspecting the output, we can see that with_groups has dropped the grouping structure in the data after applying our function just like the previous example, allowing for further analysis to be done on the full dataset.

heavy_penguins_temp <- penguins %>%
  with_groups(.groups = species, ~ slice_max(., body_mass_g, n = 3, with_ties = F))

group_keys(heavy_penguins_temp)
# A tibble: 1 × 0

Bonus: applying grouped functions with group_map

While with_groups shares some syntax with map functions, there’s also a special case of map that’s built for grouped data; group_map. We can examine its behaviour by using it to repeat the same slice_max function as the previous example.

group_map applies to data that’s already grouped. It returns the results for each grouped operation as separate tibbles in a list, much like the results of group_split. You may also notice that there is no species column in the output. This is because group_map drops the grouping variables from its output by default. There is an option to keep them by adding the argument .keep = TRUE, however.

penguins %>%
  group_by(species) %>%
  group_map(~ slice_max(., body_mass_g, n = 3, with_ties = F))

In my own code, I don’t use group_map very often, usually opting for more conventional map functions or simpler grouping workflows instead. That said, in cases where you start with grouped data and want to end with transformed, separated datasets, it’s a tidy shortcut.

Summary: When to use each dplyr grouping function

In sum, the grouping functions in dplyr are a great way of extracting a lot of value from data with little effort. To recap their uses:

  • group_by adds one or more groupings to a dataset. You can create grouping variables within group_by, to
  • group_keys returns the grouping structure of a tibble
  • ungroup removes groupings from data
  • group_split separates a dataset into separate tibbles by group
  • with_groups temporarily groups data to perform a single operation
  • group_map applies a function to grouped data and returns the results for each group in a list

Even if you only use group_by, you can do all kinds of summary statistics, within-group filtering, and much more. Lots of R users get on fine using this function alone.

That said, going further with the other grouping workflows we’ve explored gives you even more options. If you use them the next time you’re grouping data, you’ll save more space and reduce the amount of steps in your analysis.


Want to read all my articles on programming, Data Science, and more? Sign up for a Medium membership at this link and get full access to all my writing and every other story on Medium. This also helps me directly, as I get a small contribution from your membership fee at no extra cost to you.

You can also get all my new articles delivered straight to your inbox whenever I post by subscribing here. Thanks for reading!


Related Articles