In this post, I compare the syntax of R‘s two most powerful data manipulation libraries: dplyr and data.table. I am not here to advocate any package. However, while working on a project with unusually large datasets, my preferred package became data.table, for speed and memory efficiency. I took a legacy project in dplyr and recreated it in data.table. I will show the comparison of syntax and how to translate from dplyr to data.table and vice versa.
"
data table
provides a high-performance version of base R’s data frame with syntax and feature enhancements for ease of use, convenience, and programming speed." according to Matt Dowle, author of data.table"
dplyr
is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges." according to Hadley Wickham, author of dplyr
Libraries and dataset
I will use R’s built-in AutoClaims dataset of automobile insurance claims. Let’s install and load the libraries and prepare the dataset for the exercise.
#install.packages("dplyr")
#install.packages("data.table")
#install.packages("insuranceData")
library(dplyr)
library(data.table)
library(insuranceData)
data("AutoClaims") #load the dataset
data <- AutoClaims #rename the dataset
data <- data.table(data) #convert it to data.table
Data has 6773 rows and 5 columns. Let’s see how the data looks.
> head(data)
STATE CLASS GENDER AGE PAID
1 STATE 14 C6 M 97 1134.44
2 STATE 15 C6 M 96 3761.24
3 STATE 15 C11 M 95 7842.31
4 STATE 15 F6 F 95 2384.67
5 STATE 15 F6 M 95 650.00
6 STATE 15 F6 M 95 391.12
TOP 10 most frequently used data manipulation functions
- Filter by rows
Let’s filter for men aged 79 or younger
data[AGE <= 79 & GENDER == "M"] #datatable
data %>% filter(AGE <= 79 & GENDER == "M") #dplyr
2. Select by columns
Let’s select the columns GENDER, AGE, and PAID
data[, .(GENDER, AGE, PAID)] #datatable
data %>% select(GENDER, AGE, PAID) #dplyr
3. Add new columns
Let’s convert PAID from Dollars to Euros using a 0.85 Dollar/Euro conversion rate. The resulting new variable is added to the dataset
data[, PAID.IN.EURO := PAID * 0.85] #datatable
data %>% mutate(PAID.IN.EURO = PAID * 0.85) #dplyr
4. Delete column
Let’s delete the newly created variable
data[, !c("PAID.IN.EURO")] #datatable
data[, PAID.IN.EURO:= NULL] #datatable (alternative way)
select(data, -PAID.IN.EURO) #dplyr
5. Create a new column
Now create a new variable and drop the existing ones. As a result, the dataset will have only a new variable
data[, .(PAID.IN.EURO = PAID * 0.85)] #datatable
data %>% transmute(PAID.IN.EURO = PAID * 0.85) #dplyr
6. Summarize by column
What is the average insurance claim paid?
The following code aggregates data and returns a single-record named AVG.PAID
data[, .(AVG.PAID = mean(PAID))] #datatable
data %>% summarise(AVG.PAID = mean(PAID)) #dplyr
7. Rename variable
With data.table we use the setnames function, which references the old names and then the new names. With dplyr, the order is reversed
setnames(data, c("GENDER", "CLASS"), c("gender", "class")) #dt
data %>% rename("gender" = "GENDER", "class" = "CLASS") #dplyr
8. Sort data in ascending or descending order
Let’s sort the data in ascending and descending order using the setorder
function in data.table and arrange
in dplyr.
setorder(data, PAID) #datatable ascending order
setorder(data, -PAID) #datatable descending order
data %>% arrange(PAID) #dplyr ascending
data %>% arrange(desc(PAID)) #dplyr descending
9. Group by column
Grouping data alone does not do anything; you should combine it with another function, for example, group by State and summarize the data, or filter the data.
data[, .(mean.paid.by.state = mean(PAID)), by = "STATE"]
data %>% group_by(STATE) %>%
summarize(mean.paid.by.state = mean(PAID))
10. Counting observations
Let’s count how many observations we have for each class by using n() function for dplyr and .N for data.table
data[, .N, by= "class"] #datatable
data %>% #dplyr
group_by(class) %>%
summarise(n())
Now let’s see more examples of how to chain functions within dplyr and data.table.
Chaining Functions
In dplyr, we chain functions using the %>% pipe operator. In data.table, several functions can be written concisely in one line of code, or by using square brackets for more complicated chaining.
- Let’s select three variables: STATE, AGE, and PAID, then create a new column which is 20% of PAID, and sort the result in ascending order according to the column AGE.
#datatable
data[, .(STATE, AGE, PAID)][, .paid.20 := PAID * 0.2, keyby = "AGE"]
#dplyr
data %>% select(STATE, AGE, PAID) %>%
mutate(paid.20 = PAID * 0.2) %>%
arrange(AGE)
Conclusion
Today we compared dplyr
and data.table
syntax side-by-side, used 10 common functions, and chained functions. There is so much more you can do with both libraries.
The syntax. Data.table uses shorter syntax than dplyr, but is often more nuanced and complex. dplyr use a pipe operator, which is more intuitive for beginners to read and debug. Moreover, many other libraries use pipe operators, such as ggplot2 and tidyr. While data.table and dplyr are both widely used in the R community, dplyr is used more broadly and offers more opportunities for collaboration.
Are memory and speed important? When building complex reports and dashboards, and especially when working with very large datasets, data.table has distinct advantages.
It is up to you which library to use for your analysis!
You can find more materials in the reference section.
Thank you for reading!