Speeding up data wrangling with dtplyr
I recently saw a Tweet by Hadley Wickham about the release of dtplyr
. It is a package that enables working with dplyr
syntax on data.table
objects. dtplyr
automatically translates the dplyr
syntax to thedata.table
equivalent, which in the end results in a performance boost.
Marvel: Infinity War is the most ambitious crossover event in history. Hadley Wickham: Hold my beer.
I always liked the ease and readability of dplyr
and was eager to compare the performance of the package. Let’s see how it works in practice!
Loading libraries
For this article, we need to install dtplyr
from GitHub by runningdevtools::install_github("tidyverse/dtplyr")
and we use microbenchmark
for performance comparison.
Generating the dataset
We generate an artificial dataset. The first thing that came to my mind is an order registry, in which we store:
id
of the clientname
of the productdate
of purchaseamount
of product purchased- the unit
price
of a certain product
As this is only a toy example, we do not dive deeply into the logic behind the dataset. We can agree that it vaguely resembles a real-life scenario. For testing the performance of different approaches, we generate 10 million rows of data.
By using lazy_dt()
we trigger the lazy evaluation – no computation is performed until we explicitly request it by using as.data.table()
, as.data.frame()
or as_tibble()
. For the sake of comparison, we store one data.frame
, one data.table
and one "lazy" data.table
.
We can preview the transformation, as well as the generated data.table
code by printing the result:
Source: local data table [?? x 3]
Call: `_DT3`[date < as.Date("2019-02-01"), .(id, product, date)][order(date)]
id product date
<chr> <chr> <date>
1 DHQ GVF 2019-01-01
2 NUB ZIU 2019-01-01
3 CKW LJH 2019-01-01
4 AZO VIQ 2019-01-01
5 AQW AGD 2019-01-01
6 OBL NPC 2019-01-01
Generally, this should be used for debugging. We should indicate what kind of object we want to receive at the end of the pipeline to clearly show that we are done with the transformations.
Use-case 1: Filtering, Selecting and Sorting
Let’s say we want to have a list of transactions that happened before 2019–02–01, sorted by date, and we do not care about either the amount or price.

We see that dtplyr
is slightly slower than data.table
, but by looking at the median time it is ~4x faster than dplyr.
Use-case 2: Adding new variables after filtering
In this example, we want to filter orders with a number of products over 5000 and calculate the order value, which is amount * price
.
Most of the expressions using mutate()
must make a copy (do not modify in-place), which would not be necessary when using data.table
directly. To counter for that, we can specify immutable = FALSE
in lazy_dt()
to opt-out of the mentioned behavior.

This time the difference is not so pronounced. This, of course, depends on the complexity of operations done to the tables.
Use-case 3: Aggregation on top
Let’s say we want to:
- Filter all orders on amount <= 4000
-
Calculate the average order value per customer
This time we get ~3x improvement in median execution time.
Use-case 4: Joining
In the last example, we consider a case of joining datasets. For that, we create a new data.frame
/data.table
called product
by selecting 75% of the available products and assigning a random letter to them. We can assume that the letter corresponds to a distribution center (variable called origin
), from which the item is shipped.
We want to calculate the average order value per distribution center. In case we do not have data regarding the distribution center, we discard the row.

Again we see a ~3x speedup in the median execution time.
Conclusions
dtplyr
is (and always will be) slightly slower than data.table
. That is because:
- Each
dplyr
verb must be converted to adata.table
equivalent. For large datasets, this should be negligible, as these translation operations take time proportional to the complexity of the input code, rather than the amount of data. - Some
data.table
expressions have no directdplyr
equivalent. - Immutability issue mentioned in use-case 2.
Summing up, I believe that dtplyr
is a valuable addition to the tidyverse
, as with only small changes to the dplyr
code, we can achieve significant performance improvements.
As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments. You can find the code used for this article on my GitHub.