Getting Started
Preface
Full disclaimer, I know the title is a little "clickbaity", but after years of writing R scripts for data analysis, I believe I’ve come across a solid milestone of what a fully reproducible R workflow should aspire to be. This article is NOT intended for beginners but rather for advanced R users who write functionalized code and may already have personal workflows in managing many scripts. Also, please keep in mind that this article is entirely my opinion, as everyone may have their own notion of an "ultimate" workflow. Hopefully, this article may give you some new ideas!
Imagine you have a long and complex data analysis pipeline with many steps and thus lots of lines of code. After doing some research you realize something in the early stages of your pipeline needs to be fixed. You make the change. You run all the code. You wait because there’s a lot of code. An error message pops up. Crap, your change wasn’t exactly correct. You have to go back and try something different. You rerun the code. You wait. Another error message pops up in a different location because it was dependent on what you had just changed. Crap. You think to yourself, how long is this going to take me, as you buckle down for a long session of coding ahead.
It doesn’t have to be this way.
Workflow
This article will discuss the core packages used to build this workflow, the engine of the workflow,targets
and why you should consider using it, and a sample workflow using the dataset mtcars
as an example.
Packages Used:
In the "Working Example" section, I’ll provide some sample code that showcases each of these packages in action! These are the main packages that constitute the skeleton of the workflow:
[pacman](https://www.rdocumentation.org/packages/pacman/versions/0.5.1)
: package management tool that reduces the amount of code you have to write. No more repeatedinstall.packages
andlibrary
calls that clutter your code
: environment dependency management tool that allows your existing workflows to "just work" as they did before and on other machines[targets](https://github.com/ropensci/targets)
: workflow engine that fundamentally allows you to write functionalized code that only runs components that have changed (more on why this is good below)[validate](https://github.com/data-cleaning/validate)
: makes it easy to check whether your data is clean. You don’t want to load a new dataset into your workflow, run through everything and discover down the line that there were some suspicious data entries that may have affected your results (e.g. a BMI of 1000, a price of -$47, etc.)tidyverse
(a personal preference): this is just my data wrangling coding tool of choice
And of course, use any packages that you’ll need for your downstream data profiling, statistical analyses, modeling, or visualizations.
Targets
So I’ve been talking a lot about targets
, but before I get into the benefits of using it, I first have to make sure we’re on the same page about the importance of functional Programming and make-file like workflows.
R is not a functional programming language per se, but à la tidyverse principles, I do use R in a functional programming style. The nuance here is that the style follows a principle of decomposing a big problem into smaller chunks. Each chunk should be an isolated function that is simple and straightforward to understand. Turning your code into functions (and thus not copying-and-pasting code) reduces code clutter, the number of times you have to update code, and the possibility of making a mistake. It also makes debugging arguably easier (you only have to debug the function as opposed to multiple places if you copied and pasted code).
In a similar vein, if you have a data analysis pipeline with many steps (i.e. clean the data, visualize the data, model the data, etc.), a common practice would be to break up those steps into scripts so that you don’t have one mega-file 1000s of lines of long that would be hard to navigate (imagine returning to a 3000 line long script a year from today and trying to understand it). Furthermore, many R users I know out there write scripts in an imperative way as opposed to a functional way. This means that within their scripts they save global variables (e.g. x <- runif(1000)
), and they may save paths globally. This results in cluttered global environments, more difficulties in debugging, and a monolithic system that may grow outdated and become difficult to maintain over time.
Make files originated in Unix-like systems, and it’s a file containing a bunch of directives. The power behind make files is that they only recompile when necessary (i.e. when code has changed). This reduces costly runtime and rerunning long and complex scripts.
OK, now that I’ve covered those two things, let’s get into why you should consider using targets
. Fundamentally, targets
only reruns code when it itself or its dependencies have changed – this will save you a lot of time. It organizes your code by codifying your data analysis steps as tar_target
s. It relies on the functional programming style by expecting you to place all your functions in separate scripts, and then source
-ing them into your _targets.R
file (this is the main script that runs your entire pipeline, more on this below). targets
comes with some useful tools to visualize a graph network on how each dependency is related to one another and provides logs of which steps were rerun and which steps were skipped because they hadn’t changed. In workflows that are used repeatedly, where the only thing that changes is the dataset, targets
is perfect because you’ll be able to rerun a pipeline half-asleep since everything is already automated.
So what data analysis step would constitute a good target
? Ideally it should be large enough such that skipping it would save you from waiting for a long runtime and it should return values that are easy to understand and use for future functions.
If you have been using drake
, see this article on why drake
has been superseded by targets.
For a more detailed look into targets
, please read this manual that the maintainers compiled.
Working Example
First, set up your project directory as follows:
├── _targets.R ├── R/ ├──── functions.R ├── data/ └────
where _targets.R
(must be spelled exactly like this) lives at the root of your project directory, and "functions.R" can be named whatever you like. You will then set up your _targets.R
file as follows:
You specify your commands in a list()
using tar_target(name, command)
. We’re using mtcars, which is pretty clean and small dataset, but ideally you want to define some targets that load in your data and clean and transform it. I’m also adding a validation step. The code for my user-defined functions is here:
Next you ideally want to add some targets that visualize and profile your data. Here we’re just going to make a simple histogram as a proof-of-concept. Note that tar_target(hist, create_plot(data))
references the name data
from the previous tar_target
. Again, this is simple example, but the commands that you pass into targets would likely be more complex so you’d source them in from your "functions.R" script.
You can always inspect your pipeline in a tibble format with tar_manifest()
or graphically with tar_visnetwork()
(Note: you will need to run install.packages("visNetwork")
).
Running tar_visnetwork()
on our above target workflow will result in a graph network that looks like this:
Once you’ve ensured that everything looks good, run tar_make()
in your console to run the entire pipeline and the return values are saved as a file in the _targets/objects/
folder, and you’re done!
You can view your saved objects by using tar_read
or load your saved objects into your global environment (just to make exploratory data analysis easier) using tar_load
.
Concluding Thoughts
The motivation behind this article was to demonstrate a fully reproducible workflow that makes your code more organized and saves you time. You want to be able to return to past code and not have to spend extra time trying to figure out what everything did or updating bits and pieces. If you’re collaborating with people, you want to ensure that your code works for them just as it works for you. This is the essence of data analysis reproducibility and I hope you guys took something away!