The world’s leading publication for data science, AI, and ML professionals.

Data science workflows with the Targets package in R : End-to-end example with code

Structured and reproducible approach for analysis

Photo by Markus Spiske on Unsplash
Photo by Markus Spiske on Unsplash

Introduction

Any task that we do at work or as part of our daily chores has a high likelihood of becoming repetitive in nature. This isn’t to say that the task would remain exactly the same, changes may be required along the way, however structuring and organizing subtasks to accomplish the overall objective more efficiently and in a streamlined manner with minimal issues is always ideal.

Over time, we have come to adopt this "assembly line" like concept in a to our data tasks (for example, data loading, data preparation, model building, presentation of results, etc. are structured in workflows and pipelines) and the resulting improvements make a strong case for this approach. In fact, due to the iterative nature of and need of ideation in our jobs, it is imperative for data scientists to have a certain level of organization and streamlining when approaching such tasks.


Why workflows/pipelines and reproducibility is recommended?

Why do we need to organize our codes into structured workflow/pipeline format? How does it help?

  1. Data Science work is computation heavy and takes time to run – Any change along the way can imply a rerun of the whole code or parts of the code. With cost implications for both time and computation, it becomes necessary to have a streamlined approach to reruns where only updated codes and only those parts of the workflow, which are interdependent on the updated codes should run. An intelligent caching and storing function should make regeneration of results easier, efficient and error free.
  2. Easy to get lost – Most graphs on data science workflows that you see will have a loop element. This iterative nature comes from the various parties involved, for example, domain experts, data scientists, engineers, business stakeholders, etc. Almost all data science projects go through a round of changes not only in terms of objectives but data, model, definitions, parameters, etc. which could have a downstream effect on the work done till date. In midst of all these changes and rough codes with no structure, it is very easy to get lost in the multiple updates. While tools like Github may help you to manage the code versions, it is still extremely important to put a pipeline structure to all your analysis and modularize it for better management of the overall workflow. To give an example, in the image below, even just preparing and performing exploratory data analysis ("EDA") from the data to make a business case and recommendations on AI solution – the work can be very iterative till we get to the final result. This is illustrated from the circular loops below at almost each stage.
Image source : Author
Image source : Author
  1. Clean code and Collaboration – By turning your codes into smaller modular chunks of functions, you can break the code into small tasks, organize them in a logical order and reduce the clutter and noise from the code. This is in general a good practice. Also when working in a team and reproducing exactly the same results, such practices becomes essential for debugging, reviewing and correcting.

The Drake and Targets package

While Python has always been considered more evolved in this space, R has been catching up fast. First popular package here was Drake. It analyzes your workflow, skips steps with up-to-date results, and orchestrates the rest with optional distributed computing. At the end, drake provides evidence that your results match the underlying code and data, which increases your ability to trust your research.

Image source : By Author
Image source : By Author

In Jan 2021, Drake was superseded by Targets, which is more robust and easier to use. It deals with a lot of gaps around data management, collaboration, dynamic branching, and parallel efficiency¹. There are further details on enhancements and benefits of Targets over Drake which are covered here.

One of the main enhancements of Targets is the metadata management system which only updates information on global objects when the pipeline actually runs. This makes it possible to understand which specific changes to your code could have invalidated your results. In large projects with long runtimes, this feature contributes significantly to Reproducibility and peace of mind¹. A flow example given below shows parts of workflow which are up-to-date or outdated (based on user changes). This is detected real time and updated in the workflow. We will see this below in images which show a simplistic view of the workflows which can get really complex and lengthy in actual real world data science projects. Data scientists hence would appreciate the ability to quickly visualize and understand the changes, and dependencies.


Walk-through Example

In our example we will be using the popular Titanic data set. While the Workflow can be as granular and complex as the user’s work requires, for the purpose illustrating the utility of Targets, we will keep it simplistic and standard. Our workflow includes:

A. Loading the data B. Pre-processing of the data C. EDA markdown notebook generation D. Xgboost model building and prediction on test set with some results on model diagnostics in Markdown

These steps cover the standard process data scientists follow, however there are iterations to this in each stage and we will illustrate different aspects of Targets below through this example.

  1. _Folder structureL_et us create a root folder called "Targets" with the below structure (this is an example, different practices may be followed by different data scientists). Within Targets we have:

    Important thing to note here is that _Targets.R should be in the root folder.

_2. Creating functions which are used in targetsH_ere we create some functions to start with which will become a part of our workflow. As part of the example, refer to functions below which load and pre-process the data (steps A and B).

_3. Defining, visualizing and executing workflowF_irst we create a pipeline with just tasks A and B i.e. load and pre-process the data.

Once the targets are defined, let’s have a look at the flow:

tar_glimpse() 

This gives a directed acyclic graph of Targets and does not account for metadata or progress information

Image source : By author
Image source : By author
tar_visnetwork()

This gives a directed acyclic graph of Targets, accounts for metadata or progress information, global functions and objects¹. As we can see below, Targets has automatically detected the dependencies and also functions which are not used as of now anywhere for example, bar_plot. We also see that all of the below are outdated since we haven’t run the Targets yet which will be our next step.

Image source : By Author
Image source : By Author

Along with the above commands, you can also use tar_manifest() to ensure that you have constructed your pipeline correctly.

Now we run the pipeline:

tar_make()
Image source : By author
Image source : By author

This runs the correct targets in the correct order and stores the return values in the root folder by creating a new folder _targets. This folder will have _targets/objects and _targets/meta. Now when we visualize the pipeline using tar_visnetwork(), all targets have been changed to "Up to date".

Image source : By author
Image source : By author

_4. Accessing filesT_o access files you can use tar_read() or tar_load() from the Targets package.

This gives us the following dataset:

Image source : By Author
Image source : By Author

As mentioned in the previous section the files are also stored in the objects section:

Image source : By Author
Image source : By Author

You can also specify the format you want to store the files in e.g. rds. Targets can return multiple files, which can be stored as a list and returned in the target and then retrieved as target_loaded_file@..

_5. Changes to workflow_I. Changes within same workflow First let us start by making a change in the overall workflow. For example, making a change in the load dataset function:

Once we have made this update and checked the workflow via tar_visnetwork() we can see below that Targets automatically detects the dependency and outdates all the following targets accordingly. In tar_make() it will rerun all the outdated targets accordingly.

Image source : by Author
Image source : by Author

Similarly, for example if we make a change in Step B, which is process data code, and remove na.omit()

and then review the workflow:

Image source : By Author
Image source : By Author

we can see only "process_data_train" and "process_data_test" are outdated and need to be rerun. Hence Targets will skip the previous components in tar_make().

Image source : By Author
Image source : By Author

II. Addition to workflow: EDANow let us add a short EDA markdown notebook (Step C) to the workflow. Below is a sample code:

This can be added in _Targets.R as below:

This is reflected in the new process flow:

Image source : By Author
Image source : By Author

This gives us the EDA markdown for train data set. This can easily be reused for test data set as well by simply changing the source target data loaded in the code and creating a new target in _Targets.R.

Image source : By Author
Image source : By Author

III. Addition to workflow : Modelling markdown and predictionSimilar to EDA markdown, we can now create a markdown to create a model(can also be put into a separate function) and show our model results. A sample Rmd is given below (builds the model, generates model diagnostics visuals, makes prediction on test and saves results) followed by new process flow:

We add this to the targets flow.

Image source : By Author
Image source : By Author

The resulting markdown looks like this:

Image source : by Author
Image source : by Author

There are many other functionalities which can help in ordering, prioritizing, destroying targets and many more can be found at Targets page.


A note on renv & Docker — towards complete reproducibility and productionizingIn ensuring your work is finally production ready, a few more components get involved and it is always best to have working knowledge of these to master reproducibility in development or deployment phase.

  1. Renv brings project-local R dependency management which enables your colleagues and reviewers to recreate the same environment as your development to produce your results easily. R users can appreciate this given the many versions and updates that come through for various packages. A brilliant blog tutorial by my colleague Liu Chaoran can be accessed here which provides details on how to use renv.
  2. Docker is the final step and works beautifully with renv. Previously when we want to containerize R code with docker for production/deployment, we needed to create a separate R code which lists all the install.packages commands. Now we can conveniently call one line of code⁴ using renv. The article by my colleague above covers Docker as well.

Having pipeline-like toolkit such as Targets together with dependency and containerizing toolkits such as renv and Docker – R as a language is progressing fast in the production ready deployment space.


Some other options and references:

There are some other (though not many) resources on Targets but there are more on Drake which is a much older package. On Targets, some good videos by the author can be accessed here, and another good watch is this video by Bruno.

Apart from Targets there are some other packages that can be explored such as Remake package. Although appearing to be very similar to Targets, I have not used this package yet to comment.

Another option is Ruigi, which is similar to its counterpart in Python Luigi. Python users, jumping to R may prefer this package.

Finally, for a finely curated list of pipeline toolkits across different languages and aspects of work refer to:

GitHub – pditommaso/awesome-pipeline: A curated list of awesome pipeline toolkits inspired by…

Concluding note

While the above example is rather simplistic and may not be reflective of the complexities, iterations and range of outputs that data science work may have – it reflects the utility of Targets and only goes out to show how useful it could be as work gets more complex and codes get more messy.

References

  1. https://books.ropensci.org/targets/
  2. https://towardsdatascience.com/this-is-what-the-ultimate-r-data-analysis-workflow-looks-like-8e7139ee708d
  3. https://rstudio.github.io/renv/articles/renv.html
  4. https://6chaoran.github.io/data-story/data-engineering/introduction-of-renv/
  5. https://swcarpentry.github.io/r-novice-inflammation/02-func-R/

Related Articles