Put your Data Analysis in an R Package — Even if You Don’t Publish it

How to leverage R’s package development environment to organize, document and test your work

Denis Gontcharov
Towards Data Science

--

A data analysis project consists of many different files: raw data, R scripts, R Markdown reports and Shiny apps. We need a sensible project folder structure to stay organized. Why not take advantage of R’s established package development workflow?

Working on our analysis inside an R package offers four benefits:

  1. R packages provide a standardized folder structure to organize your files
  2. R packages provide functionality to document data and functions
  3. R packages provide a framework to test your code
  4. putting effort into points 1–3 enables you to reuse and share your code

In this article we will work out a data analysis example inside an R package step-by-step. We will see that these benefits require nearly no overhead.

For the data analysis example we will:

  1. download and save data files
  2. write a script and an R-function to process data
  3. create a minimal exploratory data analysis report
  4. develop a minimal interactive R flexdashboard with Shiny

The final result can be found on GitHub.

Requirements

I assume you use RStudio. Experience with R package development is recommended but not required. We’ll need the following R packages:

  • R package development: devtools, usethis, roxygen2
  • R code testing: assertive, testthat
  • Data analysis: ggplot2, dplyr
  • R Markdown and Shiny: knitr, rmarkdown, flexdashboard, Shiny

1. Creating the R package

In RStudio navigate to a folder on your computer where you want to create the package. The following command will open a new RStudio session and create a bare-bone package named “WineReviews”.

devtools::create("WineReviews")

Notice how an additional tab Build becomes available.

Additional tools for R package development

If you have version control configured in RStudio navigate to Tools, Version control, Project setup and select your version control software from the drop-down menu. We will select Git which initializes a local Git repository with a .gitignore file.

Enabling Git for this project

We will use the R package roxygen2 to automatically generate documentation for our functions and data. Navigate to Tools, Project Options, Build Tools and tick the checkbox before “Generate documentation with Roxygen”. We will use the following configuration:

Configuring Roxgen

In the DESCRIPTION file we write basic information about the project:

Package: WineReviews
Title: Your data analysis project belongs in an R package
Version: 0.1.0
Authors@R: person("Denis", "Gontcharov", role = c("aut", "cre"),
email = "gontcharovd@gmail.com")
Description: Demonstrates how to leverage the R package workflow to organize a data science project.
License: CC0
Encoding: UTF-8
LazyData: true

Our initial package structure looks like this:

.
├── .gitignore
├── .Rbuildignore
├── .Rhistory
├── DESCRIPTION
├── NAMESPACE
├── R
└── WineReviews.Rproj

Let’s run a check on our package build to see if everything is okay. We will do this regularly to ensure the project integrity throughout our work.

devtools::check()
Our initial package structure passes the check

2. Working with data

Saving and accessing raw data

In this example we use two csv files from the Wine Reviews dataset on Kaggle. We create a folder inst/ with a sub folder extdata/ in which we save the two csv files. (Subfolders of inst/ are installed when the package is built.)

.
├── .gitignore
├── .Rbuildignore
├── .Rhistory
├── DESCRIPTION
├── inst
│ └── extdata
│ ├── winemag-data-130k-v2.csv
│ └── winemag-data_first150k.csv
├── NAMESPACE
├── R
└── WineReviews.Rproj

To make these files available we have to clean and rebuild the package. This will install the R package on your computer.

Clean and Rebuild to access files in the inst/ folder

To retrieve the path to a file in inst/ we use system.file() as follows:

system.file(
"extdata",
"winemag-data-130k-v2.csv",
package = "WineReviews"
)

Manipulating data

We will now process this raw data into a useful form. The wine reviews are split over the two csv files. Let’s assume that we want to merge these two files into one data frame with four variables: “country”, “points”, “price” and “winery". Let’s call this data frame “wine_data”.

The code below creates the data_raw/ folder with a script wine_data.R

usethis::use_data_raw(name = "wine_data")

The code below creates our desired data frame. Let’s add it to the wine_data.R script and run it.

## code to prepare `wine_data` dataset goes here# retrieve paths to datafiles
first.file <- system.file(
"extdata",
"winemag-data-130k-v2.csv",
package = "WineReviews"
)
second.file <- system.file(
"extdata",
"winemag-data_first150k.csv",
package = "WineReviews"
)
# read the two .csv files
data.part.one <- read.csv(
first.file,
stringsAsFactors = FALSE,
encoding = "UTF-8"
)
data.part.two <- read.csv(
second.file,
stringsAsFactors = FALSE,
encoding = "UTF-8"
)
# select 4 variables and merge the two files
wine.variables <- c("country", "points", "price", "winery")
data.part.one <- data.part.one[, wine.variables]
data.part.two <- data.part.two[, wine.variables]
wine_data <- rbind(data.part.one, data.part.two)
# save the wine_data dataframe as an .rda file in WineReviews/data/
usethis::use_data(wine_data, overwrite = TRUE)

The final line creates the data/ folder with the data frame stored as wine_data.rda. After we Clean and Rebuild again this data can be loaded into the global environment like any other R data:

data(“wine_data”, package = "WineReviews")

Documenting data

It’s good practice to document your created data. All documentation for data should be stored in a a single R script saved in the R/ folder. Let’s create this script with the following content and name it data.R:

#' Wine reviews for 51 countries.
#'
#' A dataset containing wine reviews.
#'
#' @format A data frame with 280901 rows and 4 variables:
#' \describe{
#' \item{country}{The country that the wine is from.}
#' \item{points}{The number of points rated by WineEnthusiast
#' on a scale of 1–100.}
#' \item{price}{The cost for a bottle of the wine.}
#' \item{winery}{The winery that made the wine.}
#' }
#' @source \url{https://www.kaggle.com/zynicide/wine-reviews}
"wine_data"

To create documentation based on this script we run:

devtools::document()

Roxygen transforms the code above into a a wine_data.Rd file and adds it to the man/ folder. We can view this documentation in the help pane by typing?winedata in the R console.

Our package structure now looks like this:

.
├── .gitignore
├── .Rbuildignore
├── .Rhistory
├── data
│ └── wine_data.rda
├── data-raw
│ └── wine_data.R
├── DESCRIPTION
├── inst
│ └── extdata
│ ├── winemag-data-130k-v2.csv
│ └── winemag-data_first150k.csv
├── man
│ └── wine_data.Rd
├── NAMESPACE
├── R
│ └── data.R
└── WineReviews.Rproj

3. Intermediate checking

It’s good practice to regularly check the integrity of our package by running:

devtools::check()

We get one warning and one note.

Regularily check your package to spot problems early

The warning about R version dependence is introduced after calling usethis::use_data(wine_data, overwrite = TRUE) and is resolved by adding Depends: R (>= 2.10) to the DESCRIPTION file.

Add line 14 to your DESCRIPTION file

The note warns us about the package size because the data in inst/extdata exceeds 1 MB. We don’t want to publish this R package on CRAN so we can ignore this. However, we will resolve this note by adding the inst/extdata/ folder to .Rbuildignore.

^WineReviews\.Rproj$
^\.Rproj\.user$
^data-raw$
^inst/extdata$

Now devtools::check() shows everything is fine:

The second check passes flawlessly

4. Data analysis in a vignette

Now we can explore the processed data. I like to write reports in R package vignettes because they have a clean layout with whom R users are familiar. If you prefer not to use vignettes you can do the steps in a standard R Markdown document instead of a vignette, as shown in chapter 7.

The function below creates a vignettes/ folder with a wine_eda.Rmd vignette that we will use for our exploratory data analysis.

usethis::use_vignette(name = “wine_eda”, title = “Wine Reviews EDA”)

We will use the popular dplyr R package to manipulate data. It’s important to declare each R package we use in the DESCRIPTION file. Don’t worry: if you forget this devtools::check() will throw a note. Declare the package with:

usethis::use_package(“dplyr”)

Let’s add some code to the vignette to load the data and show a summary:

---
title: “Wine Reviews EDA”
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Wine Reviews EDA}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = “#>”
)
```
```{r setup}
library(WineReviews)
```
```{r}
# load the previously created wine_data data frame
data(wine_data)
```
```{r}
summary(wine_data)
```
the winedata has 22691 missing values for price

There are 22691 missing values for the wine price. Let’s transform our data to fill these missing values. R is a functional programming language: we should transform our data by applying functions.

“To understand computations in R, two slogans are helpful: Everything that exists is an object. Everything that happens is a function call.”

— John Chambers

This is the part where our decision to put our analysis in an R package will pay its dividends: package development makes the process of writing, documenting and testing functions incredibly smooth.

5. Working with functions

Let’s replace a missing price with the mean price of that country’s wines. This is not necessarily a good idea statistically speaking, but it will serve to drive the point home.

Writing and documenting functions

The replacement will be done by a function called “fill_na_mean”. We create a new R script variable_calculation.R and save it in the R/ folder. Notice how we can conveniently write the function’s documentation above its code. (Leave the @export tag otherwise the @examples won’t work.)

#' Replace missing values with the mean
#'
#' @param numeric.vector A vector with at least one numeric value.
#'
#' @return The input whose NAs are replaced by the input's mean.
#'
#' @export
#'
#' @examples fill_na_mean(c(1, 2, 3, NA, 5, 3, NA, 6))
fill_na_mean <- function(numeric.vector) {
ifelse(
is.na(numeric.vector),
mean(numeric.vector, na.rm = TRUE),
numeric.vector
)
}

To create the documentation we again run:

devtools::document()

As was the case with documenting data, Roxygen will transform this documentation in a file fill_na_mean.Rd and save it in the man/ folder. We can view this documentation in the help pane by typing?fill_na_mean in the R console.

If you wish to resume work after closing or restarting your R session, run the following command to load your functions as well as the packages declared in the DESCIPTION file:

devtools::load_all() 

Next we will write two tests for our newly created function:

  1. A run-time test that runs every time the function is called to warn the user for faulty input.
  2. A development-time test that runs on command to warn the developer for bugs while writing or modifying the function.

Run-time testing

The R package assertive offers functions for run-time testing.

usethis::use_package(“assertive”)

Our simple test consisting of one line checks if the input is a numeric vector and throw an error when it’s not. Note that run-time tests don’t rely on our package development environment because they are a part of the function.

fill_na_mean <- function(numeric.vector) {
assertive::assert_is_numeric(numeric.vector) # -> the run-time test
ifelse(
is.na(numeric.vector),
mean(numeric.vector, na.rm = TRUE),
numeric.vector
)
}

Development-time testing

Contrary to run-time testing, development-time testing with the testthat R package requires an active package development environment. Let’s write a unit test for our function that tests if fill_na_mean(c(2, 2, NA, 5, 3, NA)) returns the vector c(2, 2, 3, 5, 3, 3).

First we set up the testthat framework:

usethis::use_testthat() 

The following command creates the script test-variable_calculations.R that contains the unit tests for the functions defined in variable_calculations.R:

usethis::use_test("variable_calculations")

We modify the code in test-variable_calculations.R to represent our test:

context(“Unit tests for fill_na_mean”)
test_that(“fill_na_mean fills the NA with 3”, {
actual <- fill_na_mean(c(2, 2, NA, 5, 3, NA))
expected <- c(2, 2, 3, 5, 3, 3)
expect_equal(actual, expected)
})

The following command runs all your tests and returns a pretty report:

devtools::test() 
our unit test passes successfully

6. (Re)organize your code

Now we can fill the missing values in our data analysis wine_eda.Rmd. But what if we wanted to do the same thing in other reports or applications? Should we repeat this procedure in each file?

In fact, this function deals with data processing rather than exploratory analysis. It therefore belongs in the winedata.R file that generates the wine_data.rda. Putting this code in winedata.R instead of wine_eda.Rmd has the two advantages:

  1. Our code becomes modular: code for data processing is clearly separated from our data analysis report.
  2. There is only one script that generates the processed data that is used by all other files.

After filling the missing values six observations still have no wine price so we choose to remove them. This is what the final script looks like:

## code to prepare `wine_data` dataset goes here
# retrieve paths to datafiles
first.file <- system.file(
"extdata",
"winemag-data-130k-v2.csv",
package = "WineReviews"
)
second.file <- system.file(
"extdata",
"winemag-data_first150k.csv",
package = "WineReviews"
)
# read the two .csv files
data.part.one <- read.csv(
first.file,
stringsAsFactors = FALSE,
encoding = "UTF-8"
)
data.part.two <- read.csv(
second.file,
stringsAsFactors = FALSE,
encoding = "UTF-8"
)
# select 4 variables and merge the two files
wine.variables <- c("country", "points", "price", "winery")
data.part.one <- data.part.one[, wine.variables]
data.part.two <- data.part.two[, wine.variables]
wine_data <- rbind(data.part.one, data.part.two)
# fill missing prices with the mean price per country
wine_data <- wine_data %>%
dplyr::group_by(country) %>%
dplyr::mutate(price = fill_na_mean(price))
# some countries don't have any non-missing price
# we omit these observations from the data
wine_data <- wine_data %>%
dplyr::filter(!is.na(price))
# save the wine_data dataframe as an .rda file in WineReviews/data/
usethis::use_data(wine_data, overwrite = TRUE)

Because we changed how the raw data is processed we have to document these changes in the data.R file in the data-raw/ folder:

#’ Wine reviews for 49 countries.
#’
#’ A dataset containing processed data of wine reviews.
#’ Missing price values have been filled with the mean price for
#’ that country
#’ six observations coming from countries with no wine price were
#’ deleted.
#’
#’ @format A data frame with 280895 rows and 4 variables:
#’ \describe{
#’ \item{country}{The country that the wine is from.}
#’ \item{points}{The number of points WineEnthusiast
#’ rated the wine on a scale of 1–100}
#’ \item{price}{The cost for a bottle of the wine}
#’ \item{winery}{The winery that made the wine}
#’ }
#’ @source \url{https://www.kaggle.com/zynicide/wine-reviews}
“wine_data”

And update the documentation:

devtools::document()

Let’s knit our vignette. Notice that there are no more missing values even though we didn’t changed the code in the vignette. The missing values are removed in the wine_data.R script. In the vignette we just load the processed wine_data.rda.

The wine_data summary contains no missing values

7. Working with R Markdown

We can add any R Markdown file to our package. Analogously to raw data, we create a sub folder rmd/ in the inst/ folder. Let’s create a simple dashboard to view the wine price distribution per country. We create an R Markdown file wine_dashboard.Rmd in rmd/ with the following content:

---
title: "Wine dashboard"
output:
flexdashboard::flex_dashboard:
orientation: columns
runtime: shiny
---
```{r setup, include=FALSE}
library(flexdashboard)
library(ggplot2)
```
```{r}
data(wine_data, package = "WineReviews")
```
Inputs {.sidebar data-width=150}
-------------------------------------
```{r}
shiny::selectInput(
"country",
label = "select country",
choices = sort(unique(wine_data$country))
)
```
Column
-------------------------------------
### Wine points versus price

```{r}
shiny::renderPlot(
ggplot(wine_data[wine_data$country == input$country, ],
aes(points, log10(price), group = points)) +
geom_boxplot() +
labs(x = "points scored by the wine on a 100 scale",
y = "Base-10 logarithm of the wine price") +
theme_bw()
)
```
Column
-------------------------------------
### Wine price distribution

```{r}
shiny::renderPlot(
ggplot(wine_data[wine_data$country == input$country, ], aes(price)) +
geom_density() +
labs(y = "Price density function") +
theme_bw()
)
```

Let’s view the dashboard with Run Document:

Press Run Document to view the dashboard

8. Using version control

Our data analysis example is now finished. This is what our final project structure looks like:

.
├── .gitignore
├── .Rbuildignore
├── .Rhistory
├── DESCRIPTION
├── data
│ └── wine_data.rda
├── data-raw
│ └── wine_data.R
├── inst
│ ├── extdata
│ │ ├── winemag-data-130k-v2.csv
│ │ └── winemag-data_first150k.csv
│ └── rmd
│ └── wine_dashboard.Rmd
├── man
│ ├── fill_na_mean.Rd
│ └── wine_data.Rd
├── NAMESPACE
├── R
│ ├── data.R
│ └── variable_calculation.R
├── tests
│ ├── testthat
│ │ └── test-variable_calculations.R
│ └── testthat.R
├── vignettes
│ └── wine_eda.Rmd
└── WineReviews.Rproj

This structure is suitable version control. Files with code are tracked whereas raw data in inst/extdata and processed data in data/ are not tracked. We can generate the processed data from the (re)downloaded raw data using the R scripts in data-raw/. I recommend adding the following entries to .gitignore:

.Rproj.user
.Rhistory
data/
inst/extdata/
inst/doc

Recap

Here is how we made use of the R package components:

  • DESCRIPTION: gives an overview of the project and its dependencies
  • R/: contains R scripts with functions used throughout the package
  • tests/: contains development-time tests for our functions
  • inst/extdata/: contains our raw data files
  • data-raw/: contains R scripts that process the raw data into tidy data
  • data/: contains tidy data stored as .rda files
  • man/: contains documentation for our objects and functions
  • vignettes/: contains data analysis reports as package vignettes
  • inst/rmd: contains R Markdown files for reports or applications

I hope you have enjoyed the article. Would you consider using R packages to store your work? How do you like to organize your projects?
Let me know in the comments!

Photo by chuttersnap on Unsplash

--

--