Harnessing programming techniques to improve R scripts:

Automating repetitive tasks with loops and functions

Alan Davies
Towards Data Science

--

Photo by Quang Nguyen Vinh from Pexels

Many R users get into R programming from a statistics background rather than a programming/software engineering background, having previously used software such as SPSS, Excel etc. As such they may not have an understanding of some of the programming techniques that can be leveraged to improve code. This can include making the code more modular which in turn makes it easier to find and resolve bugs, but also can be used to automate repetitive tasks, such as producing tables and plots etc.

This short post includes some of the basic programming techniques that can be used to improve the quality and maintainability of R scripts. This will also save you a whole lot of time if you are carrying out repetitive tasks that are only marginally different. We assume that you have a basic understanding of writing simple scripts in R.

Let’s start with a simple example. Let’s say we have some data from several different groups. In this case 3 animals (tigers, swans and badgers) and we have collected some data on relating to this (a score and value of some kind).

Image by author

We could read this into R as a CSV file or recreate it as a data frame like so:

df <- data.frame(group = rep(c("tiger","swan","badger"), 6), 
score = c(12,32,43,53,26,56,56,32,23,53,24,65,23,78,23,56,25,75),
val = c(24,67,32,21,21,56,54,21,35,67,34,23,32,36,74,24,24,74))

We can represent lists/columns of values with a vector in R. This is done by placing those values in a comma separated list using the c function which is used to combine values into a vector. For example the numbers 1 to 4:

my_list <- c(1,2,3,4)

If we output the list by typing the name of the list:

my_list

Which outputs:

[1] 1 2 3 4

We use this to provide values in the data frame for the columns score and val. We could have done the same for the group column and added 6 lots of “tiger”, “swan” and “badger”. Instead we use the rep(replicate) function to repeat these values 6 times.

rep(c("tiger","swan","badger"), 6)

The number of values in each column of the data frame needs to be the same or R will generate an error.

We can view the dataframe in R in several ways. We could enter the name of dataframe:

df

Which outputs:

  group score val
1 tiger 12 24
2 swan 32 67
3 badger 43 32
4 tiger 53 21
5 swan 26 21
6 badger 56 56
7 tiger 56 54
8 swan 32 21
9 badger 23 35
10 tiger 53 67
11 swan 24 34
12 badger 65 23
13 tiger 23 32
14 swan 78 36
15 badger 23 74
16 tiger 56 24
17 swan 25 24
18 badger 75 74

If you are using R studio. The view function (with a capital V) will open the dataframe in new tab and display this in a tabular format:

View(df)
The output of the ‘View’ function. Image by author

You can order the columns by clicking on the up/down arrows. There is also another view function in the utils package (that contains utility functions) that can be accessed like so:

utils::View(df)

This will open the dataframe in a popup window and looks like that seen in the image below. This is useful when you want to compare your code and data at the same time without switching tabs. This will also display larger amounts of data than the other View function.

Output of ‘View’ function from ‘utils’. Image by author

Let’s say now we have some data that we want to plot for the score and values for each of the three groups.

First we can import some useful libraries.

library(ggplot2)
library(tidyverse)

The ggplot2 library is used to produce publication quality plots. The tidyverse library has some useful functionality for filtering and piping data from data sets. For example we can select the data from the dataframe that corresponds to a specific group e.g. “tiger”.

selected_group <- df %>% filter(group == "tiger")

Here we pipe the data %>%from the data frame into the variable called selected_group filtering on the group name by selecting group values that correspond to “tiger”. Next we can create a simple scatter plot like so:

plt1 <- selected_group %>% ggplot(aes(x = val, y = score)) +
geom_point(size=2) +
labs(x = "Val", y = "Score") +
ggtitle("Plot of score/val for tiger group")

The ggplot package works by adding layers to a graph with different information as depicted in the image below:

Plot layers. Image by author

Here we add an aesthetic (aes) defining the x and y axis values.

ggplot(aes(x = val, y = score))

We then add the points (dots), setting them to size 2.

geom_point(size=2)

Finally we add the axis labels:

labs(x = "Val", y = "Score")

And a plot title:

ggtitle("Plot of score/val for tiger group")

We can then output the plot:

plt1
Plot of tiger group showing score and value (Image by author)

We can then simply cut and paste this and alter it for the different groups:

selected_group <- df %>% filter(group == "tiger")plt1 <- selected_group %>% ggplot(aes(x = val, y = score)) +
geom_point(size=2) +
labs(x = "Val", y = "Score") +
ggtitle("Plot of score/val for tiger group")
plt1
selected_group <- df %>% filter(group == "swan")plt2 <- selected_group %>% ggplot(aes(x = val, y = score)) +
geom_point(size=2) +
labs(x = "Val", y = "Score") +
ggtitle("Plot of score/val for swan group")
plt2
selected_group <- df %>% filter(group == "badger")plt3 <- selected_group %>% ggplot(aes(x = val, y = score)) +
geom_point(size=2) +
labs(x = "Val", y = "Score") +
ggtitle("Plot of score/val for badger group")
plt3

Although this works, there is an unnecessary repetition of code. Also imagine that you have many more groups, 10, 100, more? This approach is not scalable and any changes would need to be applied to each plot (for example changing the size of the points would need to be applied to all plots). Ideally you want to reuse as much code as possible. This cuts down on the maintenance and scales your code to cope with any number of potential groups.

Using loops

One way to improve this is to use a ‘loop’. Loops are programming structures that allow us to repeat blocks of code a certain number of times or until a specific condition is met. To repeat code a set number of times, a for loop is typically used.

First we will create a vector containing the names of the 3 groups:

groups <- c("tiger","swan","badger")

Next we can create a loop that starts at 1 and repeats 3 times (once for each group).

for(i in 1:3)
{
}

Every line of code between the braces ({ }) is repeated 3 times. We also have a loop counter called i. This is automatically incremented (added to) each time the contents of the loop is executed. It is traditional in programming to name loop counters things like, i,j,k,x,y etc. although you can name your loop counter with whatever name you like.

Next you need to think about how you will modify the original block of code to make it work within a loop. Which bits are generic to all the groups and which bits need to change to reflect the different groups? The main changes are in the group we filter by and the graph title.

selected_group <- df %>% filter(group == groups[i])

We can change the selected group to be filtered by the group of interest in our groups vector. The loop counter i can be used to point to different group names in the vector starting at 1 (tiger). Next we can make a variable for the title and update it to display the name of the relevant group using the paste0 function to concatenate (join together) the title string of text:

group_title <- paste0("Plot of score/val for ", groups[i], "group")

The finished loop with plot would look like this:

groups <- c("tiger","swan","badger")for(i in 1:3)
{
selected_group <- df %>% filter(group == groups[i])
group_title <- paste0("Plot of score/val for ", groups[i], "group")
plt <- selected_group %>% ggplot(aes(x = val, y = score)) +
geom_point(size=2) +
labs(x = "Val", y = "Score") +
ggtitle(group_title)
print(plt)
}

The only other change is using the print function to display the plot. For some reason if the print function is not used inside loops and functions plots are not displayed.

The three plots output using the loop (image by author)

Another way to make this even more robust is not to ‘hard code’ the number of times the loop should run as the list of groups may be expanded in the future or conversely items removed. Instead we can use the length function to return the number of items in the groups vector. This way it will always work if items are added or removed from the vector.

for(i in 1:length(groups))
{

}

We could also change the explicit group names in the vector using functionality to find all the unique group names in the groups columns. This doesn’t seem like a big deal with 3 groups but again consider how this might be more useful with larger numbers of groups and/or larger data sets to ensure no groups are missed.

This would change this:

groups <- c("tiger","swan","badger")

Into this:

groups <- unique(df$group)

This essentially returns a vector of all the unique items in the group column of the data frame so you end up with a more robust version of the original that would automatically pick up new groups that were added/removed from the dataframe.

Using functions

The other way that this can be improved further is to put this code into a function. Functions allow you to modularize your code making it easier to track down problems and read the code as well as grouping functionality together in a logical way. Blocks of code can be placed in functions and only run (executed) when the function is called. Data can be passed into functions for processing and functions also return data although you do not need to provide data to all functions and not all functions return data either. An example of this would be passing in a vector of numbers to a function which computes the average of these numbers and then returns the result of the computation.

Here we make a function called generateGroupScatterPlots and pass in the dataframe and the group of interest using the variable current_group updating the code to use the group that we pass in.

generateGroupScatterPlots <- function(df, current_group)
{
selected_group <- df %>% filter(group == current_group)
group_title <- paste0("Plot of score/val for ", current_group, "group")
plt <- selected_group %>% ggplot(aes(x = val, y = score)) +
geom_point(size=2) +
labs(x = "Val", y = "Score") +
ggtitle(group_title)
print(plt)
}

The function won’t run until it’s called. To execute the function we need to invoke its name and pass in any arguments that may be required. We call the function in the loop passing in the dataframe and the group of interest groups[i].

groups <- unique(df$group)for(i in 1:length(groups))
{
generateGroupScatterPlots(df, groups[i])
}

Another issue people often run into when trying to leverage loops and functions is accessing columns in a dataframe dynamically. For example, let’s say we want sum the columns containing numerical data, you would typically access a column in R by specifying the dataframe and then column name separated with the dollar symbol $:

df$score

We could then view this column with the print function or one of the view functions:

utils::View(df$score)

But how do we modify this to use a variable instead? Another way of doing the same the thing in R is to use the list notation which consists of a set of double square brackets:

df[["score"]]

This will give us the same result as using the dollar notation. We can then make a variable to point to the different column names which we can use with this notation but without the double quotes. For example:

score_col <- "score"
utils::View(df[[score_col]])

We could then apply this using a loop to sum all the columns of interest. Again in this example there are only two therefore this would be fine:

print(sum(df$score))
print(sum(df$val))

Imagine however that you had many more columns and that maybe you also wanted to compute other metrics, such as standard deviation, mean, sum etc. This is where the power of loops comes in. First we need to store the names of the columns of interest. We could do:

col_names <- c("score", "val")

Another way is using the colnames function and specifying which column names we want to keep by number (i.e. numbers 2 and 3, ignoring the first column of group names).

col_names <- colnames(df)[2:3]

We can then loop over the columns outputting the sum, mean and standard deviations:

for(i in 1:length(col_names))
{
col <- col_names[i]
cat("\n\nColumn: ", col, "\n")
print(sum(df[[col]]))
print(mean(df[[col]]))
print(sd(df[[col]]))
}

Which produces the following output:

Column: score
[1] 755
[1] 41.94444
[1] 19.99551
Column: val
[1] 719
[1] 39.94444
[1] 19.65428

Note that we take the name of the column and store this in a variable called col for column. We then use the list notation [[ ]] adding in the variable name col which is updated each time the loop runs to point at the next column in the vector. The cat function (concatenate and print) is used to show the name of the columns in the output. The \n indicates that we want a new line (return) to prevent the text being on the same line.

Let’s finish with a final example putting a few of these ideas together. Let’s say we have a different dataset that looks like this.

Image by author

Here we have 5 participants with different classifications of disease such as cardiac (heart) issues, respiratory and metabolic. A 1 indicates they have a condition in that category a 0 means they do not. Similarly a 1 in the death column indicates they died and a 0 that they are still alive (or were when the data was collected). Again you can imagine having many more participants and many more classifications of disease. Let’s say we want to capture the number of people who died with various classifications of disease. Again in this case it would relatively easy to do just by looking at the table but we want to write code that is scalable for larger datasets and problems with many rows and columns.

Let’s write some code to output the number of people who died in each category. We will start with a function to compute the deaths in each category:

deathsPerCategory <- function(df, col_name)
{
cols_to_keep <- c("id", "death", col_name)
selected_col <- df %>% select(cols_to_keep)
filtered_col <- subset(selected_col, selected_col[[col_name]] == 1 & selected_col$death == 1)
table <- filtered_col %>% group_by(death) %>% summarise(n=n(), .groups = ‘drop’)

return(table$n)
}

Similarly to the previous example, we pass in the dataframe and the name of the column of interest. Next we create a vector containing the columns we want to keep in the data frame. This is the id and death column as well as the column of interest. Next we use the select function to select just these columns from the dataframe. Then we subset this further using the subset function to filter the data such that there needs to be a ‘1’ in both the death column and the disease category of interest column. Next we generate a table by grouping by death and summarising the number of deaths. Finally we return this summary value for outputting.

We can then use this function in a loop to output summaries for the 3 disease categories. First we need to obtain the disease category names from the column names:

col_names <- colnames(df)[2:4]

We can then use this in the loop to output the column name and number of deaths for each category.

for (i in 1:length(col_names)) 
{
cat("\n", col_names[i], "deaths = ", deathsPerCategory(df, col_names[i]), "\n")
}

Which produces the following output which we can confirm by looking at the table:

cardiac deaths = 1respiratory deaths = 1metabolic deaths = 2

The full code:

library(tidyverse)deathsPerCategory <- function(df, col_name)
{
cols_to_keep <- c("id", "death", col_name)
selected_col <- df %>% select(cols_to_keep)
filtered_col <- subset(selected_col, selected_col[[col_name]] == 1 & selected_col$death == 1)
table <- filtered_col %>% group_by(death) %>% summarise(n=n(), .groups = "drop")
return(table$n)
}
col_names <- colnames(df)[2:4]for (i in 1:length(col_names))
{
cat("\n", col_names[i], "deaths = ", deathsPerCategory(df, col_names[i]), "\n")
}

Although the examples presented here use small datasets, hopefully you can see the advantages of thinking about and building in the ability to scale up your code to deal with larger datasets and more complex problems. This makes the code easier to manage, maintain and modify without having to repeat code unnecessarily. One of the skills of a data scientist is being able to leverage programming techniques in their analysis to process large volumes of data in ways that are just not possible using software such as Excel. If you are moving into R from other more visual spreadsheet style software then one of the main advantages of using things like R and Python is the ability to scale your code in this way. I would often start for example by writing code to produce a single plot or computation, check I am getting the right/expect output and then refactor the code using functions, loops etc. to process the remaining tasks. It might seem more complex to begin with but over time and with large datasets it will save you more time in the long run, as a single change can be cascaded down over all your plots, tables, computations, … making it easy and fast to make changes.

--

--