Iterate Your R Code Efficiently!

A step-by-step guide to perform clean and efficient iterations in R.

Manasi Mahadik
Towards Data Science

--

Inuits do not actually have a hundred names for snow. Turns out that’s part myth and part misunderstanding. Expanding a similar analogy to web speak, there are about a hundred ways to iterate code! To add to this, the syntax can be confusing and some of us also run the risk of just lazily resorting to copy-pasting the code 4 times. But it is important to recognise that this is a sub-optimal route and frankly impractical. As a general rule of thumb, if we need to run a block of code more than two times, it is a good idea to iterate! Here are two reasons why this will make your code richer-

  1. It draws attention to the part of code that is different, thus making it easier to spot the intent of the operation.
  2. Due to its concise nature, you’re likely to encounter fewer bugs.

It might take some time to wrap your head around the idea of iterating, but trust me, it’s worth the investment.

Now that you’re convinced to iterate — let’s jump right in!

Let’s pick the inbuilt R dataset- air quality. A snippet of the same is presented below-

   Ozone Solar.R  Wind Temp Month Day
41 190 7.4 67 5 1
36 118 8.0 72 5 2
12 149 12.6 74 5 3
18 313 11.5 62 5 4
23 299 8.6 65 5 7
19 99 13.8 59 5 8

Problem 1: You want to find the standard deviation for each variable in the dataset-

You could copy-paste the same code for each column-

sd(airquality$Ozone)
33.27597
sd(airquality$Solar.R)
91.1523
sd(airquality$Wind)
3.557713
sd(airquality$Temp)
9.529969
sd(airquality$Month)
1.473434
sd(airquality$Day)
8.707194

This is however impractical for large data sets and breaks the rule of thumb of not running the same operation more than twice. We have a solid case to iterate!

We can write a for()loop-

stddev = vector("double", ncol(airquality))for(i in seq_along(airquality))             
{
stddev[[i]] = sd(airquality[[i]])

}
stddev
33.275969 91.152302 3.557713 9.529969 1.473434 8.707194

The loop does away with any repetition and is indeed more efficient than the first approach.

It is also imperative to pause and note that while seq_along()and length() are mostly used interchangeably to build a sequence in the for loop, there is one key difference. In case of a zero-length vector, seq_along() does the right thing, but length() takes the value of 0 and 1. Although you probably won’t create a zero-length vector deliberately, it’s easy to create them accidentally. If you use 1:length() instead of seq_along(), you’re likely to get a confusing error message

Or you could just skip the loop and do the trick with just a line of code using sapply() from base R’s apply()family -

sapply(airquality, sd) Ozone      Solar.R      Wind      Temp     Month     Day 
33.275969 91.152302 3.557713 9.529969 1.473434 8.707194

This is a great application of R’s functional programming capabilities and indeed does the job very neatly.

Let’s now take another step on the ladder of complexity and look at another problem.

Problem 2: You want to find the standard deviation and median of each column in your dataset.

Since we have established that the first approach of copy-pasting is impractical, we weigh in on our iteration options.

We start by writing a for()loop-

stddev =vector("double", ncol(airquality))
median =vector("double", ncol(airquality))
for(i in seq_along(airquality))
{
stddev[[i]] = sd(airquality[[i]])
median[[i]] = median(airquality[[i]])
}
stddev
33.275969 91.152302 3.557713 9.529969 1.473434 8.707194
median
31.0 207.0 9.7 79.0 7.0 16.0

Next, we take the functional programming route. Here, unlike the earlier example where we could directly use R’s inbuilt sd() function to calculate the standard deviation and pass it through sapply() , we need to create a custom function, as we need to calculate both the standard deviation and the median.

f <- function(x){
list(sd(x),median(x))
}
sapply(airquality, f)Ozone Solar.R Wind Temp Month Day
33.27597 91.1523 3.557713 9.529969 1.473434 8.707194
31 207 9.7 79 7 16

This is a very solid idea! The ability to pass a user built function to another function is thrilling and clearly showcases R’s functional programming capabilities to solve a wide variety of tasks. In fact, seasoned R users rarely ever use loops and resort to functional programming techniques to solve all iterative tasks. As used above, apply family of functions in base R (apply(), lapply(), tapply(), etc) are a great way to go about this, but even in the functional programming universe there is one package which has emerged as a favorite — Purrr. The purrr family of functions has more consistent syntax and has in built functionalities to carry out a wide variety of common iterative tasks.

The Map() functions form the cornerstone of the purrr’s iterative capabilities. Here are some of the forms it takes-

  • map() makes a list.
  • map_lgl() makes a logical vector.
  • map_int() makes an integer vector.
  • map_dbl() makes a double vector.
  • map_chr() makes a character vector.

Let’s use this idea to solve our earlier problem of calculating the median and standard deviation of each column-

map_df(airquality, ~list(med = median(.x), sd = sd(.x)))

Next, in order to another leap on the complexity ladder let’s pick the gapminder dataset from the gapminder library. A snippet of the same is presented below.(P.S- If you haven’t heard about the gapminder foundation, do check its website out here . The foundation does some groundbreaking work on putting basic global facts into context)

  country   continent year lifeExp  pop  gdpPercap
Afghanistan Asia 1952 28.8 8425333 779.
Afghanistan Asia 1957 30.3 9240934 821.
Afghanistan Asia 1962 32.0 10267083 853.
Afghanistan Asia 1967 34.0 11537966 836.
Afghanistan Asia 1972 36.1 13079460 740.
Afghanistan Asia 1977 38.4 14880372 786.

Problem 3: I want to know which country has the highest GDP Per Capita in each continent and in each year.

Using the for() loop approach-

list = c(“continent”, “year”)
DF= data.frame()
for( i in list)
{
df = gapminder %>% group_by_at(i) %>%
top_n(1, gdpPercap) %>%
mutate(Remark = paste0(“Country Max GDP Per capita in the “,i)) %>%
data.frame()
DF = rbind(df,DF)
}
DF

Using the Apply()approach-

do.call(rbind, lapply(list, function(x)
{
gapminder %>% group_by_at(x) %>%
top_n(1, gdpPercap)%>%
mutate(Remark = paste0("Country with the max GDP Per capita in the ",x)) %>%
data.frame
}))

Using the Purrr::Map() approach-

gapminder$year = as.character(gapminder$year)
map_dfr(list, ~gapminder %>% group_by(!!sym(.x)) %>%
top_n(1, gdpPercap)%>%
mutate(Remark = paste0(“Country with the max GDP Per capita in the “,.x)) %>% data.frame()

All of the above three approaches lead to the same output (For the sake of brevity, I am not including the output here. You can have a look at it on my Github here). Again, while you can take your pick on which iterative route you want to take, the functional programming way is a clear winner on cogency.

Purrr also has some inbuilt functions to deal with everyday iterative tasks! Below I have listed down a few popular ones.

Task: Run a piecewise regression for each segment of the data. (Here, continent):

Purrr solution:

gapminder %>% 
split(.$Continent) %>%
map(~lm(gdpPercap ~ lifeExp, data = .))

Task: Keep variables basis an arbitrary condition. (Here, if the variable is a factor):

Purrr solution:

gapminder %>% 
keep(is.factor) %>%
str()

Task: Check if any variable meets the arbitrary condition(Here, if any variable is a character):

Purrr solution:

gapminder%>% 
some(is_character)

Task: Check if every variable meets the arbitrary condition(Here, if every variable is an integer):

Purrr solution:

gapminder %>% 
every(is.integer))

Once you get accustomed to the syntax of purrr, you will need less time to actually write iterative code in R. However, one must never ever feel bad about writing loops in R. In fact, they are one of the fundamental blocks of programming and are profusely used throughout other languages. Some people even call loops slow. They’re wrong! (Well at least they’re rather out of date, as for loops haven’t been slow for many years). The chief benefit of using functions like map() and apply() is not speed, but clarity: they make your code easier to write and read.

The important thing is that you solve the problem that you’re working on, not write the most concise and elegant code (although that’s definitely something you want to strive towards!)- Hadley Wickham

Thanks for reading! You can view the code on my Github here or reach out to me here.

--

--