Dealing with Apply functions in R

vikashraj luhaniwal
Towards Data Science
7 min readMar 27, 2019

--

Apply functions in R

Iterative control structures (loops like for, while, repeat, etc.) allow repetition of instructions for several numbers of times. However, at large scale data processing usage of these loops can consume more time and space. R language has a more efficient and quick approach to perform iterations with the help of Apply functions.

In this post, I am going to discuss the efficiency of apply functions over loops from a visual perspective and then further members of apply family.

Before proceeding further with apply functions let us first see how code execution takes less time for iterations using apply functions compared to basic loops.

Consider the FARS(Fatality Analysis Recording System) dataset available in gamclass package of R. It contains 151158 observations of 17 different features. The dataset includes every accident in which there was at least one fatality and the data is limited to vehicles where the front seat passenger seat was occupied.

Now let us assume we want to calculate the mean of age column. This can be done using traditional loops and also using apply functions.

Method 1: Using for loop

library("gamclass")
data(FARS)
mean_age <- NULL
total <- NULL
for(i in 1:length(FARS$age)){
total <- sum(total, FARS$age[i])
}
mean_age <- total/length(FARS$age)
mean_age

Method 2: Using apply() function

apply(FARS[3],2, mean)

Now let us compare both the approaches through visual mode with the help of Profvis package.

Profvis is a code-profiling tool, which provides an interactive graphical interface for visualizing the memory and time consumption of instructions throughout the execution.

To make use of profvis, enclose the instructions in profvis(), it opens an interactive profile visualizer in a new tab inside R studio.

#for method 1
profvis({
mean_age <- NULL
total <- NULL
for(i in 1:length(FARS$age)){
total <- sum(total, FARS$age[i])
}
mean_age <- total/length(FARS$age)
mean_age
})

Output using method 1

Under Flame Graph tab we can inspect the time taken (in ms) by the instructions.

#for method 2
profvis({
apply(FARS[3],2, mean)
})

Output using method 2

Here, one can easily notice that the time taken using method 1 is almost 1990 ms (1960 +30) whereas for method 2 it is only 20 ms. So this is the actual power of apply() functions in terms of time consumption.

Benefits of apply functions over traditional loops

  1. Much more efficient and faster in execution.
  2. Easy to follow syntax (rather than writing a block of instructions only one line of code using apply functions)

Apply family in R

Apply family contains various flavored functions which are applicable to different data structures like list, matrix, array, data frame etc. The members of the apply family are apply(), lapply(), sapply(), tapply(), mapply() etc. These functions are substitutes/alternatives to loops.

Each of the apply functions requires a minimum of two arguments: an object and another function. The function can be any inbuilt (like mean, sum, max etc.) or user-defined function.

Explore the members

1. apply() function

The syntax of apply() is as follows

where X is an input data object, MARGIN indicates how the function is applicable whether row-wise or column-wise, margin = 1 indicates row-wise and margin = 2 indicates column-wise, FUN points to an inbuilt or user-defined function.

The output object type depends on the input object and the function specified. apply() can return a vector, list, matrix or array for different input objects as mentioned in the below table.

#---------- apply() function ---------- 
#case 1. matrix as an input argument
m1 <- matrix(1:9, nrow =3)
m1
result <- apply(m1,1,mean) #mean of elements for each row
result
class(result) #class is a vector
result <- apply(m1,2,sum) #sum of elements for each column
result
class(result) #class is a vector
result <- apply(m1,1,cumsum) #cumulative sum of elements for each row
result #by default column-wise order
class(result) #class is a matrix
matrix(apply(m1,1,cumsum), nrow = 3, byrow = T) #for row-wise order
#user defined function
check<-function(x){
return(x[x>5])
}
result <- apply(m1,1,check) #user defined function as an argument
result
class(result) #class is a list
#case 2. data frame as an input
ratings <- c(4.2, 4.4, 3.4, 3.9, 5, 4.1, 3.2, 3.9, 4.6, 4.8, 5, 4, 4.5, 3.9, 4.7, 3.6)
employee.mat <- matrix(ratings,byrow=TRUE,nrow=4,dimnames = list(c("Quarter1","Quarter2","Quarter3","Quarter4"),c("Hari","Shri","John","Albert")))
employee <- as.data.frame(employee.mat)
employee
result <- apply(employee,2,sum) #sum of elements for each column
result
class(result) #class is a vector
result <- apply(employee,1,cumsum) #cumulative sum of elements for each row
result #by default column-wise order
class(result) #class is a matrix
#user defined function
check<-function(x){
return(x[x>4.2])
}
result <- apply(employee,2,check) #user defined function as an argument
result
class(result) #class is a list

2. lapply() function

lapply() always returns a list, ‘l’ in lapply() refers to ‘list’. lapply() deals with list and data frames in the input. MARGIN argument is not required here, the specified function is applicable only through columns. Refer to the below table for input objects and the corresponding output objects.

#---------- lapply() function ---------- 
#case 1. vector as an input argument
result <- lapply(ratings,mean)
result
class(result) #class is a list
#case 2. list as an input argument
list1<-list(maths=c(64,45,89,67),english=c(79,84,62,80),physics=c(68,72,69,80),chemistry = c(99,91,84,89))
list1
result <- lapply(list1,mean)
result
class(result) #class is a list
#user defined function
check<-function(x){
return(x[x>75])
}
result <- lapply(list1,check) #user defined function as an argument
result
class(result) #class is a list
#case 3. dataframe as an input argument
result <- lapply(employee,sum) #sum of elements for each column
result
class(result) #class is a list
result <- lapply(employee,cumsum) #cumulative sum of elements for each row
result
class(result) #class is a list
#user defined function
check<-function(x){
return(x[x>4.2])
}
result <- lapply(employee,check) #user defined function as an argument
result
class(result) #class is a list

apply() vs. lapply()

  • lapply() always returns a list whereas apply() can return a vector, list, matrix or array.
  • No scope of MARGIN in lapply().

3. sapply() function

sapply() is a simplified form of lapply(). It has one additional argument simplify with default value as true, if simplify = F then sapply() returns a list similar to lapply(), otherwise, it returns the simplest output form possible.

Refer to the below table for input objects and the corresponding output objects.

#---------- sapply() function ---------- 
#case 1. vector as an input argument
result <- sapply(ratings,mean)
result
class(result) #class is a vector
result <- sapply(ratings,mean, simplify = FALSE)
result
class(result) #class is a list
result <- sapply(ratings,range)
result
class(result) #class is a matrix
#case 2. list as an input argument
result <- sapply(list1,mean)
result
class(result) #class is a vector
result <- sapply(list1,range)
result
class(result) #class is a matrix
#user defined function
check<-function(x){
return(x[x>75])
}
result <- sapply(list1,check) #user defined function as an argument
result
class(result) #class is a list
#case 3. dataframe as an input argument
result <- sapply(employee,mean)
result
class(result) #class is a vector
result <- sapply(employee,range)
result
class(result) #class is a matrix
#user defined function
check<-function(x){
return(x[x>4])
}
result <- sapply(employee,check) #user defined function as an argument
result
class(result) #class is a list

4. tapply() function

tapply() is helpful while dealing with categorical variables, it applies a function to numeric data distributed across various categories. The simplest form of tapply() can be understood as

tapply(column 1, column 2, FUN)

where column 1 is the numeric column on which function is applied, column 2 is a factor object and FUN is for the function to be performed.

#---------- tapply() function ---------- 
salary <- c(21000,29000,32000,34000,45000)
designation<-c("Programmer","Senior Programmer","Senior Programmer","Senior Programmer","Manager")
gender <- c("M","F","F","M","M")
result <- tapply(salary,designation,mean)
result
class(result) #class is an array
result <- tapply(salary,list(designation,gender),mean)
result
class(result) #class is a matrix

5. by() function

by() does a similar job to tapply() i.e. it applies an operation to numeric vector values distributed across various categories. by() is a wrapper function of tapply().

#---------- by() function ---------- 
result <- by(salary,designation,mean)
result
class(result) #class is of "by" type
result[2] #accessing as a vector element
as.list(result) #converting into a list
result <- by(salary,list(designation,gender),mean)
result
class(result) #class is of "by" type
library("gamclass")
data("FARS")
by(FARS[2:4], FARS$airbagAvail, colMeans)

6. mapply() function

The ‘m’ in mapply() refers to ‘multivariate’. It applies the specified functions to the arguments one by one. Note that here function is specified as the first argument whereas in other apply functions as the third argument.

#---------- mapply() function ---------- 
result <- mapply(rep, 1:4, 4:1)
result
class(result) #class is a list
result <- mapply(rep, 1:4, 4:4)
class(result) #class is a matrix

Conclusion

I believe I have covered all the most useful and popular apply functions with all possible combinations of input objects. If you think something is missing or more inputs are required. Let me know in the comments and I’ll add it in!

--

--

AI practitioner and technical consultant with five years of work experience in the field of data science, machine learning, big data, and programming.