The world’s leading publication for data science, AI, and ML professionals.

A Simple Guide to Understand the apply() Functions in R

Learn how to use these helpful functions once and for all


Introduction

I will start this post by saying that I work daily with R and Python languages. Honestly, I find it easier and more intuitive the way the apply functions are used in Python.

Thinking about the reason behind that, I believe it is because there aren’t many options in Python. The R Language presents many different options, the family of apply functions as I like to refer to them.

I remember that once I read somewhere somebody saying they always go directly to loops to solve a problem because they can never remember what each of the apply functions do and which one would be the most suited version to use for that case.

Well, I hope that those kind of problems will end after this post. It is my intention that those who read this article end it with a good understanding of the family of functions, and of how and when to use them.

To perform the exercises, let’s quickly create a sample data frame, without too much criteria. An ID, the product name, qty sold and dollar amount for two different periods.

# Create dataset
dtf <- data.frame(
  id = 1:100,
  product= sample(c('product A', 'product B', 'product C', 'product D'), size=100, replace=T),
  qty = as.integer( rnorm(100, 10, 2) ),
  amt = rnorm(100, 1280, 300),
  amt2 = rnorm(100, 1280, 300)
)

[OUT]:
     id   product qty    amt   amt2
1     1 product A   9  954.1 1418.5
2     2 product B  12 1606.9  877.7
3     3 product D   7 1241.6 1433.5
4     4 product A  11 1413.2 1203.8
5     5 product B  10 1623.3 1451.1

No more small talk, let’s dive in.

Functions

The apply family in R has 4 main functions: apply(), lapply(), sapply(), and tapply().

apply( )

The apply() function names the family. It is probably the most straightforward one to be used. This function applies the same function to all the elements of a row or a column. Here is the syntax.

apply(X, MARGIN, FUN)

Apply to the matrix X, on the rows (1) or columns (2), the function specified.

The easiest example is the application of the mean value. We have a dataset with a bunch of products sales. But what’s the mean quantity sold and the mean dollar amount sold?

To quickly see that, we can use apply and choose the function mean to be applied to every numeric column in the dataset.

So, our X a matrix with the numeric columns 3 (qty) and 4 (amt) of our dataset – since we can’t take the mean of product A and product B. The MARGIN can be 1 for apply to rows or 2 when applying the function to columns. FUN is the function to be applied to every element. The code will be as follows.

# Apply: Apply a function to all the columns
apply( X= dtf[,c(3,4,5)], MARGIN= 2, FUN= mean)

[OUT]:
qty     amt    amt2 
9.55 1303.42 1267.46 

Common Errors

  • We must use the slice notation for X when using apply() because R would throw an error if we try to use the whole dataset dtf, given that there are string columns. We can’t have the mean point of strings.
apply( X= dtf, MARGIN= 2, FUN= mean)

     id product     qty     amt    amt2 
     NA      NA      NA      NA      NA 
Warning messages:
1: In mean.default(newX[, i], ...) :
  argument is not numeric or logical: returning NA
  • Even if we try to use a whole column, such as dtf$qty to get the mean of the quantities (which is numeric) it is not going to work as well, because there is just one variable, so it is easier to simply use mean(dtf$qty), right?
  • When running apply(), R will first read the dimensions of the object X prior to running the function. If you try dim(dtf$qty), you will see that the output is NULL. That is why you get that error below. The dimension needs to be over 1.
apply( X= dtf$amt, MARGIN= 2, FUN= mean)

Error in apply(X = dtf$amt, MARGIN = 2, FUN = mean) : 
  dim(X) must have a positive length

Great. Nice and easy.

Apply function by row

Now let’s take another look at the data again. If we want to calculate the mean by row, to know the average between amount1 and amount2 we can use the same code, but just changing to MARGIN=1 now, and the function will calculate the simple average of the row:[amt + amt2] / 2.

     id   product qty    amt   amt2
1     1 product A   9  954.1 1418.5
2     2 product B  12 1606.9  877.7
3     3 product D   7 1241.6 1433.5
4     4 product A  11 1413.2 1203.8
5     5 product B  10 1623.3 1451.1

# Apply: Apply a function to all the rows
apply( X= dtf[, c(4,5)], MARGIN= 1, FUN= mean)

[OUT]:
  [1] 1186.3 1242.3 1337.6 1308.5 1537.2 1007.1 1094.0 1465.9 1602.8 1204.4 1155.4 1190.6  812.3 1565.5 1118.5
 [16] 1346.4 1259.2 1319.0 1293.2 1402.1 1471.2 1491.8 1248.5 1154.7 1693.5 1358.8 1396.8 1262.8 1383.0 1270.0
 [31] 1621.8  933.7  850.7  892.6 1482.3 1191.3 1612.7 1677.2 1496.7 1504.8  947.7  865.5  953.5 1151.8  947.6
 [46] 1763.6 1229.1 1328.0  893.1 1386.8 1004.8  975.0  931.7 1665.4 1417.0 1482.3  974.0 1444.9 1233.7 1548.7
 [61] 1469.9 1612.0 1159.9 1130.9 1617.9 1290.0 1227.9 1072.6 1367.1 1027.3 1472.1 1263.1 1347.1 1463.3 1324.0
 [76] 1361.6 1330.7 1380.8 1699.4 1389.0 1165.8 1146.8 1358.7 1326.3 1213.6  983.1 1385.5  919.4 1212.9 1226.1
 [91] 1003.2 1643.5 1327.8 1566.7  966.4 1270.6 1359.5 1252.3 1216.1 1405.6

And yet, we can add that to the dataset.

# Average of the amounts (avg by row)
dtf$avg_amounts <- apply( X= dtf[, c(4,5)], MARGIN= 1, FUN= mean)

# See first 5 rows
dtf |> head(5)

  id product qty    amt   amt2 avg_amounts
1  1       A   9  954.1 1418.5        1186
2  2       B  12 1606.9  877.7        1242
3  3       D   7 1241.6 1433.5        1338
4  4       A  11 1413.2 1203.8        1309
5  5       B  10 1623.3 1451.1        1537

If we want to add a column that will calculate the amt/qty , we can calculate that by row (MARGIN=1) using a custom function within apply. The custom functions takes a matrix mtrx and divides the value in the second column of mtrx by the value from the first column.

# Average AMT by product QTY
dtf$amt_by_qty <- apply(
  X= dtf[, c(3,4)],
  MARGIN= 1,
  FUN= function(mtrx){ mtrx[2]/mtrx[1]}
       )

# See first 5 rows
dtf |> head(5)

[OUT]:
  id product qty    amt   amt2 avg_amounts amt_by_qty
1  1       A   9  954.1 1418.5        1186      106.0
2  2       B  12 1606.9  877.7        1242      133.9
3  3       D   7 1241.6 1433.5        1338      177.4
4  4       A  11 1413.2 1203.8        1309      128.5
5  5       B  10 1623.3 1451.1        1537      162.3

lapply( )

In R language, a list is a concept different than it is for Python. In R, it is a collection of objects, not necessarily of the same type. And there will be occasions when we need to apply a function to each element in a list. That’s what lapply() is used for.

Important to note that lapply returns a list object with the same length as X.

lapply(X, FUN)

The simplest example I can think of is if you have a list and you want to check the type of each object in that list. Let’s see the code for this simple example.

# Creating a list
l <- list(c(1,2), 'l', 1.6, TRUE)

# Check the type of each object in the list
lapply(l, class)

[OUT]:
[[1]]
[1] "numeric"

[[2]]
[1] "character"

[[3]]
[1] "numeric"

[[4]]
[1] "logical"

As expected, the same length (4 objects) and the function class was applied to every element in the list. Basically, you should use that method if you’re dealing with a list object or if you need to have a list returned.

You should use that method if you’re dealing with a list object or if you need to have a list returned.

sapply( )

As per the documentation, sapply() is an user-friendly version of lapply(), returning a vector or matrix of same length instead of returning a list. If the argument simplify=TRUE, which is on by default, it will return an array (simplified object).

sapply( ) is an user-friendly version of lapply( ), that will return a simpler object, like an array.

sapply(X, FUN, simplify = TRUE)

# Use sapply for the same list
sapply(l, FUN = class)

[OUT]: an array
[1] "numeric"   "character" "numeric"   "logical" 

Going back to our created dataset dtf, the column product brings the names preceded by the word ‘product‘. If we want to remove that, sapply can be a good helper. Instead of going with a loop or even before loading Tidyverse to do that, remember that the apply functions from Base R can help.

Observe that we’re using a custom function here to split the string in two pieces, once it finds the space character, and taking the second element of that list.

# Using sapply to remove the word 'product' from the description
dtf$product <- sapply(dtf$product,
                      FUN= function(x){ strsplit(x, ' ')[[1]][2] })

[OUT]:
  id product qty    amt   amt2
1  1       A   9  954.1 1418.5
2  2       B  12 1606.9  877.7
3  3       D   7 1241.6 1433.5
4  4       A  11 1413.2 1203.8
5  5       B  10 1623.3 1451.1

tapply( )

Finally, tapply() is a function to be used when you’re dealing with factors (or groups).

tapply(X, INDEX, FUN, simplify = TRUE)

Use tapply to apply a function to every group of values.

The obvious example for this function is if we want to check the mean of amt by product without a group by function, tapply can be handy.

  • X is the matrix or vector.
  • INDEX is the column with the groups.
  • FUN is the function to be applied.
# Using tapply to calculate the mean of amt by product 
tapply(X= dtf$amt, INDEX = dtf$product, FUN = mean)

[OUT]:
   A    B    C    D 
1295 1290 1304 1323 

That’s about it for the most used functions from the apply family in R.

Before You Go

Well, the apply functions are great mapping functions. In your daily job as a data scientist, there will be several occasions where a mapping function is much better than a loop, being that for performance purposes or just for a better code readability.

With this quick intro, I am sure you can now apply the family of apply functions to your data.

  • apply: Use to apply the same function to every element of a row or column in a dataset.
  • lapply: Use to apply a function to each element of a list. Returns a list of same length as result.
  • sapply: friendly version of lapply that returns an array as result.
  • tapply: use to apply a function to each group of values. Almost like a group by.

Code in GitHub:

Studying/R/apply at master · gurezende/Studying

Reference

apply(), lapply(), sapply(), tapply() Function in R with Examples

https://www.datacamp.com/tutorial/r-tutorial-apply-family

SANTOS, Gustavo R. 2023. Data Wrangling with R. 1 ed. Packt Publishing.


Related Articles