Introduction
I will start this post by saying that I work daily with R and Python languages. Honestly, I find it easier and more intuitive the way the apply
functions are used in Python.
Thinking about the reason behind that, I believe it is because there aren’t many options in Python. The R Language presents many different options, the family of apply functions as I like to refer to them.
I remember that once I read somewhere somebody saying they always go directly to loops to solve a problem because they can never remember what each of the apply functions do and which one would be the most suited version to use for that case.
Well, I hope that those kind of problems will end after this post. It is my intention that those who read this article end it with a good understanding of the family of functions, and of how and when to use them.
To perform the exercises, let’s quickly create a sample data frame, without too much criteria. An ID, the product name, qty sold and dollar amount for two different periods.
# Create dataset
dtf <- data.frame(
id = 1:100,
product= sample(c('product A', 'product B', 'product C', 'product D'), size=100, replace=T),
qty = as.integer( rnorm(100, 10, 2) ),
amt = rnorm(100, 1280, 300),
amt2 = rnorm(100, 1280, 300)
)
[OUT]:
id product qty amt amt2
1 1 product A 9 954.1 1418.5
2 2 product B 12 1606.9 877.7
3 3 product D 7 1241.6 1433.5
4 4 product A 11 1413.2 1203.8
5 5 product B 10 1623.3 1451.1
No more small talk, let’s dive in.
Functions
The apply family in R has 4 main functions: apply()
, lapply()
, sapply()
, and tapply()
.
apply( )
The apply()
function names the family. It is probably the most straightforward one to be used. This function applies the same function to all the elements of a row or a column. Here is the syntax.
apply(X, MARGIN, FUN)
Apply to the matrix X, on the rows (1) or columns (2), the function specified.
The easiest example is the application of the mean value. We have a dataset with a bunch of products sales. But what’s the mean quantity sold and the mean dollar amount sold?
To quickly see that, we can use apply
and choose the function mean
to be applied to every numeric column in the dataset.
So, our X
a matrix with the numeric columns 3 (qty
) and 4 (amt
) of our dataset – since we can’t take the mean of product A and product B. The MARGIN
can be 1
for apply to rows or 2
when applying the function to columns. FUN
is the function to be applied to every element. The code will be as follows.
# Apply: Apply a function to all the columns
apply( X= dtf[,c(3,4,5)], MARGIN= 2, FUN= mean)
[OUT]:
qty amt amt2
9.55 1303.42 1267.46
Common Errors
- We must use the slice notation for
X
when usingapply()
because R would throw an error if we try to use the whole datasetdtf
, given that there are string columns. We can’t have the mean point of strings.
apply( X= dtf, MARGIN= 2, FUN= mean)
id product qty amt amt2
NA NA NA NA NA
Warning messages:
1: In mean.default(newX[, i], ...) :
argument is not numeric or logical: returning NA
- Even if we try to use a whole column, such as
dtf$qty
to get the mean of the quantities (which is numeric) it is not going to work as well, because there is just one variable, so it is easier to simply usemean(dtf$qty)
, right? - When running
apply()
, R will first read the dimensions of the objectX
prior to running the function. If you trydim(dtf$qty)
, you will see that the output isNULL
. That is why you get that error below. The dimension needs to be over 1.
apply( X= dtf$amt, MARGIN= 2, FUN= mean)
Error in apply(X = dtf$amt, MARGIN = 2, FUN = mean) :
dim(X) must have a positive length
Great. Nice and easy.
Apply function by row
Now let’s take another look at the data again. If we want to calculate the mean by row, to know the average between amount1 and amount2 we can use the same code, but just changing to MARGIN=1
now, and the function will calculate the simple average of the row:[amt + amt2] / 2
.
id product qty amt amt2
1 1 product A 9 954.1 1418.5
2 2 product B 12 1606.9 877.7
3 3 product D 7 1241.6 1433.5
4 4 product A 11 1413.2 1203.8
5 5 product B 10 1623.3 1451.1
# Apply: Apply a function to all the rows
apply( X= dtf[, c(4,5)], MARGIN= 1, FUN= mean)
[OUT]:
[1] 1186.3 1242.3 1337.6 1308.5 1537.2 1007.1 1094.0 1465.9 1602.8 1204.4 1155.4 1190.6 812.3 1565.5 1118.5
[16] 1346.4 1259.2 1319.0 1293.2 1402.1 1471.2 1491.8 1248.5 1154.7 1693.5 1358.8 1396.8 1262.8 1383.0 1270.0
[31] 1621.8 933.7 850.7 892.6 1482.3 1191.3 1612.7 1677.2 1496.7 1504.8 947.7 865.5 953.5 1151.8 947.6
[46] 1763.6 1229.1 1328.0 893.1 1386.8 1004.8 975.0 931.7 1665.4 1417.0 1482.3 974.0 1444.9 1233.7 1548.7
[61] 1469.9 1612.0 1159.9 1130.9 1617.9 1290.0 1227.9 1072.6 1367.1 1027.3 1472.1 1263.1 1347.1 1463.3 1324.0
[76] 1361.6 1330.7 1380.8 1699.4 1389.0 1165.8 1146.8 1358.7 1326.3 1213.6 983.1 1385.5 919.4 1212.9 1226.1
[91] 1003.2 1643.5 1327.8 1566.7 966.4 1270.6 1359.5 1252.3 1216.1 1405.6
And yet, we can add that to the dataset.
# Average of the amounts (avg by row)
dtf$avg_amounts <- apply( X= dtf[, c(4,5)], MARGIN= 1, FUN= mean)
# See first 5 rows
dtf |> head(5)
id product qty amt amt2 avg_amounts
1 1 A 9 954.1 1418.5 1186
2 2 B 12 1606.9 877.7 1242
3 3 D 7 1241.6 1433.5 1338
4 4 A 11 1413.2 1203.8 1309
5 5 B 10 1623.3 1451.1 1537
If we want to add a column that will calculate the amt/qty
, we can calculate that by row (MARGIN=1
) using a custom function within apply. The custom functions takes a matrix mtrx and divides the value in the second column of mtrx by the value from the first column.
# Average AMT by product QTY
dtf$amt_by_qty <- apply(
X= dtf[, c(3,4)],
MARGIN= 1,
FUN= function(mtrx){ mtrx[2]/mtrx[1]}
)
# See first 5 rows
dtf |> head(5)
[OUT]:
id product qty amt amt2 avg_amounts amt_by_qty
1 1 A 9 954.1 1418.5 1186 106.0
2 2 B 12 1606.9 877.7 1242 133.9
3 3 D 7 1241.6 1433.5 1338 177.4
4 4 A 11 1413.2 1203.8 1309 128.5
5 5 B 10 1623.3 1451.1 1537 162.3
lapply( )
In R language, a list is a concept different than it is for Python. In R, it is a collection of objects, not necessarily of the same type. And there will be occasions when we need to apply a function to each element in a list. That’s what lapply()
is used for.
Important to note that lapply
returns a list object with the same length as X.
lapply(X, FUN)
The simplest example I can think of is if you have a list and you want to check the type of each object in that list. Let’s see the code for this simple example.
# Creating a list
l <- list(c(1,2), 'l', 1.6, TRUE)
# Check the type of each object in the list
lapply(l, class)
[OUT]:
[[1]]
[1] "numeric"
[[2]]
[1] "character"
[[3]]
[1] "numeric"
[[4]]
[1] "logical"
As expected, the same length (4 objects) and the function class
was applied to every element in the list. Basically, you should use that method if you’re dealing with a list object or if you need to have a list returned.
You should use that method if you’re dealing with a list object or if you need to have a list returned.
sapply( )
As per the documentation, sapply()
is an user-friendly version of lapply()
, returning a vector or matrix of same length instead of returning a list. If the argument simplify=TRUE
, which is on by default, it will return an array (simplified object).
sapply( ) is an user-friendly version of lapply( ), that will return a simpler object, like an array.
sapply(X, FUN, simplify = TRUE)
# Use sapply for the same list
sapply(l, FUN = class)
[OUT]: an array
[1] "numeric" "character" "numeric" "logical"
Going back to our created dataset dtf
, the column product
brings the names preceded by the word ‘product‘. If we want to remove that, sapply
can be a good helper. Instead of going with a loop or even before loading Tidyverse to do that, remember that the apply functions from Base R can help.
Observe that we’re using a custom function here to split the string in two pieces, once it finds the space character, and taking the second element of that list.
# Using sapply to remove the word 'product' from the description
dtf$product <- sapply(dtf$product,
FUN= function(x){ strsplit(x, ' ')[[1]][2] })
[OUT]:
id product qty amt amt2
1 1 A 9 954.1 1418.5
2 2 B 12 1606.9 877.7
3 3 D 7 1241.6 1433.5
4 4 A 11 1413.2 1203.8
5 5 B 10 1623.3 1451.1
tapply( )
Finally, tapply()
is a function to be used when you’re dealing with factors (or groups).
tapply(X, INDEX, FUN, simplify = TRUE)
Use tapply to apply a function to every group of values.
The obvious example for this function is if we want to check the mean of amt
by product without a group by function, tapply
can be handy.
X
is the matrix or vector.INDEX
is the column with the groups.FUN
is the function to be applied.
# Using tapply to calculate the mean of amt by product
tapply(X= dtf$amt, INDEX = dtf$product, FUN = mean)
[OUT]:
A B C D
1295 1290 1304 1323
That’s about it for the most used functions from the apply family in R.
Before You Go
Well, the apply functions are great mapping functions. In your daily job as a data scientist, there will be several occasions where a mapping function is much better than a loop, being that for performance purposes or just for a better code readability.
With this quick intro, I am sure you can now apply the family of apply functions to your data.
apply
: Use to apply the same function to every element of a row or column in a dataset.lapply
: Use to apply a function to each element of a list. Returns a list of same length as result.sapply
: friendly version of lapply that returns an array as result.tapply
: use to apply a function to each group of values. Almost like a group by.
Code in GitHub:
Reference
apply(), lapply(), sapply(), tapply() Function in R with Examples
https://www.datacamp.com/tutorial/r-tutorial-apply-family
SANTOS, Gustavo R. 2023. Data Wrangling with R. 1 ed. Packt Publishing.