Writing code is, mostly, a collaborative effort. Even if you are working alone, writing clean and readable code will only save your future-self of some really tough headaches.
If you don’t have a background in Software Engineering (which most of Data Scientists and Analysts don’t), you may take a while to understand that writing clean code is one of the most important features of an excellent data professional. That sense also tends to develop when you are more senior and need to look into other people’s code more often.
If you code in R or Python, you stumbled upon the concept of a function, for sure. Functions are the main features that make our code reusable and modular, making it easier to build something from scratch. Without functions, you were probably building messy scripts that are hard to maintain, debug and pass to other developers.
Nevertheless, writing functions is not a silver bullet for clean code. Even if you put all you code into functions, there are still a lot of things that you can do that will make your script difficult to understand and tweak.
So, in this post, we’ll explore some cool best practices that will power up your R Functions. These tips range from documentation and styling to overall code structure and should help you the next time you have to build some R script.
Let’s do this!
Indent, Indent, Indent
Contrary to Python, R doesn’t need proper indentation to work. For instance, the following function works just fine:
print_names <- function(vector_names) {
for (name in vector_names) {
print(paste('Hello',name))
}
}
But it’s super hard to read.
Where does the loop start? What is inside the loop? What are we iterating on? Indenting functions is an elegant way for the a reader to look into your code – such as:
print_names <- function(vector_names) {
for (name in vector_names) {
print(paste('Hello',name))
}
}
The function works exactly the same but it’s cleaner and easier to understand. You can follow the spaces to check where the loop is inserted and how it words in the function.
Another cool usage of indentation is when we call functions with a lot of arguments or use nested functions. For example, imagine the following function that uses the mtcars
dataframe to create a brand
column based on the rownames
:
get_car_brand <- function(car_model) {
car_brand <- sapply(strsplit(car_model, ' '), '[', 1)
return (car_brand)
}
The nested function is a bit hard to read as we are using a combination of sapply
and strsplit
. We can leverage indentation to make the function a bit more readable:
get_car_brand <- function(car_model) {
car_brand <- sapply(
strsplit(car_model, ' '),
'[',
1
)
return (car_brand)
}
It may take a while to adjust to this way of feeding arguments into the function. But, if you don’t end up using apply %>%
, it’s a cleaner way to read nested functions.
Short Functions over Long Functions
If you have a function that is too long, break it down into small sub-functions that are easier to understand and debug.
Let’s see use the mtcars
dataframe as an example, again. With the function below we are:
- Creating the brand of the car on the
mtcars
dataframe; - Aggregating the
hp
bybrand
; - Plotting the horse power by brand in a scatter plot;
plot_average_horsepower_brand <- function(cars_df) {
cars_df_copy <- cars_df
cars_df_copy['car_brand'] <- sapply(
strsplit(rownames(cars_df_copy), ' '),
'[',
1
)
aggregate_brand <- aggregate(
cars_df_copy$hp,
by = list(cars_df_copy$car_brand),
FUN = mean
)
sort_order <- factor(
aggregate_brand[order(aggregate_brand[,'x']),]$Group.1
)
ggplot(
data = aggregate_brand,
aes(x=factor(Group.1, levels=sort_order), y=x, color='darkred')
) + geom_point() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
}
If I call plot_average_horsepower_brand(mtcars)
this function will work and will yield the following plot:

Main issue with this function? It tries to do everything in the same block of code, making it harder to debug. Arguably, it’s a bit overwhelming when you look at it as we are doing so much stuff inside.
There are several tactics that we can use to solve this. One cool way is to break this code down into smaller chunks , making it reusable and modular -an example:
create_brand <- function(cars_df) {
brands <- sapply(
strsplit(rownames(cars_df), ' '),
'[',
1
)
return (brands)
}
mean_by_variable <- function(df, agg_var, by_var) {
aggregate_brand <- aggregate(
df[,agg_var],
by = list(df[,by_var]),
FUN = mean
)
return (aggregate_brand)
}
plot_sorted_scatter <- function(cars_data, agg_var, by_var) {
# Add Brand
cars_data$brand <- create_brand(cars_data)
# Create Aggregation
agg_data <- mean_by_variable(cars_data, agg_var, by_var)
# Sort
sort_order <- factor(
agg_data[order(agg_data[,'x']),]$Group.1
)
ggplot(
data = agg_data,
aes(x=factor(Group.1, levels=sort_order), y=x, color='darkred')
) + geom_point() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
}
plot_sorted_scatter(mtcars, 'hp', 'brand')
You may be wondering if we are not over-complicating our code. But by making it modular we can:
- Debug it easier. If I need to change something when creating the brand of the car, I just look into
create_brand
. If I need to change the function I’m using in the aggregation, I just changemean_by_variable
. - Expand the logic for other variables or plots.
- Add additional steps to the
plot_sorted_scatter
more easily.
Breaking your function down into smaller ones is a general best practice for most programming languages. In R, it’s no different and you will feel the difference when you pick up your scripts again after letting them rest for a while.
Leverage Docstrings
In its core, you can document R functions by putting comments on top of them. For more complex functions where you want to explain the arguments in details, that is not practical. Another setback is that R can’t access that documentation when we provide a ?custom_function
if we don’t use a small trick (explaining this next!).
For instance, if we do a function that return the multiplicative inverse of a number, such as:
return_inverse <- function(x) {
#' Computes the multiplicative inverse of the input
return(1/x)
}
And later call:
?return_inverse
We will have an error on the R console:

We would like to access the Computes the multiplicative inverse of the input
just like we access documentation from other built-in (or external library) functions.
Luckily, we can use docstring
to provide us more advanced documentation for our functions that will help others debug your code a bit better. For instance:
library(docstring)
?return_inverse
Note: You may have to Restart R to access the function documentation (after you load docstring
). The code above, will now return:

Although not perfect – we now have access to specific documentation for our return_inverse
anywhere in our environment. The cool think is that we can even leverage parameters to make the documentation even better:
return_inverse <- function(x) {
#' Multiplicative Inverse of Number
#'
#' @description Computes the multiplicative inverse of the input
#'
#' @param x: Real number.
return(1/x)
}
?return_inverse
This will yield something really cool – extensive documentation of our function!

docstring
is super cool as it helps you build documentation for a function that mimics the extensive documentation of base R code. You can check more about docstring
here, including extra parameters I didn’t cover.
Explicit vs. Implicit Return
You might notice that throughout my functions, I mostly used explicit returns (using the return
keyword to wrap the last object at the end of the function). Keep in mind that this is more of a personal preference than a proper "best-practice".
But, arguably, explicit returns are better for the reader to understand what each functions yields. Looking at the last instruction of the function may get confusing, some times.
For instance, the following two functions will yield exactly the same:
create_brand_explicit <- function(cars_df) {
brands <- sapply(
strsplit(rownames(cars_df), ' '),
'[',
1
)
return (brands)
}
create_brand_implicit <- function(cars_df) {
sapply(
strsplit(rownames(cars_df), ' '),
'[',
1
)
}
Another argument for explicit returns is that the variable name can give cues to the reader of what your code does. In more complex functions, it’s easier to draw the attention of the reader to the explicit return statement. An example of this is the function create_brand_explicit.
By stating that the function returns a brands
variable, the user should be expecting that this object will contain data regarding the brands of the cars in the dataframe.
Don’t Load Libraries
Finally, don’t load libraries inside your functions and always put your dependencies in the top of your scripts.
For instance, in our plot_sorted_scatter, where we’ve used ggplot2
, there might be a temptation to do the following:
plot_sorted_scatter <- function(cars_data, agg_var, by_var) {
library(ggplot2)
# Add Brand
mtcars$brand <- create_brand(cars_data)
# Create Aggregation
agg_data <- mean_by_variable(cars_data, agg_var, by_var)
# Sort
sort_order <- factor(
agg_data[order(agg_data[,'x']),]$Group.1
)
ggplot(
data = agg_data,
aes(x=factor(Group.1, levels=sort_order), y=x, color='darkred')
) + geom_point() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
}
Although there is nothing that will prevent you from doing this, this is considered a bad practice as you want your dependencies to be stated in the beginning of your code. This will let users know how they must setup their system to be able to run your script from top to bottom – and as most R scripts depend on external libraries, this is key to help others achieve the same results you achieved on your local machine or server.
Thank you for taking the time to read this post! I hope this is useful for your future R Scripts.
Functions are a really exciting concept from programming languages. They are flexible and can power up your scripts in terms of speed and modularity. Learning these best practices will help you share code more often and make it easier for others (or more importantly, your future-self!) to understand and debug your functions.
I’ve set up an introduction to R and a Bootcamp on learning Data Science on Udemy. Both courses are tailored for beginners and I would love to have you around!

Here is a gist with the code from this post: