The world’s leading publication for data science, AI, and ML professionals.

Writing Better R Functions – Best Practices and Tips

Learn some of the common best practices and tips to use to make your code more readable and easier for others to debug

Photo by Patrick @Unsplash.com
Photo by Patrick @Unsplash.com

Writing code is, mostly, a collaborative effort. Even if you are working alone, writing clean and readable code will only save your future-self of some really tough headaches.

If you don’t have a background in Software Engineering (which most of Data Scientists and Analysts don’t), you may take a while to understand that writing clean code is one of the most important features of an excellent data professional. That sense also tends to develop when you are more senior and need to look into other people’s code more often.

If you code in R or Python, you stumbled upon the concept of a function, for sure. Functions are the main features that make our code reusable and modular, making it easier to build something from scratch. Without functions, you were probably building messy scripts that are hard to maintain, debug and pass to other developers.

Nevertheless, writing functions is not a silver bullet for clean code. Even if you put all you code into functions, there are still a lot of things that you can do that will make your script difficult to understand and tweak.

So, in this post, we’ll explore some cool best practices that will power up your R Functions. These tips range from documentation and styling to overall code structure and should help you the next time you have to build some R script.

Let’s do this!


Indent, Indent, Indent

Contrary to Python, R doesn’t need proper indentation to work. For instance, the following function works just fine:

print_names <- function(vector_names) {
for (name in vector_names) {
print(paste('Hello',name))
}
}

But it’s super hard to read.

Where does the loop start? What is inside the loop? What are we iterating on? Indenting functions is an elegant way for the a reader to look into your code – such as:

print_names <- function(vector_names) {
  for (name in vector_names) {
    print(paste('Hello',name))
  }
}

The function works exactly the same but it’s cleaner and easier to understand. You can follow the spaces to check where the loop is inserted and how it words in the function.

Another cool usage of indentation is when we call functions with a lot of arguments or use nested functions. For example, imagine the following function that uses the mtcars dataframe to create a brand column based on the rownames :

get_car_brand <- function(car_model) {
 car_brand <- sapply(strsplit(car_model, ' '), '[', 1)
 return (car_brand)
}

The nested function is a bit hard to read as we are using a combination of sapply and strsplit. We can leverage indentation to make the function a bit more readable:

get_car_brand <- function(car_model) {
  car_brand <- sapply(
      strsplit(car_model, ' '), 
      '[', 
      1
    )
  return (car_brand)
}

It may take a while to adjust to this way of feeding arguments into the function. But, if you don’t end up using apply %>% , it’s a cleaner way to read nested functions.


Short Functions over Long Functions

If you have a function that is too long, break it down into small sub-functions that are easier to understand and debug.

Let’s see use the mtcars dataframe as an example, again. With the function below we are:

  • Creating the brand of the car on the mtcars dataframe;
  • Aggregating the hp by brand ;
  • Plotting the horse power by brand in a scatter plot;
plot_average_horsepower_brand <- function(cars_df) {

  cars_df_copy <- cars_df

  cars_df_copy['car_brand'] <- sapply(
    strsplit(rownames(cars_df_copy), ' '), 
    '[', 
    1
  )

  aggregate_brand <- aggregate(
    cars_df_copy$hp,
    by = list(cars_df_copy$car_brand),
    FUN = mean
  )

  sort_order <- factor(
    aggregate_brand[order(aggregate_brand[,'x']),]$Group.1
  )

  ggplot(
    data = aggregate_brand,
    aes(x=factor(Group.1, levels=sort_order), y=x, color='darkred')
  ) + geom_point() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) 
}

If I call plot_average_horsepower_brand(mtcars) this function will work and will yield the following plot:

Average HorsePower by Brand - image by author
Average HorsePower by Brand – image by author

Main issue with this function? It tries to do everything in the same block of code, making it harder to debug. Arguably, it’s a bit overwhelming when you look at it as we are doing so much stuff inside.

There are several tactics that we can use to solve this. One cool way is to break this code down into smaller chunks , making it reusable and modular -an example:

create_brand <- function(cars_df) {

  brands <- sapply(
      strsplit(rownames(cars_df), ' '), 
      '[', 
      1
    )

  return (brands)
}
mean_by_variable <- function(df, agg_var, by_var) {

  aggregate_brand <- aggregate(
    df[,agg_var],
    by = list(df[,by_var]),
    FUN = mean
  )

  return (aggregate_brand)

}
plot_sorted_scatter <- function(cars_data, agg_var, by_var) {

  # Add Brand
  cars_data$brand <- create_brand(cars_data)

  # Create Aggregation
  agg_data <- mean_by_variable(cars_data, agg_var, by_var)

  # Sort 
  sort_order <- factor(
    agg_data[order(agg_data[,'x']),]$Group.1
  )

  ggplot(
    data = agg_data,
    aes(x=factor(Group.1, levels=sort_order), y=x, color='darkred')
  ) + geom_point() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

}
plot_sorted_scatter(mtcars, 'hp', 'brand')

You may be wondering if we are not over-complicating our code. But by making it modular we can:

  • Debug it easier. If I need to change something when creating the brand of the car, I just look into create_brand . If I need to change the function I’m using in the aggregation, I just change mean_by_variable .
  • Expand the logic for other variables or plots.
  • Add additional steps to the plot_sorted_scatter more easily.

Breaking your function down into smaller ones is a general best practice for most programming languages. In R, it’s no different and you will feel the difference when you pick up your scripts again after letting them rest for a while.


Leverage Docstrings

In its core, you can document R functions by putting comments on top of them. For more complex functions where you want to explain the arguments in details, that is not practical. Another setback is that R can’t access that documentation when we provide a ?custom_function if we don’t use a small trick (explaining this next!).

For instance, if we do a function that return the multiplicative inverse of a number, such as:

return_inverse <- function(x) {
  #' Computes the multiplicative inverse of the input
  return(1/x)  
}

And later call:

?return_inverse

We will have an error on the R console:

We would like to access the Computes the multiplicative inverse of the input just like we access documentation from other built-in (or external library) functions.

Luckily, we can use docstring to provide us more advanced documentation for our functions that will help others debug your code a bit better. For instance:

library(docstring)
?return_inverse

Note: You may have to Restart R to access the function documentation (after you load docstring). The code above, will now return:

Although not perfect – we now have access to specific documentation for our return_inverse anywhere in our environment. The cool think is that we can even leverage parameters to make the documentation even better:

return_inverse <- function(x) {
  #' Multiplicative Inverse of Number
  #' 
  #' @description Computes the multiplicative inverse of the input
  #' 
  #' @param x: Real number.
  return(1/x)

}
?return_inverse

This will yield something really cool – extensive documentation of our function!

docstring is super cool as it helps you build documentation for a function that mimics the extensive documentation of base R code. You can check more about docstring here, including extra parameters I didn’t cover.


Explicit vs. Implicit Return

You might notice that throughout my functions, I mostly used explicit returns (using the returnkeyword to wrap the last object at the end of the function). Keep in mind that this is more of a personal preference than a proper "best-practice".

But, arguably, explicit returns are better for the reader to understand what each functions yields. Looking at the last instruction of the function may get confusing, some times.

For instance, the following two functions will yield exactly the same:

create_brand_explicit <- function(cars_df) {
  brands <- sapply(
    strsplit(rownames(cars_df), ' '), 
    '[', 
    1
  )
  return (brands)
}
create_brand_implicit <- function(cars_df) {
  sapply(
    strsplit(rownames(cars_df), ' '), 
    '[', 
    1
  )  
}

Another argument for explicit returns is that the variable name can give cues to the reader of what your code does. In more complex functions, it’s easier to draw the attention of the reader to the explicit return statement. An example of this is the function create_brand_explicit. By stating that the function returns a brands variable, the user should be expecting that this object will contain data regarding the brands of the cars in the dataframe.


Don’t Load Libraries

Finally, don’t load libraries inside your functions and always put your dependencies in the top of your scripts.

For instance, in our plot_sorted_scatter, where we’ve used ggplot2, there might be a temptation to do the following:

plot_sorted_scatter <- function(cars_data, agg_var, by_var) {
  library(ggplot2)

  # Add Brand
  mtcars$brand <- create_brand(cars_data)

  # Create Aggregation
  agg_data <- mean_by_variable(cars_data, agg_var, by_var)

  # Sort 
  sort_order <- factor(
    agg_data[order(agg_data[,'x']),]$Group.1
  )

  ggplot(
    data = agg_data,
    aes(x=factor(Group.1, levels=sort_order), y=x, color='darkred')
  ) + geom_point() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

}

Although there is nothing that will prevent you from doing this, this is considered a bad practice as you want your dependencies to be stated in the beginning of your code. This will let users know how they must setup their system to be able to run your script from top to bottom – and as most R scripts depend on external libraries, this is key to help others achieve the same results you achieved on your local machine or server.


Thank you for taking the time to read this post! I hope this is useful for your future R Scripts.

Functions are a really exciting concept from programming languages. They are flexible and can power up your scripts in terms of speed and modularity. Learning these best practices will help you share code more often and make it easier for others (or more importantly, your future-self!) to understand and debug your functions.

I’ve set up an introduction to R and a Bootcamp on learning Data Science on Udemy. Both courses are tailored for beginners and I would love to have you around!

Data Science Bootcamp: Your First Step as a Data Scientist - Image by Author
Data Science Bootcamp: Your First Step as a Data Scientist – Image by Author

Join Medium with my referral link – Ivo Bernardo

Here is a gist with the code from this post:


Related Articles