Cleaner R Code with Functional Programming

Published in

Towards Data Science

9 min readFeb 19, 2019

Introduction

Due to a job switch, I am a recent R-to-Python convert. However, a few side-projects keep me switching between the two languages daily. In addition to giving me a headache, this daily back-and-forth has given me a lot of thought about programming paradigms. Specifically, I’ve really become an evangelist for the functional programming paradigm in R. I want to give a little insight as to what functional programming (FP) is and how it can give you superpowers (sort of).

Why Does it Matter?

R has a unique place in the programming world. It is used by hundreds of thousands of people every day around the world for analyzing and manipulating data. Its users are rarely trained in pure computer science and, in many cases, R code is only run once. This combination can lead R programs to be sloppy and inefficient. How did we get here? The reasoning is usually: if it works it works, right?

If this “Why change?” mentality sounds familiar, this blog post is for you. R is in fact a full (albeit domain-specific) programming language influenced by rich mathematical theory. Learning the basics of FP will help you write better code, and thus make you a better statistician, data scientist, or whatever we’ve decided to call ourselves by the time you’re reading this.

What is Functional Programming?

I’m not going to give a rigorous definition. You can go to Wikipedia for that.

Simply put, FP is exactly what it sounds like. If you are doing something more than once, it belongs in a function. In FP, functions are the primary method with which you should carry out tasks. All actions are just (often creative) implementations of functions you’ve written.

Once you get into it, the advantages become clear. Your code is easier to fix and maintain since you’ve segmented your code into easily serviceable pieces. Your code is easier to read since, if you named everything right, your code can look closer to plain English. Replacing long blocks of code with function calls can also help you cut down on spaghetti and pyramid of doom code, too.

Ok, how can we retrain our brains for that sweet, sweet FP?

Step 0: Learn the Basics

In order to write a truly “functional” function, it must be pure. A pure function has two rules:

It must be deterministic
That is, every time you run this function with the same inputs, it must have the same output. Every. Single. Time. “But what about functions and statistical processes with random components?” you ask? Simply set a seed, either inside the function, or let the seed be a parameter to the function. This is important for reproducible science, anyway.
It can’t have side effects
This means that your function cannot touch or change anything outside of it. This means you should probably never be using the global assignment (<<-) operator. Curiously, this also means the print() function disobeys FP.

Step 1: Ditch the Loops

As my grad school adviser once told me,

If you’re writing loops in R, you’re probably doing something wrong.

(He told me this, of course, as he was debugging my third layer of nested for loops.)

But… loops are so fundamental! Why should we seek to use them as sparingly as possible? There are two reasons, the first of which is specific to R.

The Whole Language is Already Vectorized
Even if you’ve never heard that word before, you knew this already. Vectorization is the reason you write this:

x <- 1:10
y <- 2 * x

instead of

x <- 1:10
for (i in seq_along(x)) {
    y <- 2 * x[i]
}

Loops are Slow — Use Applies!
The apply() function, and functions like it, are the building blocks upon which R’s FP capabilities are fully realized. While in most languages, loops and applies (often called “maps”) are the same speed, we’ll see dramatic speed boosts using applies in R.

R’s base has a few applies, but the really nifty ones are found in purrr. More on this later.

Step 2: Pipes, The Tidyverse, and More Pipes

If you haven’t heard of the Tidyverse yet, get ready to meet your new best friend. But first, let’s meet the star of the Tidyverse, the pipe operator:

The pipe (%>%) is an operator supplied by several different packages, but is most commonly accessed through either dplyr or tidyverse. Oh, and if you think it’s a pain to type (%>%) repeatedly, RStudio gives you a shortcut: Ctrl-Shift-M.

So, what does it do? Simply put, the pipe takes what’s on the left, and makes it the first argument of what’s on the right. For example:

add <- function(x, y) x + y
3 %>% add(5) 
# 8

This may seem more verbose than simply typing add(3, 5), but this allows you to write complex operations as pipelines:

3 %>%
  add(5) %>%
  add(1) %>%
  add(3) %>%
  add(7)
 
# 19

Too trivial? Check out this actual snippet from one of my consulting projects:

data_clean <- data_raw %>%
  isFinal() %>%
  dropLastFiling() %>%
  getAccStats() %>%
  getPctIncs() %>%
  capOrDrop(inc_vars, cap = 3)

You don’t need to see what the functions do to know I’m hiding a lot of complexity here. However, you can almost read this in English:

Take the raw data
Get whether or not it is the last tax filing
Drop the last tax filing for each organization
Get accounting statistics
Get the year-over-year percent increases
Drop or cap these variables where appropriate (I use a cap of 300%)

Without this modularization, this code would be nearly impossible to debug. Got a problem dropping the last tax filings for each organization? You’ll have to read through hundreds of lines of spaghetti code. Here, you simply find where dropLastFiling is defined, and fix it there. Furthermore, you can more clearly see what the steps are to prepare your data.

Now, we’re ready to get started with the Tidyverse. tidyverse is actually a collection of packages, and you may not need them all. Most of what we need is actually contained in dplyr.

Anyway, tidyverse is brimming with easy-to-pipe functions specifically built for common data manipulation tasks. Here are some of the most commonly used ones:

select() — Select which columns to keep (or drop)
filter()— Select which rows to keep (or drop)
arrange() — Sort data by given rows
rename() — Rename columns
mutate() — Make new rows from existing ones
group_by() — Organize data so that it is grouped by some categorical variable
summarize() — Similar to mutate() but collapses group_by()ed data into summary statistics

Example:

mtcars %>%
  filter(am == 0) %>%         # Consider manual cars only
  group_by(cyl) %>%           # Group them by the number of cylinders
  summarize(                  # Get the mean and sd of fuel
    mean_mpg = mean(mpg),     # economy by cylinder
    sd_mpg = sd(mpg)
  ) %>%
  ungroup()                   # Undo effects of group_by()
                              # (Not always req, but good practice) 
 
# Output:
# A tibble: 3 x 3
#     cyl mean_mpg sd_mpg
#   <dbl>    <dbl>  <dbl>
# 1     4     22.9   1.45
# 2     6     19.1   1.63
# 3     8     15.0   2.77

Step 3: Get Comfortable with Applies and Maps

The package `purrr` is short for “Pure R”. The third R was added for the cat mascot, I suppose.

We still have a gap in our toolkit: we’re not allowed to use loops, there are some tasks that aren’t vectorized for us already! What’s a data analyst to do?

The solution is to use applies (also called maps). Maps take a collection of things, and apply some function to each one of those things. Here’s a diagram taken directly from RStudio’s purrr cheat sheet (Credit: Mara Averick):

Sidenote: The dplyr package actually gets its name from applies. dplyr = data + apply + R.

The purrr package contains a ridiculous number of maps from which to choose. Seriously, check out that cheatsheet!

Example, bringing it all together: Suppose I had a vector of strings, and I wanted to extract the longest word in each. There is no vectorized function that will do this for me. I will need to split the string by the space character and get the longest word. For dramatic effect, I also upper-case the strings and paste them back together:

library(tidyverse)
library(purrr)
 
sentences <- c(
  "My head is not functional",
  "Programming is hard",
  "Too many rules"
)
 
getLongestWord <- function(words) {
  word_counts <- str_length(words)
  longest_word <- words[which.max(word_counts)]
  return(longest_word)
}
 
sentences %>% 
  toupper() %>% 
  str_split(' ') %>% 
  map_chr(getLongestWord) %>% 
  str_c(collapse = ' ')
 
# [1] "FUNCTIONAL PROGRAMMING RULES"

Bonus Step: Know the Lingo

In other languages, some of the lingo of FP is built-in. Specifically, there are three higher-order functions that make their way into almost every language, functional or not: map (which we’ve already covered), reduce, and filter.

A higher-order function is a function that either takes a function as an argument, returns a function, or both.

Filtering in R is easy. For data frames, we can use use tidyverse::filter. For most other things, we can simply use R’s vectorization. However, when all else fails, base R does indeed have a Filter() function. Example:

Filter(function(x) x %% 2 == 0, 1:10)
# [1]  2  4  6  8 10

Similarly, you probably won’t ever need Reduce() in R. But just in case, here’s how it works: Reduce() will take a collection and a binary function (ie, takes two parameters), and successively applies that function two-at-a-time along that collection, cumulatively. Example:

wrap <- function(a, b) paste0("(", a, " ", b, ")")
Reduce(wrap, c("A", "B", "C", "D", "E"))
# [1] "((((A B) C) D) E)"

Another beloved FP topic is that of currying. Currying is the act of taking a function with many arguments, and breaking it out into functions that take a partial amount of those arguments. These are sometimes called partial functions. The following example uses a function factory to make partial functions:

# Adder is a "function factory" - a function that makes new functions.
adder <- function(a) {
    return(function(b) a + b)
}
 
# Function factory pumping out new functions.
add3 <- adder(3)
add5 <- adder(5)
 
add3(add5(1))
# 9

Do you find this concept hard to follow? You’re not alone. To make this slightly more readable, the functional library gives you an explicitly currying builder:

library(functional)
add <- function(a, b) a + b
add3 <- Curry(add, a = 3)
add5 <- Curry(add, a = 5)
 
add3(add5(1))
# 9

Sidenote: The verb “currying” comes from Haskell Curry, famed Mathematician/Computer Scientist/Fellow Penn Stater.

Summary

Do you feel smarter? More powerful? Ready to torture your data with some of your new FP skills? Here are some of the big takeaways:

No more loops! Ever!
Anytime you want to use a loop, find the appropriate apply/map.
Integrate the Tidyverse into your workflow wherever possible.
Use the pipe (%>%) when applying several functions to one thing (eg, a data frame being manipulated in the Tidyverse).

Adhering to these mindsets while coding can greatly reduce ugly, difficult-to-maintain spaghetti code. Bottling things in functions can leave you with clean, readable, modular ravioli code. I’ll leave you with a famous quote from John Woods:

Always code as if the [person] who ends up maintaining your code will be a violent psychopath who knows where you live.