The world’s leading publication for data science, AI, and ML professionals.

R is Slow – and It’s Your Fault!

Understanding your tools is critical to your success

Tutorial | R | Optimizing code

Photo by Luca Ambrosi on Unsplash
Photo by Luca Ambrosi on Unsplash

Anyone who works in the Data Science space is familiar with R. You’ve surely come across someone making the argument that R is a slow language and can’t handle larger data. That simply isn’t always the case. A lot of R code I’ve seen in the wild shows that there is a lack of fundamental understanding of how the language works. Let’s look at one example of how to optimize your code to work with you instead of against you.

Join Medium with my referral link – Drew Seewald


Why are Loops Slow?

Let’s start with how the R programming language works. It is what is referred to as an interpreted language. This means you don’t have to compile anything before running code, the computer just interprets and runs it, giving you results. This helps speed up how quickly you can write and test your code, but has the drawback of generally being slower to execute. There is a lot of overhead in the processing because R needs to check the type of a variable nearly every time it looks at it. This makes it easy to change types and reuse variable names, but slows down computation for very repetitive tasks, like performing an action in a loop.

Let’s take an example of a simple programming problem: find all of the factors of a given number. To do this, we can look at every number starting with 1 all the way up to, and including, the given number. If our number is 2048, this would be every number from 1 to 2048. We would divide 2048 by each of those possible factors. If the remainder is 0, the number is a factor.

For the beginner programmer, you could tackle this pretty easily with a for loop. We assign 2048 to a variable, x. Then we can create a blank vector to store the factors in as we find them.

x <- 2048
factors <- c()

Now the for loop. We’ll go from 1 to x. Inside the for loop, we’ll create an if statement. The condition for the if statement will be x modulo (%%) i being equal to 0. The modulo will give us the remainder of dividing x by each number. When this is 0 we’ll execute the next line that adds the value of i to our vector of factors.

for (i in 1:x) {
    if (x %% i == 0) {
        factors <- c(factors, i)
    }
}

There isn’t anything wrong with this approach. Running it shouldn’t take any time at all on a pretty standard computer. You’ll get the correct values in the factors vector as well. So what’s wrong with this approach?

When R uses loops, it essentially has to check the value of x and i each and every time the loop runs. To our minds it seems obvious that these will always be numeric, but R still needs to check every time. This starts to become a problem when you want to find the factors for a larger number, such as 2,048,000. On my computer, the same loop took about 26(!) seconds to find all the factors of 2,048,000.

If you want to run it on your machine, use the rbenchmark package. Use the benchmark function and give your code to benchmark a name, like Loop. Set that name equal to our loop and you’ll get back your execution time in the elapsed section of the output. Note that the code is run multiple times and the elapsed time is the cumulative time for all runs.

library(rbenchmark)
benchmark(
    "Loop" = {
        loop_factors <- c(1)
        for (i in 2:x) {
            if (x %% i == 0) {
                loop_factors <- c(loop_factors, i)
            }
        }
    }
)
# Output
# test replications elapsed relative user.self sys.self user.child
# Loop          100    27.5        1     27.44        0         NA
# Sys.child
#        NA

There is Another Way

Now I know what you’re saying, use python (or another faster language)! But there is no reason R can’t handle this task much faster, if you know how it works. All we need to do is eliminate that overhead of R checking every variable’s type every single time it loops. Something like this maybe:

factors <- (1:x)[(x %% 1:x) == 0]

There you go problem solved! That runs about 7 to 8 times faster and gives the same correct answer. But this is all about understanding why and I like code to be readable, so let’s dive a bit deeper in as we rewrite this optimized statement.

First off, we already established that part of why R was taking a long time to run the loop was because it was checking the types of the variables every time it loops. Lucky for us, R was designed to leverage vectors to avoid this. Vectors are required to have the same type for every element, so R doesn’t need to check each and every element while it is computing. So to optimize things we’ll use vectorized operations.

Storing Data: R Data Structures

First, we’ll take our number, x, and use the modulo operator (%%) to get every remainder when dividing it by every number from 1 to x. We can use 1:x to create the vector of all the numbers 1 through x. When we use the modulo of the single element x by a vector of numbers, R divides x by each element of the vector, returning a vector of the remainders.

remainders <- x %% 1:x

Next, we want to test which remainders are 0. Similarly to how the remainders operation worked, we can compare the entire remainders vector to 0. This will return a vector of TRUE/FALSE, where TRUE represents remainders that are 0.

true_false <- remainders == 0

Now, we can use the true_false vector to filter our possible numbers, 1 to x, giving us the final answer.

(1:x)[true_false]

Let’s Drag Race

Let’s check the speed of our calculations again, comparing all three versions: the for loop version, the single line vector optimized version, and the readable vector optimized version. I’ll use the benchmark function again, this time adding a Vector and Vector Readable. x is once again set to 2,048,000.

benchmark(
    "Loop" = {
        loop_factors <- c(1)
        for (i in 2:x) {
            if (x %% i == 0) {
                loop_factors <- c(loop_factors, i)
            }
        }
    },
    "Vector" = {
        loop_factors <- (1:x)[(x %% 1:x) == 0]
    },
    "Vector Readable" = {
        remainders <- x %% 1:x
        true_false <- remainders == 0
        loop_factors <- (1:x)[true_false]
    }
)
# Output
#              test replications elapsed relative user.self sys.self
# 1            Loop          100   27.52    7.519     27.38     0.00
# 2          Vector          100    3.66    1.000      2.99    0.66
# 3 Vector Readable          100    3.77    1.030      3.24    0.50
#
#   user.child sys.child
# 1         NA        NA
# 2         NA        NA
# 3         NA        NA

In the output, the fastest running code will have a relative value of 1. Every other snippet will show how many times longer it took relative to that fastest snippet. In my case, the single line version was slightly faster than the readable version. This is most likely due to us storing the outputs between steps to make it more readable. You can also see that the for loop was 7.5 times slower than the vector versions. Vectors really do help optimize your code!

That’s awesome! R can do anything much faster using vectors, right? Well, R still has some limitations. While using vectors can greatly speed up calculations, R still does the majority of calculations in memory. So once we reach sufficiently large numbers, R won’t be able to allocate vectors of the size required to do the calculation, like 2,048,000,000. Here’s what happens:


Conclusion

So in the end, it is important to know your tools. R is a very approachable language that is pretty easy to pick up. Learning a few tricks, such as leveraging vectors, can speed up your calculations and help keep R relevant in your toolkit. You also need to know the limitations of your tool. Learning to optimize your code with vectors is useful, but once you hit a certain size of data and calculations, you may want to start looking to a new language.


Related Articles