R vs. Python vs. Julia

How easy it is to write efficient code?

Daniel Moura
Towards Data Science

--

Photo by Austris Augusts on Unsplash

TL;DR: just jump to the Overall Comparison section

If you are a Data Scientist, chances are that you program in either Python or R. But there is a new kid on the block named Julia that promises C-like performance without compromising the way Data Scientists write code and interact with data.

In my last post, I have compared R to Julia, showing how Julia brings a refreshening programming mindset to the Data Science community. The main takeaway is that with Julia, you no longer need to vectorize to improve performance. In fact, good use of loops might deliver the best performance.

In this post, I am adding Python to the mix. The language of choice of Data Scientists has a word to say. We will solve a very simple problem where built-in implementations are available and where programming the algorithm from scratch is straightforward. The goal is to understand our options when we need to write efficient code.

Membership testing via linear search

Let us consider the problem of membership testing on an unsorted vector of integers.

julia> 10 ∈ [71,38,10,65,38]
true
julia> 20 ∈ [71,38,10,65,38]
false

In principle, this problem is solved via linear search. The algorithm runs through the elements of the input vector until finding the value being searched (successful search) or reaching the end of the vector (unsuccessful search). The goal is to tell if a given integer is in the vector.

For evaluating different implementations in R, Python, and Julia, I generated a dataset with 1.000.000 unique integers ranging from 1 to 2.000.000 and performed 1.000 searches with all integers from 1 to 1.000. The probability of a search being successful is ~50%, so half the times the algorithm will scan the complete vector to conclude that the search was unsuccessful. In the remaining cases, the algorithm should require (n+1)/2 evaluations (on average) to find the element, with n being the length of the vector.

I measured the performance of each implementation by taking the median CPU time of 3 runs. Additional information about the hardware and software for running experiments can be found here.

Please note that the goal of these experiments is not to make an accurate benchmark of the different languages and implementations. The goal is to highlight the hurdles that languages pose to data scientists when performance is important.

C implementation

I implemented the linear search in C to get a grasp on what would be the performance on a statically typed compiled language and to set the baseline. The binary executable took 0.26 seconds of CPU time to execute the 1.000 searches.

R implementation

I tried different flavors of the membership test in R, from a specialized operator (in) to a C-like implementation using loops, passing by a vectorized approach.

As we move from in_search to for_search we get higher control of the algorithm. However, as control increases, performance degrades in R. It is an order of magnitude faster to use a vectorized operation, as in vec_search, that does a full array scan than looping through the elements until finding a match. Like in my previous post, vectorization paid off despite requiring more memory and (redundant) operations. As expected, the specialized operator in has the highest performance and the cleaner code.

I also tried Map-Reduce operations, but did not have the patience to wait until they finished… not an option if you seek performance.

Python implementation

To be honest, the initial goal was to use only native functions and native data structures, but the in operator was ~10x slower than R when using Python’s native lists. So, I also included results with NumPy arrays (which bring vectorized operations to Python). CPU time went from 9.13 to 0.57 seconds, about 2 times the baseline. When moving to a loop approach, however, native lists took the edge by one order of magnitude… I gave NumPy a second chance by adding JIT compilation with the Numba package. There are limitations in Numba but it is simple to use: you just need to include the Numba package and tag the functions you want to see compiled JIT (and carefully read the manual). Loops with NumPy plus Numba deliver comparable (or better) performance than vectorized/specialized operations, but getting here was not easy as there are several traps in the way. Speaking about traps, performing loops on native lists with Numba was disappointing… again, I stopped execution because it was taking over 5 minutes to finish.

Julia implementation

In Julia, I included a couple more flavors to show the diversity and performance of natively available functionalities.

Except for vectorized operations, the performance was pretty close to the implementation in C, with a degradation between 20%-50%. The vectorized performance was decent, about 4x C’s CPU time, but also about 2x NumPy’s CPU time on vectorized operations. The freedom you get is incredible, as you can code virtually any algorithm in Julia! For achieving top performance on for loops, I used hints to tell the compiler not to check if indexes are within bounds of the array (inbounds macro) and to tell the compiler it has extra liberties in the order it executes iterations (simd macro). In case you are wondering, not providing these hints would bring the loops’ performance close to the in_search.

Overall comparison

(plot made in R using ggplot2 and ggrepel)

Looking to results side by side for this simple problem, we observe that:

  • Julia’s performance is close to C almost independently on the implementation;
  • The exception in Julia is when writing R-like vectorized code, with performance degrading about 3x;
  • When adding JIT compilation (Numba) to Python, loop-based implementations got close to Julia’s performance; still Numba imposes constraints on your Python code, making this option a compromise;
  • In Python, pick well between native lists and NumPy arrays and when to use Numba: for the less experienced it is not obvious which is the best data structure (performance-wise), and there is no clear winner (especially if you include the use case of adding elements dynamically, not covered here);
  • R is not the fastest, but you get a consistent behavior compared to Python: the slowest implementation in R is ~24x slower than the fastest, while in Python is ~343x (in Julia is ~3x);
  • Native R always performed better than native Python;
  • Whenever you cannot avoid looping in Python or R, element-based looping is more efficient than index-based looping.

Details matter…

I could stop the article right here and write how seamless it is to write efficient code in Julia. Still, details matter, and the programmer needs to be mindful of Julia’s internals. Can you guess what is the line of code that most affects performance? Here is a hint: you will not find it in any of the snippets presented before…

Here it is:

map(line -> parse(Int, line), eachline(f))

This line of code parses the input text file f, which contains one number per line (note that reading the file is not part of the benchmark). So, what is it so special about this line of code? In a nutshell, Julia infers that:

  • the type of the elements returned by the anonymous function (first argument of map) is (always) integer;
  • the output of the mapping is, therefore, an array of integers.

Since Julia knows that is storing an array of integers, it allocates a contiguous block of memory where each item contains an integer. This allows for efficient read operations.

How can we mess up? Here is one way:

a = []
for line in eachline(f)
push!(a, parse(Int, line))
end

Seems similar, right? However:

> typeof(a)
Array{Any,1}

The sentence a = [], as convenient as it might look, creates an array of Any, meaning that you can store any type of data on each element of the array. Internally, Julia stores an array of pointers in memory for cooping with the flexibility that Any provides. As a result, Julia can no longer process a sequential contiguous block of memory when processing the array. What is the impact on performance? About 50 to 100 times slower!

Fixing this code would be fairly simple: a = Int[] (instead of a = []) would do the job as it specifies the type of the elements.

Bottom line

From all the languages covered in this article, Julia clearly is the easiest to write efficient code. Still, you need to know what you are doing. Fortunately, performance tips are available that can put you on the right track.

Happy coding!

p.s. I use all three languages regularly and I love them all. Each one has its place.

Code available at github.com/dcmoura/blogposts

You can find me on Tweeter, LinkedIn and Vimeo (check out my datavizs!)

If you liked this article, you might also enjoy Freeing the data scientist mind from the curse of vectoRization

--

--