Python And R for Data Wrangling: Compare Pandas and Tidyverse Code Side-By-Side, and Learn Speed-Up Tips.
Skill-up by becoming a bilingual data scientist. Learn speed-up code tips. Write bilingual notebooks with interoperable Python and R cells.

A couple of years back, you would write your data analysis program, exclusively in one of these two languages: Python or R. Both languages offer great functionality from data exploration to modeling and have very loyal fans. However, things have changed [1]. These days, there are libraries, such as reticulate and PypeR, that allow you to incorporate Python code in R Markdown and R code in a Jupyter notebook respectively. Knowing basic functionality, such as data wrangling, in both languages, expands your programming horizons, allows you to work with people of either language and create bilingual notebooks, leveraging the best of each language. In this article, we will discuss data wrangling in Python and R, using the pandas and tidyverse libraries respectively, and tips to speed up your code. After you read the article, it will become obvious that Python and R, are quite similar in many of their expressions, at least in the area of data wrangling. So, with a little extra effort, you can master data wrangling in both languages, and become a "superR-Pythonista"!
Be a supeR-Pythonista (instead of just a Pythonista!)
A Bilingual R Markdown file
This article is accompanied by an R Markdown file, which you can find on github. In this file, data wrangling operations are implemented twice: in Python and R cells, adjacent to each other. This is facilitated by importing the reticulate library. The Python and R cells work independently currently; in my next article, I will show the passing of arguments between the Python and R cells and inter-language communication.
The R Markdown file can also serve as two independent cheatsheets for Python and R data wrangling operations. You can also download a knitted version of the R Markdown file. The data wrangling for both languages is performed on similar structures: The R Data Frame, and the Python DataFrame. The specific implemented operations are:
A. Create/Read Data Frame.
B. Summarize.
C. Select rows, columns, elements using indices, names, logical conditions, regular expressions. The filter() functions in Python and R will be presented.
D. Delete-add rows, columns. We will discuss the mutate() function in R and map in Python.
E. Apply a function to rows/columns, including lambda functions in Python.
F. Speed-up code. We will discuss techniques, such as parallelization, and function compilation for code speed-up.
Data Wrangling Operations
In the aforementioned R Markdown file, the Python code is enclosed in the following brackets, which define a Python cell:
while the R code is enclosed in these brackets:
These brackets are automatically generated by RStudio, by clicking on the Insert tab and selecting the language you want.
Programming tip: Before executing any Python cells in the R Markdown, execute the shell that imports the reticulate library. Note that in the same cell, we instruct RStudio, which Python to use. Also, it is a good idea to configure Python in the RStudio console, using the command _pyconfig().
It is important to note that in both languages, there are multiple ways to perform operations, many of which are presented in the aforementioned R Markdown file. Due to space constraints, in this article, we will only present a portion of them. Let us get started:
Read a Dataframe From a CSV File
As we see in the gist below, in both languages, we call a function that reads the csv file. In Python, this function is called through the pandas library. The heart.csv file is from Kaggle and the UCI repository.
Notable in the Python code below, is the use of a lambda function (an anonymous inline function) to exclude from reading the columns defined in the _dropcol list.
Programming tip: It is a good idea to define the encoding for the read.csv() function in R, as shown below, to ensure proper reading of all column names. When I did not use the encoding, the name of the first column was read incorrectly. I had no need to define the encoding in Python.
Create DataFrame From Scratch
As we see below, in both languages, a Data Frame can be created from a lower order structure, matrix and array for Python and R respectively.
In the R section, The DepositFrame Data Frame contains the bank deposits (in thousands) of 3 persons. The bank deposits are generated using the rnorm() function that generates 6 random numbers, with mean 12 and standard deviation 3.
Programming tip: As you note in the R code below, we import the tidyverse library, which is a library that contains many packages, such as dplyr. The functions we need in the segment of our R code are defined only in the dplyr library. So why import tidyverse, instead of what we just need, which is dplyr? One reason is future extensibility (we might need to use functions from the other packages in tidyverse). The other reason is that when we import library(dplyr), we get a warning that its important function filter() is masked by another library. If we import library(tidyverse), we do not have this problem.
Summarization
Here we will examine two types of summarization: (a) Functions that summarize basic information about the Data Frame. (b) Functions that allow grouping and tailored insights in slices of the data.
Functions That Describe General Info About the Data Frame
In the following gist, in the R section, the head() function shows the first few rows (observations) of the Data Frame. The glimpse() function displays the number of observations, and variables (columns) along with the type, name, and values of the latter. Similar information to the glimpse() function is displayed by the str() function. A more useful function is the describe() function from the prettyR package, which displays basic statistics and number of valid cases for each variable. Variable statistical information is also displayed by the summary() function. Similarly to R, in the Python section, the head() function shows the first rows, while the describe() function displays basic statistics for each column, such as mean, min, and max.
Grouping and Summarizing Information Inside Groups
Both R and Python have functions for grouping based on a variable or a set of variables. As shown in the gist below, in R, we can use function tapply() from base R, to compute the mean cholesterol of men and women separately. Alternatively, we can use function _groupby() followed by summarization functions such as count(), mean(), or _summarizeall() . Similarly, as shown below, in Python, group summarization is performed by the function groupby() followed by summarization functions such as count().
An additional example for grouping, summarizing in R:
An additional example for grouping, summarizing in Python:
Index-Based Selection of Rows, Columns, Elements
An important note here is that indices in R start at 1, while in Python at 0. In Python, we use the iloc indexer to access integer locations.
Column/Row Selection Using Names, Regular Expressions, Logical Conditions
Some highlights to note here:
- The use of the %>% pipe operator in R, which allows you to write succinct functional code.
- The use of select() function in R to select columns, using either column names, regular expressions and/or logical expressions. For example, in lines 7,8 select() statements are used to select all columns that contain letter m, and their max() is less than 12.
- The function filter(), that is an alternate way to make selections in R. Its extension, _filterall() below allows us to select all rows that fulfill a criterion.
-
Python offers a similar filter() function for selection. The axis argument specifies whether the operation will be applied to columns or rows.
Row/Column Deletion and Addition
Below, notable are the following:
-
In R, column deletion is performed using the minus operator with the column name. We can also use string matching functions, such as _startswith(), which is shown below. Other similar functions are _endswith(), matches(), contains().
- In R, a column can be added by simply using its name (column Brandon below). An alternative popular way is the mutate() function from the dplyr library. Here, we add a column sumdep using mutate(), where this new column is the sum of two existing columns. Notice the difference with the transmute() function, which only adds new columns (does not preserve the old ones). As also shown below, a particularly interesting way to use mutate() is to combine it with group by, to do calculations within groups.
-
In the Python section, notable is the addition of a column using logic implemented with the map() function.
Apply a Function to Columns/Rows
Both Python and R offer apply() functions that allow us to implement a function on rows or columns, depending on an extra argument. For example below, in R, if the middle argument is 2, the function is applied to columns, while if it is 1, it is applied to rows. In Python, the determining argument of the apply() function is axis (0, it is applied to columns, 1, to rows). Another interesting tidbit below is the use of a lambda function to multiply by 5 all elements.
Speed up Considerations
Here, we will briefly discuss three options for speeding up code in R: data.table, parallelization, and function compilation. For Python, the speed-up option that will be presented is modin.

R: data.table
Data.table is an extension of Data Frame [2], and its advantages are two-fold:
(a) For large files, much faster loading of data, than Data Frame. An example is shown in the gist below. I read a large file (66.9MB), using function read.csv(), which returns a Data Frame, and fread(), which returns a data.table. The difference is quite significant (e.g., user+system time, for data.table is approximately 30 times faster!). The reference for the data file I used is [3] in the references section.
(b) Concise, compact filtering expressions, which also execute quite fast. An example of filtering with data.table and comparison with dplyr filtering is shown in the gist below. In this gist, we compute the mean age of men with cholesterol > 280 using filter() of library dplyr. We also compute it with data.table, which is done __ in one concise line of code.
R: Parallelization
In R, parallelization is implemented using the parallel library [4], which allows our R program to utilize all cores in our computer. We can use the function detectCores() to find out how many cores we have and then we can use function makeCluster() to create a cluster with these cores. This is shown in the gist below. Then we can use, the parallel version of the apply() family of functions to perform various operations. For example, in the gist below, the sapply() function and its parallel version parSapply() are used to square a large set of numbers. We can see that the user+system time for parSapply() is 7.95+0.66=8.61 while for plain apply() is 11.80+0.03=11.83.
R: Function Compilation
An alternative way to speed up operations in R is to compile functions. This is achieved through the compiler library and its cmpfun() function [5]. The gist below shows this, as well as a comparison of parallelization, compilation, in terms of time consumed. We see that the fastest user+system time (7.73) is achieved when we combine parallelization and compilation. On their own, implementing only parallelization yields a faster user+system time (8.61) , than implementing only function compilation (8.87).
To conclude our discussion on speeding R, it is worth mentioning the library gpuR [6] . Using the ViennaCL library, gpuR allows computation in parallel on any GPU, in contrast to previous R packages that depended on NVIDIA CUDA on the backend.
Python: modin
As described in [7], modin is a Dataframe library that has the exact same API as pandas and allows significant speed-up of workflows (4-times faster on an 8 core machine). To use it, we only need to change one line of code. Instead of:
import pandas as pd we will use:
import modin.pandas as pd
One word of caution. modin is built on top of Ray, so ensure that you have the right version of Ray.
An additional freebie you can find in my github repository is another bilingual R Markdown file, that implements array creation (1-D and 2-D) and math functions using arrays (dot product, eigenvalues, etc.) in both Python and R. You can also download its knitted version.
Thanks for reading!
References
- Pandey, P., From R vs. Python, to R and Python, https://towardsdatascience.com/from-r-vs-python-to-r-and-python-aa25db33ce17
- Introduction to data.table, https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
- MITx and HarvardX, 2014, "HMXPC13_DI_v25–14–14.csv", HarvardX-MITx Person-Course Academic Year 2013 De-Identified dataset, version 2.0_, https://doi.org/10.7910/DVN/26147/OCLJIV, Harvard Dataverse, V10
- Treadway, A., Running R Code in Parallel, https://www.r-bloggers.com/running-r-code-in-parallel/
- Ross, N., FasteR!HigheR!StrongeR!-A Guide to Speeding Up R Code for Busy People, https://www.r-bloggers.com/faster-higher-stonger-a-guide-to-speeding-up-r-code-for-busy-people/
- Package gpuR, https://cran.r-project.org/web/packages/gpuR/gpuR.pdf
- Pandey, P., Get faster pandas with Modin, even on your laptops, https://towardsdatascience.com/get-faster-pandas-with-modin-even-on-your-laptops-b527a2eeda74