Knowing and learning multiple programming languages has always been important whether you are a data scientist, data engineer, data analyst, or any developer of sorts. Being able to be a versatile programmer means your skills will never be outdated and you can quickly adapt to industry trends.
That being said, we’ve all faced this familiar dilemma as programmers:
You’ve got a big deadline to build a data pipeline and your boss keeps asking for its ETA. Or you have to work on that machine learning class project and there are only a few days before final presentations. You want to continue your commitment to improving your R skills (or Python) but it’s so much easier, faster, and less-stressful to stick to the language you know best.
However, the reality is that for everything you can do in Python when it comes to working with data frames, there is a way to accomplish it in R. And many times, the approach is quite similar except for the syntax.
In this article, we will cover 7 useful operations that you may have encountered with Python in a typical data wrangling process. But we’ll also look at the equivalent code side-by-side to accomplish it in R.
We will use the Rain in Australia dataset available on Kaggle. We first need to load in needed packages and read the csv file into dataframes using functions.
Python
import pandas as pd
df = pd.read_csv('data/weatherAUS.csv')
df.head(5)
R
library(dplyr)
library(ggplot2)
library(tidyr)
library(pivottabler)
library(reshape2)
df <- read.csv('data/weatherAUS.csv')
head(df)

Finding Missing Values (NA)
Python
We can find missing values by checking each column to see if at least one NA
value exists using boolean checks.
df.isna().any()
R
In R, we can apply a similar logic using the sapply
function.
sapply(df, function(x)any(is.na(x)))

Replace Missing Values (NA)
Python
We can replace NAs with a specific value such as zero for select columns.
df['Rainfall'] = df['Rainfall'].fillna(0)
R
In R, we do this by using dplyr
pipes %>%
to chain data transformations.
df$Rainfall <- df$Rainfall %>% replace_na(0)
Drop Missing Values (NA)
Python
Here, we can drop all observations that contain at least one NA value.
df.dropna(axis=0, inplace=True)
R
df <- df %>% drop_na()

Apply Group by & Aggregation
Python
In Python, we can group data and apply aggregate calculations by chaining commands to the respective dataframe.
df[['Location', 'MinTemp', 'MaxTemp',
'Rainfall', 'WindGustSpeed']].groupby('Location').mean()
R
No surprise here. We can use dplyr’s group_by
and summarise
functions.
df %>% group_by(Location) %>%
summarise(across(c('MinTemp', 'MaxTemp', 'Rainfall',
'WindGustSpeed'), mean))

Apply Custom Function To Dataframe Column
Python
The pythonic way to apply a function to a column is using apply
andlambda
functions.
# define function
def convertToFahrenheit(c_degrees):
f = c_degrees * 1.8 + 32
return f
df[['MinTemp', 'MaxTemp']] =
df[['MinTemp', 'MaxTemp']].apply(lambda x: convertToFahrenheit(x))
R
R-way of doing this using mutate
and sapply
. Mutate can be used to create new columns or update existing ones.
convertToFahrenheit <- function(c_degrees){
f <- c_degrees * 1.8 + 32
return(f)
}
df %>% mutate(MinTemp = sapply(MinTemp, convertToFahrenheit),
MaxTemp = sapply(MaxTemp, convertToFahrenheit))

Create a Pivot Table
Python
For those familiar with Excel’s famous pivot tables, we can do this easily on Pandas dataframes to quickly summarise and aggregate specific data.
pd.pivot_table(data=df,
values='WindGustSpeed',
index='Location',
columns='WindGustDir',
aggfunc='mean',
fill_value=0)
R
Although not as intuitive, we can do this using reshape2’s dcast
function.
dcast(data=df, formula=Location ~ WindGustDir,
value.var='WindGustSpeed', fun.aggregate=mean, fill=0)

Detect Presence of Outliers in Dataset
Python
Though many approaches exist to identify outliers, one common way we can do this with a pandas dataframe is to plot a boxplot for columns we are interested in exploring.
df.boxplot(column=['Rainfall'])
R
When it comes to exploratory Data Analysis and visualization, R’s ggplot2
package allows us to easily generate a boxplot.
ggplot(df, aes(x=Rainfall)) + geom_boxplot()

Summary
The examples above are by no means exhaustive of all the operations on dataframes that you can perform in both R and Python. But hopefully, it gives you a glimpse of how easy it can be to overcome your hesitation and perhaps try a new language on your next data wrangling task.
For next steps, if you’re really comfortable in working with pandas in Python, challenge yourself to work with the similar dplyr, tidyr, and reshape2 packages in R. And if you’re comfortable with these in R but are rusty in Python, hop right into using pandas as there are plenty more useful capabilities to explore.
Thank you for reading. Hope this was helpful and please let me know if you have any questions or interested in any other specific parallel series how-tos like this.