The world’s leading publication for data science, AI, and ML professionals.

7 Key Operations for Data Wrangling in R or Python

Switching between the two is easier than it may seem

Photo by Sarah Pflug from Burst
Photo by Sarah Pflug from Burst

Knowing and learning multiple programming languages has always been important whether you are a data scientist, data engineer, data analyst, or any developer of sorts. Being able to be a versatile programmer means your skills will never be outdated and you can quickly adapt to industry trends.

That being said, we’ve all faced this familiar dilemma as programmers:

You’ve got a big deadline to build a data pipeline and your boss keeps asking for its ETA. Or you have to work on that machine learning class project and there are only a few days before final presentations. You want to continue your commitment to improving your R skills (or Python) but it’s so much easier, faster, and less-stressful to stick to the language you know best.

However, the reality is that for everything you can do in Python when it comes to working with data frames, there is a way to accomplish it in R. And many times, the approach is quite similar except for the syntax.

In this article, we will cover 7 useful operations that you may have encountered with Python in a typical data wrangling process. But we’ll also look at the equivalent code side-by-side to accomplish it in R.


We will use the Rain in Australia dataset available on Kaggle. We first need to load in needed packages and read the csv file into dataframes using functions.

Python

import pandas as pd
df = pd.read_csv('data/weatherAUS.csv')
df.head(5)

R

library(dplyr)
library(ggplot2)
library(tidyr)
library(pivottabler)
library(reshape2) 
df <- read.csv('data/weatherAUS.csv')
head(df)
Initial R dataframe
Initial R dataframe

Finding Missing Values (NA)

Python

We can find missing values by checking each column to see if at least one NA value exists using boolean checks.

df.isna().any()

R

In R, we can apply a similar logic using the sapply function.

sapply(df, function(x)any(is.na(x)))
Resulting boolean checks for dataframe columns with at least one missing NA
Resulting boolean checks for dataframe columns with at least one missing NA

Replace Missing Values (NA)

Python

We can replace NAs with a specific value such as zero for select columns.

df['Rainfall'] = df['Rainfall'].fillna(0)

R

In R, we do this by using dplyr pipes %>% to chain data transformations.

df$Rainfall <- df$Rainfall %>% replace_na(0)

Drop Missing Values (NA)

Python

Here, we can drop all observations that contain at least one NA value.

df.dropna(axis=0, inplace=True)

R

df <- df %>% drop_na()
Resulting R dataframe with missing values dropped
Resulting R dataframe with missing values dropped

Apply Group by & Aggregation

Python

In Python, we can group data and apply aggregate calculations by chaining commands to the respective dataframe.

df[['Location', 'MinTemp', 'MaxTemp', 
    'Rainfall', 'WindGustSpeed']].groupby('Location').mean()

R

No surprise here. We can use dplyr’s group_by and summarise functions.

df %>% group_by(Location) %>% 
           summarise(across(c('MinTemp', 'MaxTemp', 'Rainfall',  
                               'WindGustSpeed'), mean))
Resulting group by aggregation set in R
Resulting group by aggregation set in R

Apply Custom Function To Dataframe Column

Python

The pythonic way to apply a function to a column is using apply andlambda functions.

# define function 
def convertToFahrenheit(c_degrees):
    f = c_degrees * 1.8 + 32
    return f
df[['MinTemp', 'MaxTemp']] = 
df[['MinTemp', 'MaxTemp']].apply(lambda x: convertToFahrenheit(x))

R

R-way of doing this using mutate and sapply . Mutate can be used to create new columns or update existing ones.

convertToFahrenheit <- function(c_degrees){
     f <- c_degrees * 1.8 + 32
     return(f)
}
df %>% mutate(MinTemp = sapply(MinTemp, convertToFahrenheit),
              MaxTemp = sapply(MaxTemp, convertToFahrenheit))
Resulting dataframe in R with mutated columns for min and max temps
Resulting dataframe in R with mutated columns for min and max temps

Create a Pivot Table

Python

For those familiar with Excel’s famous pivot tables, we can do this easily on Pandas dataframes to quickly summarise and aggregate specific data.

pd.pivot_table(data=df,
               values='WindGustSpeed',
               index='Location',
               columns='WindGustDir',
               aggfunc='mean',
               fill_value=0)

R

Although not as intuitive, we can do this using reshape2’s dcast function.

dcast(data=df, formula=Location ~ WindGustDir,   
      value.var='WindGustSpeed', fun.aggregate=mean, fill=0)
Resulting pivot table in R
Resulting pivot table in R

Detect Presence of Outliers in Dataset

Python

Though many approaches exist to identify outliers, one common way we can do this with a pandas dataframe is to plot a boxplot for columns we are interested in exploring.

df.boxplot(column=['Rainfall'])

R

When it comes to exploratory Data Analysis and visualization, R’s ggplot2 package allows us to easily generate a boxplot.

ggplot(df, aes(x=Rainfall)) + geom_boxplot()
Resulting ggplot boxplot of rainfall data in R
Resulting ggplot boxplot of rainfall data in R

Summary

The examples above are by no means exhaustive of all the operations on dataframes that you can perform in both R and Python. But hopefully, it gives you a glimpse of how easy it can be to overcome your hesitation and perhaps try a new language on your next data wrangling task.

For next steps, if you’re really comfortable in working with pandas in Python, challenge yourself to work with the similar dplyr, tidyr, and reshape2 packages in R. And if you’re comfortable with these in R but are rusty in Python, hop right into using pandas as there are plenty more useful capabilities to explore.

Thank you for reading. Hope this was helpful and please let me know if you have any questions or interested in any other specific parallel series how-tos like this.


Related Articles