Python and R are the dominating programming languages in Data Science ecosystem. Both provides numerous packages and frameworks to perform efficient data analysis and manipulation.
In this article, we will compare two highly popular libraries in terms of data manipulation and transformation tasks.
- Pandas: Data analysis and manipulation library for Python
- Dplyr: Data manipulation package for R
The following examples consists of some simple tasks solved by both pandas and dplyr. There are many options to use these packages. I’m using R-studio IDE for R and Google Colab for Python.
The first step is to install dependencies and reading the data.
#Python
import numpy as np
import pandas as pd
marketing = pd.read_csv("/content/DirectMarketing.csv")
#R
> library(readr)
> library(dplyr)
> marketing <- read_csv("/home/soner/Downloads/datasets/DirectMarketing.csv")
We have a dataset about marketing campaign stored in a csv file. We read this dataset into a dataframe for pandas and tibble for dplyr,


We can now start on the tasks.
Filter based on a condition
Task: Filter the rows in which the amount spent is more than 2000.
The following codes create a new dataframe or tibble based according to the given condition.
#pandas
high_amount = marketing[marketing.AmountSpent > 2000]
#dplyr
high_amount <- filter(marketing, AmountSpent > 2000)
For pandas, we apply the filtering condition on the dataframe like we are indexing. For dplyr, we pass both the dataframe and the condition to the filter function.
Filter on multiple conditions
Task: Filter the rows in which the amount spent is more than 2000 and the history is high.
#pandas
new = marketing[(marketing.AmountSpent > 2000) & (marketing.History == 'High')]
#dplyr
new <- filter(marketing, AmountSpent > 2000 & History == 'High')
For both libraries, we can combine the multiple conditions using logical operators.
We will create a sample dataframe and tibble for the following two examples.
#pandas
df = pd.DataFrame({
'cola':[1,3,2,4,6],
'colb':[12, 9, 16, 7, 5],
'colc':['a','b','a','a','b']
})
#dplyr
df <- tibble(cola = c(1,3,2,4,6),
colb = c(12, 9, 16, 7, 5),
colc = c('a','b','a','a','b')
)

Sort based on a column
Task: Sort the rows in df based on cola.
#pandas
df.sort_values('cola')
#dplyr
arrange(df, cola)
We use the sort_values function in pandas and the arrange function in dplyr. Both of them sort the values in ascending order by default.

Sort based on multiple columns
Task: Sort the rows based on colc in descending order first and cola in ascending order.
This is more complicated than the previous example but the logic is the same. We also need to change the default behavior of sorting in ascending order.
#pandas
df.sort_values(['colc','cola'], ascending=[False, True])
#dplyr
arrange(df, desc(colc), cola)

For pandas, the values are sorted based on the columns in the given list. The order of columns in list matters. We also pass a list to the ascending parameter if we want to change the default behavior.
For dplyr, the syntax is a little simpler. We can change the default behavior by using the desc keyword.
Selecting a subset of columns
We may only need some of the columns in a dataset. Both pandas and dplyr provides simple ways to select a column or a list of columns.
Task: Create a subset of the dataset by selecting the location, salary, amount spent columns.
#pandas
subset = marketing[['Location','Salary','AmountSpent']]
#dplyr
subset = select(marketing, Location, Salary, AmountSpent)

The logic is the same and the syntax is pretty similar.
Creating a new column based on the existing ones
In some cases, we need to combine columns with a transformation to create a new column.
Task: Create a new column called spent_ratio which is the ratio of the amount spent to the salary.
#pandas
subset['spent_ratio'] = subset['AmountSpent'] / subset['Salary']
#dplyr
mutate(subset, spent_ratio = AmountSpent / Salary)
We use the mutate function of dplyr whereas we can directly apply simple math operations on the columns with pandas.

Conclusion
We have compared how simple data manipulation tasks are done with pandas and dplyr. These are just the basic operations but essential to understand the more complex and advanced operations.
There is many more functions and methods these libraries provide. In fact, both are quite versatile and powerful data analysis tools.
In the following articles, I will compare these two libraries based on complex data manipulation and transformation tasks.
Stay tuned for the upcoming articles!
Thank you for reading. Please let me know if you have any feedback.