The world’s leading publication for data science, AI, and ML professionals.

R Equivalent of 7 Common Pandas Operations

Hands-on tutorial for Pandas and data.table li

braries

Photo by NordWood Themes on Unsplash
Photo by NordWood Themes on Unsplash

Python and R are the two key players in the Data Science ecosystem. Both of these programming languages offer a rich selection of highly useful libraries.

When it comes to data analysis and manipulation, two libraries stand out: "data.table" for R and Pandas for Python.

I have been using both but I cannot really declare one superior to the other. Although I personally like "data.table" better, I haven’t come across any task that cannot be done with both.

In this article, I will walk you through 7 typical data analysis operations using Pandas and "data.table". Thus, it will be a good practice for both.

In an earlier article, I have generated mock sales data using NumPy and Pandas. I will be using that data for the examples. Let’s start with importing the libraries and reading the dataset.

#Pandas
import numpy as np
import pandas as pd
sales = pd.read_csv("/content/mock_sales.csv")
sales.head()
(image by author)
(image by author)
#data.table
library(data.table)
sales <- fread("~/Downloads/mock_sales.csv")
head(sales)
(image by author)
(image by author)

Example 1

Find the unique values in a column.

#Pandas
sales.ProductGroup.unique()
array(['D', 'A', 'C', 'B'], dtype=object)
#data.table
unique(sales$ProductGroup)
[1] "D" "A" "C" "B"

Example 2

Find the number of unique values in a column.

It is not the case in our example but when the number of unique values in a column is large, you would probably be interested in the number of unique values instead of displaying all of them.

We can perform this task just by adding a letter "n". Where you add it is different for Pandas and "data.table" though.

#Pandas
sales.ProductGroup.nunique()
4
#data.table
uniqueN(sales$ProductGroup)
4

Example 3

Calculate the average daily sales for each product group.

For this task, we use the groupby function of Pandas. It is a little simpler with "data.table".

#Pandas
sales.groupby("ProductGroup").agg(
   avg_daily_sales = ("SalesAmount", "mean")
)
#data.table
sales[, .(avg_daily_sales = mean(SalesAmount)), by=ProductGroup]
data table output (image by author)
data table output (image by author)

Example 4

Find the top 3 stores in terms of total amount of sales.

Calculation part is similar to the previous example. We just change the column names and aggregation function. Unlike the previous example, we sort the results in descending order and select the top 3.

#Pandas
sales.groupby("StoreCode").agg(
   total_sales = ("SalesAmount", "sum")
).sort_values(
   by="total_sales", ascending=False
)[:3]
#data.table
sales[, .(total_sales = sum(SalesAmount)), 
      by=StoreCode][order(-total_sales)][1:3]
Pandas output (image by author)
Pandas output (image by author)

Example 5

Create a new column named "IsHigh" which takes the value 1 if sales amount is more than 40 and 0 otherwise.

There are several ways to do this operation. For instance, we can use the where function of numpy as follows:

#Pandas
sales["IsHigh"] = np.where(sales["SalesAmount"] > 40, 1, 0)
sales.head()
(image by author)
(image by author)

On the R side, the ifelse function can be used for this task.

#data.table
sales[, IsHigh := ifelse(SalesAmount>40,1,0)]
head(sales)
(image by author)
(image by author)

Example 6

Find the number of distinct days in which the "IsHigh" column takes the value 1.

We have already covered how to use the number of distinct values in a column. This task involves an additional filtering operation.

#Pandas
sales[sales["IsHigh"]==1]["Date"].nunique()
731
#data.table
uniqueN(sales[IsHigh==1]$Date)
731

Example 7

Find the number of occurrences of each distinct value in the product group column.

The value_counts function of Pandas returns the distinct values along with the number of occurrences (i.e. row count).

With "data.table", we can group the rows based on the product group column and count the rows using the N function.

#Pandas
sales.ProductGroup.value_counts()
B    164475 
D    164475 
A    164475 
C    164475
#data.table
sales[,.N,ProductGroup]
   ProductGroup      N
1:            D 164475
2:            A 164475
3:            C 164475
4:            B 164475

The number of occurrences of each group are the same because I have generated the data this way. There is nothing wrong with the methods 😊


Conclusion

Both R and Python has several third-party libraries that help data scientists a lot. In most cases, a task can be done using libraries of both languages. Thus, it would be a controversial discussion to declare one superior to the other.

In this article, we have done 7 examples that demonstrate R and Python versions of some tasks.

Thank you for reading. Please let me know if you have any feedback.


Related Articles