Python and R are the predominant languages in the Data Science ecosystem. They both provide numerous libraries to perform efficient data wrangling and analysis.
When it comes to data analysis, Pandas has always been the first choice for me. However, I feel like some of my preferences might change because of the outstanding capabilities of R packages.
In this article, we will do a slightly complicated task using R packages. I will explain it step-by-step to make the operation in each step clear. At the end, you will see how each step is combined to accomplish the task in one big step.
It is worth noting that the same operation can also be done with Pandas. In fact, I will also provide the Pandas code that does it so you will have the chance to make a comparison.
We will use a part of the Netflix dataset on Kaggle. We start by importing the libraries and reading the dataset.
library(data.table)
library(stringr)
netflix <- fread(
"~/Downloads/netflix_titles.csv", select = c("type", "title",
"date_added", "duration", "release_year", "country")
)
head(netflix)

The dataset contains some features of the TV shows and movies on Netflix such type, length, country, and so on.
The task is to find the top 5 country in terms of the average movie length in movies. As an additional condition, I want to filter out the countries that have less than 10 titles.
The first step is to create a column that contains the movie length in minutes. We can obtain it by extracting the numerical part from the duration column.
netflix$duration_qty <- as.numeric(str_split_fixed(netflix$duration, " ", 2)[,1])
We use the str_split_fixed function of the stringr package to split the values at the space character. We then take the first part and convert it to a number using the as.numeric function.
The next step is to filter the movie titles because we are not interested in the TV shows. The filtering with the data.table package can be done as follows:
netflix[type == "Movie"]
The average movie length can be calculated as follows:
netflix[
type == "Movie",
.(avg_length = mean(duration_qty)
]
However, we want to calculate the average movie length for each category in the country column separately. This task involves a "group by" operation. We can implement it using the by parameter.
netflix[
type == "Movie",
.(avg_length = mean(duration_qty),
by = "country"
]
Since we would like to filter out the countries that have less than 10 movies, we need a variable to hold the quantity information. It can easily be done by adding the count (N) function.
netflix[
type == "Movie",
.(avg_length = mean(duration_qty, .N),
by = "country"
]
The remaining parts of the task are:
- Filter the countries with less than 10 movies
- Sort the countries according to the average movie length in descending order
- Select the first 5
[N>10] # filter
[order(avg_length, decreasing = TRUE)] # sort
[1:5] # select
We can now combine each step in a single operation.
netflix[, duration_qty := as.numeric(str_split_fixed(netflix$duration, " ", 2)[,1])][type == "Movie", .(avg_length = mean(duration_qty), .N), by='country'][N>10][order(avg_length, decreasing = TRUE)][1:5]

Here is how the same task is done with Pandas (You need to import pandas and read the dataset into a dataframe first).
netflix["duration_qty"] = netflix.duration.str.split(" ")
.str.get(0).astype('int')
netflix[netflix["type"] == "Movie"].groupby("country")
.agg(avg_length = ("duration_qty", "mean"),
N = ('duration_qty', 'count'))
.query("N > 10")
.sort_values(by="avg_length", ascending=False)[:5]

Conclusion
Both Pandas and data.table libraries allow for chaining multiple operations. Thus, we can break a complicated task into several small steps and then combine them. Such modularity also provides a great flexibility which I think is very important for certain data wrangling tasks.
Both of these libraries are very powerful and capable of successfully completing highly complicated operations. Thus, you can pick any of them for the data wrangling and analysis process.
Although I have used Pandas far more often, the syntax of data.table seems more appealing to me.
Thank you for reading. Please let me know if you have any feedback.