The world’s leading publication for data science, AI, and ML professionals.

Why It Does Matter to Choose Python or R for Data Analysis

A comparative practical guide

Photo by Javier Allegue Barros on Unsplash
Photo by Javier Allegue Barros on Unsplash

Python and R are the predominant programming languages in the Data Science ecosystem. Both provide various libraries to perform efficient data wrangling and analysis.

One of the most common questions that aspiring data scientists or analysts face is which programming language to choose for learning data science.

There are numerous articles that compare Python and R from many different perspective. This article is also a comparison. However, the focus or goal is not to declare one is superior than the other.

What I want to demonstrate in this article is to emphasize that both Python and R libraries are capable of doing efficient data wrangling and analysis. I will go over several examples that accomplish the same operation using both Python and R.

The libraries we will be using are Pandas for Python and "data.table" and "stringr" for R. We will do the examples using the Netflix dataset on Kaggle. We start by importing the libraries and reading the dataset.

# Pandas
import pandas as pd
netflix = pd.read_csv("/content/netflix_titles.csv")

# data.table
library(data.table)
netflix <- fread("Downloads/netflix_titles.csv")

The dataset contains detailed information on 7787 titles. Here is a list of the features (i.e. columns).

['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']

A title can be a movie or TV show. Let’s first check the number of titles in each category. In Pandas, we can use the groupby or the value_counts function.

netflix['type'].value_counts()
Movie      5377 
TV Show    2410
netflix.groupby('type').agg(number_of_titles = ('show_id','count'))
         number_of_titles
type    
Movie                5377
TV Show              2410

In data.table, we can accomplish this task as follows:

netflix[, .(number_of_titles = .N), by=type]
      type number_of_titles
1: TV Show             2410
2:   Movie             5377

The by parameter acts like the groupby function of Pandas. The N counts the number of rows in each category.


We may want to calculate the average length of movies on Netflix. The current format of the duration column is not suitable for calculation. It contains both the quantity and the unit:

netflix.duration[:5] # Pandas
0    4 Seasons 
1       93 min 
2       78 min 
3       80 min 
4      123 min

We will create two new columns including the duration quantity and unit separately.

# Pandas
netflix['duration_qty']=netflix.duration.str.split(" ").str.get(0).astype('int')
netflix['duration_unit']=netflix.duration.str.split(" ").str.get(1)
# stringr and data.table
netflix[, duration_qty := as.numeric(str_split_fixed(netflix$duration, " ", 2)[,1])]
netflix[, duration_unit := str_split_fixed(netflix$duration, " ", 2)[,2]]

What we have done is the basically the same but with different syntax. We split the duration column at the space character and then use the first part as quantity and the second part as the unit.

The quantity part needs to be converted to a numerical data type. In Pandas, we use the astype function. The same operation can be done with the "as.numeric" function in R.

Let’s take a look at the derived columns along with the original one.

# data.table
head(netflix[, c("duration", "duration_qty", "duration_unit")]) 
(image by author)
(image by author)

We can now calculate the average movie length in minutes.

# Pandas
netflix[netflix['type'] == 'Movie']['duration_qty'].mean()
99.30798
# data.table
netflix[type == "Movie", mean(duration_qty)]
99.30798

The average movie length is about 100 minutes. I think the movie length changes substantially in different countries. I would like to check the top 5 countries in terms of the average movie length.

# data.table
netflix[type == 'Movie', .(avg_length = mean(duration_qty)), by='country'][order(avg_length, decreasing = TRUE)][1:5]
# Pandas
netflix[netflix['type'] == 'Movie'].groupby('country')
.agg(avg_length = ('duration_qty', 'mean'))
.sort_values(by='avg_length', ascending=False)[:5]
(image by author)
(image by author)

Let’s elaborate on the syntax. We group the rows by country and calculate the average movie length for each group in the country column. Then we sort the calculated average values in descending order and select the top 5.

The maximum 5 values are around 200 minutes which are much more than the overall average of 99 minutes. The top 5 values shown in the screenshot above may be extreme cases. They may contain very few movies so we can’t really talk about an average.

One way to confirm our suspicion is to also add the number of movies in each category or filter the categories that have more than a certain amount of movies.

# Pandas
netflix[netflix['type'] == 'Movie'].groupby('country')
.agg(avg_length = ('duration_qty', 'mean'), 
     qty = ('duration_qty', 'count'))
.query('qty > 10').sort_values(by='avg_length', ascending=False)[:5]

# data.table
netflix[type == 'Movie', .(avg_length = mean(duration_qty), .N), by='country'][N > 10][order(avg_length, decreasing = TRUE)][1:5]
(image by author)
(image by author)

We have added a column that indicates the number of movies in each colum. We then filtered the rows based on that column before sorting the values.

The most lengthy movies are in Pakistan, India, and South Korea.


Conclusion

We have done some examples that demonstrates typical cases in data wrangling and analysis. The point here is both Python and R libraries provide highly efficient ways of accomplishing these tasks.

Whatever Programming language you choose for learning data science, you will be fine. Once you learn or practice with one of them, it becomes relatively easier to get used to the other one. The logic and the way of implementation is very similar for most tasks.

Thank you for reading. Please let me know if you have any feedback.


Related Articles