When it comes to data analysis, Pandas dominates the Data Science ecosystem. Its lovable Python syntax and highly practical functions play a key role in Pandas’ popularity.
The second-best player in the data analysis game is R. My favorite one is data table but the packages in tidyverse are also highly popular.
I use both Pandas and data table at my job. I can’t really declare one being superior to the other. They both have some pros and cons. Which one to use is a matter of your preferences, your team, or your company.
In this article, we will go over the Pandas and data table versions of the common data manipulation and analysis tasks. The examples will definitely be helpful if you plan to transition from one to another or use both.
We, of course, need a dataset to work on. I prepared a sample sales dataset that you can download from my GitHub page. We will be using the one called "sample-sales-data.csv".
I would like to also mention which IDEs I prefer for working with these libraries. I’m using
- R studio for R
- PyCharm and Jupyter Notebook for Python.
Let’s start with reading the datasets.
The dataset contains daily sales information of some products and stores.
We usually start with checking the size of the dataset in terms of the number of rows and columns.
The dataset consists of 224229 rows and 5 columns.
A common operation is to check the distribution of values in columns. For categorical columns, we can count the number of occurrences of each distinct value.
In Pandas, there is a specific function for this operation which is the value_counts. In data table, we can use the group by syntax. Let’s do an example on the store column.
In data table, we group the rows based on the distinct values in the store column and count the rows that belong to each group. It is basically what the value_counts function does.
We have just seen how to calculate the number of data points in each group which means counting by groups. There are many other aggregations we can perform for comparing different groups such as mean, min, max, number of unique values, and so on.
We can calculate the total sales quantity for each store as follows:
The groupby function is used in Pandas. On the other side, we don’t need to use a specific function for grouping in data table. It has a special syntax structure as follows:
dt[filtering or sorting, aggregations, grouping]
Both libraries allow for grouping by multiple columns and calculating multiple aggregations.
Here is a more complicated example. We will find the average and total sales per product and then select the top 5 products in terms of average sales.
The sort_values and order functions are used for sorting the values in Pandas and data table, respectively. Both functions sort in ascending order by default. To change this behavior and sort in descending order, we can set the ascending parameter as false in Pandas. In data table, we can do it by adding a minus sign in front of the column name.
The output of the Pandas code above is:
It is quite simple to add new columns in both libraries. For instance, we can extract the month information from the date and save it in a column named month.
It is important to note that for these operations to work, the data type of the date column should be appropriate.
Another common task in data analysis and manipulation is filtering. We filter observations (i.e. rows) based on a condition or a set of conditions.
The following lines of code select the first month (i.e. January) sales of store number 2.
The column names need to be written along with the name of the dataframe in Pandas whereas we just write the column names in data table.
When working with real-life datasets, we often deal with missing values which is a two-step task. The first step is to find missing values and the second one is to replace them with appropriate values.
Let’s say there are some missing values in the sales quantity column and we want to replace them with 0. Here is how we do this task:
Both Pandas and data table are highly efficient and practical libraries for data analysis and manipulation tasks. We have seen how some of the typical tasks are done with them. Both libraries are capable of doing much more complicated tasks. I hope this article serves as a basic comparison and motives you to try out both.
Don’t forget to subscribe if you’d like to get an email whenever I publish a new article.
You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you do so using the following link, I will receive a portion of your membership fee at no additional cost to you.
Thank you for reading. Please let me know if you have any feedback.