The Data Science ecosystem consists of numerous software tools and packages that make our lives easier. Some of them are optimized to perform better and more efficient at certain tasks. However, we have many options for typical data analysis and manipulation tasks.
In this article, we will compare Python, R, and SQL with respect to typical operations in exploratory data analysis. The examples can be considered a basic level. The goal of the article is to emphasize the similarities and differences between these tools.
I also wanted to point out how same operations can be done with a different set of tools. Although there are syntactical differences, the logic behind the operations and the approach for handling a particular task is quite similar.
In the following examples, I will define a task and complete it using Pandas library (Python), Data.table library (R), and SQL.
Here is a snapshot of the dataset that will be used in the examples.
Example 1
Find the average price of items for each store id.
Sql: We select the store id and price columns. The aggregation on the price column is specified while selecting it. We then group the values by the store id column.
mysql> select store_id, avg(price)
-> from items
-> group by store_id;
+----------+------------+
| store_id | avg(price) |
+----------+------------+
| 1 | 1.833333 |
| 2 | 3.820000 |
| 3 | 3.650000 |
+----------+------------+
Pandas: We select the columns and apply the group by function. The last step is the aggregate function which is the mean.
items[['store_id','price']].groupby('store_id').mean()
price
store_id
1 1.833333
2 3.820000
3 3.650000
Data.table: The syntax is kind of a mixture of Pandas and SQL. We apply the aggregation and specify the grouping column while selecting the columns.
> items[, .(mean(price)), by = .(store_id)]
store_id V1
1: 1 1.833333
2: 2 3.820000
3: 3 3.650000
Example 2
What is the price of the most expensive item in store 3?
It is similar to the previous example with additional filtering. We are only interested in store 3.
SQL: We select the price column and apply the max function. The filtering is done by using the where clause.
mysql> select max(price) from items
-> where store_id = 3;
+------------+
| max(price) |
+------------+
| 7.50 |
+------------+
Pandas: We first apply the filter and select the column of interest. Then the max function is applied.
items[items.store_id == 3]['price'].max()
7.5
Data.table: The filtering is similar to Pandas but the aggregation is similar to the SQL syntax.
> items[store_id == 3, max(price)]
[1] 7.5
You may have noticed a small difference in the syntax for data.table. The aggregation function is specified with a dot (.(mean(price)) in the previous example but without a dot in this example (max(price)).
Using the notation with dot returns a table whereas an array is returned if used without the dot.
Example 3
List the items and their prices in store 1 and sort them based on the price in descending order.
SQL: In addition to what we have seen up to this point, the order by clause is added at the end to sort the results. It sorts in ascending order by default so we need to change it using the desc keyword.
mysql> select description, price
-> from items
-> where store_id = 1
-> order by price desc;
+-------------+-------+
| description | price |
+-------------+-------+
| banana | 3.45 |
| apple | 2.45 |
| lettuce | 1.80 |
| cucumber | 1.20 |
| bread | 1.15 |
| tomato | 0.95 |
+-------------+-------+
Pandas: Sorting is done using the sort_values function. Pandas also sorts in ascending order by default which can be changed with the ascending parameter.
items[items.store_id == 1][['description','price']]
.sort_values(by='price', ascending=False)
description price
1 banana 3.45
0 apple 2.45
4 lettuce 1.80
11 cucumber 1.20
14 bread 1.15
7 tomato 0.95
Data.table: The sorting operation is done by using the order function as below. We change the default behavior of sorting in ascending order by adding a minus sign.
> items[store_id == 1, .(description, price)][order(-price)]
description price
1: banana 3.45
2: apple 2.45
3: lettuce 1.80
4: cucumber 1.20
5: bread 1.15
6: tomato 0.95
Example 4
Show all the rows in which the description of item contains the word "egg".
SQL: This task includes a filtering based on strings. Since we are not making an exact comparison, we will use the like keyword.
mysql> select * from items
-> where description like '%egg%';
+---------+-------------+-------+----------+
| item_id | description | price | store_id |
+---------+-------------+-------+----------+
| 9 | egg 15 | 4.40 | 3 |
| 11 | egg 30 | 7.50 | 3 |
+---------+-------------+-------+----------+
Pandas: We will use the contains function of str accessor.
items[items.description.str.contains("egg")]
item_id description price store_id 8 9 egg 15 4.4 3
10 11 egg 30 7.5 3
Data.table: The filtering is quite similar to the SQL syntax. We will use the like keyword as below.
> items[description %like% "egg"]
V1 item_id description price store_id
1: 8 9 egg 15 4.4 3
2: 10 11 egg 30 7.5 3
Example 5
Find the number of items sold in each store.
SQL: The count function can be used as below:
mysql> select store_id, count(description) as item_count
-> from items
-> group by store_id;
+----------+------------+
| store_id | item_count |
+----------+------------+
| 1 | 6 |
| 2 | 5 |
| 3 | 4 |
+----------+------------+
Pandas: There is dedicated function for such tasks. The value_counts function returns the number of occurrences for each distinct value.
items.store_id.value_counts()
1 6
2 5
3 4
Name: store_id, dtype: int64
Data.table: We use the N option for the aggregation which does the same operation as the count function in SQL.
> items[, .N, by=(store_id)]
store_id N
1: 1 6
2: 2 5
3: 3 4
Conclusion
We have done some basic data analysis and manipulation operations. There is, of course, much more we can do with these tools. In fact, they provide versatile and powerful functions to complete advanced and complex tasks.
The goal of this article is to show the similarities and differences between these tools. Having a broad selection of tools might be intimidating but they all are capable of handling most of what you need. After a while, it comes down to a decision based on your taste.
Thank you for reading. Please let me know if you have any feedback.