Translating SQL Queries to Pandas Operations
SQL queries and their corresponding methods in Pandas
Both Pandas and Structured Query Language (SQL) are undoubtedly the go-to tools for Data Scientists for tabular data management, processing, and analysis. While Pandas is a popular Python library for data analysis, SQL is an entire programming language of its own to interact with Databases. One thing that stands out in common between them is that both are excellent tools for working on Tabular Data.
Since both Pandas and SQL are essentially used to handle and operate tabular data, similar operations can be performed using both. Therefore, this post attempts to translate the most commonly used SQL queries by Data Scientists to their equivalent operations in Pandas. The entire code accompanying this post can be found here: GitHub.
For experimentation purposes, I created a random dataset of 10 employees working in an organization. The dataset comprises seven columns: Employee ID, First Name, Last Name, Gender, Salary (in $), Level, and Date of Joining. This is shown in the image below:
Furthermore, I used ipython-sql to run SQL queries directly from a Jupyter Notebook and created a table in SQLite database as follows:
Next, let’s proceed towards executing commonly used queries in SQL for data analysis and translate them to their corresponding operations in Pandas.
Display First 5 (or k) Rows of the Table
In SQL, we can use limit after select and specify the number of records we want to display, as shown below:
Note that not all database systems support the limit clause. Below, I have summarized the syntax for a few database management systems.
In Pandas, using the head method and passing the number of rows as an argument does this job.
Select with Condition on One (or more) Columns
Say we wish to select only female employees working in the organization. In SQL, we can accomplish this using the where clause as shown below:
This would translate to the following in Pandas:
Select a Subset of Columns
Next, consider that we want to select only the ID, first name, and last name from the table. We can do this in SQL as follows:
Similar filtering can be executed in Pandas as follows:
Note: SQL syntax is case insensitive. Therefore, the following two queries will return the same result:
However, the syntax of Pandas is case-sensitive. Thus, column names should be specified in the same format and case as they appear in the original DataFrame.
Sorting the records based on Column values
In this task, assume that we want to sort the employee data in descending order of their salaries. This can be executed using the order by clause in SQL as follows:
In Pandas, we can use the sort_values() method and pass the column as an argument, as shown below:
Grouping records
Grouping is another widely-used operation while analyzing tabular data. Say we want to obtain the average salary of employees across levels and the number of employees working on each level. This can be accomplished using the group by clause as shown below:
On the other hand, we can use the groupby method in Pandas and pass the aggregations we need to perform as follows:
Number of Distinct records
Next, let’s see how we can display the distinct entries in a column and their count. In SQL, this can be done using the DISTINCT keyword as shown below:
To do this in Pandas, we can call the unique() and nunique() method on the series as shown below:
Number of Rows in the Table
Lastly, let’s see how we can find the number of rows in a table using SQL and Pandas. In SQL, we can do this using the count() method as shown below:
In Pandas, this translates to the shape attribute of a DataFrame which returns both the number of rows and the columns.
To conclude, in this post, we dived into a handful of SQL queries and looked into how to achieve equivalent results in Pandas. Since both Pandas and SQL are essentially used to handle and operate tabular data, we can perform similar operations using both. However, as per my experience, implementing complicated operations such as computing crosstabs and complex aggregation after groupby can be more challenging in SQL than in Pandas because of Pandas’ pythonic syntax.
Looking back to the last SQL-to-Pandas example, we notice that SQL only returns the number of rows in the table. If one wishes to compute the number of columns in a table using SQL, the method is complicated. On the other hand, doing this in Pandas requires printing a single attribute of the DataFrame (shape), and we get both the number of rows and columns in the DataFrame.
If you enjoyed reading this post and learned something new, I am sure you would enjoy my other articles, which are listed below: