Translating SQL Queries to Pandas Operations

SQL queries and their corresponding methods in Pandas

Avi Chawla

Follow

Published in

Towards Data Science

6 min readJul 3, 2022

--

Both Pandas and Structured Query Language (SQL) are undoubtedly the go-to tools for Data Scientists for tabular data management, processing, and analysis. While Pandas is a popular Python library for data analysis, SQL is an entire programming language of its own to interact with Databases. One thing that stands out in common between them is that both are excellent tools for working on Tabular Data.

Since both Pandas and SQL are essentially used to handle and operate tabular data, similar operations can be performed using both. Therefore, this post attempts to translate the most commonly used SQL queries by Data Scientists to their equivalent operations in Pandas. The entire code accompanying this post can be found here: GitHub.

For experimentation purposes, I created a random dataset of 10 employees working in an organization. The dataset comprises seven columns: Employee ID, First Name, Last Name, Gender, Salary (in $), Level, and Date of Joining. This is shown in the image below:

Random Employee Dataset (Image by author)

Furthermore, I used ipython-sql to run SQL queries directly from a Jupyter Notebook and created a table in SQLite database as follows:

Code snippet for creating a table using SQL (Image by author)

Next, let’s proceed towards executing commonly used queries in SQL for data analysis and translate them to their corresponding operations in Pandas.

Display First 5 (or k) Rows of the Table

In SQL, we can use limit after select and specify the number of records we want to display, as shown below:

SQL syntax for displaying the first five rows of a table (Image by author)

Note that not all database systems support the limit clause. Below, I have summarized the syntax for a few database management systems.

SQL syntax for displaying the first k rows in different DBMS (Image by author)

In Pandas, using the head method and passing the number of rows as an argument does this job.

Pandas Syntax for displaying the first five rows (Image by author)

Result of the first five rows (Image by author)

Select with Condition on One (or more) Columns

Say we wish to select only female employees working in the organization. In SQL, we can accomplish this using the where clause as shown below:

SQL syntax for filtering using condition (Image by author)

This would translate to the following in Pandas:

Pandas syntax for filtering using condition (Image by author)

Select a Subset of Columns

Next, consider that we want to select only the ID, first name, and last name from the table. We can do this in SQL as follows:

SQL syntax for selecting a subset of columns (Image by author)

Similar filtering can be executed in Pandas as follows:

Pandas syntax for selecting a subset of columns (Image by author)

Note: SQL syntax is case insensitive. Therefore, the following two queries will return the same result:

SQL syntax depicting its case-insensitive behavior (Image by author)

However, the syntax of Pandas is case-sensitive. Thus, column names should be specified in the same format and case as they appear in the original DataFrame.

Sorting the records based on Column values

In this task, assume that we want to sort the employee data in descending order of their salaries. This can be executed using the order by clause in SQL as follows:

SQL syntax for ordering data on column values (Image by author)

In Pandas, we can use the sort_values() method and pass the column as an argument, as shown below:

Pandas syntax for ordering data on column values (Image by author)

Result of ordering data on column values (Image by author)

Grouping records

Grouping is another widely-used operation while analyzing tabular data. Say we want to obtain the average salary of employees across levels and the number of employees working on each level. This can be accomplished using the group by clause as shown below:

On the other hand, we can use the groupby method in Pandas and pass the aggregations we need to perform as follows:

Pandas syntax for grouping records (Image by author)

Result for grouping records (Image by author)

Number of Distinct records

Next, let’s see how we can display the distinct entries in a column and their count. In SQL, this can be done using the DISTINCT keyword as shown below:

SQL syntax for finding distinct records (Image by author)

To do this in Pandas, we can call the unique() and nunique() method on the series as shown below:

Pandas syntax for finding distinct records (Image by author)

Result for distinct records in a table (Image by author)

Number of Rows in the Table

Lastly, let’s see how we can find the number of rows in a table using SQL and Pandas. In SQL, we can do this using the count() method as shown below:

SQL syntax for displaying the number of rows in a table (Image by author)

In Pandas, this translates to the shape attribute of a DataFrame which returns both the number of rows and the columns.

Pandas syntax for displaying the shape of the DataFrame (Image by author)

To conclude, in this post, we dived into a handful of SQL queries and looked into how to achieve equivalent results in Pandas. Since both Pandas and SQL are essentially used to handle and operate tabular data, we can perform similar operations using both. However, as per my experience, implementing complicated operations such as computing crosstabs and complex aggregation after groupby can be more challenging in SQL than in Pandas because of Pandas’ pythonic syntax.

Looking back to the last SQL-to-Pandas example, we notice that SQL only returns the number of rows in the table. If one wishes to compute the number of columns in a table using SQL, the method is complicated. On the other hand, doing this in Pandas requires printing a single attribute of the DataFrame (shape), and we get both the number of rows and columns in the DataFrame.