Introduction to PandaSQL: The Downsides No One Talks About

Pandas + SQL = PandaSQL = A big mess

Avi Chawla
Towards Data Science

--

Photo by Joshua Hoehne on Unsplash

Both Structured Query Language (SQL) and Pandas are undoubtedly the go-to tools for Data Scientists for tabular data management, processing, and analysis.

While Pandas is a popular Python library for data analysis used by Data Scientists, SQL is an entire programming language of its own to interact with Databases spanning applicability across various domains of computer science. One thing that stands out in common between them is that both are incredible tools to handle Tabular Data.

PandaSQL, a popular python package, is a blend of both Pandas and SQL, allowing you to leverage the power of SQL syntax in a pythonic environment. As a result, PandaSQL empowers you to query pandas DataFrames using SQL syntax — making it a big win for someone who is unfamiliar or uncomfortable with Pandas syntax.

This post serves as an introduction to PandaSQL. I’ll demonstrate how to use PandaSQL with example queries. Towards the end, I’ll compare the run-time difference between Pandas and PandaSQL. A brief overview is mentioned below:

Getting Started With PandaSQL
The caveats of PandaSQL

You can find the notebook for this article here.

Let’s begin 🚀!

Getting Started With PandaSQL

As mentioned above, PandaSQL is a python library that provides you with the flexibility to execute SQL Queries over a Pandas DataFrame. The step-by-step process to do this is demonstrated in detail below.

Step 1: Install PandaSQL

First and foremost, install the PandaSQL library using pip. You can find the instructions here or use the command below to install it:

Step 2: Import Requirements

Next, we import the installed library above and other pre-installed libraries required in this project.

Step 3: Create a Dummy Dataset

To understand how to use SQL in a pythonic environment using PandaSQL, we shall create a dummy dataset of ten thousand employees. The image following the code below shows a glimpse of the dataset.

First five rows of the DataFrame (Image by author).

Step 4: Create the PandaSQL Environment

The next step is to set up the SQL environment, which is done as follows:

We create a lambda function that takes a query q and the variables in the scope of the program using globals() which returns a python dictionary.

Essentially, SQL queries are written as strings in Python. For PandaSQL to interpret the table name and its column names specified in the string, we retrieve and pass all the global variables using the globals() method to the defined lambda function. The data DataFrame defined in Step 3 can be referenced using globals() as follows:

Step 5: Run SQL Queries on Pandas DataFrame

Once you have set up the SQL environment, you can proceed with executing SQL queries. The SQL syntax used with PandaSQL is exactly the same as the traditional SQL syntax.

Query 1:

For example, let’s write a query to count the number of records in the DataFrame. This is demonstrated below:

Query 2:

Next, if you want to query the dataframe based on a condition, you can do this using the where clause in SQL as shown below:

Query 3:

Lastly, if you want to obtain the number of records corresponding to every organization in the given data, you can do so using the groupby clause:

Similar to what I demonstrated in the above three sample queries, you can write any SQL query of your choice and execute it over the Pandas DataFrame using the PandaSQL library.

The caveats of PandaSQL

PandaSQL is undoubtedly an excellent alternative for someone from a SQL background unfamiliar with the pythonic syntax that Pandas demands. For instance, filtering dataframe or grouping them on a particular column while aggregating multiple columns might feel a little overwhelming or confusing. But these things are easily implementable and interpretable in SQL.

However, this ease of leveraging SQL in Pandas comes with a tremendous cost of the excess run-time. To understand this better, let’s compare the run-time of PandaSQL and Pandas. We shall write a query in SQL and then compare the run-time with its corresponding implementation in Pandas. If you are unfamiliar with the translation of fundamental SQL queries to Pandas operations, I would recommend reading my post below:

Syntactical Caveat:

Python already has numerous names that are reserved as keywords, such as for, while, in, if, else, elif, import, as, etc. The introduction of SQL introduces further restrictions around variable naming as more keywords such as create, like, where, having etc., can no longer be used as variable names. I have implemented such a scenario below:

Execution Run-time Caveat

While the syntactical caveat can be handled for once by altering the variable names, the overhead introduced in terms of run-time is profoundly concerning. Consider the queries below to understand better.

Query 1:

First, let’s compare the run-time of computing the number of records in the Dataframe. This is demonstrated below using both PandaSQL and Pandas:

To calculate the number of rows, PandaSQL takes close to 10⁵ times more run-time than Pandas shape attribute.

Query 2:

Next, let’s compare the run-time to find the number of records corresponding to each organization in the given DataFrame. This is implemented below:

Once again, Pandas outperforms PandaSQL by a considerable margin. In this case, Pandas is over 70 times faster than PandaSQL.

Query 3:

Now, I will demonstrate the difference in run-time for filtering the dataframe based on a condition using PandaSQL and Pandas. The SQL query and its corresponding Pandas method are implemented below:

PandaSQL takes over 160 times more time for the filtering operation compared to the Pandas-based method.

Query 4:

Lastly, let’s also consider the run-time differences for joins. We shall implement a self-join on the given dataframe. This is demonstrated below:

Yet again, Pandas is the clear winner, outperforming PandaSQL by 27 times.

To conclude, I agree that PandaSQL is a perfect tool to utilize if you are uncomfortable with Pandas. However, it is undoubtedly NOT the long-run solution one should look forward to leveraging in their data pipeline. Instead, spending meaningful time learning about Pandas and their corresponding methods is the most optimal way one can take.

But learning the whole Pandas library can be challenging, no? Don’t worry. Read my blog below to get started.

Thanks for reading.

Meme created by the author on imgflip.com

--

--

👉 Get a Free Data Science PDF (550+ pages) with 320+ tips by subscribing to my daily newsletter today: https://bit.ly/DailyDS.