The world’s leading publication for data science, AI, and ML professionals.

Query Pandas DataFrame with SQL

Can you use SQL in Pandas? Yes and this is how.

Photo by .M.Q Huang from Pexels
Photo by .M.Q Huang from Pexels

Motivation

Python Pandas library and Structured Query Language (SQL) are among the top essential tools in a Data Scientist toolbox. While Pandas is a powerful tool for data manipulation, there are many Data Scientist who are familiar and prefer to use SQL for data manipulation instead. In this article we will examine how to perform data manipulation of Pandas Dataframe using SQL with pandasqllibrary.

What is Pandasql?

Pandasql is a Python library that allows manipulation of a Pandas Dataframe using SQL. Under the hood, Pandasql creates an SQLite table from the Pandas Dataframe of interest and allow users to query from the SQLite table using SQL.

How Does Pandasql Work?

Install Pandasql package.

!pip install -U pandasql

Import the necessary packages.

from pandasql import sqldf
import pandas as pd
from sklearn import datasets

We are using the iris dataset as our example. df_feature is a Dataframe containing features while df_target is a Series containing the target. Pandasql can work both on Pandas DataFrame and Series.

df_feature = datasets.load_iris(as_frame = True)['data']
df_target = datasets.load_iris(as_frame = True)['target']
print (type(df_feature))
print (type(df_target))
>> <class 'pandas.core.frame.DataFrame'>
>> <class 'pandas.core.series.Series'>

The sqldfmethod is used to query the Dataframes and it requires 2 inputs:

  1. The SQL query string
  2. globals()or locals() function

A typical query would look like this, where q is the SQL query string. sqldf returns the result in a Dataframe.

q = "SELECT * FROM df_target LIMIT 3"
sqldf(q, globals())
Image by Author
Image by Author

globals() and locals() are built-in function in python where functions and variables are stored. Lets examine what the globals() function does.

globals()
Image by Author
Image by Author

The globals() function returns dictionary of variables that are created in this session such as df_feature and df_target. The key of the dictionary is the variable name and the value of the dictionary contains the actual value of the variable.

print (globals().keys())
>> dict_keys(['__name__', '__doc__', '__package__', '__loader__', '__spec__', '__builtin__', '__builtins__', '_ih', '_oh', '_dh', '_sh', 'In', 'Out', 'get_ipython', 'exit', 'quit', '_', '__', '___', '_i', '_ii', '_iii', '_i1', '_exit_code', '_i2', 'sqldf', 'pd', 'datasets', '_i3', 'df_feature', 'df_target', '_i4', '_4', '_i5', '_5', '_i6'])

Since the globals() function outputs a dictionary, we can take access values of variable using the globals() function in the following manner:

globals()['df_feature']

This will return the df_feature Dataframe.

Image by Author
Image by Author

Examples

Now that we understand how globals() or locals() function work with Pandasql, let’s take a look at some examples. We create a new function named pysqldf to avoid passing in globals() or locals() for every query.

pysqldf = lambda q: sqldf(q, globals())

We can simply query Dataframes in the following manner.

query = 'SELECT * FROM df_feature LIMIT 3'
pysqldf(query)
Image by Author
Image by Author
query = 'SELECT * FROM df_target LIMIT 3'
pysqldf(query)
Image by Author
Image by Author

Let’s join the two Dataframes df_feature and df_target . It can be done using the pd.concat method if we are using Pandas.

pd.concat([df_feature, df_target], axis = 1).head()
Image by Author
Image by Author

If we are using SQL, we will need to create a running row number column and join the two tables using the row number. Since Pandasql utilises SQLite under the hood, a SQLite table would have rowid column created by default. This column contains incremental integer value starting from 1.

query = 'SELECT rowid, * FROM df_feature LIMIT 3'
pysqldf(query)
Image by Author
Image by Author

We can now join both tables on the rowid column. The output can be assigned to another variable which can later be queried with Pandasql again.

query = 'SELECT * FROM df_feature INNER JOIN df_target ON df_feature.rowid = df_target.rowid'
df = pysqldf(query)
df.head()
Image by Author
Image by Author

Here are examples of other operations that we can perform.

Finding the mean sepal length for different target class. Notice that "sepal length (cm)" is wrapped in quotations. This is only necessary if there are spaces in the column names.

query = 'SELECT target, AVG("sepal length (cm)") AS mean_sepal_length FROM df GROUP BY target'
pysqldf(query)
Image by Author
Image by Author

We can also make use of f-strings in python to create dynamic SQL query strings.

COL_NAME = '"sepal length (cm)"'
ALIAS = 'sepal_length'
AGG = 'MAX'
query = f"SELECT {AGG}({COL_NAME}) AS {ALIAS} FROM df"
pysqldf(query)
Image by Author
Image by Author

Limitations of Pandasql

  1. As Pandasql uses SQLite, it is subjected to all the limitations of SQLite. For example, SQLite does not implement right outer join or full outer join.
  2. Pandasql performs query only, it cannot perform SQL operations such as update, insert or alter tables.

Conclusion

Pandasql is a great add to the Data Scientist toolbox for Data Scientist who prefer SQL syntax over Pandas. In this article we discussed how to query Pandas Dataframe with SQL using Pandasql and some of its limitations.


Join Medium to read more stories like this.


Related Articles