Motivation
Python Pandas library and Structured Query Language (SQL) are among the top essential tools in a Data Scientist toolbox. While Pandas is a powerful tool for data manipulation, there are many Data Scientist who are familiar and prefer to use SQL for data manipulation instead. In this article we will examine how to perform data manipulation of Pandas Dataframe using SQL with pandasql
library.
What is Pandasql?
Pandasql is a Python library that allows manipulation of a Pandas Dataframe using SQL. Under the hood, Pandasql creates an SQLite table from the Pandas Dataframe of interest and allow users to query from the SQLite table using SQL.
How Does Pandasql Work?
Install Pandasql package.
!pip install -U pandasql
Import the necessary packages.
from pandasql import sqldf
import pandas as pd
from sklearn import datasets
We are using the iris dataset as our example. df_feature
is a Dataframe containing features while df_target
is a Series containing the target. Pandasql can work both on Pandas DataFrame
and Series
.
df_feature = datasets.load_iris(as_frame = True)['data']
df_target = datasets.load_iris(as_frame = True)['target']
print (type(df_feature))
print (type(df_target))
>> <class 'pandas.core.frame.DataFrame'>
>> <class 'pandas.core.series.Series'>
The sqldf
method is used to query the Dataframes and it requires 2 inputs:
- The SQL query string
globals()
orlocals()
function
A typical query would look like this, where q
is the SQL query string. sqldf
returns the result in a Dataframe.
q = "SELECT * FROM df_target LIMIT 3"
sqldf(q, globals())

globals()
and locals()
are built-in function in python where functions and variables are stored. Lets examine what the globals()
function does.
globals()

The globals()
function returns dictionary of variables that are created in this session such as df_feature
and df_target
. The key of the dictionary is the variable name and the value of the dictionary contains the actual value of the variable.
print (globals().keys())
>> dict_keys(['__name__', '__doc__', '__package__', '__loader__', '__spec__', '__builtin__', '__builtins__', '_ih', '_oh', '_dh', '_sh', 'In', 'Out', 'get_ipython', 'exit', 'quit', '_', '__', '___', '_i', '_ii', '_iii', '_i1', '_exit_code', '_i2', 'sqldf', 'pd', 'datasets', '_i3', 'df_feature', 'df_target', '_i4', '_4', '_i5', '_5', '_i6'])
Since the globals()
function outputs a dictionary, we can take access values of variable using the globals()
function in the following manner:
globals()['df_feature']
This will return the df_feature
Dataframe.

Examples
Now that we understand how globals()
or locals()
function work with Pandasql, let’s take a look at some examples. We create a new function named pysqldf
to avoid passing in globals()
or locals()
for every query.
pysqldf = lambda q: sqldf(q, globals())
We can simply query Dataframes in the following manner.
query = 'SELECT * FROM df_feature LIMIT 3'
pysqldf(query)

query = 'SELECT * FROM df_target LIMIT 3'
pysqldf(query)

Let’s join the two Dataframes df_feature
and df_target
. It can be done using the pd.concat
method if we are using Pandas.
pd.concat([df_feature, df_target], axis = 1).head()

If we are using SQL, we will need to create a running row number column and join the two tables using the row number. Since Pandasql utilises SQLite under the hood, a SQLite table would have rowid
column created by default. This column contains incremental integer value starting from 1.
query = 'SELECT rowid, * FROM df_feature LIMIT 3'
pysqldf(query)

We can now join both tables on the rowid
column. The output can be assigned to another variable which can later be queried with Pandasql again.
query = 'SELECT * FROM df_feature INNER JOIN df_target ON df_feature.rowid = df_target.rowid'
df = pysqldf(query)
df.head()

Here are examples of other operations that we can perform.
Finding the mean sepal length for different target class. Notice that "sepal length (cm)"
is wrapped in quotations. This is only necessary if there are spaces in the column names.
query = 'SELECT target, AVG("sepal length (cm)") AS mean_sepal_length FROM df GROUP BY target'
pysqldf(query)

We can also make use of f-strings in python to create dynamic SQL query strings.
COL_NAME = '"sepal length (cm)"'
ALIAS = 'sepal_length'
AGG = 'MAX'
query = f"SELECT {AGG}({COL_NAME}) AS {ALIAS} FROM df"
pysqldf(query)

Limitations of Pandasql
- As Pandasql uses SQLite, it is subjected to all the limitations of SQLite. For example, SQLite does not implement right outer join or full outer join.
- Pandasql performs query only, it cannot perform SQL operations such as update, insert or alter tables.
Conclusion
Pandasql is a great add to the Data Scientist toolbox for Data Scientist who prefer SQL syntax over Pandas. In this article we discussed how to query Pandas Dataframe with SQL using Pandasql and some of its limitations.
Join Medium to read more stories like this.