How to Use SQL in Pandas

Add another relational database skill into your data science toolkit

Published in

Towards Data Science

3 min readOct 24, 2020

https://images.unsplash.com/photo-1489875347897-49f64b51c1f8?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=800&q=60

If you consider the structure of a Pandas DataFrame and the structure of a table from a SQL Database, they are structured very similarly. They both consist of data points, or values, with every row having a unique index and each column having a unique name. Because of this, SQL allows you to rapidly access the specific information you need for whatever project you are working on. But, very similar queries can be made using Pandas! In this blog post, I will show you how to do just that, along with explaining which library you’ll need to make it happen.

.query()

When using SQL, obtaining the information we need is called querying the data. In Pandas, there is a built-in querying method that allows you to do the exact same thing, which is called .query(). This both saves time and makes your queries much more coherent in your code because you don’t have to use slicing syntax. For instance, a brief example to query data in Pandas using the .query() method would be:

query_df = df.query("Col_1 > Col_2")

Otherwise, if you didn’t use this method to obtain your data and used slicing syntax instead, it would look something like this:

query_df = df[df[df['Col_1'] > df['Col_2']]]

Like I said, the .query() method makes your code look more professional and more efficient. One important thing I want to note, is if/when you decide to use “and” or “or” in your Pandas query, you can’t actually use the words “and” or “or” — you have to use the symbols for “and” (&) and “or” (|) instead. Below is an example using “&” to help clarify:

query_df = df.query("Col_1 > Col_2 & Col_2 <= Col_3")

The pandasql Library

As is well known, the ability to use SQL and/or all of its varieties are some of the most in demand job skills on the market for data scientists — even during a pandemic. Luckily, there is a library in Python now called pandasql that allows you to write SQL-style syntax to gather data from Pandas DataFrames! This is great for both aspiring data scientists who want to practice their SQL skills and experienced data scientists who are comfortable gathering data using SQL-style syntax. To install it onto your computer, just use !pip install:

!pip install pandasql

Then, to import it into your notebook, you want to import a sqldf object from pandasql:

from pandasql import sqldf

After you’ve imported everything, it’s a good idea to write a quick lambda function that can make writing your queries easier. The reason for doing this is so that you don’t have to pass in global variables every time an object is used. Below is the lambda function that I was taught and have success with:

pysqldf = lambda q: sqldf(q, globals())

Now, whenever you pass a query into pysqldf, the global variables will be passed along in the lambda so that you don’t have to do that over and over again for each object that’s used.

Now that you have everything set up and ready to go, you can query data in your DataFrames using the same syntax as SQL! Here’s an example — this query will return the first 10 names from a df:

q = """SELECT Name 
       FROM df 
       LIMIT 10;"""

names = pysqldf(q)
names

The complexity of your queries is dependent on your needs and your skills as a data scientist. So if you’re comfortable using SQL-style syntax, or are looking to improve your SQL syntax skills, using pandasql can be a great way to continue organizing your data & practicing your skills. Thank you for reading!

How to Use SQL in Pandas

Add another relational database skill into your data science toolkit

.query()

The pandasql Library

Written by Acusio Bivona