Motivation
Pandas is being increasingly used by Data Scientists and Data Analysts for data analysis purposes, and it has the advantage of being part of the wider Python universe, making it accessible to many people. SQL on the other hand is known for its performance, being human-readable, and can be easily understood even by non-technical people.
What if we could find a way to combine the benefits of both Pandas and SQL statements? Here is where pandasql
comes in handy.
From this conceptual blog, you will understand what pandasql
is before diving deep into hands-on practices, which will be even much easier if you are already familiar with SQL.
Keep in mind that table and dataframe will be used interchangeably to mean the same thing.
So, what is pandasql?
It is basically used to query pandas
DataFrames using SQL syntax. The same process can be performed using sqldf to interact with R DataFrames.
Prerequisites
The installation is straightforward with the syntaxes below depending on your environment:
Before installing, it is important to note that installing libraries using conda, it is necessary to specify the --yes
flag so that you don’t get stuck to the (Proceed ([y]/n)?) prompt.
- Using Python console
pip install pandasql
- Using Jupyter Notebook
!conda install --yes pandasql
After successfully installing the library, you should be able to import the library using the following statement.
from pandasql import sqldf
Let’s get started
Before diving deep, let’s start by creating the datasets StudentTable
and TeachingAssistantTable
that will be used for hands-on practice.
# Import the pandas library for creating the dataframes
import pandas as pd
# Create the Student Data Frame
students= {
'Students':["Sira","Ibrahim","Moussa","Mamadou","Nabintou"],
'Gender':['Female','Male','Male', "Male", "Female"],
'Age':[18, 27, 19, 22, 21],
'Email': ["[email protected]", "[email protected]", "[email protected]",
"[email protected]", "[email protected]"]
}
students_df = pd.DataFrame(students)
students_df

# Create the Teaching Assistant Data Frame
teaching_assistant= {
'Teacher':["Ibrahim","Nabintou","Mamadou","Fatim","Aziz"],
'Email':['[email protected]','[email protected]','[email protected]',
"[email protected]", "[email protected]"],
'Degree':["M.S in Data Science", "B.S in Statistics",
"B. Comp Sc", "M.S. Architecture", "B.S in Accounting"],
'Department': ["Business", "Statistics", "Comp Sc",
"Engineering", "Business"]
}
teaching_assistant_df = pd.DataFrame(teaching_assistant)
teaching_assistant_df

These are the main concepts that will be covered in this section: (1) column section, (2) data filtering, (3) data aggregation, and finally, (4) data joining.
1- Column Selection
This corresponds to selecting part or all the columns of your database/data frame. It is performed with the keyword SELECT col_1, col_2,... col_X FROM tableName
→ 1, 2, ...X
correspond to the columns you are interested in having in the final result.
→ tableName
is the name of the dataframe/table.
The result from the following code includes all the columns of the student dataframe.
all_students = sqldf("SELECT * FROM students_df")
all_students

all_students
is also a DataFrame, which can be helpful if you want to perform quick Pandas tasks such as column types. Below is an illustration.
# Check the type of all_students
print(type(all_students))
# Run Pandas Statement to show the type of the columns
print("---"*10)
print(all_students.dtypes)

Sometimes when you have complicated queries, you can proceed step by step as follow:
- Define the query as a string. When doing so, make sure to use the triple quote sign
"""
so that you can write the query on multiple lines. - Apply the
sqldf
function to the query to get the result.
Let’s say we want the student’s name, their email and limit the result to the first 3.
# Query definition
query = """ SELECT Students, Email
FROM students_df
LIMIT 3
"""
# Query execution
name_email = sqldf(query)
name_email

2- Data Filtering
Data filtering is when the WHERE
statement comes into the equation to perform custom filtering on the result of the SELECT
statement.
Let’s imagine that we want all the female students.
# Define the query
query = """SELECT *
FROM students_df
WHERE Gender = 'Female'
"""
# Execute the query
female_student = sqldf(query)
female_student

Let’s have a look at this query that aims to retrieve the Name, Email, and Degree of all the Teaching Assistants with a Master’s Degree.
query = """ SELECT Teacher, Email, Degree
FROM teaching_assistant_df
WHERE Degree LIKE 'M.S%'
"""
ms_students = sqldf(query)
ms_students

I have decided to perform the same request, this time using Pandas statement, and this is what it looks like!
cols_to_get = ['Teacher', 'Email', 'Degree']
teaching_assistant_df[teaching_assistant_df.Degree.str.startswith('M.S')][cols_to_get]
This looks like. scary statement 🥺. Let’s understand what happened:
cols_to_get
corresponds to the columns we want in the final dataframe. It’s basically equivalent to the columns we specify when using theSELECT
statement.teaching_assistant_df.Degree.str.startswith('M.S')
returnsTrue
is the value in theDegree
column starts with'M.S'
.False
Otherwise.- Finally, those boolean values are transmitted to the starting
teaching_assistant_df
to get rows where the result isTrue.
But… we performed the same task with SQL
query, and was simple and easy to understand. Without going into too much comparative analysis, just wanted to emphasize how complicated can be some queries that are simple to write in SQL.
3- Data Aggregation
Aggregation in SQL is performed with the help of aggregation functions, and these are some of the most commonly used: COUNT
, SUM
, MAX & MIN
, and AVG
. For instance, you can get the age of students, based on their gender.
query = """ SELECT AVG(Age), Gender
FROM students_df
GROUP BY Gender
"""
avg_age = sqldf(query)
avg_age

4- Data Joining
The join concept becomes useful when we want to combine data from at least two tables. This section goes straight to the point with only one type of join. You can learn more about all the join concepts from my article titled 4 Types of SQL JOIN Every Data Scientist Should Know.
Imagine you want to know who is both a student and also a teaching assistant. The answer to this requires joining our two tables as follows, using an INNER JOIN
and the final result contains the following information:
- Student Name, Gender, Email, and Age from the
students_df
table. - Department from the
teaching_assistant_df
dataframe.
query = """ SELECT st.Students, st.Gender, st.Email, st.Age, tat.Department
FROM students_df st INNER JOIN teaching_assistant_df tat
ON st.Email = tat.Email;
"""
result = sqldf(query)
result

Conclusion
Congratulations!🎉🍾 You have just learned how to leverage the power of pandasql,
a great tool that allows you to apply both SQL and Pandas queries on your dataframes.
If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.
Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!