The world’s leading publication for data science, AI, and ML professionals.

How to Write All of Your SQL Queries in Pandas

A Comprehensive SQL to Pandas dictionary

Photo by Ilona Froehlich on Unsplash
Photo by Ilona Froehlich on Unsplash

Introduction

What makes SQL so amazing is that it’s so easy to learn – the reason why it’s so easy to learn is that the code syntax is so intuitive.

Pandas on the other hand isn’t so intuitive, especially if you started out with SQL first like I did.

Personally, what I found really helpful was thinking about how I would manipulate my Data in SQL, and then replicate it in Pandas. And so, if you’re trying to be more proficient in Pandas, I highly recommend that you take this approach as well.

Thus, this article serves as a cheatsheet, dictionary, a guide, whatever you want to call it so that you can refer to this when using Pandas.

And with that said, let’s dive into it!


Table of Content

  1. Selecting Rows
  2. Combining tables
  3. Filtering tables
  4. Sorting values
  5. Aggregate functions

1. Selecting Rows

SELECT * FROM

If you want to select an entire table, simply call the name of the table:

# SQL
SELECT * FROM table_df
# Pandas
table_df

SELECT a, b FROM

If you want to select specific columns from a table, list the columns that you want in double brackets:

# SQL
SELECT column_a, column_b FROM table_df
# Pandas
table_df[['column_a', 'column_b']]

SELECT DISTINCT

Simply use .drop_duplicates() to get distinct values:

# SQL
SELECT DISTINCT column_a FROM table_df
# Pandas
table_df['column_a'].drop_duplicates()

SELECT a as b

If you want to rename a column, use .rename():

# SQL
SELECT column_a as Apple, column_b as Banana FROM table_df
# Pandas
table_df[['column_a', 'column_b']].rename(columns={'column_a':
'Apple', 'column_b':'Banana'})

SELECT CASE WHEN

For the equivalent of SELECT CASE WHEN, you can use np.select() where you first specify your choices and the values for each choice.

If you want to include multiple conditions per choice, check out this article.

# SQL
SELECT CASE WHEN column_a > 30 THEN "Large"
            WHEN column_a <= 30 THEN "Small"
            END AS Size
FROM table_df
# Pandas
conditions = [table_df['column_a']>30, table_df['column_b']<=30]
choices = ['Large', 'Small']
table_df['Size'] = np.select(conditions, choices)

2. Combining Tables

INNER/LEFT/RIGHT JOIN

Simply use .merge() to join tables, You can specify whether you want it to be a LEFT, RIGHT, INNER, or OUTER join using the "how" parameter.

# SQL
SELECT * FROM table_1 t1
         LEFT JOIN table_2 t1 on t1.lkey = t2.rkey 
# Pandas
table_1.merge(table_2, left_on='lkey', right_on='rkey', how='left')

UNION ALL

Simply use pd.concat():

# SQL
SELECT * FROM table_1
UNION ALL
SELECT * FROM table_2
# Pandas
final_table = pd.concat([table_1, table_2])

3. Filtering Tables

SELECT WHERE

When filtering a DataFrame the same way you would use a WHERE clause in SQL, you simply need to define the criteria within square brackets:

# SQL
SELECT * FROM table_df WHERE column_a = 1
# Pandas
table_df[table_df['column_a'] == 1]

SELECT column_a WHERE column_b

When you want to select a certain column from a table and filter it with a different column, follow the format below (if you want to read more about it, check out this link):

# SQL
SELECT column_a FROM table_df WHERE column_b = 1
# Pandas
table_df[table_df['column_b']==1]['column_a']

SELECT WHERE AND

If you want to filter by multiple criteria, simply wrap each condition in parenthesis and separate each condition using ‘&’. For more variations of this, check out this link.

# SQL
SELECT * FROM table_df WHERE column_a = 1 AND column_b = 2
# Pandas
table_df[(table_df['column_a']==1) &amp; (table_df['column_b']==2)]

SELECT WHERE LIKE

The equivalent of LIKE in SQL is .str.contains(). If you want to apply case insensitivity, simply add case=False in the parameters (see here).

# SQL
SELECT * FROM table_df WHERE column_a LIKE '%ball%'
# Pandas
table_df[table_df['column_a'].str.contains('ball')]

SELECT WHERE column IN()

The equivalent of IN() in SQL is .isin().

# SQL
SELECT * FROM table_df WHERE column_a IN('Canada', 'USA')
# Pandas
table_df[table_df['column_a'].isin(['Canada', 'USA'])]

4. Sorting Values

ORDER BY one column

The equivalent of ORDER BY in SQL is .sort_values(). Use the ‘ascending’ parameter to specify whether you want to sort the values in ascending or descending order – the default is ascending just like SQL.

# SQL
SELECT * FROM table_df ORDER BY column_a DESC
# Pandas
table_df.sort_values('column_a', ascending=False)

ORDER BY multiple columns

If you want to ORDER BY multiple columns, list the columns in brackets, and specify the direction of ordering in the ‘ascending’ parameter in brackets as well. Make sure that you follow the respective order of the columns that you list.

# SQL
SELECT * FROM table_df ORDER BY column_a DESC, column_b ASC
# Pandas
table_df.sort_values(['column_a', 'column_b'], ascending=[False, True])

5. Aggregate Functions

COUNT DISTINCT

You’ll notice a common pattern for aggregate functions.

To replicate COUNT DISTINCT, simply use .groupby() and .nunique(). See here for more info:

# SQL
SELECT column_a, COUNT DISTINCT(ID) 
FROM table_df
GROUP BY column_a
# Pandas
table_df.groupby('column_a')['ID'].nunique()

SUM

# SQL
SELECT column_a, SUM(revenue) 
FROM table_df
GROUP BY column_a 
# Pandas
table_df.groupby(['column_a', 'revenue']).sum()

AVG

# SQL
SELECT column_a, AVG(revenue) 
FROM table_df
GROUP BY column_a
# Pandas
table_df.groupby('column_a')['revenue'].mean()

Thanks for Reading!

Hopefully, this serves as a useful guide when manipulating data with Pandas. Don’t feel like you have to memorize all of this either! This is something that I constantly come back to when I work with Pandas.

Eventually, with enough practice, you’ll feel more comfortable with Pandas and will understand the underlying mechanics enough where you won’t need to rely on cheat sheets like this one.

As always, I wish you the best in your endeavors! 🙂

Not sure what to read next? I’ve picked another article for you:

A Complete Pandas Glossary for Data Science

Terence Shin


Related Articles