Introduction
What makes SQL so amazing is that it’s so easy to learn – the reason why it’s so easy to learn is that the code syntax is so intuitive.
Pandas on the other hand isn’t so intuitive, especially if you started out with SQL first like I did.
Personally, what I found really helpful was thinking about how I would manipulate my Data in SQL, and then replicate it in Pandas. And so, if you’re trying to be more proficient in Pandas, I highly recommend that you take this approach as well.
Thus, this article serves as a cheatsheet, dictionary, a guide, whatever you want to call it so that you can refer to this when using Pandas.
And with that said, let’s dive into it!
Table of Content
- Selecting Rows
- Combining tables
- Filtering tables
- Sorting values
- Aggregate functions
1. Selecting Rows
SELECT * FROM
If you want to select an entire table, simply call the name of the table:
# SQL
SELECT * FROM table_df
# Pandas
table_df
SELECT a, b FROM
If you want to select specific columns from a table, list the columns that you want in double brackets:
# SQL
SELECT column_a, column_b FROM table_df
# Pandas
table_df[['column_a', 'column_b']]
SELECT DISTINCT
Simply use .drop_duplicates() to get distinct values:
# SQL
SELECT DISTINCT column_a FROM table_df
# Pandas
table_df['column_a'].drop_duplicates()
SELECT a as b
If you want to rename a column, use .rename():
# SQL
SELECT column_a as Apple, column_b as Banana FROM table_df
# Pandas
table_df[['column_a', 'column_b']].rename(columns={'column_a':
'Apple', 'column_b':'Banana'})
SELECT CASE WHEN
For the equivalent of SELECT CASE WHEN, you can use np.select() where you first specify your choices and the values for each choice.
If you want to include multiple conditions per choice, check out this article.
# SQL
SELECT CASE WHEN column_a > 30 THEN "Large"
WHEN column_a <= 30 THEN "Small"
END AS Size
FROM table_df
# Pandas
conditions = [table_df['column_a']>30, table_df['column_b']<=30]
choices = ['Large', 'Small']
table_df['Size'] = np.select(conditions, choices)
2. Combining Tables
INNER/LEFT/RIGHT JOIN
Simply use .merge() to join tables, You can specify whether you want it to be a LEFT, RIGHT, INNER, or OUTER join using the "how" parameter.
# SQL
SELECT * FROM table_1 t1
LEFT JOIN table_2 t1 on t1.lkey = t2.rkey
# Pandas
table_1.merge(table_2, left_on='lkey', right_on='rkey', how='left')
UNION ALL
Simply use pd.concat():
# SQL
SELECT * FROM table_1
UNION ALL
SELECT * FROM table_2
# Pandas
final_table = pd.concat([table_1, table_2])
3. Filtering Tables
SELECT WHERE
When filtering a DataFrame the same way you would use a WHERE clause in SQL, you simply need to define the criteria within square brackets:
# SQL
SELECT * FROM table_df WHERE column_a = 1
# Pandas
table_df[table_df['column_a'] == 1]
SELECT column_a WHERE column_b
When you want to select a certain column from a table and filter it with a different column, follow the format below (if you want to read more about it, check out this link):
# SQL
SELECT column_a FROM table_df WHERE column_b = 1
# Pandas
table_df[table_df['column_b']==1]['column_a']
SELECT WHERE AND
If you want to filter by multiple criteria, simply wrap each condition in parenthesis and separate each condition using ‘&’. For more variations of this, check out this link.
# SQL
SELECT * FROM table_df WHERE column_a = 1 AND column_b = 2
# Pandas
table_df[(table_df['column_a']==1) & (table_df['column_b']==2)]
SELECT WHERE LIKE
The equivalent of LIKE in SQL is .str.contains(). If you want to apply case insensitivity, simply add case=False in the parameters (see here).
# SQL
SELECT * FROM table_df WHERE column_a LIKE '%ball%'
# Pandas
table_df[table_df['column_a'].str.contains('ball')]
SELECT WHERE column IN()
The equivalent of IN() in SQL is .isin().
# SQL
SELECT * FROM table_df WHERE column_a IN('Canada', 'USA')
# Pandas
table_df[table_df['column_a'].isin(['Canada', 'USA'])]
4. Sorting Values
ORDER BY one column
The equivalent of ORDER BY in SQL is .sort_values(). Use the ‘ascending’ parameter to specify whether you want to sort the values in ascending or descending order – the default is ascending just like SQL.
# SQL
SELECT * FROM table_df ORDER BY column_a DESC
# Pandas
table_df.sort_values('column_a', ascending=False)
ORDER BY multiple columns
If you want to ORDER BY multiple columns, list the columns in brackets, and specify the direction of ordering in the ‘ascending’ parameter in brackets as well. Make sure that you follow the respective order of the columns that you list.
# SQL
SELECT * FROM table_df ORDER BY column_a DESC, column_b ASC
# Pandas
table_df.sort_values(['column_a', 'column_b'], ascending=[False, True])
5. Aggregate Functions
COUNT DISTINCT
You’ll notice a common pattern for aggregate functions.
To replicate COUNT DISTINCT, simply use .groupby() and .nunique(). See here for more info:
# SQL
SELECT column_a, COUNT DISTINCT(ID)
FROM table_df
GROUP BY column_a
# Pandas
table_df.groupby('column_a')['ID'].nunique()
SUM
# SQL
SELECT column_a, SUM(revenue)
FROM table_df
GROUP BY column_a
# Pandas
table_df.groupby(['column_a', 'revenue']).sum()
AVG
# SQL
SELECT column_a, AVG(revenue)
FROM table_df
GROUP BY column_a
# Pandas
table_df.groupby('column_a')['revenue'].mean()
Thanks for Reading!
Hopefully, this serves as a useful guide when manipulating data with Pandas. Don’t feel like you have to memorize all of this either! This is something that I constantly come back to when I work with Pandas.
Eventually, with enough practice, you’ll feel more comfortable with Pandas and will understand the underlying mechanics enough where you won’t need to rely on cheat sheets like this one.
As always, I wish you the best in your endeavors! 🙂
Not sure what to read next? I’ve picked another article for you:
Terence Shin
- If you enjoyed this, follow me on Medium for more
- Follow me on Kaggle for more content!
- Let’s connect on LinkedIn
- Interested in collaborating? Check out my website.