Sql and Pandas aren’t new technologies. Still, it’s not the easiest task to find corresponding functions for both technologies. That’s where this one and the previous article come into play – providing you with a detailed comparison between the two.
A couple of days back, I’ve covered the first part of this two-part series, dealing with more simple comparisons between the two technologies:
Reading that article first is not a prerequisite, but will definitely help you to get a better understanding of the two. The technologies aren’t designed for the same job, but it’s nice to see corresponding functions between the two. As promised, today we’ll cover more advanced topics:
- Joins
- Unions
- Groupings
Before we do so, let’s start simple with the delete statements.
Delete
DELETE
statement is used in SQL to delete or remove a row from the table. The syntax for deleting rows in SQL is as follows:
DELETE FROM table_name
WHERE condition;
Deleting a row is slightly different in Pandas. In Pandas, we do not delete a row, we just select the part that we require and discard the rest of it. Don’t worry if it seems like a riddle to you, the example will illustrate it further.
Let’s say we want to delete all the records from the Asian region.
SQL
DELETE FROM fert_data
WHERE region = 'Asia';

The rows have been successfully deleted. Now let’s see how to perform this task in Pandas.
Pandas
df = df.loc[df['region'] != 'Asia']
Here, we have selected all the rows where the region is not ‘Asia’ and then assigned the resultset to our current data frame. That means we have excluded all the rows where the region was ‘Asia’.
Joins
JOINS
are used in SQL to join or merge two or more tables together based on a specific condition. There are primarily four types of joins in SQL: LEFT
, RIGHT
, INNER
, FULL
. Here is the syntax for JOIN :
SELECT *
FROM table_name_1 as t1 JOIN table_name_2 as t2
ON t1.column_name_1 = t2.column_name_2;
In Pandas, we can join two or more data frames using merge()
. By default, it will perform an inner join. But you can customize it using the how
argument to perform other joins. The basic syntax for pd.merge()
is as follows :
merge(left_df, right_df, how='inner', on=condition)
Here is an example to illustrate joins.
SQL
Given below is a table called _country_subregion. We have to join this table with _fertdata using an inner join.
SELECT country, sub_region
FROM country_sub_region;

SELECT * FROM
fert_data as f INNER JOIN country_sub_region as c
ON f.country = c.country;

The tables have been successfully joined. Let’s see how to join them in Pandas.
Pandas
Here we have created a data frame similar to the _country_subregion table:
country_sub_reg = data = [
['country', 'subregion'],
['Kenya', 'East Africa'],
['Liberia', 'West Africa'],
['Mali', 'West Africa']
]
df_sr = pd.DataFrame(country_sub_reg[1:],columns=country_sub_reg[0])

We will merge df_sr
with df
on the country field using an inner join:
pd.merge(df, df_sr, on='country', how='inner')

Union
UNION
operator is used to club together the results of two or more SELECT
statements in SQL. There is a comrade to the Union operator called UNION ALL
. They differ in the sense that the former removes duplicate values from the combined result.
The task of a UNION ALL
operator in Pandas can be performed using pd.concat()
. While the function of the UNION
operator can be performed by first concatenating the data frames using pd.concat()
and then applying pd.drop_duplicates()
**** on it.
SQL
In order to illustrate the UNION/UNION ALL
operator in SQL, we have created an additional table called _fert_data1. The data in this table looks something as follows:

Our task is as follows – find the union of rows from _fertdata and _fert_data1 table:
SELECT * FROM fert_data_1
UNION ALL
SELECT * FROM fert_data
ORDER BY country;

You will observe that there are some duplicate values. Yes, you guessed it right. You can use the UNION
operator to remove them. Try it for yourself.
Pandas
In Pandas, we have created a data frame that is similar to the _fert_data1 table in SQL.
data = [
['country', 'region', 'tfr', 'contraceptors'],
['USA', 'North.Amer', 1.77, 20],
['UK', 'Europe', 1.79, 23],
['Bangladesh', 'Asia', 5.5, 40],
['Thailand', 'Asia', 2.3, 68]
]
df1 = pd.DataFrame(data[1:], columns=data[0])

Union of df
and df1
:
df_dupli = pd.concat([df1, df])

The data from data frames have been combined. But, in this case, we will get duplicate rows as well. For example, the goal is to have ‘Bangladesh’ listed only once:
df_dupli[df_dupli['country'] == 'Bangladesh']

We can drop duplicate records using drop_duplicates()
as shown:
df_wo_dupli = pd.concat([df1, df]).drop_duplicates()
Let’s run the same query and see if we still get two rows.
df_wo_dupli[df_wo_dupli['country'] == 'Bangladesh']

Problem solved. No more duplicate rows.
Group by
GROUP BY
clause in SQL is used to prepare summary rows by grouping records together. The clause is usually used in conjugation with aggregate functions such as AVG, SUM, COUNT, MIN, MAX, etc. Here is the basic syntax for GROUP BY
clause :
SELECT column_name_1, agg_func(column_name_2)
FROM table_name
GROUP BY column_name_1;
In Pandas, we have a groupby()
function that helps us in summarizing data along a specific column. The generic syntax is as follows:
df.groupby(['column_name_1']).agg_function()
Let’s try an example to understand it better – find the average tfr and count of contraceptors field for each region.
SQL
SELECT region, round(avg(tfr),2), count(contraceptors)
FROM fert_data
GROUP BY region;

Pandas
df.groupby('region').agg({'tfr': np.mean, 'contraceptors': np.size}).round(2)

We got the same results from both the queries. You must be wondering what that agg()
in Pandas is used for. It used to aggregate one or more operations over a specified axis.
Before you go
And this does it – you should now have a good picture behind both technologies, at least data-analysis-wise. It’s difficult to recommend one over the other, as that will depend on your previous experience, biases, and options company you work in opted for.
The good thing is – everything done in SQL can be done in Pandas – at least on this level. Feel free to choose the one you like better, you won’t make a mistake.
Thanks for reading.
Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.