Introduction
The Python Pandas library and SQL are both popular tools for manipulating data and have a lot of overlap in their functionality. So, what better way to improve your skill at both than to put Pandas and SQL head-to-head by working through the same coding problems for each?
This is the first of a series of articles I will write to help you directly compare Pandas and SQL. My goal is to help you:
- Learn and compare functionality
- If you already know one, then to learn the other
- Understand the strengths and weaknesses of each
- Have multiple tools for manipulating data
For the examples here I will use the infamous Titanic dataset that can be found on Kaggle. I encourage you to look at it while going through this article so you can follow along. At the end, I will present some practice problems for you to work through on your own based on the specific functionality discussed in this article.
Selecting Columns
Let’s start with the most basic functionality of how you can select columns and display them in a certain order. For our exercise, let’s say we are curious to see the pclass, fare, age, and survived columns of the dataset, in that order.
SQL
In Sql the SELECT
statement is where you list the columns you want to display and in what order. We select the columns from the titanic dataset after it is entered as a table in our database. In SQL if you want to see all the columns for a table you would use SELECT *
, but we will name the columns we want to display.
SELECT pclass, fare, age, survived
FROM titanic
Pandas
In Pandas the titanic dataset would be read into Pandas and stored as a dataframe object. From here we will slice out the desired columns. There are multiple ways to achieve the same result.
col_list = ['pclass', 'fare', 'age', 'survived']
titanic[col_list]
You could also put the list of columns directly into the brackets without creating a separate variable.
titanic[['pclass', 'fare', 'age', 'survived']]
Another method is to use loc[]
.
titanic.loc[:, ['pclass', 'fare', 'age', 'survived']]
The colon indicates we want to select all the rows and then after the comma we specify which columns we want and in what order. Again you could create a list of the columns as a variable and put that after the comma. Usually I prefer to create a variable as it is cleaner to look at with less brackets on brackets on brackets in a single line. It’s also easier to keep to PEP-8 guidelines with use of a list variable as you can keep your lines of code from getting too long. Going forward I will continue to use loc[]
and create separate variables.
It’s also worth noting here that the loc[]
option with Pandas offers the functionality to select a range of rows based on the index. If we had the desire to only look at index values 50 through 100 in the Titanic dataset, then we could put 50:100
as the first term and that inclusive range of rows would be the only ones returned. We could also use the plain colon after the comma to select all the columns or we could use it to select a range of columns, just like with the rows.
Filtering Based On Column Values
Now let’s say that you want results based a specific condition. Let’s continue on the same example from above and say we want to see the pclass, fare, age, and survived columns again but now we only want the entries for adult males.
SQL
In SQL the WHERE
clause is how you would filter based on a specific value in columns. The WHERE
clause follows the FROM
clause. Since our problem includes the need to filter on two conditions, we will list both with an AND in between. You can also use OR between two conditions if you would like to return rows that meet either condition.
SELECT pclass, fare, age, survived
FROM titanic
WHERE sex = 'male' AND age >= 18
Pandas
For Pandas, we will create variables that contain the two conditions we want to filter. Then we will use loc[]
again to select the desired rows and columns.
male_mask = titanic.sex == 'male'
age_mask = titanic.age >= 18
col_list = ['pclass', 'fare', 'age', 'survived']
titanic.loc[male_mask & age_mask, col_list]
The two mask variables are lists of boolean values for whether each row of the dataframe meets that condition. Remember how before when we used loc[]
we used a colon to specify we wanted all rows and then after the comma we specified what columns we wanted? Well here we are using the same functionality of specifying what rows we want first, then after the comma specifying what columns we want to display. Just like before, you could feed in the mask criteria directly without creating variables, but creating variables makes things look neater. With the &
between the mask variables, only rows that are True for both conditions will be returned. If you want to do "or" logic, you would use the pipe symbol |
instead.
A neat little functionality for Pandas is that you can also use a tilde ~
before the mask variables to flip the boolean values. For example, if we used ~male_mask
to filter it would flip all Trues to be False and vice versa. In this case it would be the same as filtering for females, but there may be times when this functionality comes in handy to be able to use the same variable but flip the logic.
Sorting Results By Column Values
Let’s continue with our problem and say we now want the results sorted by the fare amount first and age second. And we want the fare values sorted in descending order and the age values sorted in ascending order. This means that any entries that have the same fare value will sort those entries by the youngest individuals first.
SQL
In SQL we will use the ORDER BY
clause. The ORDER BY
clause will come after the WHERE
clause in the query and the default sorting method is ascending. To have the results sorted with the largest values at the top we will use the DESC
keyword.
SELECT pclass, fare, age, survived
FROM titanic
WHERE sex = 'male' AND age >= 18
ORDER BY fare DESC, age
Pandas
For Pandas we will use sort_values()
to sort. We can use a list of the columns to sort by, with the sorting hierarchy being in order that the columns are listed. There is an ascending parameter which is set to True by default. We can pass in a list of boolean values to this parameter with the values matched to the respective column by order that they are listed.
male_mask = titanic.sex == 'male'
age_mask = titanic.age >= 18
col_list = ['pclass', 'fare', 'survived']
df_slice = titanic.loc[male_mask & age_mask, col_list]
df_slice.sort_values(['fare', 'age'], ascending=[False, True])
An important note I want to make here is that if you plan on changing any values to the dataframe slice, it is best practice to chain on a .copy()
after the loc[]
. This way you can be sure you are not accidentally changing the original dataframe. It can be really confusing even for those experienced in Pandas to know the rules of when the original dataframe is being altered or not when making changes on a slice.
Limiting To Top Results
The last piece that we will add to our practice problem, is to limit the results to the top 10 entries.
SQL
This functionality is pretty straightforward with SQL; we simply add the LIMIT
clause and put the number of entries we want to limit to – in this case 10.
SELECT pclass, fare, age, survived
FROM titanic
WHERE sex = 'male' AND age >= 18
ORDER BY fare DESC, age
LIMIT 10
Pandas
For Pandas we can take the top values by chaining on a head()
at the end. You pass the number of entries you want to display in the parenthesis – in our case 10.
male_mask = titanic.sex == 'male'
age_mask = titanic.age >= 18
col_list = ['pclass', 'fare', 'survived']
df_slice = titanic.loc[male_mask & age_mask, col_list]
df_slice.sort_values(['fare', 'age'], ascending=[False, True]).head(10)
I want to point out here that Pandas also has nlargest()
and nsmallest()
methods which can be handy to order the columns and limit the results returned in one simple method. However, in our problem, we wanted to filter one column in descending order and another column in ascending order so the nlargest()
and nsmallest()
methods don’t really fit our needs. However, let’s say we actually want to filter both fare and age in ascending order. That would be a valid use case for nsmallest()
.
male_mask = titanic.sex == 'male'
age_mask = titanic.age >= 18
col_list = ['pclass', 'fare', 'survived']
df_slice = titanic.loc[male_mask & age_mask, col_list]
df_slice.nsmallest(10, ['fare', 'age'])
Practice Problems
- Display the age, sex, and survived columns for children under 16 who don’t have a parent traveling with them (a zero value in the parch column). Sort the results in ascending order by age.
- Display the class, age, and fare for females in class 2 or 3. Sort the results in descending order by age then fare and limit to the top 20 entries.
- Display all columns for children (under 18) in 3rd class who survived. Sort the results in ascending order by age then descending order by fare. The survived column contains 1 for yes and 0 for no.
STOP – Answers Below

I hope you didn’t peek without at least giving each problem a try in your head! Some problems have multiple methods for solving so it’s expected that your answers will be a little different from mine.
1.Display the age, sex, and survived columns for children under 16 who don’t have a parent traveling with them (a zero value in the parch column). Sort the results in ascending order by age.
SQL:
SELECT age, sex, survived
FROM titanic
WHERE age < 16 AND parch == 0
ORDER BY age
Pandas:
children = titanic.age < 16
parents = titanic.parch == 0
col_list = ['age', 'sex', 'survived']
titanic.loc[children & parents, col_list].sort_values('age')
2. Display the class, age, and fare for females in class 2 or 3. Sort the results in descending order by age then fare and show only the top 20 entries.
SQL:
SELECT pclass, age, fare
FROM titanic
WHERE sex = 'female' AND class != 1
ORDER BY age DESC, fare DESC
LIMIT 20
Pandas:
females = titanic.sex == 'female'
lower_class = titanic.pclass != 1
col_list = ['pclass', 'age', 'fare']
df_slice = titanic.loc[females & lower_class, col_list]
df_slice.nlargest(20, ['age', 'fare'])
3. Display all columns for children (under 18) in 3rd class who survived. Sort the results in ascending order by age then descending order by fare. The survived column contains 1 for yes and 0 for no.
SQL:
SELECT *
FROM titanic
WHERE age < 18 AND pclass = 3 AND survived = 1
ORDER BY age, fare DESC
Pandas:
children = titanic.age < 18
low_class = titanic.pclass == 3
survival = titanic.survived == 1
df_slice = titanic.loc[children & low_class & survival, :]
df_slice.sort_values(['age', 'fare'], ascending=[True, False])
Conclusion
As you can see, SQL and Pandas are very different but can still accomplish the same tasks. I enjoyed mentally switching between Pandas and SQL while writing this article, so I hope you enjoyed it as well. Be on the lookout for subsequent parts of this series!
Part two can be found here:
Working With SQL Versus Pandas (Part 2) Plus Practice Problems