Pandas is a very powerful and versatile Python data analysis library that expedites the data analysis and exploration process. One of the advantages of Pandas is that it usually provides multiple ways to accomplish a task.

During the Data Analysis process, we almost always need to do some filtering either based on a condition or by selecting a subset of the dataframe. In this post, we will go through 7 different ways to filter a Pandas dataframe.
I will do the examples on the california housing dataset which is available under the sample data folder in google colab.
import numpy as np
import pandas as pd
df = pd.read_csv("/content/sample_data/california_housing_train.csv",
usecols =['total_rooms','total_bedrooms','population', 'median_income','median_house_value'])
df.head()

The most commonly used way is to specify the condition inside the square brackets like selecting columns.
#1
df[df['population'] > 10][:5]

We only get the rows in which the population is greater than 1000.
The rows which have the largest values in a particular column can be filtered with the nlargest function.
#2
df.nlargest(5, 'population')

Rows that have the 5 largest values in the population column are displayed.
Similarly, we can select the rows with the smallest values.
#3
df.nsmallest(5, 'population')

Another way to select the largest or smallest values based on a column is to sort the rows and then take a slice.
#4
df.sort_values(by='median_income', ascending=False)[:5]

The dataframe is sorted in descending order based on the median_income column and then the first 5 rows are selected.
Pandas query function is a very flexible filtering method. It allows specifying the condition as a string.
#5
df.query('5000 < total_rooms < 5500')[:5]

In some cases, we may want to randomly select a sample from the dataframe. It is more like a selection than filtering but definitely worth mentioning. The sample function returns a random sample of the specified size.
#6
df.sample(n=5)

The sample contains 5 rows. It is also possible to specify a fraction. For instance, the following code will return a sample of size equal to 1% of the original dataframe.
df.sample(frac=0.01)
We also have the option to select a specified range of indices. Just like the sample function, this method is more like a selection rather than filtering based on a condition. However, in case of sequential data (e.g. time series), it can be considered as a filtering way.
The method we will use is iloc which returns the rows or column within the specified index range.
#7
df.iloc[50:55, :]

The rows with indices in the range of (50:55) are returned. We have the option to select only some of the columns.
df.iloc[50:55, :3]

You may have noticed the indices of the returned rows did not change. They still have the same indices as in the original dataframe. If you’d like to create a new dataframe after filtering, you may want to reset the indices which can be achieved by the reset_index function.
#without reset_index
df_new = df.query('total_rooms > 5500')
df_new.head()

#with reset_index
df_new = df.query('total_rooms > 5500').reset_index()
df_new.head()

We have covered different methods to filter a dataframe or select a part of it. Although the same operation can be done with many of them, you may prefer one over another because of the syntax or some other reason.
It is always good to have the flexibility and multiple ways to do a task. And, Pandas provides a great deal of it.
Thank you for reading. Please let me know if you have any feedback.