With Selection, Slicing, Indexing and Filtering

In part 1 and part 2, we’ve learned how to inspect, describe and summarize a Pandas DataFrame. Today, we’ll learn how to extract a subset of a Pandas DataFrame. This is very useful because we often want to perform operations on subsets of our data. There are many different ways of subsetting a Pandas DataFrame. You may need to select specific columns with all rows. Sometimes, you want to select specific rows with all columns or select rows and columns that meet a specific criterion, etc.
All different ways of subsetting can be divided into 4 categories: Selection, Slicing, Indexing and Filtering.

As you continue reading this post, you’ll learn the differences between these categories.
Before discussing any of the methods of subsetting a data frame, it is worth distinguishing between a Pandas Series object and a Pandas DataFrame object.
Pandas Series and DataFrame objects
The Series and the DataFrame are two main data structures in Pandas. Simply, a Series is similar to a single column of data while a DataFrame is similar to a sheet with rows and columns. Look at the following diagram:

As you can see, a Series is one dimensional and a DataFrame is two dimensional. If we combine two or more Series objects together, we can get a DataFrame object. Let’s look at the actual view of a Series object.
import numpy as np
import pandas as pd
data = np.array([85, 90, 70, 80])
series = pd.Series(data=data,
name="marks")
print(series)

A Series consists of two components: One-dimensional data values and Index. The index provides meaningful labels for each data value. The users can use this index to select the values. By default, the index begins with 0.
Let’s look at the actual view of a DataFrame object.
import numpy as np
import pandas as pd
data = np.array([[25, 85],
[25, 90],
[26, 70],
[24, 80]])
dataframe = pd.DataFrame(data=data,
columns=["age", "marks"])
print(dataframe)

A DataFrame consists of three components: Two-dimensional data values, Row index and Column index. These indices provide meaningful labels for rows and columns. The users can use these indices to select rows and columns. By default, the indices begin with 0.
Now, we discuss different ways of subsetting a Pandas DataFrame. For explaining purposes, I’ll use the "wine dataset". Here is a part of it.

Selection
When we grab the entire column(s), it refers to as Selection. **** The selected column(s) contain all the rows.
Method 1: Selecting a single column using the column name
We can select a single column of a Pandas DataFrame using its column name. If the DataFrame is referred to as df, the general syntax is:
df['column_name']
# Or
df.column_name # Only for single column selection
The output is a Pandas Series which is a single column!
# Load some data
import pandas as pd
from sklearn.datasets import load_wine
wine = load_wine()
df = pd.DataFrame(data=wine.data,
columns=wine.feature_names)
# Select the 'alcohol column'
print(df['alcohol'])
print(type(df['alcohol']))

Method 2: Selecting multiple columns using the column names
We can select multiple columns of a Pandas DataFrame using its column names. We can define columns names inside a list:
['column_1', 'column_2', ...]
Then, we can include this list into df[]. The general syntax is:
df[['column_1', 'column_2', ...]]
This time, the output is a Pandas DataFrame!
df[['alcohol', 'ash', 'hue']]

Method 3: Selecting a single column using the .loc attribute
The same result in Method 1 can be obtained using the .loc attribute which selects Pandas data by label (column name).
df.loc[:, 'alcohol']
Method 4: Selecting multiple columns using the .loc attribute
The same result in Method 2 can be obtained using the .loc attribute which selects Pandas data by labels (column names).
df.loc[:, ['alcohol', 'ash', 'hue']]
The general syntax of the .loc attribute is:
df.loc['row_label', 'column_label']
If there are multiple labels, they should be specified inside lists:
df.loc[['row_1', 'row_2'],
['column_1', 'column_2']]
If we want to select all the rows or columns, it can be done with : symbol. The most important thing about the .loc attribute is that it selects Pandas data by label.
Method 5: Selecting a single column using the .iloc attribute
The same result in Method 1 can be obtained using the .iloc attribute which selects Pandas data by position (column index).
df.iloc[:, 0]
The alcohol variable is in position 0 (first variable).
Method 6: Selecting multiple columns using the .iloc attribute
The same result in Method 2 can be obtained using the .iloc attribute which selects Pandas data by positions (column indices).
df.iloc[:, [0, 2, 10]]
The general syntax of the .iloc attribute is:
df.iloc['row_index', 'column_index']
If there are multiple labels, they should be specified inside lists:
df.iloc[['row_index_1', 'row_index_2'],
['column_index_1', 'column_index_2']]
If we want to select all the rows or columns, it can be done with : notation. The most important thing about the .iloc attribute is that it selects Pandas data by position using numeric indices.
Method 7: Selecting consecutive columns using the .iloc attribute (The easy way)
We can select the first 5 columns of df as follows:
df.iloc[:, [0, 1, 2, 3, 4]]
We can also use the following easy method to obtain the same result.
df.iloc[:, 0:5]
To use this, columns should be positioned consecutively. The 0:5 range includes 0 (first column), excludes 5 (sixth column) and selects every column between the range.
Method 8: Selecting the last column
Selecting the last column is often useful in many cases. There are two methods:
First, we can count the number of columns in the data frame using the .shape attribute.
df.shape
# Output: (178, 13)
The last column is the 13th one that can be accessed through index 12. By using .iloc,
df.iloc[:, 12]

The second method is much easy. Here, we do not need to know the number of columns in the data frame.
df.iloc[:, -1]
The -1 represents the last column.
Slicing
When we want to extract certain rows from the DataFrame, it refers to as Slicing. The extracted rows are called slices **** and contain all the columns.
Method 9: Selecting a single row using the .iloc attribute
The easiest way to extract a single row is to use the row index inside the .iloc attribute. The general syntax is:
df.iloc[row_index]
The output is a Pandas Series which contains the row values.
df.iloc[0]

The appearance is a bit confusing as the output is a Pandas Series. If you want this as a row itself, simply use the index values inside a list as follows:
df.iloc[[0]]

This is a Pandas DataFrame which contains 1 row and all the columns!
Method 10: Selecting multiple rows using the .iloc attribute
We can extract multiple rows of a Pandas DataFrame using its row indices. We include row indices inside a list:
[row_index_1, row_index_2, ...]
Then we include this list inside df.iloc[].
df.iloc[[row_index_1, row_index_2, ...]]
The output is a Pandas DataFrame.
df.iloc[[0, 25, 100]]

Method 11: Selecting the last few rows
The negative indices count rows from the bottom.
df.iloc[[-1, -2, -3, -4, -5]]

Indexing
When we combine column selection and row slicing, it is referred to as Indexing. Here, we can use .loc and .iloc attributes of a Pandas DataFrame.
Method 12: Selecting a single value using the .iloc attribute
If we specify a single row and a single column, the intersection is a single value!
df.iloc[0, 0]

Keep in mind that we cannot use column or row names inside .iloc[]. Only the index numbers can be used.
Method 13: Selecting a single value using the .loc attribute
Here we can use row or column names inside .loc[]. Also keep in mind that, in our data, the row labels are the same as the row indices. The following code gives the same result as in Method 12.
df.loc[0, 'alcohol']
Method 14: Selecting multiple rows and columns using the .iloc attribute
The general syntax is:
df.iloc[[row_index_1, row_index_2, ...],
[column_index_1, column_index_2, ...]]
The output is a Pandas DataFrame.
df.iloc[[0, 5, 100], [0, 3, 7]]

Method 15: Selecting multiple rows and columns using the .loc attribute
The general syntax is:
df.loc[[row_name, row_name_2, ...],
[column_name_1, column_name_2, ...]]
The output is a Pandas DataFrame.
df.loc[[0, 5, 100], ['alcohol', 'ash', 'hue']]

Here we can use row or column names inside .loc[]. Also keep in mind that, in our data, the row names are the same as the row indices.
Method 16: Selecting consecutive rows and columns using the .loc and .iloc attributes (The easy way)
This can be easily done with the : notation. For this, rows and columns should be positioned consecutively.
df.iloc[0:6, 0:5]

Another example is:
df.loc[0:6, ['alcohol', 'ash']]

Filtering
When we select rows and columns based on specific criteria or conditions, it is referred to as Filtering. We can also combine the above-discussed methods with this.
Method 17: Filtering based on a single criterion with all columns
Let’s subset our data when alcohol > 14.3. Here, we select all the columns when alcohol > 14.3.
df['alcohol'] > 14.3

This is a Pandas Series of boolean data type. We can use this Series to get the required subset of the data.
df[df['alcohol'] > 14.3]

Method 18: Filtering based on a single criterion with a few columns
Let’s subset our data when alcohol > 14.3. This time, we select only 3 columns when alcohol > 14.3. For this, we can combine the above filtering technique with .loc[].
df.loc[df['alcohol'] > 14.3,
['alcohol', 'ash', 'hue']]

Method 19: Filtering based on two criteria with AND operator (Same column)
Let’s subset our data when alcohol > 14.3 AND alcohol < 14.6. Here, we use two conditions and combine them with the AND operator. Each condition should be surrounded in parentheses.
df[(df['alcohol'] > 14.3) & (df['alcohol'] < 14.6)]

Method 20: Filtering based on two criteria with the between() method
A similar type of filtering discussed in Method 19 can be achieved using the between() method.
df[df['alcohol'].between(14.3, 14.6)]

Here, the output is a bit different because the between() method includes the values of the lower bound (14.3) and upper bound (14.6) by default. However, we can pass inclusive=False if we don’t want the inclusive selection.
df[df['alcohol'].between(14.3, 14.6,
inclusive=False)]

This subset is exactly the same as the subset obtained in Method 19.
Method 21: Filtering based on two criteria with AND operator (Different columns)
Here, the two conditions are made using two different columns: alcohol and hue.
df[(df['alcohol'] > 14.3) & (df['hue'] > 1.0)]

Method 22: Filtering based on two criteria with OR operator
When we use the AND operator, the filtering happens considering both conditions to be true. If we want at least one condition to be true, we can use the OR operator.
df[(df['alcohol'] > 14.5) | (df['hue'] > 1.4)]

Method 23: Filtering based on the minimum and maximum values
Let’s subset our data based on the minimum and maximum values of the alcohol variable. First, we get the indices of the minimum and maximum:
df['alcohol'].idxmin() # Min value index
df['alcohol'].idxmax() # Max value index
Then we use .iloc[].
df.iloc[[df['alcohol'].idxmin(),
df['alcohol'].idxmax()]]

Summary
These are not the only ways of subsetting a Pandas DataFrame. There are many more. We can combine multiple methods for complex subsetting. This post helps you to be familiar with subsetting syntax. And also, you’re now familiar with the terms Selection, Slicing, Indexing and Filtering. Also, keep in mind that .iloc needs integer values (i for integer) while .loc needs label values.
This is the end of today’s post. My readers can sign up for a membership through the following link to get full access to every story I write and I will receive a portion of your membership fee.
Sign-up link: https://rukshanpramoditha.medium.com/membership
Thank you so much for your continuous support! See you in the next story. Happy learning to everyone!
Special credit goes to Hans-Peter Gauster on Unsplash, **** who provides me with a nice cover image for this post.
Rukshan Pramoditha 2021–06–04