The world’s leading publication for data science, AI, and ML professionals.

23 Efficient Ways of Subsetting a Pandas DataFrame

With Selection, Slicing, Indexing and Filtering

Photo by Hans-Peter Gauster on Unsplash
Photo by Hans-Peter Gauster on Unsplash

In part 1 and part 2, we’ve learned how to inspect, describe and summarize a Pandas DataFrame. Today, we’ll learn how to extract a subset of a Pandas DataFrame. This is very useful because we often want to perform operations on subsets of our data. There are many different ways of subsetting a Pandas DataFrame. You may need to select specific columns with all rows. Sometimes, you want to select specific rows with all columns or select rows and columns that meet a specific criterion, etc.

All different ways of subsetting can be divided into 4 categories: Selection, Slicing, Indexing and Filtering.

(Image by author)
(Image by author)

As you continue reading this post, you’ll learn the differences between these categories.

Before discussing any of the methods of subsetting a data frame, it is worth distinguishing between a Pandas Series object and a Pandas DataFrame object.

Pandas Series and DataFrame objects

The Series and the DataFrame are two main data structures in Pandas. Simply, a Series is similar to a single column of data while a DataFrame is similar to a sheet with rows and columns. Look at the following diagram:

(Image by author)
(Image by author)

As you can see, a Series is one dimensional and a DataFrame is two dimensional. If we combine two or more Series objects together, we can get a DataFrame object. Let’s look at the actual view of a Series object.

import numpy as np
import pandas as pd
data = np.array([85, 90, 70, 80])
series = pd.Series(data=data, 
                   name="marks")
print(series)
(Image by author)
(Image by author)

A Series consists of two components: One-dimensional data values and Index. The index provides meaningful labels for each data value. The users can use this index to select the values. By default, the index begins with 0.

Let’s look at the actual view of a DataFrame object.

import numpy as np
import pandas as pd
data = np.array([[25, 85],
                 [25, 90],
                 [26, 70],
                 [24, 80]])
dataframe = pd.DataFrame(data=data, 
                         columns=["age", "marks"])
print(dataframe)
(Image by author)
(Image by author)

A DataFrame consists of three components: Two-dimensional data values, Row index and Column index. These indices provide meaningful labels for rows and columns. The users can use these indices to select rows and columns. By default, the indices begin with 0.

Now, we discuss different ways of subsetting a Pandas DataFrame. For explaining purposes, I’ll use the "wine dataset". Here is a part of it.

A part of the wine dataset
A part of the wine dataset

Selection

When we grab the entire column(s), it refers to as Selection. **** The selected column(s) contain all the rows.

Method 1: Selecting a single column using the column name

We can select a single column of a Pandas DataFrame using its column name. If the DataFrame is referred to as df, the general syntax is:

df['column_name']
# Or
df.column_name # Only for single column selection

The output is a Pandas Series which is a single column!

# Load some data
import pandas as pd
from sklearn.datasets import load_wine
wine = load_wine()
df = pd.DataFrame(data=wine.data,
                  columns=wine.feature_names)
# Select the 'alcohol column'
print(df['alcohol'])
print(type(df['alcohol']))
(Image by author)
(Image by author)

Method 2: Selecting multiple columns using the column names

We can select multiple columns of a Pandas DataFrame using its column names. We can define columns names inside a list:

['column_1', 'column_2', ...]

Then, we can include this list into df[]. The general syntax is:

df[['column_1', 'column_2', ...]]

This time, the output is a Pandas DataFrame!

df[['alcohol', 'ash', 'hue']]
(Image by author)
(Image by author)

Method 3: Selecting a single column using the .loc attribute

The same result in Method 1 can be obtained using the .loc attribute which selects Pandas data by label (column name).

df.loc[:, 'alcohol']

Method 4: Selecting multiple columns using the .loc attribute

The same result in Method 2 can be obtained using the .loc attribute which selects Pandas data by labels (column names).

df.loc[:, ['alcohol', 'ash', 'hue']]

The general syntax of the .loc attribute is:

df.loc['row_label', 'column_label']

If there are multiple labels, they should be specified inside lists:

df.loc[['row_1', 'row_2'], 
       ['column_1', 'column_2']]

If we want to select all the rows or columns, it can be done with : symbol. The most important thing about the .loc attribute is that it selects Pandas data by label.

Method 5: Selecting a single column using the .iloc attribute

The same result in Method 1 can be obtained using the .iloc attribute which selects Pandas data by position (column index).

df.iloc[:, 0]

The alcohol variable is in position 0 (first variable).

Method 6: Selecting multiple columns using the .iloc attribute

The same result in Method 2 can be obtained using the .iloc attribute which selects Pandas data by positions (column indices).

df.iloc[:, [0, 2, 10]]

The general syntax of the .iloc attribute is:

df.iloc['row_index', 'column_index']

If there are multiple labels, they should be specified inside lists:

df.iloc[['row_index_1', 'row_index_2'], 
       ['column_index_1', 'column_index_2']]

If we want to select all the rows or columns, it can be done with : notation. The most important thing about the .iloc attribute is that it selects Pandas data by position using numeric indices.

Method 7: Selecting consecutive columns using the .iloc attribute (The easy way)

We can select the first 5 columns of df as follows:

df.iloc[:, [0, 1, 2, 3, 4]]

We can also use the following easy method to obtain the same result.

df.iloc[:, 0:5]

To use this, columns should be positioned consecutively. The 0:5 range includes 0 (first column), excludes 5 (sixth column) and selects every column between the range.

Method 8: Selecting the last column

Selecting the last column is often useful in many cases. There are two methods:

First, we can count the number of columns in the data frame using the .shape attribute.

df.shape
# Output: (178, 13)

The last column is the 13th one that can be accessed through index 12. By using .iloc,

df.iloc[:, 12]
The last column of the wine dataset (Image by author)
The last column of the wine dataset (Image by author)

The second method is much easy. Here, we do not need to know the number of columns in the data frame.

df.iloc[:, -1]

The -1 represents the last column.

Slicing

When we want to extract certain rows from the DataFrame, it refers to as Slicing. The extracted rows are called slices **** and contain all the columns.

Method 9: Selecting a single row using the .iloc attribute

The easiest way to extract a single row is to use the row index inside the .iloc attribute. The general syntax is:

df.iloc[row_index]

The output is a Pandas Series which contains the row values.

df.iloc[0]
(Image by author)
(Image by author)

The appearance is a bit confusing as the output is a Pandas Series. If you want this as a row itself, simply use the index values inside a list as follows:

df.iloc[[0]]
(Image by author)
(Image by author)

This is a Pandas DataFrame which contains 1 row and all the columns!

Method 10: Selecting multiple rows using the .iloc attribute

We can extract multiple rows of a Pandas DataFrame using its row indices. We include row indices inside a list:

[row_index_1, row_index_2, ...]

Then we include this list inside df.iloc[].

df.iloc[[row_index_1, row_index_2, ...]]

The output is a Pandas DataFrame.

df.iloc[[0, 25, 100]]
(Image by author)
(Image by author)

Method 11: Selecting the last few rows

The negative indices count rows from the bottom.

df.iloc[[-1, -2, -3, -4, -5]]
(Image by author)
(Image by author)

Indexing

When we combine column selection and row slicing, it is referred to as Indexing. Here, we can use .loc and .iloc attributes of a Pandas DataFrame.

Method 12: Selecting a single value using the .iloc attribute

If we specify a single row and a single column, the intersection is a single value!

df.iloc[0, 0]
(Image by author)
(Image by author)

Keep in mind that we cannot use column or row names inside .iloc[]. Only the index numbers can be used.

Method 13: Selecting a single value using the .loc attribute

Here we can use row or column names inside .loc[]. Also keep in mind that, in our data, the row labels are the same as the row indices. The following code gives the same result as in Method 12.

df.loc[0, 'alcohol']

Method 14: Selecting multiple rows and columns using the .iloc attribute

The general syntax is:

df.iloc[[row_index_1, row_index_2, ...],
        [column_index_1, column_index_2, ...]]

The output is a Pandas DataFrame.

df.iloc[[0, 5, 100], [0, 3, 7]]
(Image by author)
(Image by author)

Method 15: Selecting multiple rows and columns using the .loc attribute

The general syntax is:

df.loc[[row_name, row_name_2, ...],
        [column_name_1, column_name_2, ...]]

The output is a Pandas DataFrame.

df.loc[[0, 5, 100], ['alcohol', 'ash', 'hue']]
(Image by author)
(Image by author)

Here we can use row or column names inside .loc[]. Also keep in mind that, in our data, the row names are the same as the row indices.

Method 16: Selecting consecutive rows and columns using the .loc and .iloc attributes (The easy way)

This can be easily done with the : notation. For this, rows and columns should be positioned consecutively.

df.iloc[0:6, 0:5]
(Image by author)
(Image by author)

Another example is:

df.loc[0:6, ['alcohol', 'ash']]
(Image by author)
(Image by author)

Filtering

When we select rows and columns based on specific criteria or conditions, it is referred to as Filtering. We can also combine the above-discussed methods with this.

Method 17: Filtering based on a single criterion with all columns

Let’s subset our data when alcohol > 14.3. Here, we select all the columns when alcohol > 14.3.

df['alcohol'] > 14.3
(Image by author)
(Image by author)

This is a Pandas Series of boolean data type. We can use this Series to get the required subset of the data.

df[df['alcohol'] > 14.3]
(Image by author)
(Image by author)

Method 18: Filtering based on a single criterion with a few columns

Let’s subset our data when alcohol > 14.3. This time, we select only 3 columns when alcohol > 14.3. For this, we can combine the above filtering technique with .loc[].

df.loc[df['alcohol'] > 14.3, 
      ['alcohol', 'ash', 'hue']]
(Image by author)
(Image by author)

Method 19: Filtering based on two criteria with AND operator (Same column)

Let’s subset our data when alcohol > 14.3 AND alcohol < 14.6. Here, we use two conditions and combine them with the AND operator. Each condition should be surrounded in parentheses.

df[(df['alcohol'] > 14.3) &amp; (df['alcohol'] < 14.6)]
(Image by author)
(Image by author)

Method 20: Filtering based on two criteria with the between() method

A similar type of filtering discussed in Method 19 can be achieved using the between() method.

df[df['alcohol'].between(14.3, 14.6)]
(Image by author)
(Image by author)

Here, the output is a bit different because the between() method includes the values of the lower bound (14.3) and upper bound (14.6) by default. However, we can pass inclusive=False if we don’t want the inclusive selection.

df[df['alcohol'].between(14.3, 14.6,
                         inclusive=False)]
(Image by author)
(Image by author)

This subset is exactly the same as the subset obtained in Method 19.

Method 21: Filtering based on two criteria with AND operator (Different columns)

Here, the two conditions are made using two different columns: alcohol and hue.

df[(df['alcohol'] > 14.3) &amp; (df['hue'] > 1.0)]
(Image by author)
(Image by author)

Method 22: Filtering based on two criteria with OR operator

When we use the AND operator, the filtering happens considering both conditions to be true. If we want at least one condition to be true, we can use the OR operator.

df[(df['alcohol'] > 14.5) | (df['hue'] > 1.4)]
(Image by author)
(Image by author)

Method 23: Filtering based on the minimum and maximum values

Let’s subset our data based on the minimum and maximum values of the alcohol variable. First, we get the indices of the minimum and maximum:

df['alcohol'].idxmin() # Min value index
df['alcohol'].idxmax() # Max value index

Then we use .iloc[].

df.iloc[[df['alcohol'].idxmin(),
         df['alcohol'].idxmax()]]
(Image by author)
(Image by author)

Summary

These are not the only ways of subsetting a Pandas DataFrame. There are many more. We can combine multiple methods for complex subsetting. This post helps you to be familiar with subsetting syntax. And also, you’re now familiar with the terms Selection, Slicing, Indexing and Filtering. Also, keep in mind that .iloc needs integer values (i for integer) while .loc needs label values.


This is the end of today’s post. My readers can sign up for a membership through the following link to get full access to every story I write and I will receive a portion of your membership fee.

Sign-up link: https://rukshanpramoditha.medium.com/membership

Thank you so much for your continuous support! See you in the next story. Happy learning to everyone!

Special credit goes to Hans-Peter Gauster on Unsplash, **** who provides me with a nice cover image for this post.

Rukshan Pramoditha 2021–06–04


Related Articles