Interesting Ways to Select Pandas DataFrame Columns

Published in

Towards Data Science

4 min readApr 16, 2021

Manipulating pandas data frames is a common task during exploratory analysis or preprocessing in a Data Science project. Filtering and sub-setting the data is also common. Over time, I have found myself needing to select columns based on different criteria. I hope readers find this article as a reference.

Example Data

If you want to use the data I used to test out these methods of selecting columns from a pandas data frame, use the code snippet below to get the wine dataset into your IDE or a notebook.

from sklearn.datasets import load_wine
import pandas as pd
import numpy as np
import reX = load_wine()
df = pd.DataFrame(X.data, columns = X.feature_names)df.head()

Image of Table of Wine Data — Screenshot by Author of Wine Dataset in a Jupyter notebook

Now, depending on what you want to do, check out each one of the code snippets below and try for yourself!

Selecting columns based on their name

This is the most basic way to select a single column from a dataframe, just put the string name of the column in brackets. Returns a pandas series.

df['hue']

Passing a list in the brackets lets you select multiple columns at the same time.

df[['alcohol','hue']]

Selecting a subset of columns found in a list

Similar to the previous example, but here you can search over all the columns in the dataframe.

df[df.columns[df.columns.isin(['alcohol','hue','NON-EXISTANT COLUMN'])]]

Selecting a subset of columns based on difference of columns

Let’s say you know what columns you don’t want in the dataframe. Pass those as a list to the difference method and you’ll get back everything except them.

df[df.columns.difference([‘alcohol’,’hue’])]

Selecting a subset of columns that is not in a list

Return a data frame that has columns that are not in a list that you want to search over.

df[df.columns[~df.columns.isin(['alcohol','hue'])]]

Selecting columns based on their data type

Data types include ‘float64’ and ‘object’ and are inferred from the columns passed to the dtypes method. By matching on columns that are the same data type, you’ll get a series of True/False. Use the values method to get just the True/False values and not the index.

df.loc[:,(df.dtypes=='float64').values]

Selecting columns based on their column name containing a substring

If you have tons of columns in a data frame and their column names all have a similar substring that you are interested in, you can return the columns who’s names contain a substring. Here we want everything that has the “al” substring in it.

df.loc[:,['al' in i for i in df.columns]]

Selecting columns based on their column name containing a string wildcard

You could have hundreds of columns, so it might make sense to find columns that match a pattern. Searching for column names that match a wildcard can be done with the “search” function from the re package (see the link in the reference section for more details on using the regular expression package).

df.loc[:,[True if re.search('flava+',column) else False for column in df.columns]]

Selecting columns based on how their column name starts

If you want to select columns with names that start with a certain string, you can use the startswith method and pass it in the columns spot for the data frame location.

df.loc[:,df.columns.str.startswith('al')]

Selecting columns based on how their column name ends

Same as the last example, but finds columns with names that end a certain way.

df.loc[:,df.columns.str.endswith('oids')]

Selecting columns if all rows meet a condition

You can pick columns if the rows meet a condition. Here, if all the the values in a column is greater than 14, we return the column from the data frame.

df.loc[:,[(df[col] > 14).all() for col in df.columns]]

Selecting columns if any row of a column meets a condition

Here, if any of the the values in a column is greater than 14, we return the column from the data frame.

df.loc[:,[(df[col] > 14).any() for col in df.columns]]

Selecting columns if the average of rows in a column meet a condition

Here, if the mean of all the values in a column meet a condition, return the column.

df.loc[:,[(df[col].mean() > 7) for col in df.columns]]

Thanks for checking this out and feel free to reference it often.

References

GitHub - caseywhorton/medium-blog-code

Contribute to caseywhorton/medium-blog-code development by creating an account on GitHub.

github.com

re - Regular expression operations - Python 3.9.4 documentation

This module provides regular expression matching operations similar to those found in Perl. Both patterns and strings…

docs.python.org

pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of…

pandas.pydata.org