5 String-Based Filtering Methods Every Pandas User Should Know

A guide to filtering on string columns.

Avi Chawla
Towards Data Science

--

Photo by Andreas Palmer on Unsplash

Filtering operation in Pandas refers to selecting a subset of rows whose values in some particular column(s) comply with a specific condition.

For instance, consider the dummy DataFrame on the left in the image below. If you want to select all the rows where the value in col1 is either A or B, the filtering operation should be such that it yields the DataFrame on the right.

Filtering dataframe on col1 (Image by author).

The above type of filtering specifically refers to string-based filtering in Pandas.

In this post, I will share some of the most prevalent methods you should know as a Pandas user to filter DataFrames on string columns. The highlight of the article is as follows:

How to Identify a String Column?
String-based Filtering Methods:
#1 Filter based on a single categorical value
#2 Filter based on multiple categorical values
#3 Filter based on the length of string
#4 Filter based on the presence of a substring
#5 Filter based on the type of characters in a string

Let’s begin 🚀!

How to Identify a String Column?

Before I proceed with the popular methods in Pandas to filter data on string values, let’s understand how you can identify a column with a string data type.

In Pandas, the data type of a string column is represented as object. To determine the data type, you can use the dtype attribute of a series as follows:

Here, you should note that even if a single value in a series is a string, the whole column will be interpreted as a string-type column. For instance, let’s change the first value in col2 from 1 to “1".

This time, the data type of col2 is object rather than int64— depicting a string data type.

String-based Filtering Methods

Next, let’s proceed with understanding methods that you can use to filter DataFrames on a column with object data type.

#1 Filter based on a single categorical value

First, say you want to filter all the rows whose value in the string column belongs to a single categorical value in the column. This is demonstrated in the image below:

The above filtering is implemented below:

The above approach filters all the rows where the value in col1 is “A”.

This can also be implemented using the query() method as shown below:

Note: While filtering using the query() method on a string column, you should enclose the filter value in single quotes as demonstrated above.

#2 Filter based on multiple categorical values

Similar to the above filtering, if you want to filter multiple values in a single go, you can do so in three ways.

  • Using logical operators:

The above condition states that the value in col1 should either be “A” or “B”.

  • The second way is to use the isin() method as demonstrated below:

The isin() method used above accepts a list of values to filter.

  • Lastly, we can use the query() method as shown below:

The isin() method used above accepts a list of filter values. On the other hand, the query() method evaluates a string expression to filter rows from a DataFrame.

#3 Filter based on the length of string

Here, say you want to filter all the rows from a DataFrame where the length of the strings in a column is greater/less than a threshold.

Invoking the len() method on a series lets you compute the length of individual entries, which can then be used to filter the rows according to a threshold.

Below, we filter all the strings from col1 whose length is greater than 4.

Before executing a method on an object column, values should be retrieved as string type using the str attribute, over which you can run a range of string methods available in python, such as strip(), isupper(), upper(), len() etc.

#4 Filter based on the presence of a substring

Next, say you want to extract rows for which the values in the string column contain a particular substring.

There are three widely used methods for this.

  • Match at the beginning of the string

As the name suggests, this method will return a row only if the substring matches the beginning of the string-value column.

Say you want to find all strings which begin with the substring “Jo”. We will use the startswith() method demonstrated below. Also, recall from the previous filtering method (#3), we should first convert the object column to a string using the str attribute.

If your column has NaN values, you should specify nan=False in the startswith() method, otherwise, it will raise an error

The error block is shown below:

Specifying nan=False ignores NaN values:

  • Match at the end of the string

Matching at the end of the string has a similar syntax to startswith(). Here, we use the endswith() method as shown below:

Note: Both startswith() and endswith() are case-sensitive methods.

  • Match anywhere in the string

In contrast to the startswith() and endswith() method that only match a substring at the start and the end of the string, respectively, the contains() method can find potential matches anywhere within the string-valued column.

By default, the contains() method performs case-sensitive matches. However, it can perform case-insensitive matches as well by passing the case=False argument as shown below:

#5 Filter based on the type of characters in a string

This type of filtering is based on the type of characters present in the string, such as:

- Filter if all characters are upper-case   : isupper()
- Filter if all characters are lower-case : islower()
- Filter if all characters are alphabetic : isalpha()
- Filter if all characters are numeric : isnumeric()
- Filter if all characters are digits : isdigit()
- Filter if all characters are decimal : isdecimal()
- Filter if all characters are whitespace : isspace()
- Filter if all characters are titlecase : istitle()
-
Filter if all characters are alphanumeric : isalnum()

I have demonstrated a couple of these methods below.

  • Filter alphanumeric strings from the DataFrame:
  • Filter numeric strings from the DataFrame:

This brings us to the end of this post.

To conclude, in this post, we discussed some of the most widely used and must-know string-based filtering methods in Pandas.

String-based filtering can also be executed using Regular Expressions (RegEx) in python. In the interest of time, I have not discussed them in this post and will release another blog soon!

Interested in reading more such stories on Medium??

✉️ Sign-up to my Email list to never miss another article on data science guides, tricks and tips, Machine Learning, SQL, Python, and more. Medium will deliver my next articles right to your inbox.

Thanks for reading!

--

--

👉 Get a Free Data Science PDF (550+ pages) with 320+ tips by subscribing to my daily newsletter today: https://bit.ly/DailyDS.