The world’s leading publication for data science, AI, and ML professionals.

Boolean Masking with Pandas

Filtering Pandas Dataframes

Photo by Larry Costales on Unsplash
Photo by Larry Costales on Unsplash

One of the topics in Miki Tebeka‘s excellent "Faster Pandas" __ course was how to use Boolean masks to filter data in Pandas. I wanted to practice what I had learned, so I updated a recent project to use Boolean masks. I think this is a very useful technique, so I wanted to share how I updated my code. I’ll show you how to use Boolean masks in some _Befor_e and _Afte_r code snippets taken directly from my project.

Before we dive into the snippets, I’ll provide some context. I wanted to analyse reviews of a popular cooking app. I downloaded the reviews and focused on the ones posted in 2019 and 2020. Why this time range? I wanted to see if there might be a difference in the app reviews before and during COVID. My household increased our home cooking during 2020 and I was curious if the cooking app reviews might show an increase in app usage and/or user enjoyment.


How To Use Boolean Masks

Now, let’s learn how to use Boolean masks by checking out some Before and After code snippets.

The following code filters the dataset to only use the rows where the year is 2019 or 2020.

Before:

#reviewTimestamp is a datetime object
reviews_df = all_reviews_df[(all_reviews_df['reviewTimestamp'].dt.year == 2019) | (all_reviews_df['reviewTimestamp'].dt.year == 2020)]

After:

pre_mask = all_reviews_df['reviewTimestamp'].dt.year == 2019
during_mask = all_reviews_df['reviewTimestamp'].dt.year == 2020
reviews_df = all_reviews_df[pre_mask | during_mask]

Here’s another example below. This code filters the dataset to find the "high" and "low" reviews for 2019 and 2020.

Before:

top_reviews_2019 = reviews_df[(reviews_df['reviewTimestamp'].dt.year == 2019) & (reviews_df['score'] == 5)]['content'].to_list()
low_reviews_20191 = reviews_df[(reviews_df['reviewTimestamp'].dt.year == 2019) &amp; (reviews_df['score'] < 3)]['content'].to_list()
top_reviews_2020 = reviews_df[(reviews_df['reviewTimestamp'].dt.year == 2020) &amp; (reviews_df['score'] == 5)]['content'].to_list()
low_reviews_2020 = reviews_df[(reviews_df['reviewTimestamp'].dt.year == 2020) &amp; (reviews_df['score'] < 3)]['content'].to_list()

After:

pre_mask = reviews_df['reviewTimestamp'].dt.year == 2019
during_mask = reviews_df['reviewTimestamp'].dt.year == 2020
high_mask = reviews_df['score'] == 5
low_mask = reviews_df['score'] < 3
high_reviews_2019 = reviews_df[pre_mask &amp; high_mask]['content'].to_list()
low_reviews_2019 = reviews_df[pre_mask &amp; low_mask]['content'].to_list()

high_reviews_2020 = reviews_df[during_mask &amp; high_mask]['content'].to_list()
low_reviews_2020 = reviews_df[during_mask &amp; low_mask]['content'].to_list()

Why Use Boolean Masks?

Why did I find it useful to make these changes? The first reason is that it makes the code easier to read and to make changes. For example, I might update the analysis with a pre-COVID timeframe of 2018 and 2019 and the during-COVID timeframe of 2020 and 2021. I would just need to update the criteria in the Boolean mask definition instead of going through the code and replacing the criteria in every place I filtered the dataset.

The second reason is that I’d been told that using Boolean masks improves performance. I did performance testing on my code using the **** %timeit magic command (in cell mode) in the IPython interactive command shell. My performance check revealed that code using a Boolean mask was faster than the code that used regular conditional filtering. On my computer, the code was 7 times faster.

Image provided by Author
Image provided by Author

Now you’ve seen some examples of how to use Boolean masks and are aware of the reasons why you should consider using them in your code. Please feel free to reach out with suggestions and to let me know if this content has helped you.

Happy coding!


Related Articles