The world’s leading publication for data science, AI, and ML professionals.

Mastering Data Filtering with Pandas

Three Useful Functions for Data Filtering with Pandas

Photo by Ken Tomita on Pexels
Photo by Ken Tomita on Pexels

Pandas is a python library used for generating statistics, wrangling data, analyzing data and more. In this post I will discuss three useful functions that allow us to easily filter data using Pandas.

Let’s get started!

For our purposes we will be working with the FIFA 19 data set which can be found here.

To start, let’s import the pandas package:

import pandas as pd 

Next, let’s set the maximum number of display columns and rows to ‘None’:

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Now, let’s read in our data:

df = pd.read_csv('fifa_data.csv')

Next we will print the first five rows of data to get an idea of the column types and their values (column results are truncated):

print(df.head())

The first function we can consider is one that filters by category. Let’s define our function ‘filter_category’. Our function will take a category and category value:

def filter_category(category, category_value):

We can use the ‘loc’ operator to select rows where the value of the category is something we choose:

def filter_category(category, category_value):
    df_filter = df.loc[df[category] == category_value]
    return df_filter

Let’s filter for Argentinean Nationality:

df_filter = filter_category('Nationality', 'Argentina')
print(df_filter.head())

We see that all the values for ‘Nationality’ are ‘Argentina’. Another type of function that can be useful is one that filters by a list of values:

def filter_category_with_list(category, category_value_list):
    df_filter = df.loc[df[category].isin(category_value_list)]
    return df_filter

We can filter for rows where ‘Nationality’ is ‘Brazil’, ‘Spain’ or ‘Argentina’:

df_filter = filter_category_with_list('Nationality', ['Brazil', 'Spain', 'Argentina'])
print(df_filter.head())

We can also filter by the categorical column ‘Club’. Let’s filter for ‘Manchester City’, ‘Real Madrid’, and ‘FC Barcelona’:

df_filter = filter_category_with_list('Club', ['Manchester City', 'Real Madrid', 'FC Barcelona'])
print(df_filter.head())

We can also define a function that filters by a numerical value. The function will be able to filter for rows based on an input value and whether the rows are greater than, less than, or equal to the input:

def filter_numerical(numerical, numerical_value, relationship):
    if relationship == 'greater':
        df_filter = df.loc[df[numerical] > numerical_value]
    elif relationship == 'less':
        df_filter = df.loc[df[numerical] < numerical_value]     
    else: 
        df_filter = df.loc[df[numerical] == numerical_value]  
    return df_filter

We can filter for rows where the ‘Age’ of the soccer player is greater than 30:

df_filter = filter_numerical('Age', 30, 'greater')
print(df_filter.head())

Less than 30:

df_filter = filter_numerical('Age', 30, 'less')
print(df_filter.head())

Or equal to 30:

df_filter = filter_numerical('Age', 30, 'equal')
print(df_filter.head())

We can also filter by wage. First let’s convert the wage into a float:

f['Wage'] = df['Wage'].str.lstrip('€')
df['Wage'] = df['Wage'].str.rstrip('K')
df['Wage'] = df['Wage'].astype(float)*1000.0
print(df['Wage'].head())

Now let’s filter for wages less than 100000 Euros:

df_filter = filter_numerical('Wage', 100000, 'less')
print(df_filter.head())

I’ll stop here but I encourage you to play around with the data and code yourself. An interesting function to consider writing is one that filters based off of both numerical and categorical column values.

CONCLUSIONS

To summarize, in this post we discussed how to define three functions that allows us to easily filter data rows. First we discussed how to filter data rows by a single categorical value. We then discuss how to filter rows using a list of categorical values. Finally we showed how to filter data rows by numerical column values. If you are interested in learning more about data manipulation with pandas, machine learning or even just some of the basics of python Programming check out _Python for Data Science and Machine Learning: Python Programming, Pandas and Scikit-learn Tutorials for Beginners_. I hope you found this post useful interesting. The code from this post is available on GitHub. Thank you for reading!


Related Articles