The world’s leading publication for data science, AI, and ML professionals.

Mastering Data Selection with Pandas

Selecting Data with Pandas

Source
Source

Pandas is a Python library that provides several useful tools for statistical analysis, feature engineering, data selection, data filtering and much more. Data selection and data filtering are useful for exploratory data analysis, feature engineering, and predictive model building. In this post, we will discuss some useful methods in Pandas for data selection and filtering.

Let’s get started!

For our purposes, we will be working with the Video Game Sales Dataset.

To start, let’s import the Pandas library and read the data into a Pandas data frame:

import pandas as pd 
df = pd.read_csv("vgsales.csv")

Let’s also relax the limit on the number of display columns:

pd.set_option('display.max_columns', None)

Next, let’s print the columns in the data:

print(list(df.columns))

Let’s also print some basic information about the data:

print(df.info())

Now, let’s print the first five rows of data using the ‘head()’ method:

print(df.head())

Ok, now that we have a decent idea of the column names and types in the data we can perform some basic data selection operations. To start, we can use the ‘.loc[]’ method to select subsets of our data based on column values. Let’s select video games in the ‘sports’ genre:

df_sports = df.loc[df.Genre == 'Sports']

Let’s print the first five rows of our new data frame:

print(df_sports.head())

Let’s take a look at another column. Let’s select the subset of data corresponding video games for the Wii platform:

df_wii = df.loc[df.Platform == 'Wii']

Let’s print the first five rows of data:

print(df_wii.head())

We can also select subsets of subsets. For example, let’s select ‘Racing’ games played on ‘Wii’:

df_wii_racing = df_wii.loc[df.Genre == 'Racing']

We can also perform this operation with a single line of code using a joint condition:

df_wii_racing = df.loc[(df.Platform == 'Wii') & (df.Genre == 'Racing')]
print(df_wii_racing.head())

And we can add as many conditions as we like. Let’s select Wii racing games that sold more than 1 million units globally:

df_gt_1mil = df.loc[(df.Platform == 'Wii') & (df.Genre == 'Racing') & (df.Global_Sales >= 1.0)]
print(df_gt_1mil.head())

We can also select data by row using the ‘.iloc[]’ method. Let’s select the first 1000 rows of the original data:

df_filter_rows = df.iloc[:1000]
print("Length of original: ", len(df))
print("Length of filtered: ", len(df_filter_rows))

We can also select a random sample of data, using the ‘sample()’ method:

df_random_sample = df.sample(n=5000)
print("Length of sample: ", len(df_random_sample))

If we print the first five rows:

print(df_random_sample.head())

And print the first five rows in another run:

We see we get different results. If we want to consistently get the same result, we can set the ‘random_state’ parameter in the sample method:

df_random_sample = df.sample(n=5000, random_state = 42)
print("Length of sample: ", len(df_random_sample))

This should give us the same result upon each run. I’ll stop here but I encourage you to play around with the code and data yourself.

CONCLUSIONS

To summarize, in this post we discussed how to select and filter data using the Python Pandas library. We discussed how to use the ‘.loc[]’ method to select subsets of data based on column values. We also showed how to filter data frames by row using the ‘.iloc[]’ method. Finally, we discussed how to select a random sample of data from a data frame using the ‘sample()’ method. I hope you found this post interesting/useful. The code in this post is available on GitHub. Thank you for reading!


Related Articles