The world’s leading publication for data science, AI, and ML professionals.

Top 3 Lesser-Known Pandas Function

Pandas functions you need to know as a modern data scientist.

s

When it comes to Data Science libraries in Python, Pandas is probably the most useful one. It’s been around for quite some time and includes a lot of time-saving functions. This article covers three lesser-known Pandas functions in a practical manner.

Photo by XPS on Unsplash
Photo by XPS on Unsplash

Don’t feel like reading? Check out my video on the topic:

You’ll need a Housing prices dataset to follow along, so please download it from this link. Let’s load it in with Pandas:

import pandas as pd 
df = pd.read_csv('HousingPrices.csv') 
df.head()

Here’s how the dataset looks like:

Image 1 - Housing prices dataset head (image by author)
Image 1 – Housing prices dataset head (image by author)

Without much ado, let’s jump into function number one.


select_dtypes()

The select_dtypes function is used to select only the columns of a specific data type. Just imagine you want to do some work on strings – you can use the mentioned function to make a subset of non-numeric columns and perform the operations from there.

Our dataset doesn’t contain string columns, as visible from the image below:

Image 2 - Datatypes per column (image by author)
Image 2 – Datatypes per column (image by author)

We’ll use the select_dtypes function to keep only integer columns. According to the previous image, only the TransactionDate and NumStores columns should remain:

df.select_dtypes('int64').head()

And here are the results:

Image 3 - Integer columns (image by author)
Image 3 – Integer columns (image by author)

Let’s proceed with the next function.


memory_usage()

The memory_usage function does just what the name suggests – reports back the memory usage. By default, the reporting is done per column and in bytes, but we’ll see how to change it in a bit.

The function is called on the entire dataset:

df.memory_usage()

The call results in the following series for our dataset:

Image 4 - Memory usage in bytes per column (image by author)
Image 4 – Memory usage in bytes per column (image by author)

If you want the total number instead, you can call the sum() function on the result set:

df.memory_usage().sum()

Which would result in 23312 bytes. Since bytes aren’t as interpretable as megabytes, let’s make a conversion next:

df.memory_usage().sum() / 1e+6

The above code outputs 0.023312 when executed, representing how many megabytes of memory the housing dataset takes.

You can take this function a step further with logging to enable memory usage monitoring in production applications, but that’s a bit out of this article’s scope.

Let’s go over the last function.


query()

The query function is used to filter the dataset and to grab records of interest. The filtering format is what makes it special. There’s no need to manage tens of brackets if you are filtering by multiple conditions, as this function enables you to pass filter condition as a string.

Let’s take a look at a simpler example first – we are interested in houses that are between 20 and 30 years old:

df.query('20 < HouseAge <= 30').head()

Here are the results:

Image 5 - Houses between 20 and 30 years old (image by author)
Image 5 – Houses between 20 and 30 years old (image by author)

But what if we want to test for multiple conditions? Just put the and or or keywords in between:

df.query('20 < HouseAge <= 30 and NumStores == 6').head()

This is the resulting subset:

Image 6 - Houses between 20 and 30 years old with 6 stores nearby (image by author)
Image 6 – Houses between 20 and 30 years old with 6 stores nearby (image by author)

And that’s all you need to know. Let’s wrap things up in the next section.


Parting words

Pandas is among the top data science libraries in Python for a reason – there’s nothing you can’t do simply and intuitively. Things can get messy with complex filtering, but the newly learned query functions should make things easier.

What are your favorite lesser-known Pandas functions? Please let me know in the comment section below.

Thanks for reading.

Join my private email list for more helpful insights.


Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.

Join Medium with my referral link – Dario Radečić


Originally published at https://www.betterdatascience.com on November 16, 2020.


Related Articles