s
When it comes to Data Science libraries in Python, Pandas
is probably the most useful one. It’s been around for quite some time and includes a lot of time-saving functions. This article covers three lesser-known Pandas
functions in a practical manner.

Don’t feel like reading? Check out my video on the topic:
You’ll need a Housing prices dataset to follow along, so please download it from this link. Let’s load it in with Pandas:
import pandas as pd
df = pd.read_csv('HousingPrices.csv')
df.head()
Here’s how the dataset looks like:

Without much ado, let’s jump into function number one.
select_dtypes()
The select_dtypes
function is used to select only the columns of a specific data type. Just imagine you want to do some work on strings – you can use the mentioned function to make a subset of non-numeric columns and perform the operations from there.
Our dataset doesn’t contain string columns, as visible from the image below:

We’ll use the select_dtypes
function to keep only integer columns. According to the previous image, only the TransactionDate
and NumStores
columns should remain:
df.select_dtypes('int64').head()
And here are the results:

Let’s proceed with the next function.
memory_usage()
The memory_usage
function does just what the name suggests – reports back the memory usage. By default, the reporting is done per column and in bytes, but we’ll see how to change it in a bit.
The function is called on the entire dataset:
df.memory_usage()
The call results in the following series for our dataset:

If you want the total number instead, you can call the sum()
function on the result set:
df.memory_usage().sum()
Which would result in 23312
bytes. Since bytes aren’t as interpretable as megabytes, let’s make a conversion next:
df.memory_usage().sum() / 1e+6
The above code outputs 0.023312
when executed, representing how many megabytes of memory the housing dataset takes.
You can take this function a step further with logging to enable memory usage monitoring in production applications, but that’s a bit out of this article’s scope.
Let’s go over the last function.
query()
The query
function is used to filter the dataset and to grab records of interest. The filtering format is what makes it special. There’s no need to manage tens of brackets if you are filtering by multiple conditions, as this function enables you to pass filter condition as a string.
Let’s take a look at a simpler example first – we are interested in houses that are between 20 and 30 years old:
df.query('20 < HouseAge <= 30').head()
Here are the results:

But what if we want to test for multiple conditions? Just put the and
or or
keywords in between:
df.query('20 < HouseAge <= 30 and NumStores == 6').head()
This is the resulting subset:

And that’s all you need to know. Let’s wrap things up in the next section.
Parting words
Pandas
is among the top data science libraries in Python for a reason – there’s nothing you can’t do simply and intuitively. Things can get messy with complex filtering, but the newly learned query
functions should make things easier.
What are your favorite lesser-known Pandas
functions? Please let me know in the comment section below.
Thanks for reading.
Join my private email list for more helpful insights.
Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.
Originally published at https://www.betterdatascience.com on November 16, 2020.