The world’s leading publication for data science, AI, and ML professionals.

3 Not-So-common Pandas Tricks You Should Know

Making the most out of Pandas.

Photo by Joshua Chun on Unsplash
Photo by Joshua Chun on Unsplash

If you are reading this article, you must have heard of or used Pandas. Thus, I’m skipping the part where I talk about how great Pandas is. 😊

In this article, we will go over examples that demonstrate some of the not-so-commonly used Pandas features. I haven’t come across them very often in my Pandas journey of 3 years.

The best way to learn Pandas, or any other software tool, is practicing. That’s what we will do. The first step is to create a sample DataFrame to work on.

import numpy as np
import pandas as pd
df = pd.DataFrame({

   "date": pd.date_range(start="2021-11-20", periods=100, freq="D"),
   "class": ["A","B","C","D"] * 25,
   "amount": np.random.randint(10, 100, size=100)
})
df.head()
The first 5 rows of df (image by author)
The first 5 rows of df (image by author)

We have a DataFrame with 3 columns and 100 rows. The date column contains 100 consecutive dates, the class column contains 4 distinct values stored with the object data type, and the amount column contains random integers between 10 and 100.


1. To_period

We use dates with many different intervals or periods such as day, week, month, quarter, and so on. Which one to use depends on the task at hand. For instance, in a revenue dashboard, we might want to show monthly or quarterly revenues.

In Pandas, the functions for manipulating dates are available under the dt accessor. The to_period function allows for converting a date to a specific interval. The periods I frequently use are month and quarter.

The month method only returns the month which is not useful in many cases such as reporting historical revenue. We need to be able distinguish December 2020 and December 2021. This can be achieved by using the "M" period with the to_period function. Similarly, we can extract the quarter info with "Q".

Let’s create new columns for year-month and quarter.

df["month"] = df["date"].dt.to_period("M")
df["quarter"] = df["date"].dt.to_period("Q")
df.head()
The first 5 rows of df (image by author)
The first 5 rows of df (image by author)

Let’s also check the different year-month and quarter values in our DataFrame.

df["month"].value_counts()
# output
2021-12    31
2022-01    31
2022-02    27
2021-11    11
Freq: M, Name: month, dtype: int64
--------------------------
df["quarter"].value_counts()
# output
2022Q1    58
2021Q4    42
Freq: Q-DEC, Name: quarter, dtype: int64

2. Cumsum and groupby

The cumsum is a highly useful Pandas function. It calculates the cumulative sum of the values in a column. Here is how we normally use it:

df["cumulative_sum"] = df["amount"].cumsum()
df.head()
The first 5 rows of df (image by author)
The first 5 rows of df (image by author)

We now have the cumulative sum of the values in the amount column. However, it does not take the classes into consideration. In some cases, we may need to calculate the cumulative sum separately for different categories.

Thankfully, Pandas makes this a very simple task. We just need to group the rows by the class column and then apply the cumsum function.

df["class_cum_sum"] = df.groupby("class")["amount"].cumsum()

Let’s confirm the results on class A.

df[df["class"]=="A"].head()
(image by author)
(image by author)

The class cumulative sum column contains the cumulative sum values calculated separately for each class.


3. Category data type

We often need to deal with categorical data which takes on a limited, and usually fixed, number of possible values. In our DataFrame, the class column is a categorical variable with 4 distinct values: A, B, C, D.

By default, the data type of this column becomes "object".

df.dtypes
# output
date              datetime64[ns]
class                     object
amount                     int64
month                  period[M]
quarter            period[Q-DEC]
cumulative_sum             int64
class_cum_sum              int64

Pandas also has a "category" data type which consumes much less memory than the object data type. Thus, it is better to use the category data type whenever possible.

Let’s replicate the class column but with the "category" data type.

df["class_category"] = df["class"].astype("category")
df.dtypes
# output
date              datetime64[ns]
class                     object
amount                     int64
month                  period[M]
quarter            period[Q-DEC]
cumulative_sum             int64
class_cum_sum              int64
class_category          category
dtype: object

We can now compare the memory consumption of the class and class_category columns.

df.memory_usage()
# output
Index             128
date              800
class             800
amount            800
month             800
quarter           800
cumulative_sum    800
class_cum_sum     800
class_category    304
dtype: int64

The class_category column consumes less than half of the memory consumption of the class column. The difference is 496 bytes which is not much. However, when we work with large datasets, the difference will definitely matter.


You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.


Related Articles