
If you are reading this article, you must have heard of or used Pandas. Thus, I’m skipping the part where I talk about how great Pandas is. 😊
In this article, we will go over examples that demonstrate some of the not-so-commonly used Pandas features. I haven’t come across them very often in my Pandas journey of 3 years.
The best way to learn Pandas, or any other software tool, is practicing. That’s what we will do. The first step is to create a sample DataFrame to work on.
import numpy as np
import pandas as pd
df = pd.DataFrame({
"date": pd.date_range(start="2021-11-20", periods=100, freq="D"),
"class": ["A","B","C","D"] * 25,
"amount": np.random.randint(10, 100, size=100)
})
df.head()

We have a DataFrame with 3 columns and 100 rows. The date column contains 100 consecutive dates, the class column contains 4 distinct values stored with the object data type, and the amount column contains random integers between 10 and 100.
1. To_period
We use dates with many different intervals or periods such as day, week, month, quarter, and so on. Which one to use depends on the task at hand. For instance, in a revenue dashboard, we might want to show monthly or quarterly revenues.
In Pandas, the functions for manipulating dates are available under the dt accessor. The to_period function allows for converting a date to a specific interval. The periods I frequently use are month and quarter.
The month method only returns the month which is not useful in many cases such as reporting historical revenue. We need to be able distinguish December 2020 and December 2021. This can be achieved by using the "M" period with the to_period function. Similarly, we can extract the quarter info with "Q".
Let’s create new columns for year-month and quarter.
df["month"] = df["date"].dt.to_period("M")
df["quarter"] = df["date"].dt.to_period("Q")
df.head()

Let’s also check the different year-month and quarter values in our DataFrame.
df["month"].value_counts()
# output
2021-12 31
2022-01 31
2022-02 27
2021-11 11
Freq: M, Name: month, dtype: int64
--------------------------
df["quarter"].value_counts()
# output
2022Q1 58
2021Q4 42
Freq: Q-DEC, Name: quarter, dtype: int64
2. Cumsum and groupby
The cumsum is a highly useful Pandas function. It calculates the cumulative sum of the values in a column. Here is how we normally use it:
df["cumulative_sum"] = df["amount"].cumsum()
df.head()

We now have the cumulative sum of the values in the amount column. However, it does not take the classes into consideration. In some cases, we may need to calculate the cumulative sum separately for different categories.
Thankfully, Pandas makes this a very simple task. We just need to group the rows by the class column and then apply the cumsum function.
df["class_cum_sum"] = df.groupby("class")["amount"].cumsum()
Let’s confirm the results on class A.
df[df["class"]=="A"].head()

The class cumulative sum column contains the cumulative sum values calculated separately for each class.
3. Category data type
We often need to deal with categorical data which takes on a limited, and usually fixed, number of possible values. In our DataFrame, the class column is a categorical variable with 4 distinct values: A, B, C, D.
By default, the data type of this column becomes "object".
df.dtypes
# output
date datetime64[ns]
class object
amount int64
month period[M]
quarter period[Q-DEC]
cumulative_sum int64
class_cum_sum int64
Pandas also has a "category" data type which consumes much less memory than the object data type. Thus, it is better to use the category data type whenever possible.
Let’s replicate the class column but with the "category" data type.
df["class_category"] = df["class"].astype("category")
df.dtypes
# output
date datetime64[ns]
class object
amount int64
month period[M]
quarter period[Q-DEC]
cumulative_sum int64
class_cum_sum int64
class_category category
dtype: object
We can now compare the memory consumption of the class and class_category columns.
df.memory_usage()
# output
Index 128
date 800
class 800
amount 800
month 800
quarter 800
cumulative_sum 800
class_cum_sum 800
class_category 304
dtype: int64
The class_category column consumes less than half of the memory consumption of the class column. The difference is 496 bytes which is not much. However, when we work with large datasets, the difference will definitely matter.
You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.
Thank you for reading. Please let me know if you have any feedback.