The world’s leading publication for data science, AI, and ML professionals.

3 Python Pandas Tricks for Efficient Data Analysis

Explained with examples.

Photo by Nick Fewings on Unsplash
Photo by Nick Fewings on Unsplash

Pandas is one of the predominant data analysis tools which is highly appreciated among data scientists. It provides numerous flexible and versatile functions to perform efficient data analysis.

In this article, we will go over 3 pandas tricks that I think will make you a more happy pandas user. It is better to explain these tricks with some examples. Thus, we start by creating a data frame to wok on.

The data frame contains daily sales quantities of 3 different stores. We first create a period of 10 days using the date_range function of pandas.

import numpy as np
import pandas as pd
days = pd.date_range("2020-01-01", periods=10, freq="D")

The days variable will be used as a column. We also need a sales quantity column which can be generated by the randint function of numpy. Then, we create a data frame with 3 columns for each store.

A = pd.DataFrame({"date": days,
                  "store": "A",
                  "sales": np.random.randint(100, 200, size=10)})
B = pd.DataFrame({"date": days,
                  "store": "B",
                  "sales": np.random.randint(100, 200, size=10)})
C = pd.DataFrame({"date": days,
                  "store": "C",
                  "sales": np.random.randint(100, 200, size=10)})

We now combine these 3 data frames with the concat function. It is better to also sort the rows based on date since we are interested in the daily sales quantities.

df = pd.concat([A, B, C]).sort_values(by="date")
df.head()
(image by author)
(image by author)

  1. Rank function

The first trick is about the rank function. Let’s suppose we need to find the store that has the highest sales quantity for each day. This task can be done using the groupby and max functions.

What if we need to sort the stores based on their sales quantities for each day? Please take a look at the screenshot above. For the date 2020–01–01, A is the first, C is the second, and B is the third score.

We can use the rank function to apply such an order for each date.

df["rank"] = df.groupby("date)["sales"]
               .rank(ascending=False).astype("int")
df.head(6)
(image by author)
(image by author)

We first group the data points (i.e. rows) by date and then apply the rank function. The generated rank is assigned to a new column. The rank function sorts the values in ascending order by default but it can be changed with the ascending parameter.

The rank information can be used to make a comparison between the stores. Let’s find out the rank distribution of each store.

df.groupby(["store","rank"]).count()[["sales"]]
(image by author)
(image by author)

Store A has the highest number of first ranks. Store B is usually the second store and C has kind of a uniform distribution of ranks.


  1. Naming the aggregation

When we found the rank distribution in the previous example, we selected a column after the aggregation was done. The reason is that the count aggregation is applied to all the columns that are not used in the grouping. If We do an aggregation with mean, it is applied to all the numerical columns not used in the grouping.

Here is how the result looks without selecting a column.

df.groupby(["store","rank"]).count()
(image by author)
(image by author)

If we do an aggregation with mean, it is applied to all the numerical columns that are not used in the grouping.

Another downside of this approach is that the aggregated column name is not informative of what it represents. Without seeing the code, we cannot really tell what sales column tells us. We can always rename the columns but there is a more practical way.

A solution to both these issues is using a named aggregation with the agg function. We need to specify the name of the column and the aggregation function as below:

df.groupby(["store","rank"]).agg(rank_count = ("rank", "count"))
(image by author)
(image by author)

We only see the aggregated column names so we do not have to select a column after the aggregation. It also allows us for assigning any name to the aggregated column.


  1. Ignore the index

Recall the first screenshot.

(image by author)
(image by author)

We combine 3 data frames but the indices are not updated. They represent the index of each individual data frame.

The index is an important part of a data frame so it needs to be accurate. One option is to reset the index after the concatenation.

df = pd.concat([A, B, C])
       .sort_values(by="date").reset_index(drop=True)
df.head()
(image by author)
(image by author)

We now have a more accurate index. If we do not set the drop parameter of the reset_index function as true, the prior indices are kept as a column in the data frame.

The trick here is the ignore_index parameter. It saves us from having to use the reset_index function. Both the concat and sort_values functions have this paremeter. When set as true, it ignores the indices of the individual data frames and assigns a new integer index.

df = pd.concat([A, B, C]).sort_values(by="date", ignore_index=True)
df.head()
(image by author)
(image by author)

It is not much of a gain but we achieve the same result with one less function.


Conclusion

We have covered 3 pandas tricks which seem to be simple but come in handy for some cases. They also simplify the syntax and operations.

Pandas is a great library with an easy-to-understand and intuitive syntax. Thus, you can easily learn the basic operations with some practice. After the basics, I suggest to focus on some tricks and detailed techniques for efficiency.

Thank you for reading. Please let me know if you have any feedback.


Related Articles