The world’s leading publication for data science, AI, and ML professionals.

4 Tricks for Making Python Pandas More Efficient

Making the most out of Pandas

Photo by Pawel Czerwinski on Unsplash
Photo by Pawel Czerwinski on Unsplash

Pandas is arguably the most popular data analysis and manipulation library in the Data Science ecosystem. The user-friendly and intuitive Python syntax is a significant factor in the popularity of Pandas. However, it is not the only reason why Pandas is adapted by a vast majority of data scientists.

Pandas provides numerous functions and methods that expedite the data analysis and manipulation operations. In this article, we will go over 4 tricks for using these functions even more efficiently.

Let’s start with creating a data frame. We will use the Melbourne housing dataset available on Kaggle.

import pandas as pd
df = pd.read_csv("/content/melb_housing.csv")
df.columns
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG', 'Date', 'Postcode', 'Regionname', 'Propertycount', 'Distance', 'CouncilArea'], dtype='object')

The dataset contains 13 features (i.e. columns) about the houses in Melbourne.


First trick

The first trick is about the read_csv function. It might be the most frequently used function but we almost always use it without any parameters. The read_csv function has many parameters that make it more handy.

Pandas offers several methods under the dt accessor to manipulate dates. In order to use them, the dates must have datetime data type. We can always convert the data types but there is a more practical way. The parse_dates parameter can handle this task.

Another useful parameter is usecols. If you need to work with only some of the columns in a dataset, you can pass the name of the columns to the usecols parameter. It saves both memory and computation power especially when working with large datasets.

Let’s re-read the dataset with these parameter.

df = pd.read_csv(
  "/content/melb_housing.csv", 
  parse_dates=["Date"],
  usecols=["Address", "Rooms", "Date","Type", "Price",   
           "Regionname"]
)
df.head()
(image by author)
(image by author)

Second trick

Another frequently used function of Pandas is the value_counts. It counts the number of occurrences of distinct values in a column. Thus, we get an overview of the distribution of a feature (i.e. column).

df.Regionname.value_counts()
(image by author)
(image by author)

We see the number of houses in each region. In some cases, it is more informative to normalize these values so that we have a better idea of the distribution of categories.

One option is to divide the number of occurrences by the total number of values. The normalize parameter of the value_counts function accomplishes this task very easily.

df.Regionname.value_counts(normalize=True)
(image by author)
(image by author)

Almost 28% of the houses are in the Southern Metropolitan region.


Third trick

The third trick is also about the value_counts function. It is usually used with categorical variables. However, we can also make use of it on continuous variables such as price.

It is not a preferred method to see the number of occurrences of each price. However, we can create bins and check the number of values in each bin. The bin parameter of the value_counts function, divides the entire range into the number of specified bins. We can then see the distribution of values in terms of these bins.

df.Price.value_counts(normalize=True, bins=10)
(image by author)
(image by author)

56% of the houses are in the first bin which spans the range between 73k and 1.2 million.


Fourth trick

The fourth one is about filtering a data frame. We do lots of filtering in a typical exploratory data analysis process. Or, we might simply be interested in observations (i.e. rows) that fit a set of conditions.

Consider a case where we need to filter the houses that have more than 3 rooms and cost more than 500k. We can filter the data frame based on these conditions as follows:

df_filtered = df[(df.Price > 500000) & (df.Rooms > 3)]
df_filtered.head()
(image by author)
(image by author)

We need to write the names of both data frame and column. In case of multiple filters, we need put each one in parenthesis and separate with the "&" sign.

The query function offers a simpler syntax. If you are familiar with SQL, it will definitely seem more appealing to you. We can pass the conditions as a string as follows:

df_filtered = df.query("Price > 500000 and Rooms > 3")
df_filtered.head()
(image by author)
(image by author)

We still need to write the conditions but the syntax is easier to type.


Conclusion

We have covered 4 simple tricks. They are not some magical tricks but will help you make use of Pandas more. In general, we tend to overlook the parameters of functions. I definitely suggest to use them as much as possible. They are there for a reason.

Thank you for reading. Please let me know if you have any feedback.


Related Articles