If you are in the field of data science, you must have heard Pandas. It is the most commonly used Data Analysis and manipulation library.
The functions and methods of Pandas make it very simple and efficient to work with tabular data.
In this post, we will cover 4 small tips that further expedite the data analysis process. Together with the basic operations, these small tips can make you use Pandas like Jedi.
I will do a few examples for each tip so that it is well understood. I will use the grocery and direct marketing datasets available on Kaggle to do examples.
Let’s start by importing Pandas along with NumPy and read the datasets into a dataframe.
import numpy as np
import pandas as pd
groceries = pd.read_csv("/content/Groceries_dataset.csv")
marketing = pd.read_csv("/content/DirectMarketing.csv")


The grocery dataset contains some shopping data. The columns show the member number, date of shopping, and purchased items.
The marketing dataset contains customer and sales information about a marketing campaign.
Tip 1: Inserting a column
We may need to create new columns when working on dataframes. By default, Pandas attach the new columns at the end. However, it might be more appropriate to insert the new column at a specific location (i.e. index).
The insert function is used to create new columns at a specific index.
Let’s assume we need to create a "year" column by extracting the year from the date. We can use the "dt" accessor to extract the year from a date but the data type of the column needs to be datetime.
groceries.dtypes
Member_number int64
Date object
itemDescription object
dtype: object
Date column is stored as object. We should first change it to datetime and extract the year.
groceries['Date'] = pd.to_datetime(groceries['Date'])
Year = groceries.Date.dt.year
We can now create a "Year" column. The first way that comes to mind is as follows:
groceries['Year'] = Year
groceries.head()

As you can see, it is attached at the end. However, it is better to have the year column next to the date column.
#drop the existing year column first
groceries.drop('Year', axis=1, inplace=True)
#insert the year column next to date
groceries.insert(2, 'Year', Year)
groceries.head()

The first parameter is the column index of new column (starts from 0). The second parameter is the name of the column and the last one is the array that contains the values.
Tip 2: Groupby NamedAgg
We can use single or multiple aggregations with the groupby function. It also allow to apply different aggregations on different columns.
Consider we need to calculate the average number of catalogs and total spent amount for different age groups in the marketing dataframe.
One way to calculate this with the groupby function:
marketing[['Age','Catalogs','AmountSpent']].groupby('Age')
.agg({'Catalogs':'mean','AmountSpent':'sum'})

It is not clear what aggregations these columns represent. They could be mean, count, sum, and so on. We can always rename the columns after the calculation.
However, there is an easier way. We will use the "NamedAgg" method inside
marketing[['Age','Catalogs','AmountSpent']].groupby('Age').agg(
Avg_catalog = pd.NamedAgg('Catalogs','mean'),
Total_spent = pd.NamedAgg('AmountSpent','sum')
)

It is more informative now.
Tip 3: Parse dates
The "read_csv" function is one of the most commonly used pandas functions because the date is usually stored in csv files. I feel like we are not using this function efficiently.
What is usually done is to read the csv file into a dataframe and then do the analysis and manipulation. However, there are things we can accomplish while reading the csv file.
The read_csv function has a bunch of useful parameters. I will use one of them to encourage you to take a look at the other parameters.
In the first tip, we change the data type of the date column because it was stored as object. We could have used the parse dates parameter to change the data type while reading the data.
groceries = pd.read_csv("/content/Groceries_dataset.csv",
parse_dates=['Date'])
groceries.dtypes
Member_number int64
Date datetime64[ns]
itemDescription object
dtype: object
Tip 4: Where function
If you are familiar with SQL, you know that "where" is used for filtering. We can put a condition on the selected data using the where statement.
The where function of pandas works in similar way. We can update the values based on a condition.
Let’s check the values in the catalogs column of the marketing dataframe.
marketing.Catalogs.value_counts()
12 282
6 252
24 233
18 233
Name: Catalogs, dtype: int64
Assume that we want to change 6 to 0 and keep the other values the same. The where function just fits this task.
marketing['Catalogs'] = marketing['Catalogs']
.where(marketing['Catalogs'] > 6, 0)
marketing.Catalogs.value_counts()
12 282
0 252
24 233
18 233
Name: Catalogs, dtype: int64
The values that fit the condition remain unchanged. The other ones are replaced with the specified values (0 in this case).
Conclusion
What we have covered can be considered as small details. However, sometimes these details help you make a difference. Besides, if you apply them in your analysis, you will see how your efficiency will increase.
There is much more pandas offers to accelerate the data analysis and manipulation process. The best way to learn and adapt these techniques is through practice.
Thanks you for reading. Please let me know if you have any feedback.