Practical Python Pandas Tricks – Part 3: Data Wrangling

This article is the 3rd part of a series of Pandas tricks. Please Stay Tune for more future articles on this topic.
Part 1: Import and Create DataFrame
Part 2: Data Preview and Subsetting
Introduction
In this article, I’m going to cover Pandas functions to clean and transform dataframe(s). I would say this is one of the most important steps in any Data Science project. Any statistical analysis and machine learning models can be as good as the quality of the data you feed into them.
Prerequisite
# Install Pandas library
!pip install pandas
# Import libraries
import pandas as pd
import numpy as np
Missing Values
Let’s first talk about handling missing values with Pandas. I’ll use the following dataframe as the data source.
data = {'emp_id': [1, 1, 2, 2],
'emp_name': ['Jane Dow', 'Jane Dow', 'Thomas Daley', 'Thomas Daley'],
'gender': [np.nan, 'M', 'F', 'F'],
'year': [2019, 2020, 2019, 2020],
'compensation': [53000, np.nan, 53000, '$56,000']}
df2 = pd.DataFrame(data)
df2
Out[107]:
emp_id emp_name gender year compensation
0 1 Jane Dow NaN 2019 53000
1 1 Jane Dow M 2020 NaN
2 2 Thomas Daley F 2019 53000
3 2 Thomas Daley F 2020 $56,000
isnull() & notnull(): isnull()
allows us to display rows with missing values based on a specified column, whereas notnull()
would return rows with non-missing values.
df2[df2['compensation'].isnull()]
Out[108]:
emp_id emp_name gender year compensation
1 1 Jane Dow M 2020 NaN
df2[df2['compensation'].notnull()]
Out[109]:
emp_id emp_name gender year compensation
0 1 Jane Dow NaN 2019 53000
2 2 Thomas Daley F 2019 53000
3 2 Thomas Daley F 2020 $56,000
fillna(): After we review the missing data, we can either remove them or fill in the missing cells with appropriate values. (You can check out my other article regarding Missing Data.)
In this dataframe, we see missing values on "gender" and "compensation" columns. We can fill in the missing gender with non-missing gender value per employee since gender doesn’t change over time. Also, we’re told by HR that the missing compensation implies the employee has left the company in that year. We should replace the missing compensation with 0.
fillna()
allows us to replace the missing value with any value. In this case, we replace missing compensation with the value of 0. We can use method = 'backfill'
and 'ffill'
with groupby()
to fill in the missing gender with previous or next valid observation for a given employee.
df2['compensation'].fillna(0, inplace = True)
df2['gender'] = df2.groupby('emp_id')['gender'].fillna(method = 'backfill')
df2['gender'] = df2.groupby('emp_id')['gender'].fillna(method = 'ffill')
print(df2)
emp_id emp_name gender year compensation
0 1 Jane Dow M 2019 53000
1 1 Jane Dow M 2020 0
2 2 Thomas Daley F 2019 53000
3 2 Thomas Daley F 2020 $56,000
String Operation
str.split(): str.split()
allows us to split text in a given column into multiple columns. In the following code, "emp_name" column is splitted into "first_name" and "last_name" columns using an option expand = True
. By default, text in a column is splitted by whitespace, but you can specify the separator using an option pat = '<str>'
.
df2[['first_name', 'last_name']] = df2['emp_name'].str.split(expand = True)
print(df2)

str.replace & re.sub(): Upon further inspection, we see that "compensation" column contains both integer and string values. The string value contains characters, such as "$" and ",".
df2.apply(lambda x: type(x['compensation']), axis = 1).value_counts()
Out[118]:
<class 'int'> 3
<class 'str'> 1
dtype: int64
df2[df2.apply(lambda x: type(x['compensation']), axis = 1)== str]

It is tempting to just use str.replace()
to remove "$" and ",". But when we do that, we notice that the integer values would be replaced with np.nan
since str.replace()
doesn’t work with integer values.
# Incorrect Method
df2['compensation'].str.replace('$|,', '')
Out[124]:
0 NaN
1 NaN
2 NaN
3 56000
Name: compensation, dtype: object
Trick 1: To correctly clean a column with a mix of string and numeric values, we need to first convert values in the column into string values using astype()
, then use either str.replace()
or re.sub()
to replace the relevant string values as needed. Lastly, we convert the column back to numeric values.
# Method 1: Using str.replace() with astype()
df2['compensation'].astype(str).str.replace('$|,', '').astype(int)
# Method 2: Using apply with re.sub()
df2['compensation'] = df2.apply(lambda x: re.sub('$|,', '', str(x['compensation'])), axis = 1).astype(int)
Window Functions
Window Functions are implemented on a set of rows called a window frame. A window frame is all the rows within the same group based on one or more columns. When a window function is implemented, a new column would be produced and the output would have the same number of rows as the original data set.
Create a New Column of Row number
df['new_column_name'] = df.groupby('{column name}').cumcount()+1
Create Count/Max/Min/Average/Sum Within a Group
In Pandas, we often use transform with a window function, such as, count, max, min, avg and sum.
df.groupby('gender')['salary'].transform('count')
df.groupby('gender')['salary'].transform('max')
df.groupby('gender')['salary'].transform('min')
df.groupby('gender')['salary'].transform('mean')
df.groupby('gender')['salary'].transform('sum')
Create Running Sum Within a Group
df.groupby('gender')['salary'].transform('cumsum')
Create Percentiles Within a Group
# create median
df.groupby('gender')['salary'].transform(lambda x: x.quantile(0.5))
Create Lag/Lead Within a Group
# create lag variable
df.groupby('gender')['salary'].transform(lambda x: x.shift(1))
# create lead variable
df.groupby('gender')['salary'].transform(lambda x: x.shift(-1))
Create Ranking Within a Group
df.groupby('gender')['salary'].rank('dense', ascending = False)
Aggregate Functions
Aggregate Functions are implemented in the same as window functions. But the result will be more compact. The number of observations in the final output would equal the number of distinct groups (i.e., unique values in group-by variables).
Collapse Rows With Count/Max/Min/Average/Sum Within a Group
In Pandas, there are many ways to implement an aggregate function. I include 3 different ways in the following code snippet.
- An aggregate function would run as the default using groupby
- Use apply to run a built-in aggregate function or a user-defined function with groupby
- Use agg to run a built-in aggregate function or a user-defined function with more flexibility, such as, naming new columns and creating more than one new column
df.groupby('gender')['salary'].mean().reset_index()
df.groupby('gender')['salary'].min().reset_index()
df.groupby('gender')['salary'].max().reset_index()
df.groupby('gender')['salary'].sum().reset_index()
df.groupby('gender').apply(lambda x: x['salary'].mean()).reset_index()
df.groupby('gender').agg(count = pd.NamedAgg('salary', 'mean')).reset_index()
Create Percentile Within a Group
df.groupby('gender')['salary'].quantile(0.9).reset_index()
You can check out more about Window Function & Aggregate Function in this article.
Trick 2: Apply with User-Defined Function – apply
function allows us to have more flexibility and control to create a user-defined function. In the following example, I’m going to use "supermarket_sales – Sheet1.csv" (Download Link) as the data source. We will create a new column of rating category based on the values on "Rating" column.
# Import data source
df = pd.read_csv('supermarket_sales - Sheet1.csv')
def rating_category(rating):
if rating >= 4 and rating <6:
return 'Low'
elif rating >= 6 and rating <8:
return 'Mid'
else:
return 'High'
df['rating_cat'] = df.apply(lambda x: rating_category(x['Rating']), axis = 1)
Trick 3: Show Progress Bar Using tqdm— When working with a big dataset, it will take a while to run apply
function. It would be nice if we know the job status and when we expect the task to complete. Python library tqdm
works seamlessly with Pandas. You just need to replace apply
with progress_apply
with everything else unchanged.
from tqdm import tqdm
tqdm.pandas()
df.progress_apply(lambda x: rating_category(x['Rating']), axis = 1)
100%|██████████| 1000/1000 [00:00<00:00, 104413.84it/s]
Trick 4: Speed up Using Swifter – Being able to fully utilize hardware power to speed up runtime is important for a data science task. Your analysis might not be helpful to decision-makers if they would like to see a result within a short period of time, but your program needs weeks or months to run.
Python library [swifter](https://medium.com/@jmcarpenter2/swiftapply-automatically-efficient-pandas-apply-operations-50e1058909f9)
can efficiently utilize hardware power to apply any function to Pandas Dataframe or Series object in the quickest possible way. The syntax is straightforward, you just need to add swifter
before apply
function.
import swifter
df.swifter.apply(lambda x: rating_category(x['Rating']), axis = 1)
Join Multiple DataFrames
Combining Rows
append() & concat(): We can use either append()
or concat()
to combine two dataframes in a row-wise fashion and produce the same output. We can use reset_index(drop=True)
to reset the index in the combined dataframe.
df3 = pd.DataFrame({'emp_id': 3, 'emp_name': 'Jason Zandi', 'gender': 'M', 'year': 2020, 'compensation': 60000}, index = [0])
df2.append(df3).reset_index(drop=True)
pd.concat([df2, df3]).reset_index(drop=True)
Out[182]:
emp_id emp_name gender year compensation
0 1 Jane Dow NaN 2019 53000
1 1 Jane Dow M 2020 NaN
2 2 Thomas Daley F 2019 53000
3 2 Thomas Daley F 2020 $56,000
4 3 Jason Zandi M 2020 60000
Combining Columns
join() & merge(): Both join()
or merge()
can combine two dataframes in a column-wise fashion. The difference is join()
is based on index, whereas, merge()
is based on the common column(s) from two dataframes. In the following example, when combining df2 and df4 using join() based on index, we have one row with matched index of [0]. When combining df2 and d5 using merge() based on the column, "emp_id", we are implementing a typical left join operation.
df4 = pd.DataFrame({'test': 300}, index = [0])
df4
Out[190]:
test
0 300
df5 = pd.DataFrame({'emp_id': [1, 2], 'title': ['Engineer', 'Analyst']})
df5
Out[192]:
emp_id title
0 1 Engineer
1 2 Analyst
df2.join(df4)
Out[188]:
emp_id emp_name gender year compensation test
0 1 Jane Dow NaN 2019 53000 300.0
1 1 Jane Dow M 2020 NaN NaN
2 2 Thomas Daley F 2019 53000 NaN
3 2 Thomas Daley F 2020 $56,000 NaN
df2.merge(df5, on = 'emp_id', how = 'right')
Out[189]:
emp_id emp_name gender year compensation title
0 1 Jane Dow NaN 2019 53000 Engineer
1 1 Jane Dow M 2020 NaN Engineer
2 2 Thomas Daley F 2019 53000 Analyst
3 2 Thomas Daley F 2020 $56,000 Analyst
Reformat DataFrame
Convert Dataframe from Long to Wide
Both pivot()
and pivot_table()
would produce a (wide) table that summarizes the data from a more extensive (long) table. The difference is pivot_table()
can also incorporate aggregate functions, such as sum, average, max, min and first. In the following example, we try to create separate compensation columns, "2019" and "2020" based on the "year" column.
df2
Out[200]:
emp_id emp_name gender year compensation
0 1 Jane Dow M 2019 53000
1 1 Jane Dow M 2020 0
2 2 Thomas Daley F 2019 53000
3 2 Thomas Daley F 2020 $56,000
df_wide = df2.pivot(index = ['emp_id', 'emp_name', 'gender'], columns = ['year'], values = 'compensation').reset_index()
print(df_wide)
year emp_id emp_name gender 2019 2020
0 1 Jane Dow M 53000 0
1 2 Thomas Daley F 53000 $56,000
df_wide2 = df2.pivot_table(index = ['emp_id', 'emp_name', 'gender'], columns = ['year'], values = 'compensation', aggfunc = 'first').reset_index()
print(df_wide2)
year emp_id emp_name gender 2019 2020
0 1 Jane Dow M 53000 0
1 2 Thomas Daley F 53000 $56,000
Convert Dataframe from Wide to Long
melt()
allows us to change dataframe format from wide to long. This is the opposite of creating a pivot table. In the following example, we want to consolidate compensation columns, "2019" and "2020" into a single column and create a new "year" column.
df_long = df_wide.melt(id_vars= ['emp_id', 'emp_name', 'gender'], var_name= 'year', value_vars= [2019, 2020], value_name="Compensation")
print(df_long)
emp_id emp_name gender year Compensation
0 1 Jane Dow M 2019 53000
1 2 Thomas Daley F 2019 53000
2 1 Jane Dow M 2020 0
3 2 Thomas Daley F 2020 $56,000
Thank you for reading !!!
If you enjoy this article and would like to Buy Me a Coffee, please click here.
You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.