The world’s leading publication for data science, AI, and ML professionals.

Introduction to Pandas – Part 3: Data Wrangling

Any statistical analysis and machine learning models can be as good as the quality of the data you feed into them

Practical Python Pandas Tricks – Part 3: Data Wrangling

Photo by kazuend on Unsplash
Photo by kazuend on Unsplash

This article is the 3rd part of a series of Pandas tricks. Please Stay Tune for more future articles on this topic.

Part 1: Import and Create DataFrame

Part 2: Data Preview and Subsetting

Part 3: Data Wrangling

Introduction

In this article, I’m going to cover Pandas functions to clean and transform dataframe(s). I would say this is one of the most important steps in any Data Science project. Any statistical analysis and machine learning models can be as good as the quality of the data you feed into them.

Prerequisite

# Install Pandas library
!pip install pandas
# Import libraries
import pandas as pd
import numpy as np

Missing Values

Let’s first talk about handling missing values with Pandas. I’ll use the following dataframe as the data source.

data = {'emp_id': [1, 1, 2, 2],
        'emp_name': ['Jane Dow', 'Jane Dow', 'Thomas Daley', 'Thomas Daley'],
        'gender': [np.nan, 'M', 'F', 'F'],
        'year': [2019, 2020, 2019, 2020],
        'compensation': [53000, np.nan, 53000, '$56,000']}
df2  = pd.DataFrame(data)
df2
Out[107]: 
   emp_id      emp_name gender  year compensation
0       1      Jane Dow    NaN  2019        53000
1       1      Jane Dow      M  2020          NaN
2       2  Thomas Daley      F  2019        53000
3       2  Thomas Daley      F  2020      $56,000

isnull() & notnull(): isnull() allows us to display rows with missing values based on a specified column, whereas notnull() would return rows with non-missing values.

df2[df2['compensation'].isnull()]
Out[108]: 
   emp_id  emp_name gender  year compensation
1       1  Jane Dow      M  2020          NaN
df2[df2['compensation'].notnull()]
Out[109]: 
   emp_id      emp_name gender  year compensation
0       1      Jane Dow    NaN  2019        53000
2       2  Thomas Daley      F  2019        53000
3       2  Thomas Daley      F  2020      $56,000

fillna(): After we review the missing data, we can either remove them or fill in the missing cells with appropriate values. (You can check out my other article regarding Missing Data.)

In this dataframe, we see missing values on "gender" and "compensation" columns. We can fill in the missing gender with non-missing gender value per employee since gender doesn’t change over time. Also, we’re told by HR that the missing compensation implies the employee has left the company in that year. We should replace the missing compensation with 0.

fillna() allows us to replace the missing value with any value. In this case, we replace missing compensation with the value of 0. We can use method = 'backfill' and 'ffill' with groupby() to fill in the missing gender with previous or next valid observation for a given employee.

df2['compensation'].fillna(0, inplace = True)
df2['gender'] = df2.groupby('emp_id')['gender'].fillna(method = 'backfill')
df2['gender'] = df2.groupby('emp_id')['gender'].fillna(method = 'ffill')
print(df2)
   emp_id      emp_name gender  year compensation
0       1      Jane Dow      M  2019        53000
1       1      Jane Dow      M  2020            0
2       2  Thomas Daley      F  2019        53000
3       2  Thomas Daley      F  2020      $56,000

String Operation

str.split(): str.split() allows us to split text in a given column into multiple columns. In the following code, "emp_name" column is splitted into "first_name" and "last_name" columns using an option expand = True . By default, text in a column is splitted by whitespace, but you can specify the separator using an option pat = '<str>' .

df2[['first_name', 'last_name']] = df2['emp_name'].str.split(expand = True)
print(df2)
(Created by Author)
(Created by Author)

str.replace & re.sub(): Upon further inspection, we see that "compensation" column contains both integer and string values. The string value contains characters, such as "$" and ",".

df2.apply(lambda x: type(x['compensation']), axis = 1).value_counts()
Out[118]: 
<class 'int'>    3
<class 'str'>    1
dtype: int64
df2[df2.apply(lambda x: type(x['compensation']), axis = 1)== str]
(Created by Author)
(Created by Author)

It is tempting to just use str.replace() to remove "$" and ",". But when we do that, we notice that the integer values would be replaced with np.nan since str.replace() doesn’t work with integer values.

# Incorrect Method 
df2['compensation'].str.replace('$|,', '')
Out[124]: 
0      NaN
1      NaN
2      NaN
3    56000
Name: compensation, dtype: object

Trick 1: To correctly clean a column with a mix of string and numeric values, we need to first convert values in the column into string values using astype() , then use either str.replace() or re.sub() to replace the relevant string values as needed. Lastly, we convert the column back to numeric values.

# Method 1: Using str.replace() with astype()
df2['compensation'].astype(str).str.replace('$|,', '').astype(int)
# Method 2: Using apply with re.sub()
df2['compensation'] = df2.apply(lambda x: re.sub('$|,', '', str(x['compensation'])), axis = 1).astype(int)

Window Functions

Window Functions are implemented on a set of rows called a window frame. A window frame is all the rows within the same group based on one or more columns. When a window function is implemented, a new column would be produced and the output would have the same number of rows as the original data set.

Create a New Column of Row number

df['new_column_name'] = df.groupby('{column name}').cumcount()+1

Create Count/Max/Min/Average/Sum Within a Group

In Pandas, we often use transform with a window function, such as, count, max, min, avg and sum.

df.groupby('gender')['salary'].transform('count')
df.groupby('gender')['salary'].transform('max')
df.groupby('gender')['salary'].transform('min')
df.groupby('gender')['salary'].transform('mean')
df.groupby('gender')['salary'].transform('sum')

Create Running Sum Within a Group

df.groupby('gender')['salary'].transform('cumsum')

Create Percentiles Within a Group

# create median 
df.groupby('gender')['salary'].transform(lambda x: x.quantile(0.5))

Create Lag/Lead Within a Group

# create lag variable
df.groupby('gender')['salary'].transform(lambda x: x.shift(1)) 
# create lead variable
df.groupby('gender')['salary'].transform(lambda x: x.shift(-1))

Create Ranking Within a Group

df.groupby('gender')['salary'].rank('dense', ascending = False)

Aggregate Functions

Aggregate Functions are implemented in the same as window functions. But the result will be more compact. The number of observations in the final output would equal the number of distinct groups (i.e., unique values in group-by variables).

Collapse Rows With Count/Max/Min/Average/Sum Within a Group

In Pandas, there are many ways to implement an aggregate function. I include 3 different ways in the following code snippet.

  • An aggregate function would run as the default using groupby
  • Use apply to run a built-in aggregate function or a user-defined function with groupby
  • Use agg to run a built-in aggregate function or a user-defined function with more flexibility, such as, naming new columns and creating more than one new column
df.groupby('gender')['salary'].mean().reset_index()
df.groupby('gender')['salary'].min().reset_index()
df.groupby('gender')['salary'].max().reset_index()
df.groupby('gender')['salary'].sum().reset_index()
df.groupby('gender').apply(lambda x: x['salary'].mean()).reset_index()
df.groupby('gender').agg(count = pd.NamedAgg('salary', 'mean')).reset_index()

Create Percentile Within a Group

df.groupby('gender')['salary'].quantile(0.9).reset_index()

You can check out more about Window Function & Aggregate Function in this article.

Trick 2: Apply with User-Defined Functionapply function allows us to have more flexibility and control to create a user-defined function. In the following example, I’m going to use "supermarket_sales – Sheet1.csv" (Download Link) as the data source. We will create a new column of rating category based on the values on "Rating" column.

# Import data source
df = pd.read_csv('supermarket_sales - Sheet1.csv')
def rating_category(rating):
    if rating >= 4 and rating <6:
        return 'Low'
    elif rating >= 6 and rating <8:
        return 'Mid'
    else:
        return 'High'
df['rating_cat'] = df.apply(lambda x: rating_category(x['Rating']), axis = 1)

Trick 3: Show Progress Bar Using tqdm— When working with a big dataset, it will take a while to run apply function. It would be nice if we know the job status and when we expect the task to complete. Python library tqdmworks seamlessly with Pandas. You just need to replace apply with progress_apply with everything else unchanged.

from tqdm import tqdm
tqdm.pandas()
df.progress_apply(lambda x: rating_category(x['Rating']), axis = 1)
100%|██████████| 1000/1000 [00:00<00:00, 104413.84it/s]

Trick 4: Speed up Using Swifter – Being able to fully utilize hardware power to speed up runtime is important for a data science task. Your analysis might not be helpful to decision-makers if they would like to see a result within a short period of time, but your program needs weeks or months to run.

Python library [swifter](https://medium.com/@jmcarpenter2/swiftapply-automatically-efficient-pandas-apply-operations-50e1058909f9) can efficiently utilize hardware power to apply any function to Pandas Dataframe or Series object in the quickest possible way. The syntax is straightforward, you just need to add swifter before apply function.

import swifter
df.swifter.apply(lambda x: rating_category(x['Rating']), axis = 1)

Join Multiple DataFrames

Combining Rows

append() & concat(): We can use either append() or concat() to combine two dataframes in a row-wise fashion and produce the same output. We can use reset_index(drop=True) to reset the index in the combined dataframe.

df3 = pd.DataFrame({'emp_id': 3, 'emp_name': 'Jason Zandi', 'gender': 'M', 'year': 2020, 'compensation': 60000}, index = [0])
df2.append(df3).reset_index(drop=True)
pd.concat([df2, df3]).reset_index(drop=True)
Out[182]: 
   emp_id      emp_name gender  year compensation
0       1      Jane Dow    NaN  2019        53000
1       1      Jane Dow      M  2020          NaN
2       2  Thomas Daley      F  2019        53000
3       2  Thomas Daley      F  2020      $56,000
4       3   Jason Zandi      M  2020        60000

Combining Columns

join() & merge(): Both join() or merge() can combine two dataframes in a column-wise fashion. The difference is join() is based on index, whereas, merge() is based on the common column(s) from two dataframes. In the following example, when combining df2 and df4 using join() based on index, we have one row with matched index of [0]. When combining df2 and d5 using merge() based on the column, "emp_id", we are implementing a typical left join operation.

df4 = pd.DataFrame({'test': 300}, index = [0]) 
df4
Out[190]: 
   test
0   300
df5 = pd.DataFrame({'emp_id': [1, 2], 'title': ['Engineer', 'Analyst']})
df5
Out[192]: 
   emp_id     title
0       1  Engineer
1       2   Analyst
df2.join(df4)
Out[188]: 
   emp_id      emp_name gender  year compensation   test
0       1      Jane Dow    NaN  2019        53000  300.0
1       1      Jane Dow      M  2020          NaN    NaN
2       2  Thomas Daley      F  2019        53000    NaN
3       2  Thomas Daley      F  2020      $56,000    NaN
df2.merge(df5, on = 'emp_id', how = 'right')
Out[189]: 
   emp_id      emp_name gender  year compensation     title
0       1      Jane Dow    NaN  2019        53000  Engineer
1       1      Jane Dow      M  2020          NaN  Engineer
2       2  Thomas Daley      F  2019        53000   Analyst
3       2  Thomas Daley      F  2020      $56,000   Analyst

Reformat DataFrame

Convert Dataframe from Long to Wide

Both pivot() and pivot_table() would produce a (wide) table that summarizes the data from a more extensive (long) table. The difference is pivot_table() can also incorporate aggregate functions, such as sum, average, max, min and first. In the following example, we try to create separate compensation columns, "2019" and "2020" based on the "year" column.

df2
Out[200]: 
   emp_id      emp_name gender  year compensation
0       1      Jane Dow      M  2019        53000
1       1      Jane Dow      M  2020            0
2       2  Thomas Daley      F  2019        53000
3       2  Thomas Daley      F  2020      $56,000
df_wide = df2.pivot(index = ['emp_id', 'emp_name', 'gender'], columns = ['year'], values = 'compensation').reset_index()
print(df_wide)
year  emp_id      emp_name gender   2019     2020
0          1      Jane Dow      M  53000        0
1          2  Thomas Daley      F  53000  $56,000
df_wide2 = df2.pivot_table(index = ['emp_id', 'emp_name', 'gender'], columns = ['year'], values = 'compensation', aggfunc = 'first').reset_index()
print(df_wide2)
year  emp_id      emp_name gender   2019     2020
0          1      Jane Dow      M  53000        0
1          2  Thomas Daley      F  53000  $56,000

Convert Dataframe from Wide to Long

melt() allows us to change dataframe format from wide to long. This is the opposite of creating a pivot table. In the following example, we want to consolidate compensation columns, "2019" and "2020" into a single column and create a new "year" column.

df_long = df_wide.melt(id_vars= ['emp_id', 'emp_name', 'gender'], var_name= 'year', value_vars= [2019, 2020], value_name="Compensation")
print(df_long)
   emp_id      emp_name gender  year Compensation
0       1      Jane Dow      M  2019        53000
1       2  Thomas Daley      F  2019        53000
2       1      Jane Dow      M  2020            0
3       2  Thomas Daley      F  2020      $56,000

Thank you for reading !!!

If you enjoy this article and would like to Buy Me a Coffee, please click here.

You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.


Related Articles