Background
Very recently I was tasked with analyzing leaves taken by employees at a client’s company. In particular, I need to understand whether an employee has taken leave in a given period, ultimately setting a benchmark for measuring employees’ compliance with return to office policies.
I was provided with the following two leaves datasets:
- Time Off Data ("Dataset A") which sets out shorter term leaves such as annual leave or sick leave taken by an employee. These leaves were unique at a per employee per date level (i.e. each row in the dataset represents a day of leave taken by a particular employee).
- Leave of Absence Data ("Dataset B") which sets out the from and to dates of longer term leaves taken by an employee. Some examples of these leaves are parental leave, maternity leave, leave without paid and career break. This dataset is on a ‘leave-as-you-go’ basis for employees who have taken longer term leaves, in that each row represents a date range for an employee, who may appear in multiple rows in the dataset with multiple date ranges (e.g. employees may prefer to take parental leave in batches of 3 days a week for 30 weeks, which would appear as 30 date ranges across 30 rows in the dataset).
The two datasets augment each other as an employee could take both shorter and longer term leaves in a given period.
For my analysis, I would like to merge the two datasets into a common format so that all leaves by a particular employee are accounted for. Since Dataset A is already of a structured tabular format at a per date per employee level, I needed to convert Dataset B into a similar format as shown in the image below (for a dataset I created for the purpose of demonstration).
Methodology
The transformation illustrated in image 1 presented a number of challenges given it’s not ‘one-to-one’ and it involves dates. To achieve the transformation, I followed the methodology below:
- The date ranges in Dataset B (as shown in the table on the left in image 1) were broken up into unique (working) dates
- The broken up dates associated with the same employee which appear in multiple rows were converted into two columns grouped by employee and then unique dates (as shown in the table on the right in image 1)
How did I implement the above in Python? Read on for a step-by-step guide!
Datetime Transformation using Pandas
Step 1: Load libraries and data
As the name of the step implies:
# Load libraries
import Pandas as pd
import numpy as np
from datetime import date, timedelta
path = r"Your_Directory_PathData.xlsx"
data = pd.read_excel(path)
The image below shows a print of the data ingested and data type by column. You should ensure that the two date columns are of the datetime data type, which if they are not, they can be converted to using _pd.to_datetime()_.
Step 2: Get unique dates per row
We will then ‘explode’ the date ranges in each row of the data into unique dates. This can be achieved using the _pd.date_range_ method and specifying the freq parameter as ‘D’:
collate = []
for j in range(len(data)):
start_date = data['Leave From Date'][j]
end_date = data['Leave To Date'][j]
date_range = pd.date_range(start = start_date, end = end_date, freq = 'D')
collate.append(date_range)
df_date = pd.DataFrame(collate)
The output of this is shown in the image below. For instance, date range from 2023–02–01 to 2023–02–03 for employee B in row 2 is now broken up into three unique dates.
You may notice that there are some non-working dates in the last row as 2023–07–15 and 2023–07-16 in the date range fall on a weekend. To get business dates only, add a line of code to restrict the weekday index to less than 5 per the code below below. In addition, I’ve provided an option to restrict dates to a range in the commented-out code.
collate = []
for j in range(len(data)):
start_date = data['Leave From Date'][j]
end_date = data['Leave To Date'][j]
date_range = pd.date_range(start = start_date, end = end_date, freq = 'D')
## Get business dates only
work_dates = date_range[date_range.weekday < 5]
## Apply a date range filter if you wish
#from_date = pd.to_datetime('2023-01-01')
#to_date = pd.to_datetime('2023-06-30')
#work_dates = work_dates[(work_dates >= from_date) & (work_dates <= to_date)]
collate.append(work_dates)
df_date = pd.DataFrame(collate)
Step 2: Row to column transformation
Now we have the date ranges broken up. We will move on to collapse these at an employee level. For instance, both row 2 and 5 in image 3 above correspond to employee C.
To convert the dates in multiple rows into just one column by employee, we add back the ‘Employee’ identifier to the _dfdate dataframe generated above, and then apply Pandas’ .melt() method over the ‘Employee’ column.
## Add back employee identifier
df_concat = pd.concat([pd.DataFrame(data['Employee']), df_date], axis = 1)
.sort_values(['Employee']).reset_index(drop = True)
## Apply .melt() over employee identifier, and remove NaT values
df_melt = df_concat.melt(id_vars = ['Employee'], var_name = 'Index', value_name = 'Date')
df_melt.sort_values(['Employee']).dropna().reset_index(drop = True)
The output of the code above is exactly what we are after as shown below.
Mission accomplished! This dataframe can now be appended to Dataset A like-for-like for further analysis.
Concluding Thoughts
When I first encountered this problem, I was contemplating a solution in various tools such as SQL, R and Python. Through a bit of research, I ultimately chose Python as it appeared to be the easiest to implement, particularly for breaking up date ranges into individual dates and unpivoting columns into rows (noting that the melt method is also available in R).
In saying that, I’ll be more than happy to receive feedback on some of the other ways the readers think are more practical in solving this data transformation problem – leave them in the comments!
As I ride the AI/ML wave, I enjoy writing and sharing step-by-step guides and how-to tutorials in comprehensive language with ready-to-run codes. If you would like to access all my articles (and articles from other practitioners/writers on Medium), you can sign up using the link here!