
The easy choice is to drop missing or erroneous data, but at what cost?
Dealing with missing, null, or erroneous values is one of the most painful and common exercises that we encounter in Data Science and in machine learning. In some cases, it is acceptable or even preferred to drop such records – algorithms are fragile and can seize up on missing values. However, in other cases when the inclusion of every record is important, what then? The easy choice is to drop missing erroneous data, but at what cost?
What if a single record represents a real person and dropping this record means a person’s story is not counted?
If dropping data is not acceptable, then, the obvious choice is to substitute bad data with something else, i.e. impute the data. However, such decisions are often based on gut feeling and although not exactly arbitrary – they are sometimes quite arbitrary and perhaps, even, unaccountable. As a result, when the task boils down to imputing missing or erroneous dates, it helps to make traceable and purposeful tradeoffs.
In this article, I present one way to replace erroneous datetime stamps in a Python-based Pandas DataFrame. The article builds up to a solution that leverages df.apply() and a lambda function to replace the year of one column, conditionally with the year of another column. However, the method of applying a lambda function to a dataframe is transferable for a wide-range of impute conditions.
The solution is provided here for quick reference:
Impute Erroneous Datetime Stamps
Bad data often originates at the source, i.e. when a person manually types in a record or converts paper records into digital ones. In these cases, the year 2013 or 2019 might become the year 2103 or 2109. For a given domain, if we know future dates are, in fact, erroneous, then the inclusion of these records will create problems during analysis if not addressed.
Data Example
To better illustrate the problem space, consider a Pandas DataFrame that has two columns, event_date, and received_date. In this scenario, we are confident in the received_date column – it has good data. However, the event_date column is based on converted records from manual entry and has a number of future dates that, in this domain, are obviously wrong.
Consider a scenario wherein the project DataFrame contains about 1 million total records but has anywhere from a handful to hundreds of records with bad datetime stamps for multiple columns.
# example of bad datetime stamps
# given a Pandas DataFrame, df
col1 = 'event_date'
col2 = 'received_date'
print(df[[col1, col2]])
# output
event_date received_date
2022-03-03 2011-02-15
2022-02-22 2012-02-07
2022-02-22 2012-02-07
2102-12-06 2012-11-29
2102-12-06 2012-11-29
...
Solution Tutorial
In the real-world project that inspired this article, I noticed that in all other cases with good data, the event_date and received_date for each row had the same year. As a result, I made a key assumption that whenever there is a future event_date, the year portion of the datetime stamp should be the same as received_date’s year. As a result, the desired solution is somewhat simple – replace the year of every bad datetime stamp in one column with the value from another column.
Replace Pandas Datetime Year
Replacing portions of a datetime stamp in a Pandas DataFrame is actually straightforward and mechanical; however, the task becomes complex when considering runtime and traceability. For instance, the documentation provides the methods to replace year, but what about multiple conditions and scalability over millions of records?
A working solution is great but an efficient solution is always better.
To get started with working code, consider a basic method to replace part of a datetime stamp. Based on the documentation, we can use the dt.replace() method to access and replace just the year value.
# given a df, replace year of a datetime
# necessary imports and prerequisites
import pandas as pd
col1 = 'event_date'
col2 = 'received_date'
# make a col into datetime type
df[col1] = pd.to_datetime(df[col1])
df[col2] = pd.to_datetime(df[col2])
# replace year of a single cell
print(test[col1].iloc[0].replace(year=2099))
# output
2099-04-28 00:00:00
What about replacing many records, all at once?
As shown in the next code snip, there is immediately a problem when attempting to apply the dt.replace() method to a Pandas Series. In both output errors, the replace operation fails because the object is not a timestamp. As a result, we might ask how to apply this data transformation to each record in the DataFrame.
# error cases when scaling up
# try to replace year on series
df[col1].dt.replace(year=2099)
# output error 1
AttributeError: 'DatetimeProperties' object has no attribute 'replace'
# another try to replace year on series
df[col1].replace(year=2099)
# output error 2
TypeError: replace() got an unexpected keyword argument 'year'
A straightforward solution is to iterate through the DataFrame with a for-loop and some conditional statements. To loop through rows in a DataFrame, we need to leverage the iterrows() method. For scope, let’s make another assumption that this substitution is valid for dates between 2020 and 2010, or the current year and past year, respectively.
Solution 1 – A Working But Slow Solution
The following method works because, in each iteration, we access the DataFrame by row and column which gives us the timestamp and all of its methods. If the cell’s year value is greater than curr_year, then replace by column (col1) and index (idx) value with the .iloc[] indexer.
With iterrows and loops, we’ve solved the problem, right? Not quite.
Although the preceding code snip works, we have no record of what’s been changed. To make a change log, it might be good enough to export a CSV file of the edited records. Let’s try, in each iteration, to save data to a tuple, append tuples to a list, and then make a DataFrame out of the resulting values and write it to CSV.
A working solution:
So, problem solved? Not quite.
Although this is working code, there is a second issue. The iterative solution takes way too long. Based on a simple test of elapsed time on my MacOS, this loop takes about 65 seconds to complete on a dataset of 1 million records. If this job needs to run on multiple columns, we are in big trouble of wasting precious time.
What are the alternatives to slowly iterating through a DataFrame? The answer, in this case, is to leverage the built-in apply() method. With apply(), we can do something (like change the year) to a selected portion of the entire DataFrame instead of going row-by-row.
Solution 2 – A More Efficient Solution
To impute dates more efficiently and to really take advantage of apply(), we need to build some lambda functions. If lambdas are new or confusing, that’s okay, they are actually confusing, abstract, and all the weird things. In fact, I usually build a slow and messy iterative solution before making a lambdas function. In this section, let’s build a working solution with apply() and lambdas, step-by-step.
If lambdas are new or confusing, that’s okay, they are actually confusing, abstract, and all the weird things… let’s build a working solution, step-by-step.
Working Code
As a first, step, let’s get some working code into play and see how the lambda function works with apply(). To get started, we can build a useless function that really does nothing except return data – makes a copy, if you prefer.
# useless lambda function just makes a copy of itself
df[col1] = df[col1].apply(lambda x: x)
In the preceding code snip, the lambda’s x object takes on the value of every row in df[col1] which have the dtype timestamp. If we were to build the same lambdas function, the standard way, we might just return x and do nothing else.
# a standard function that does also nothing
def some_function(x):
return x
Now that we are on the board, let’s make this lambda function do something. Before, we changed and replaced datetimes for every record, just to see if it works – we can do the same with lambdas.
Design Objectives
The lambdas work, and we know we can change timestamps, now what? Let’s start building with a few objectives in mind.
- Change one column conditionally based on another column
- Replace the year value of datetime stamps that meet a condition
- Write a record of changes to file
Apply Lambda to DataFrame
Since we are changing one column conditionally on another, we have to use apply() on the DataFrame, not just a Series. Let’s test to see if we can access both columns in the lambda function, col1 and col2.
Next, replace the year of col1 with the year of col2, only if the year in col1 is greater than the curr_year. As shown in the next code snip, the year 2109 is now 2019.
With df.apply() the x in lambda x represents the DataFrame, as a result, within the lambda, x[col1] is similar to df[col1], except that x[col1] is the actual value of co1 at a given row.
How to read the lambda function: "The value if some condition, else some other value"
Now, add another conditional statement for dates that are order than a past_year.
At this point, the simple one-line lambda function has grown. To clean up the code and re-use this function (if desired), assign the lambda function to an object called impute, then call impute in df.apply().
Elapsed Time: About 25 seconds with apply() and lambdas versus 65 seconds with iteration – roughly a 60% decrease in time!
Conclusion
In this article, I present one way to impute dates instead of dropping them. The solution involves conditionally changing the year of a datetime stamp with the year of another datetime stamp. One method leverages iterrows() to loop through a DataFrame and change values, row-by-row. A second method leverages apply() and a lambda function to make the same changes more efficiently. Further, I demonstrated a simple way of providing traceability by writing a change log to CSV that documents the modifications to data.
Imputing dates is more art than science, further it is domain specific.
Imputing dates is more art than science, further it is domain specific. As a result, the method of replacing the year with another year, as shown in this article, may not apply to other projects. However, the method to make changes conditionally and efficiently with apply() and lambdas should be transferable to other conditions.
Thanks for reading. I am still working on making this solution better and would like to know if you can improve upon it. Otherwise, I hope this helps with your project. Happy imputing!