The world’s leading publication for data science, AI, and ML professionals.

Data Wrangling Solutions – Working With Dates – Part 2

Convert columns from non-date to date datatypes.

Photo by Steinar Engeland on Unsplash
Photo by Steinar Engeland on Unsplash

In the last tutorial, we looked at the various approaches to read the data files containing the date-time variables. In this tutorial, we will discuss how to handle columns of an existing dataframe that contains date-time data but are still read by Python as an object datatype.

Assumption and Recommendation

Being hands-on is the key to master programming. We recommend that you continue to implement the codes as you follow through with the tutorial. The sample data and the associated Jupiter notebook are available in the Scenario_6 folder of this GitHub link.

If you are new to GitHub and want to learn it, please go through [this](https://towardsdatascience.com/getting-started-guide-anaconda-80a4d30d3486) tutorial. To set up a new Python environment on your system, please go through this tutorial.

Following is the list of Python concepts and pandas functions/ methods used in the tutorial:

Pandas functions

  • read_csv
  • to_datetime

Challenges & Solutions

Knowing the dataset

As explained in the last tutorial, we have created a dummy dataset that contains multiple versions of a data field, _release_date_. The data dictionary of this dummy dataset is as follows:

  • release_date – Actual date column with first date value deleted.
  • release_date_int – Another column containing date information but in an integer format, for example, date 2020–02–12 is present as 20200212 in YYYYMMDD format.
  • release_date_text – Column containing dates in text format, and # as the separator. The format code is YYYY#DD#MM.
  • Year – Column containing only year values of the date data.
  • Month – Column containing only month values of the date data.
  • Day – Column containing information on days

A snapshot of the dataset is shared below:

Sample Data Snapshot (Image by Author)
Sample Data Snapshot (Image by Author)

In all our scenarios explained below, we have assumed that this dummy data file is read in Python using the vanilla _read_csv_ implementation. The code to read the dummy data and information on each column after reading it is as below:

#### Sample Code
#### Importing Pandas
import pandas as pd
#### Importing Data File - Change the Windows Folder Location
imp_data = pd.read_csv("C:UjjwalAnalyticsGitScenario_6Date_Dataset.csv")
#### Checking the dataset information
imp_data.info()
#### Sample output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   release_date       2999 non-null   object
 1   release_date_int   3000 non-null   int64 
 2   release_date_text  3000 non-null   object
 3   year               3000 non-null   int64 
 4   month              3000 non-null   int64 
 5   day                3000 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 140.8+ KB

Challenge 1 – Converting the column containing the date-time data in text format into a date-time object

In this scenario, we are trying to convert the _release_datetext column from and object datatype into the date-time datatype. The sample code to demonstrate this is as below:

#### Sample Code
#### Datatype Conversion (Assuming the data is already read into a dataframe)
imp_data["release_date_text_converted"] = pd.to_datetime(imp_data["release_date_text"], format="%Y#%d#%m")
#### Checking the dataset information
imp_data.info()
#### Sample output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   release_date                 2999 non-null   object        
 1   release_date_int             3000 non-null   int64         
 2   release_date_text            3000 non-null   object        
 3   year                         3000 non-null   int64         
 4   month                        3000 non-null   int64         
 5   day                          3000 non-null   int64         
 6   release_date_text_converted  3000 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 164.2+ KB

Explanation

  • The new column _release_date_text_converted_ is now available as a date-time object.
  • to_datetime – The function expects the first argument to be the column (series in Pandas) that we want to convert into a date-time object. As the second parameter, we have to provide the format in which the data currently exists.

In our case, the data is in YYYY#DD#MM format. In this format, we have represented the year component as %Y, the month component as %m, and the date component as %d.

Challenge 2 – Converting the column containing the date in integral form into a date-time object

This scenario is a minor diversion from the previous one. Here, rather than having the source column in text format, we have it in integral format. The code to implement this solution is as follows:

#### Sample Code
#### Datatype Conversion (Assuming the data is already read into a dataframe)
imp_data["release_date_int_converted"] = pd.to_datetime(imp_data["release_date_int"], format="%Y%m%d")
#### Checking the dataset information
imp_data.info()
#### Sample output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   release_date                 2999 non-null   object        
 1   release_date_int             3000 non-null   int64         
 2   release_date_text            3000 non-null   object        
 3   year                         3000 non-null   int64         
 4   month                        3000 non-null   int64         
 5   day                          3000 non-null   int64         
 6   release_date_text_converted  3000 non-null   datetime64[ns]
 7   release_date_int_converted   3000 non-null   datetime64[ns]
dtypes: datetime64[ns](2), int64(4), object(2)
memory usage: 187.6+ KB

Explanation

  • The new column _release_date_int_converted_ is now available as a date-time object.
  • to_datetime— The code in this solution remains the same except the input to the format parameter. Since there were no separators between the date components, the revised format we have used is %Y%m%d.

Challenge 3 – Manually adding a new column with a constant date value to the data frame

Sometimes the situation warrants us to create a new date column with some fixed values. A minor modification to the above code can help us achieve this.

#### Sample Code
#### Datatype Conversion (Assuming the data is already read into a dataframe)
imp_data["new_constant_date_column"] = pd.to_datetime("1985-17-03", format="%Y-%d-%m")
#### Checking the dataset information
imp_data.info()
#### Sample output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 9 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   release_date                 2999 non-null   object        
 1   release_date_int             3000 non-null   int64         
 2   release_date_text            3000 non-null   object        
 3   year                         3000 non-null   int64         
 4   month                        3000 non-null   int64         
 5   day                          3000 non-null   int64         
 6   release_date_text_converted  3000 non-null   datetime64[ns]
 7   release_date_int_converted   3000 non-null   datetime64[ns]
 8   new_constant_date_column     3000 non-null   datetime64[ns]
dtypes: datetime64[ns](3), int64(4), object(2)
memory usage: 211.1+ KB
Sample Output (Image by Author)
Sample Output (Image by Author)

Explanation

  • Observe the new column _new_constant_date_column_ containing the constant date values.
  • to_datetime – This time, rather than passing the column containing the date-time objects, we have passed a constant date value to the function. The format parameter is also adjusted accordingly.

Closing note

Given the approach we have taken in this tutorial, can you think of a solution to convert the three date columns: year, month, and day into a single date column? Refer to the Jupyter notebook shared through GitHub Repository for the solution.

In this tutorial, we learned about converting dataframe columns from any data type into a date-time object. In the next tutorial, we will go through the various data-wrangling activities associated with the date-time columns.

HAPPY LEARNING ! ! ! !


Related Articles