Data Wrangling Solutions — Working With Dates — Part 1

Reading files containing the dates column.

Ujjwal Dalmia
Towards Data Science

--

Photo by Elena Mozhvilo on Unsplash

The topic discussed here is a challenge that every aspiring data scientist/ analyst stumbles upon at the start of their data science journey. The challenge with this problem is that you will continue to encounter it in some scenario or the other, and every time you do that, you will google to find the solution. In this and the next few tutorials, we will learn how to work with date datatype.

Assumption and Recommendation

Being hands-on is the key to master programming. We recommend that you continue to implement the codes as you follow through with the tutorial. The sample data and the associated Jupiter notebook are available in the Scenario_5 folder of this GitHub link.

If you are new to GitHub and want to learn it, please go through this tutorial. To set up a new Python environment on your system, please go through this tutorial.

Following is the list of Python concepts and pandas functions/ methods used in the tutorial:

Pandas functions

  • read_csv

Challenges & Solutions

The various challenges with reading the date columns and their associated solutions are listed below. But, before getting started, let’s understand the dataset we are using for this purpose.

Knowing the dataset

We have created a dummy dataset that contains multiple versions of a data field, release_date. The data dictionary of this dummy dataset is as follows:

  • release_date— Actual date column with first date value deleted.
  • release_date_int — Another column containing date information but in an integer format, for example, date 2020–02–12 is present as 20200212 in YYYYMMDD format.
  • release_date_text — Column containing dates in text format, and # as the separator. The format code is YYYY#DD#MM.
  • Year — Column containing only year values of the date data.
  • Month— Column containing only month values of the date data.
  • Day — Column containing information on days

A snapshot of the dataset is shared below:

Sample Data Snapshot (Image by Author)

Challenge 1— Reading Excel/CSV files containing the date column

In this scenario, we are trying to read our dataset with a focus on the release_date column. The sample code below demonstrates how read_csv reads the date column in its vanilla form.

#### Sample Code
#### Importing Pandas
import pandas as pd
#### Importing Data File - Change the Windows Folder Location
imp_data = pd.read_csv("C:\\Ujjwal\\Analytics\\Git\\Scenario_5\\Date_Dataset.csv")
#### Checking the dataset information
imp_data.info()
#### Sample output<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 release_date 2999 non-null object
1 release_date_int 3000 non-null int64
2 release_date_text 3000 non-null object
3 year 3000 non-null int64
4 month 3000 non-null int64
5 day 3000 non-null int64
dtypes: int64(4), object(2)
memory usage: 140.8+ KB

Observe the output from the info method. The release_date column is read as the object datatype by Pandas. To read it as a date object, the read_csv function provides the option to use an additional parameter. Refer to the modified code below for the solution

#### Sample Code
#### Importing Pandas
import pandas as pd
#### Importing Data File - Change the Windows Folder Location
imp_data = pd.read_csv("C:\\Ujjwal\\Analytics\\Git\\Scenario_5\\Date_Dataset.csv", parse_dates = ["release_date"])
#### Checking the dataset information
imp_data.info()
#### Sample output<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 release_date 2999 non-null datetime64[ns]
1 release_date_int 3000 non-null int64
2 release_date_text 3000 non-null object
3 year 3000 non-null int64
4 month 3000 non-null int64
5 day 3000 non-null int64
dtypes: datetime64[ns](1), int64(4), object(1)
memory usage: 140.8+ KB

Explanation

  • The release_date column is now available as a date-time object.
  • parse_dates — The parameter expects us to pass all the date column names as a list. If the date values in these columns are correct, it will parse the columns as a date-time datatype.

Challenge 2— Reading Excel/CSV files containing year, month & day columns

In this scenario, we will try to read three separate columns, year, month, and day into a single date column. From the vanilla use of the read_csv function, we know that the read_csv read the three columns as integers. In the modified code below, we have again used the parse_dates argument to achieve the desired result.

#### Sample Code
#### Importing Pandas
import pandas as pd
#### Importing Data File - Combining 3 columns into a single date field
imp_data = pd.read_csv("C:\\Ujjwal\\Analytics\\Git\\Scenario_5\\Date_Dataset.csv", parse_dates = ["release_date", ["year","month","day"]])
#### Checking the dataset information
imp_data.info()
#### Sample output<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year_month_day 3000 non-null datetime64[ns]
1 release_date 2999 non-null datetime64[ns]
2 release_date_int 3000 non-null int64
3 release_date_text 3000 non-null object
dtypes: datetime64[ns](2), int64(1), object(1)
memory usage: 93.9+ KB

Explanation

  • Notice the creation of a new datetime64[ns] column year_month_day. It is over and above the release_date column that the function read as a date-time object.
  • parse_dates — In the code shared above, other than passing the release_date column, we have passed another list of the year, month, and day column names (individual components of date data). The function parse_dates has converted this list into the date column.

Important points —Rather than passing the column names, we can also use the column numbers to the parse_dates parameter. We can either use all the three date components (the year, month, and date) or only two (year and month). It will not accept any other combination of parameters. The column names can be anything and don’t have to be the names of the date components. Finally, the order of the columns in the list should be the year, month & date, or the columns will be available as an object datatype.

Challenge 3— Reading Excel/CSV files containing year, month & day columns and adding custom column name

In the last challenge, we successfully parsed the year, month, and day columns into a single date column but couldn’t customize its name. A minor modification to the above code can help us achieve this.

#### Sample Code
#### Importing Pandas
import pandas as pd
#### Importing Data File - Combining 3 columns into a single date field
imp_data = pd.read_csv("C:\\Ujjwal\\Analytics\\Git\\Scenario_5\\Date_Dataset.csv", parse_dates = {"combined_date":["year","month","day"]})
#### Checking the dataset information
imp_data.info()
#### Sample output<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 combined_date 3000 non-null datetime64[ns]
1 release_date 2999 non-null object
2 release_date_int 3000 non-null int64
3 release_date_text 3000 non-null object
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 93.9+ KB

Explanation

  • Notice the creation of a new datetime64[ns] column combined_date.
  • parse_dates — This time, rather than passing the three columns, year, month, and day as the list of lists, we have used the list of these columns as a dictionary. The key to the dictionary is the new column name, and the value component is the list of columns to be combined.

Closing note

In the output from the solution of challenge three, a small piece is missing. Identify it, its associated solution, and comment below. The answer is available in the Jupyter notebook shared through GitHub Repository.

In this tutorial, we learned about reading the columns containing date information. In the next tutorial, we will learn about converting the date data in non-date-time format into the date-time format.

HAPPY LEARNING ! ! ! !

--

--