Hands-on Tutorials

In data analysis, Date format and related issues are headaches. Today I would like to take an example to show what kind of problems exist in date data in reality, how to handle it with Python, to insert it into MySQL database.
Case study
A company needs to consolidate the information for employees worldwide, as the data comes from different countries, and there are different conventions to input the birthday into Excel, and there are also typos, so when the data is consolidated, the birthday looks like:

Birthday1: how the original birthday looks like (in order to show it, a character like '
is added to make it display the original format)
Birthday2: how the birthday looks like in Excel on my computer
If we summarize the case, there are situations as below table:

As you may already know, data left-aligned by default is a string, such as’2010–13–09′, data right-aligned by default is date or number, like ‘9/13/2010’, and I am not sure the type of 20100913, just leave it as it.
I categorize the date into two cases:
1. There is a number greater than 12 (except year)
Under this case, there are three cases, which are : A. The string contains forward slash ‘/’ or dash ‘-‘, for example: ‘2010/09/13’, ‘2010–09–13’ B. The string doesn’t contain forward slash ‘/’ or dash ‘-‘, but other characters like comma or Punkt, for example: ‘2010,09,13’, ‘2010.09.13’ C. There is no connector or delimiter, like **** ‘20100913’, ‘20101309’
2. If there is no number greater than 12 (except year)
Under this case, there are also three cases, which are : A. The string contains forward slash ‘/’ or dash ‘-‘, for example: ‘2010/09/02′,’2010–09–02’ B. The string doesn’t contain forward slash ‘/’ or dash ‘-‘, but other characters like comma or Punkt, for example: ‘2010,09,02’, ‘2010.09.02’ C. There is no connector or delimiter, like ‘20100902’, ‘20100209’
Now let’s import it into Python:
df = pd.read_excel('employee.xlsx')
It looks like:

So the date in Excel has been transferred into timestamp format in python, the none date data now on the right side, but it isn’t timestamped format.
The table employee_birthday has been created in MySQL as:
Let’s try to insert the data of name and Birthday2 into the database.
It stopped when the first error happened. The error information is like below:
DataError: 1292 (22007): Incorrect date value: '2010-13-09' for column 'birthday' at row 1
Solution for transferring multi-format date data with Python
So we need to correct all of the incorrect date values before inserting.
First, let’s check the type of the incorrect date value, take ‘2010–13–09’, and ‘20100913’ as examples. The result is:

- For data type is str, use datetime.strptime to make test:

- For int, first, transfer it into str, then use datetime.strptime. Let’s have a try:

It works too.
Now come to the point: there are so many different formats, how can we handle them efficiently? Let’s categorized the situation into four cases:
- If the type(birthday) is pandas.Timestamp, return the birthday
- If the type(birthday) is a string, create a format list, and combine with datetime.strptime to return the correct date
- If the type is int, first transfer it into a string, then use the method at step2
- others exception
There is a trick in the format list when the type of data is str.
- In case, the connectors are forward slash or dash, if there is no number greater than 12(except year), Excel will transfer the data in the format: ‘yyyy-mm-d’ or ‘mm/d/yyyy’ to date, which looks like: all of these dates will be transferred as pandas.Timestamp

- In case, the connectors are forward slash or dash, if there is a number greater than 12(except year), Excel will transfer the data in the format: ‘yyyy-m-d’ or ‘m/d/yyyy’ by default to date, except: if the day(greater than 12) is in the default month place, these cases need to be listed in the ‘fmt’ for forwards slash and dash:
- %Y-%d-%m, like ‘2010–13–09’
- %d-%m -%Y, like ’13–9–2010′
- %d/%m%Y, like ’13/9/2010′
- %Y/%d/%m, like ‘2010/13/9’

- For the other connectors, all of the possible cases should be listed. They are: ‘%Y.%m.%d’, ‘%Y.%d.%m’, ‘%m.%d.%Y’, ‘%d.%m.%Y’, ‘%Y,%m,%d’, ‘%Y,%d,%m’, ‘%m,%d,%Y’, ‘%d,%m,%Y’

So the fill list should look like: (‘%Y-%d-%m’,’%d-%m-%Y’, ‘%Y/%d/%m’, ‘%d/%m/%Y’, ‘%Y.%m.%d’, ‘%Y.%d.%m’, ‘%m.%d.%Y’, ‘%d.%m.%Y’, ‘%Y,%m,%d’, ‘%Y,%d,%m’, ‘%m,%d,%Y’, ‘%d,%m,%Y’)
The complete date parsing function is as below:
Results:
Now let’s try to insert into MySQL again. Use the below codes to query the data from the table:
The result is like below:

Remarks:
In this article, I focus on how to correct multi-format date data with Python automatically to insert date data from Excel to MySQL.
At last, I would like to clarify another two points:
- For data like 2010–9–2, and 2010–2–9: when imported from Excel to python, the system has transferred it in the format yyyy-m-d, if you want to parse as yyyy-d-m, this method explained here doesn’t work.
- For the incorrect date in Excel, if you want to proceed with it in Excel, there are also some hits for your reference.
- Use replace to transfer the connector except for forward-slash (‘/’) and dash (‘-‘)
- Use function
=Date+right(),left() and mid(),
to transfer numbers into date, for example: for 20100209, use ‘=DATE(LEFT(cell,4),MID(cell,5,2),RIGHT(cell,2))’ to transfer it into ‘2/9/2010’
Thanks for reading and enjoy it.