The world’s leading publication for data science, AI, and ML professionals.

A case study for transferring multi-format date data from Excel into MySQL through Python

In data analysis, date format and related issues are headaches. Today I would like to take an example to show what kind of problems exist…

Hands-on Tutorials

Photo by Cookie the Pom on Unsplash
Photo by Cookie the Pom on Unsplash

In data analysis, Date format and related issues are headaches. Today I would like to take an example to show what kind of problems exist in date data in reality, how to handle it with Python, to insert it into MySQL database.

Case study

A company needs to consolidate the information for employees worldwide, as the data comes from different countries, and there are different conventions to input the birthday into Excel, and there are also typos, so when the data is consolidated, the birthday looks like:

The table in Excel (Image by Author)
The table in Excel (Image by Author)

Birthday1: how the original birthday looks like (in order to show it, a character like 'is added to make it display the original format) Birthday2: how the birthday looks like in Excel on my computer

If we summarize the case, there are situations as below table:

Analysis of the possible case(Image by Author)
Analysis of the possible case(Image by Author)

As you may already know, data left-aligned by default is a string, such as’2010–13–09′, data right-aligned by default is date or number, like ‘9/13/2010’, and I am not sure the type of 20100913, just leave it as it.

I categorize the date into two cases:

1. There is a number greater than 12 (except year)

Under this case, there are three cases, which are : A. The string contains forward slash ‘/’ or dash ‘-‘, for example: ‘2010/09/13’, ‘2010–09–13’ B. The string doesn’t contain forward slash ‘/’ or dash ‘-‘, but other characters like comma or Punkt, for example: ‘2010,09,13’, ‘2010.09.13’ C. There is no connector or delimiter, like **** ‘20100913’, ‘20101309’

2. If there is no number greater than 12 (except year)

Under this case, there are also three cases, which are : A. The string contains forward slash ‘/’ or dash ‘-‘, for example: ‘2010/09/02′,’2010–09–02’ B. The string doesn’t contain forward slash ‘/’ or dash ‘-‘, but other characters like comma or Punkt, for example: ‘2010,09,02’, ‘2010.09.02’ C. There is no connector or delimiter, like ‘20100902’, ‘20100209’

Now let’s import it into Python:

df = pd.read_excel('employee.xlsx')

It looks like:

The data in Python (Image by Author)
The data in Python (Image by Author)

So the date in Excel has been transferred into timestamp format in python, the none date data now on the right side, but it isn’t timestamped format.

The table employee_birthday has been created in MySQL as:

Let’s try to insert the data of name and Birthday2 into the database.

It stopped when the first error happened. The error information is like below:

DataError: 1292 (22007): Incorrect date value: '2010-13-09' for column 'birthday' at row 1

Solution for transferring multi-format date data with Python

So we need to correct all of the incorrect date values before inserting.

First, let’s check the type of the incorrect date value, take ‘2010–13–09’, and ‘20100913’ as examples. The result is:

  • For data type is str, use datetime.strptime to make test:
  • For int, first, transfer it into str, then use datetime.strptime. Let’s have a try:

It works too.

Now come to the point: there are so many different formats, how can we handle them efficiently? Let’s categorized the situation into four cases:

  1. If the type(birthday) is pandas.Timestamp, return the birthday
  2. If the type(birthday) is a string, create a format list, and combine with datetime.strptime to return the correct date
  3. If the type is int, first transfer it into a string, then use the method at step2
  4. others exception

There is a trick in the format list when the type of data is str.

  1. In case, the connectors are forward slash or dash, if there is no number greater than 12(except year), Excel will transfer the data in the format: ‘yyyy-mm-d’ or ‘mm/d/yyyy’ to date, which looks like: all of these dates will be transferred as pandas.Timestamp
Image by Author
Image by Author
  1. In case, the connectors are forward slash or dash, if there is a number greater than 12(except year), Excel will transfer the data in the format: ‘yyyy-m-d’ or ‘m/d/yyyy’ by default to date, except: if the day(greater than 12) is in the default month place, these cases need to be listed in the ‘fmt’ for forwards slash and dash:
  • %Y-%d-%m, like ‘2010–13–09’
  • %d-%m -%Y, like ’13–9–2010′
  • %d/%m%Y, like ’13/9/2010′
  • %Y/%d/%m, like ‘2010/13/9’
Image by Author
Image by Author
  1. For the other connectors, all of the possible cases should be listed. They are: ‘%Y.%m.%d’, ‘%Y.%d.%m’, ‘%m.%d.%Y’, ‘%d.%m.%Y’, ‘%Y,%m,%d’, ‘%Y,%d,%m’, ‘%m,%d,%Y’, ‘%d,%m,%Y’
Image by Author
Image by Author

So the fill list should look like: (‘%Y-%d-%m’,’%d-%m-%Y’, ‘%Y/%d/%m’, ‘%d/%m/%Y’, ‘%Y.%m.%d’, ‘%Y.%d.%m’, ‘%m.%d.%Y’, ‘%d.%m.%Y’, ‘%Y,%m,%d’, ‘%Y,%d,%m’, ‘%m,%d,%Y’, ‘%d,%m,%Y’)

The complete date parsing function is as below:

Results:

Now let’s try to insert into MySQL again. Use the below codes to query the data from the table:

The result is like below:

Data_queried_from_MySql
Data_queried_from_MySql

Remarks:

In this article, I focus on how to correct multi-format date data with Python automatically to insert date data from Excel to MySQL.

At last, I would like to clarify another two points:

  1. For data like 2010–9–2, and 2010–2–9: when imported from Excel to python, the system has transferred it in the format yyyy-m-d, if you want to parse as yyyy-d-m, this method explained here doesn’t work.
  2. For the incorrect date in Excel, if you want to proceed with it in Excel, there are also some hits for your reference.
  • Use replace to transfer the connector except for forward-slash (‘/’) and dash (‘-‘)
  • Use function =Date+right(),left() and mid(), to transfer numbers into date, for example: for 20100209, use ‘=DATE(LEFT(cell,4),MID(cell,5,2),RIGHT(cell,2))’ to transfer it into ‘2/9/2010’

Thanks for reading and enjoy it.


Related Articles