The world’s leading publication for data science, AI, and ML professionals.

Traversing , cleaning batch worksheets in batch Excel files and interacting with MySQL

Cleaning only with pandas

Photo by the author
Photo by the author

Today’s story will include:

  1. How to only use pandas to traverse and clean dirty batch Excel worksheets
  2. Based on step 1, how to traverse batch Excel workbooks and consolidate the report
  3. How to save the Excel file to MySQL database and interact with it.

Read and write Excel files is very often operation in daily business. It is not complicated, but cumbersome, there are various cases you might have to face and needs patience to detail to handle them. Today I focus on reading-only Excel files (not modify Excel), cleaning the data only with Pandas, and interact with MySQL.

Introduction to datasets

Clean and dirty datasets

Sometimes, the datasets might have organized cleanly as below:

Pic 1: clean dataset
Pic 1: clean dataset

In other cases, the dataset might have a lot of additional useless information, as below:

Pic 2: dirty dataset 1
Pic 2: dirty dataset 1
Pic 3: dirty dataset 2
Pic 3: dirty dataset 2

For the first case, it is easier to consolidate different worksheets into one report, but for the second case, the data needs to be cleaned as following steps:

  • Part1: extract client info and sale date from specific cell
  • Part2: when reading excel into Dataframe, there will generate NaN between the last record (5, 56) and the footer (generated by BrandNew_sales_team), even worse, it is dynamic.
Pic 4: steps for data cleaning
Pic 4: steps for data cleaning

Tool for cleaning Excel files:

You might use Excel, VBA, or even Power BI, etc. Microsoft product to clean and consolidate the data, or you might use xlwings, Openpyxl, etc. python libraries, but I prefer to only use pandas DataFrame, because

  • Openpyxl doesn’t need to install Excel, but can’t run .xls file and it can’t open a password-protected file.
  • xlwings needs to install Excel on the machine.
  • The assumed situation is that the application will run without Excel installed, like in Cloud.
  • My skill in pandas is better than the others.

Datasets in detail

  1. The Excel files are password protected and VBA included, and there are various worksheets inside, which are needed to be consolidated.
  2. The Excel file format is exactly like the above pictures 2 and 3.
  3. A batch of Excel Files needed to be processed.

Data cleaning

I start cleaning only one workbook, named ‘sales_2020.xlsm’.

First import the libraries:

import numpy as np
import pandas as pd
from pandas import DataFrame

Get all of the worksheet’s names:

xl_file_name='sales_2020.xlsm'
xl = pd.ExcelFile(xl_file_name)
sheet_name=xl.sheet_names  # see all sheet names

The main Data Cleaning function:

The result looks like below:

Pic5: Top 5
Pic5: Top 5
Pic6: tail 4
Pic6: tail 4

Traversing batch Excel files

If there are batch files that need to be consolidated, use os library to search all files and combine them with the above code to get the report.

Suppose the files organized as below:

Pic 7: files path
Pic 7: files path

The method to explore the total files is as below:

filepath=r'C:UsersusernameDesktopdata'
for root,dirs,files in os.walk(filepath):
                for file in files:                        
                    filename=os.path.join(root,file)# get the name with absolute path                    

Integrating all of the steps, the function of consolidating all of the reports is:

Only one variable as filepath needs to be given:

filepath=r'C:Usersuser namedata'
consolidate_sales_report(filepath)

All of the reports are consolidated.

Save to Mysql

Supposed that you have the authorization to write data to the Mysql database, and a schema named ‘sales_report’ has been created.

We can use the below code to interact with MySQL:

The result is perfect!

Pic 8: interacting with MySQL
Pic 8: interacting with MySQL

Today I summarize the main process from traveling, cleaning data in Excel, in batch with pandas, and interacting with MySQL. The main used tools are:

  • Only use pandas to traversing and cleaning data in Excel, because, with this library, Excel isn’t needed to be installed (good for Cloud) and it can handle various cases as password protected and macro-enabled file, compared with Openpyxl or xlwings and it is efficient compared with Excel
  • Use _sheetnames to traverse all of the worksheets
  • Use os.walk() and os.path.join(root,file) to traves all of the workbooks
  • Use _tosql and _readsql to interact with MySQL.

Thank you for your time.


Related Articles