The world’s leading publication for data science, AI, and ML professionals.

Report Automation Tips with Python

Sharing the tips to make your automation script cleaner and easier to debug. I learned from my rookie scripts that I wrote one year ago

AUTOMATION WITH PYWIN32

Photo by Amanpreet Deol on Unsplash
Photo by Amanpreet Deol on Unsplash

Well, recently, I am debugging the Automation scripts I wrote one year ago as they cannot run properly on the new PC. I just can’t stop thinking about what am I thinking at that time. Overall, it was a funny and painful experience to debug my script. Haha!


The first lesson I learned from my painful debugging experience is:

Do Not Hardcode The Date Calculation Variable

Generally, the automation scripts are written to populate a weekly report on the day or the next day the input reports are generated from the ERP system. The input report generated will always have the same report name, and always have the date the report generated in the file name.

For example, Material Usage Report 15042022.xlsx and Material Usage Report 22042022.xlsx. The report name "Material Usage Report" should be always consistent and the date format also should be consistent.

In my case, the generated reports are shared out by saving them in Network Drive or sent out via email. Then, the automation script will be reading them and performing the needed transformation, like merging tables or pivoting.

So, I use the datetime.date.today() function to get the date, then combined with the name of the report that needs to be downloaded. If the script is scheduled on the same day the report is generated, I will use the output from the function directly, else I will use the timedelta function to calculate the actual date.

# the file is processed on the next day of the file received
file_received_date = datetime.date.today() - timedelta(days = 1)

This looks perfectly fine during the development stage, but when it comes to the testing stage, I always have to manually adjust the parameter for timedelta function to get the correct date. After countless times of testing, I finally realize, I should not code it this way. As the script run weekly, I should be able to run the script on any day of the week to get the output. For example below, the script is scheduled for every Monday.

# add 1 day for everyday delayed
day_delay = datetime.date.today().weekday()
# the process file dated the next day of the file received
date_process_file = datetime.date.today() - timedelta(days = day_delay)

The weekday() function will return the weekday of the date in number form. As the Python index starts from zero, so the first day of the week, Monday is represented by 0 as shown below.

Monday: 0
Tuesday: 1
Wednesday: 2
Thursday: 3
Friday: 4
Saturday: 5
Sunday: 6

So, if I run the script on Tuesday, which is the next day, the date_process_file needs to be minus one day.

If my script is scheduled on any other day than Monday, I just need to minus the index of the day from the day_delay calculated as follows. The day_delay returns the weekday of the date in number.

# Scheduling on Wednesday
date_process_file = datetime.date.today() - timedelta(days = day_delay - 2)

The logic behind this is shown in this image.

Image by Author. Created with Excalidraw.
Image by Author. Created with Excalidraw.

For a report that is scheduled on Monday, if we run the script on Sunday, we have to minus 6 days to the date to get the date of Monday, while the 6 is the index of Sunday and the index of Monday is 0.

For a report that is scheduled on Wednesday, if we run the script on Sunday, we have to minus 4 days to the date to get the date of Wednesday, while the 4 is equal to the index of Sunday minus the index of Wednesday.

If we derived a formula for this, it should look like something like this.

actual_date = datetime.date.today() - timedelta(days = weekday_script_run - weekday_report_scheduled)

Use The Copy Method Wisely

One of my pain points is I was too lazy back then. I abused the PyWin32 Copy Method. When I was assigned to automate the weekly report, there are already old reports with the format set. So I just copy and paste the format from the previous week’s report to the new workbook that is generated by my automation script.

# copy paste FORMAT only, which includes font colour, fill colour, and border
wb2.Sheets(sheetname).Range("A:D").Copy()
wb.Sheets(sheetname).Range("A1").PasteSpecial(Paste=win32c.xlPasteFormats)

Everything is perfectly fine until the users wish to add a new column in the new report. This means my copy-paste trick cannot be used anymore, else the format will all miss-match🤣.

So here’s the second lesson I learned:

Use The Copy Method Wisely

Before you write the script to copy-paste anything into the new report, think twice! Copy-paste is just like hard-coding any Python variables. Always review the part you try to copy-paste and think thoroughly about whether will this be affected if there are new requirements in the future.

If you are curious about how I resolve this, I code the format explicitly 🤣.

wb.Sheets(sheetname).Columns("A:A").AutoFit()
wb.Sheets(sheetname).Columns("B:B").ColumnWidth = 19.71
wb.Sheets(sheetname).Columns("C:C").ColumnWidth = 43
wb.Sheets(sheetname).Columns("D:I").AutoFit()
wb.Sheets(sheetname).Columns("J:N").ColumnWidth = 23

During resolving this, I also learned some formatting in Excel that can be achieved with Python.

Wrap Text

wb.Sheets(sheetname).Range("B1").WrapText = True

Set Borders Line Style

In line 12, the Range(select_table, select_table.End(win32c.xlDown)) will select the entire table in the Excel file, so we do not need to specify the last row. The first parameter select_table refers to the first row of the table, the second parameter select_table.End(win32c.xlDown) will select the rows based on the first column, and it will stop when the first column is empty.


Fixing System Generated Excel File That Cannot Be Read by Pandas With PyWin32

This is referring to a type error as below.

TypeError: expected <class 'openpyxl.styles.fills.Fill'>

So, the Excel file can be open directly on my laptop but not able to be read by Pandas no matter how I try. Then, eventually, I solved it, in a not so smart way I guess🤣. This is an issue that is available on the StackOverFlow platform, and people are suggesting reverting to an older version of the OpenPyXL or resaving the file as an XLS file type.

I did try to revert to the older version of the OpenPyXL version but it did not solve the error. I also tried to install all the packages with the same version that I installed on the old PC, where the script will still work, but the same error still occurs on the new PC.

Then, I try the second solution suggested, which is to resave the file as an XLS file. Unfortunately, it caused another error as below.

XLRDError: Workbook is encrypted

I try to read through the Microsoft Documentation but was still not able to save the file unencrypted. When I almost give up, I try to resave the file as a CSV file and it works perfectly well!

import win32com.client as win32
import pandas as pd
win32c = win32.constants
import pandas as pd
file_xlsx = r"the original file.xlsx"
file_csv = r"the new file.csv"
# create excel object
excel = win32.gencache.EnsureDispatch('Excel.Application')
# excel can be visible or not
excel.Visible = True  # False
wb = excel.Workbooks.Open(file_xlsx_local)
wb.SaveAs(file_csv, win32c.xlCSV, win32c.xlLocalSessionChanges, win32c.xlExclusive)
wb.Close()
excel.Quit()
# read the resaved file
df_csv = pd.read_csv(file_csv, encoding='windows-1252', low_memory=False)

The file needs a special encoding to read but everything is resolved now😉 .

The reason I added this issue to my lesson is that my senior did suggest I save it as a new file but I did not take it seriously in the beginning. It can work if I open the file directly on my laptop and save it as a different file. However, I felt that I can’t even read it with pandas, how can I save it as another file.

Then after struggling for a few more hours, I realize all I need is to use PyWin32 to open the file and save it as another file. This is because PyWin32 access the Excel file by accessing Window’s Component Object Model (COM)[1][2].

So, the lesson learned here is to take the senior’s suggestions seriously, there are some reasons that they are the senior. Moreover, I should always stay creative while writing scripts.


Above are the three lessons I learned during debugging my script.

Do you ever experience something similar?

Let me know that I am not alone! 🤣

Write A Clean Python Code

Lastly, I appreciate that I read about the clean code concept before I wrote this script. My slightly clean code allows me to catch up to my code quickly 1 year after I first wrote it (still have plenty of room for improvement😁 ).

I forget where I read about this, but I believe you saw articles or videos about writing clean code somewhere when you are learning to program. Like here’s one I read recently.

There are various versions of suggestions for writing clean code. For me, the most important concept is to keep the variable and function name interpretable. The name should be descriptive and easy to read. Hence, the ambiguous short forms should be avoided.


Stay Connected

Subscribe on YouTube

Side Note

In Automate Excel with Python, I have explained on Objects, Methods, and Properties of Excel VBA. These are the 3 main concepts you will have to know to use pywin32.

If you are interested to automate Pivot Table and extract the filtered data from Pivot Table to pandas DataFrame, here you go Automate Pivot Table with Python (Create, Filter and Extract).

If your job used Google Sheets instead of Microsoft Excel, you may refer to this article, "Automate Google Sheet Report" for the possible automation to be made.

Reference

  1. Excel COM add-ins and Automation add-ins
  2. Automating Windows Applications Using COM by Chris Moffitt

Congrats and thanks for reading to the end. Hope you enjoy this article. 😊

Photo by Courtney Hedger on Unsplash
Photo by Courtney Hedger on Unsplash

Related Articles