How to Automate Excel Files Received on Email - Increase your Productivity

Step-by-step guide to download files from email and process the same as per our requirements

Published in

Towards Data Science

8 min readMay 17, 2020

We receive tons of emails daily; some spam, some promotional, some important, some extremely important (like my story notifications). Managing the same is a tedious process. Let’s ease our daily work by outsourcing the same to computers.

Let’s assume multiple banks are sending us daily transaction summary via email and we need to collate the same in one single excel to perform cash-flow analysis. Now, we have a lot of such use cases like invoices from multiple vendors, MIS (Management Information System) from multiple employees, job offer emails, multiple news articles so on and so forth.

Email Automation Image — Photo by Webaroo on Unsplash

Outsourcing this activity will not only ease our work but also increase our productivity by lowering the manual error rate. We will use a powerful ETL tool Pentaho Data Integration (Kettle) for this automation. We don’t have to write a single line of code to perform this automation. Yes - No Code!

In case you have not installed PDI in your system, then I would recommend you to kindly go through the below step-by-step guide.

Pentaho Data Integration Installation Guide - Easy yet powerful ETL tool.

Building data-pipelines using simple graphical user-interface without writing a single line of code.

medium.com

User Stories - Let’s define our use case

If you have been following along all the stories, then I observe a standard pattern like defining the stories in plain English, writing test cases and then step-by-step process.

I want to download banks statements CSV files from my Gmail.
I want to collate the information received in one single file.

Input Data

It’s a good practice to understand input data and the little nuances. This helps us in writing generalized logic.

These are dummy transaction data received from Bank-1 and Bank-2. We can always play around with the real datasets.

Sample bank 1 screenshots — Bank - 1 Transaction Data

Sample Data of Bank-2 Screenshot — Bank - 2 Transaction Data

As you observe both the banks follow a predefined structure and the only difference in the naming convention of the header. We can manage the same in the data pipeline flow.

Please note, in real-world, you will not see such a scenario where the structure is the same. However, we can handle such complex structures using metadata injection plugin.

Test Cases

Let’s define our test cases for this pipeline.

Check the output file format.
Cross-validate couple of line items from statements for both the suppliers in the output file.

Step-1: Project Set-up

Ideally, a good practice to set up the bare skeleton of the project. As I have mentioned earlier in my previous stories as well. I prefer to store all my project in a folder named Work.

Input - This is where we will download our statements.
Output - Storing our output file
Main.kjb - This is where we will configure and download files from email. PDI job file.
Process-Bank-1-File.ktr & Process-Bank-2-File.ktr - These are the transformation files where we will process the downloaded CSV files.

If you have any difficulty understanding the above steps or Spoon (desktop app), then request you to go through the below link.

Getting Started with Pentaho Data Integration (Kettle) and its Components.

Understanding the key components like Spoon, Pan, Kitchen, etc will enable us to get a better idea about the PDI tool

medium.com

Step - 2: Set up of Application Password

Let’s assume that we receive bank statements in our Gmail account. This can be any account and we can read the emails as far as we have Open Authentication (OAuth) details for the same.

We need to set up an application password read Gmail via an application without two-factor authentication. We need to follow the guide provided by Google. Below is the screenshot guide. We need to click on ‘Manage you Google Account’ in your profile icon.

Click on Security Tab and then on App password

Select ‘Mail’ in Select app option and Select device Other (custom name)

You will receive a pop-up for an app password. Copy and store it somewhere safe

Please note, we cannot use our Gmail password to authenticate or access emails via PDI or any other application. We will have to create an App password for the same.

Step - 3: Let’s read Gmail and Download File

Once the application password is set-up, we can download the files from Gmail and perform analysis on the same. Below are the steps that we need to observe.

Drag Steps/Plugins: In PDI main.kjb job file, we need to search for ‘Start’, ‘Get mails (POP3/IMAP)’, ‘Transformation’ and ‘Success’ Steps.

Edit ‘Get mails’ step properties: We need to double click on ‘Get mails (POP3/IMAP) step and edit the properties. Below are the properties that we need to edit.

We need to either use POP3 or IMAP to read and fetch files. You can go through the following link to understand more about the same. We need to rename the step name to ‘download_bank_statements’.
In Source Host, put imap.gmail.com. We need to check the Use SSL checkbox
We need to use Port 993; this is specific to IMAP and SSL. You can refer the point #1 link to understand various ports.
Fill your Username and app Password (do not use your Gmail password; it will not work). Please note, medium.blog@gmail.com is a dummy address.
Define Target directory to Input path. Make sure the Get mail attachment and Different Folder is checked.
In the Settings tab, we need to change the Retrieve to Get all messages options. This is because we might receive the same naming convention files daily. Optionally, we can also perform activities like moving the email to some other directory like ‘processed’ or deleting the same etc.
In the Filters tab, we need to add a filter using the Received date filter. Again, I have hard-coded the date here. However, we can easily use system date variable use it there. I will write a separate blog explaining the variables in PDI. We can remove this filter as well; this will fetch all the emails

General tab properties screenshot — General tab properties

Settings tab properties screenshot — Settings tab properties

Filters tab properties screenshot — Filters tab properties

Sample emails screenshot — Sample Inbox Screenshot with Emails

Now that we have configured the email download process. Let’s collate both the files.

Step - 4: Collate Information

There are various ways of collating information from multiple workbooks. We can create a generalized flow; which caters to multiple types of structure or create a simple case transformation. We will be using the later one.

If you want to generalize multiple workbooks processing, then I will recommend you to go through the below guide.

How to Automate Multiple Excel Workbooks and Perform Analysis

Apply this step-by-step guide to solve the problem using PDI with Magical Metadata Injection.

towardsdatascience.com

Process-Bank1-Files.ktr and Process-Bank2-Files.ktr transformations will be identical and the only difference will be regarding the file name changes.

Open Process-Bank1-Files.ktr and drag ‘Text file input’ and ‘Microsoft Excel Writer’.

In the Text file input, we will change the name to input.
In the File tab, browse for ‘D:\Work\AutomateEmail\Input\Bank-1 - Transaction.csv’; assuming bank will observe same naming conventions.
In the Content tab, change the Separator to ‘,’ comma and Format to mixed.
In the Fields tab, click on Get Fields button and change the Name column to Column1, Column2, Column3.. as shown in the below email.

In ‘Microsoft Excel writer’ plugin and in File & Sheet tab, add file path of the output file (D:\Work\AutomateMultipleExcel\Output\ConsolidatedStatement), change the extension to xlsx [Excel 2007 and above] in Extension field, select Use existing file for writing in If output file exists field and select write to existing sheet in If sheet exists in output file.

In ‘Microsoft Excel writer’ plugin and in the Content tab, select shift existing cells down in When writing rows, check Start writing at end of the sheet (appending lines), tick Omit header field and click on Get Fields button.

File & Sheet tab properties — File & Sheet tab Properties

Now, we need to duplicate the same transformation and only change the input file name to D:\Work\AutomateEmail\Input\Bank-2 - Transaction.csv

We need to create an empty excel (xlsx) file in the output folder and name it ConsolidatedStatement.xlsx with below mentioned headers. Here, the idea is to create a template and overwrite the same regularly.

That’s it, let’s check if it’s working.

Step - 5: Connect, Execute Job and Test

In Main.kjb file, we need to connect the two transformations in our job file by sequentially browsing the same.

All our efforts would ideally get realized here in this step; where we execute and see the output. Let’s see if the pipeline created by us works as per our expectation.

Here, we can perform our test cases defined earlier and analyze the file as per our requirements.

Conclusion

There are little nuances of the above data pipeline which can be tweaked and probably can be generalized to cater to multiple structures and files. Since I am cherry-picking the use case as per my convenience, this may or may not be the real-world situation. However, you can build using this as a platform. We can perform a lot of similar activities using PDI. We will take a case on how to send email notifications using PDI next.

Please feel free to ask questions in the comment section.

See you in the next post. Happy ETL

If you like similar content, then I would recommend you to subscribe using the below link.

WryteEx - Development, Data Engineering, Project Management and more

I write blogs on Data Engineering, Finance, Python, Python Django, Project Management, Web Development and Many More…

wryteex.com