How to Automate Excel Files Received on Email - Increase your Productivity

Step-by-step guide to download files from email and process the same as per our requirements

Shravankumar Suvarna
Towards Data Science

--

We receive tons of emails daily; some spam, some promotional, some important, some extremely important (like my story notifications). Managing the same is a tedious process. Let’s ease our daily work by outsourcing the same to computers.

Let’s assume multiple banks are sending us daily transaction summary via email and we need to collate the same in one single excel to perform cash-flow analysis. Now, we have a lot of such use cases like invoices from multiple vendors, MIS (Management Information System) from multiple employees, job offer emails, multiple news articles so on and so forth.

Email Automation Image
Photo by Webaroo on Unsplash

Outsourcing this activity will not only ease our work but also increase our productivity by lowering the manual error rate. We will use a powerful ETL tool Pentaho Data Integration (Kettle) for this automation. We don’t have to write a single line of code to perform this automation. Yes - No Code!

In case you have not installed PDI in your system, then I would recommend you to kindly go through the below step-by-step guide.

User Stories - Let’s define our use case

If you have been following along all the stories, then I observe a standard pattern like defining the stories in plain English, writing test cases and then step-by-step process.

  • I want to download banks statements CSV files from my Gmail.
  • I want to collate the information received in one single file.

Input Data

It’s a good practice to understand input data and the little nuances. This helps us in writing generalized logic.

These are dummy transaction data received from Bank-1 and Bank-2. We can always play around with the real datasets.

Sample bank 1 screenshots
Bank - 1 Transaction Data
Sample Data of Bank-2 Screenshot
Bank - 2 Transaction Data

As you observe both the banks follow a predefined structure and the only difference in the naming convention of the header. We can manage the same in the data pipeline flow.

Please note, in real-world, you will not see such a scenario where the structure is the same. However, we can handle such complex structures using metadata injection plugin.

Test Cases

Let’s define our test cases for this pipeline.

  • Check the output file format.
  • Cross-validate couple of line items from statements for both the suppliers in the output file.

Step-1: Project Set-up

Ideally, a good practice to set up the bare skeleton of the project. As I have mentioned earlier in my previous stories as well. I prefer to store all my project in a folder named Work.

  1. Input - This is where we will download our statements.
  2. Output - Storing our output file
  3. Main.kjb - This is where we will configure and download files from email. PDI job file.
  4. Process-Bank-1-File.ktr & Process-Bank-2-File.ktr - These are the transformation files where we will process the downloaded CSV files.
Project Folder Structure
Project Folder Structure

If you have any difficulty understanding the above steps or Spoon (desktop app), then request you to go through the below link.

Step - 2: Set up of Application Password

Let’s assume that we receive bank statements in our Gmail account. This can be any account and we can read the emails as far as we have Open Authentication (OAuth) details for the same.

We need to set up an application password read Gmail via an application without two-factor authentication. We need to follow the guide provided by Google. Below is the screenshot guide. We need to click on ‘Manage you Google Account’ in your profile icon.

Click on Security Tab and then on App password
Select ‘Mail’ in Select app option and Select device Other (custom name)
You will receive a pop-up for an app password. Copy and store it somewhere safe

Please note, we cannot use our Gmail password to authenticate or access emails via PDI or any other application. We will have to create an App password for the same.

Step - 3: Let’s read Gmail and Download File

Once the application password is set-up, we can download the files from Gmail and perform analysis on the same. Below are the steps that we need to observe.

  • Drag Steps/Plugins: In PDI main.kjb job file, we need to search for ‘Start’, ‘Get mails (POP3/IMAP)’, ‘Transformation’ and ‘Success’ Steps.
Main job flow
  • Edit ‘Get mails’ step properties: We need to double click on ‘Get mails (POP3/IMAP) step and edit the properties. Below are the properties that we need to edit.
  1. We need to either use POP3 or IMAP to read and fetch files. You can go through the following link to understand more about the same. We need to rename the step name to ‘download_bank_statements’.
  2. In Source Host, put imap.gmail.com. We need to check the Use SSL checkbox
  3. We need to use Port 993; this is specific to IMAP and SSL. You can refer the point #1 link to understand various ports.
  4. Fill your Username and app Password (do not use your Gmail password; it will not work). Please note, medium.blog@gmail.com is a dummy address.
  5. Define Target directory to Input path. Make sure the Get mail attachment and Different Folder is checked.
  6. In the Settings tab, we need to change the Retrieve to Get all messages options. This is because we might receive the same naming convention files daily. Optionally, we can also perform activities like moving the email to some other directory like ‘processed’ or deleting the same etc.
  7. In the Filters tab, we need to add a filter using the Received date filter. Again, I have hard-coded the date here. However, we can easily use system date variable use it there. I will write a separate blog explaining the variables in PDI. We can remove this filter as well; this will fetch all the emails
General tab properties screenshot
General tab properties
Settings tab properties screenshot
Settings tab properties
Filters tab properties screenshot
Filters tab properties
Sample emails screenshot
Sample Inbox Screenshot with Emails

Now that we have configured the email download process. Let’s collate both the files.

Step - 4: Collate Information

There are various ways of collating information from multiple workbooks. We can create a generalized flow; which caters to multiple types of structure or create a simple case transformation. We will be using the later one.

If you want to generalize multiple workbooks processing, then I will recommend you to go through the below guide.

Process-Bank1-Files.ktr and Process-Bank2-Files.ktr transformations will be identical and the only difference will be regarding the file name changes.

Open Process-Bank1-Files.ktr and drag ‘Text file input’ and ‘Microsoft Excel Writer’.

  1. In the Text file input, we will change the name to input.
  2. In the File tab, browse for ‘D:\Work\AutomateEmail\Input\Bank-1 - Transaction.csv’; assuming bank will observe same naming conventions.
  3. In the Content tab, change the Separator to ‘,’ comma and Format to mixed.
  4. In the Fields tab, click on Get Fields button and change the Name column to Column1, Column2, Column3.. as shown in the below email.
File tab properties
File tab properties
Fields tab properties
Fields tab properties

In ‘Microsoft Excel writer’ plugin and in File & Sheet tab, add file path of the output file (D:\Work\AutomateMultipleExcel\Output\ConsolidatedStatement), change the extension to xlsx [Excel 2007 and above] in Extension field, select Use existing file for writing in If output file exists field and select write to existing sheet in If sheet exists in output file.

In ‘Microsoft Excel writer’ plugin and in the Content tab, select shift existing cells down in When writing rows, check Start writing at end of the sheet (appending lines), tick Omit header field and click on Get Fields button.

File & Sheet tab properties
File & Sheet tab Properties
Content tab properties
Content tab properties

Now, we need to duplicate the same transformation and only change the input file name to D:\Work\AutomateEmail\Input\Bank-2 - Transaction.csv

We need to create an empty excel (xlsx) file in the output folder and name it ConsolidatedStatement.xlsx with below mentioned headers. Here, the idea is to create a template and overwrite the same regularly.

That’s it, let’s check if it’s working.

Step - 5: Connect, Execute Job and Test

In Main.kjb file, we need to connect the two transformations in our job file by sequentially browsing the same.

All our efforts would ideally get realized here in this step; where we execute and see the output. Let’s see if the pipeline created by us works as per our expectation.

Success
Success

Here, we can perform our test cases defined earlier and analyze the file as per our requirements.

Conclusion

There are little nuances of the above data pipeline which can be tweaked and probably can be generalized to cater to multiple structures and files. Since I am cherry-picking the use case as per my convenience, this may or may not be the real-world situation. However, you can build using this as a platform. We can perform a lot of similar activities using PDI. We will take a case on how to send email notifications using PDI next.

Please feel free to ask questions in the comment section.

See you in the next post. Happy ETL

--

--