The world’s leading publication for data science, AI, and ML professionals.

Extracting Data from Excel and Outlook Files with Java

Code Implementation + Use-Cases & Situational Constraints

The past 1 year plus of the Coronavirus outbreak has disrupted workplace operations on both an industrial and individual level. For me, working in the healthcare sector meant witnessing first-hand the existing cracks in the general healthcare sector. More specifically, the inefficiency in information flow & lack of a stable framework for data management coupled with the scarce availability of technical expertise within the workplace ended up shaping the technical solution I eventually implemented to semi-automate some of the workflows.

Photo by Mika Baumeister on Unsplash
Photo by Mika Baumeister on Unsplash

For the technical part of this article, I’ll be sharing the Java code snippets I have implemented which can also be found at the GitHub repository: data-extraction-with-Java. Sample input data and outputs are also present in the same GitHub repo.

Disclaimer: All data used in this article is dummy data to illustrate the functionalities of the application.


For starters, the most immediate issue to resolve was to reduce the need for manual compilation of incoming data Excel files and datatables embedded in daily Outlook email feeds. Hence, the final technical implementation was functionally expected to enable users to extract the required datasets embedded in separate Excel and Outlook files into a single master excel spreadsheet. After some consideration, I made the decision to create an application in Java:

Image by Author | List of Java libraries used to read & write to Microsoft Excel File(s) | Note that the specific version of [apache poi](https://poi.apache.org/) used is 3.15. Different versions of apache poi have varying versions for its dependencies.
Image by Author | List of Java libraries used to read & write to Microsoft Excel File(s) | Note that the specific version of [apache poi](https://poi.apache.org/) used is 3.15. Different versions of apache poi have varying versions for its dependencies.
Image by Author | List of Java libraries used to parse content from Microsoft Outlook File(s)
Image by Author | List of Java libraries used to parse content from Microsoft Outlook File(s)

Finally, the application was compiled and packaged with Java’s in-built Graphic User Interface (GUI) library, Java Swing.

Image by Author | Preview of final application deliverable which interacts with users via Java's in-built GUI library, Java Swing. The image showcases the extraction of Outlook files to CSV format - (1) Multiple email files are selected; (2) After processing the Outlook files, application prompts user to save output in a ZIP archive; and (3) File output auto-opens after it is saved successfully.
Image by Author | Preview of final application deliverable which interacts with users via Java’s in-built GUI library, Java Swing. The image showcases the extraction of Outlook files to CSV format – (1) Multiple email files are selected; (2) After processing the Outlook files, application prompts user to save output in a ZIP archive; and (3) File output auto-opens after it is saved successfully.

In order to consolidate every archive generated by the app, another separate module was created to read in all Zip file inputs by the user (sample final output of this can be found at ZipToExcel_(15-Aug-2021_0234PM).xlsx)

Image by Author | Demo of consolidating all data into a single excel file output
Image by Author | Demo of consolidating all data into a single excel file output

FYI: If anyone would like a copy of the above app feel free to retrieve the runnable JAR from the same GitHub repository


⚠ Summary of Situational Constraints Faced

Arguably, the greatest challenge I had faced during this entire saga was communication with business users and working with unprecedented constraints. While it is nothing new that there is common misalignment in perspectives between the IT developer and the business user, I believe it is still worth sharing the authentic real-world hurdles I had to face:

I. Limited platforms to run technical implementations —While most ETL processes are ideally implemented in Python which is both lightweight and has a vast number of data-centric libraries as compared to Java, this scenario is only favourable if pre-existing setups had already been in place.

Unfortunately, that just isn’t the case in many companies at this point in time (much less a public healthcare company) and only applicable to workplaces where a well-established technical culture has long been present & consistently reinforced.

Considering that the places most in need of technological enhancements to improve their work efficiency often are places which not only have a lack of technical facilities but also many staff members with non-receptive attitudes towards "technological solutions", there is a bitter irony to this; which in turn leads to many unpleasant ripple effects. To be objectively fair, a majority of these workers are definitely not lacking in aptitude or altitude—for industries such as public healthcare, the short turnover time and urgent nature of their assigned tasks reinforces their natural inclination to be more risk averse, hence falling back on their conventional (and usually manual) approaches to doing things.

With the stringent requirement for minimal set-up and installations, it led to the most instinctive choice of using Java Programming language due to its platform independent property.

II. Acceptance that business users have limited technical-related awareness – While I initially did not intend on building a GUI for the app, it quickly occurred to me that a handful of users had trouble running the following command to execute the application:

java -jar <name of jar file>

as per any other compiled Java software. The rather surprising revelation that 40% of users had trouble opening their Windows CMD/Shell Terminals/PowerShell Terminal hit me hard 😨 . Therefore, To avoid this issue entirely, I then proceeded to build the application’s interface which requires a mere double-click of the mouse 🖱 ️ to execute.

III. Overall opinions on adopting a sustainable mindset towards constant improvement – Through interactions with various groups of users, the common denominator for any actual progress to be made is without a doubt managing user expectations & iterative user engagement (corny, I know). More specifically, any technological implementation to a business process should be rolled out in phases for good reasons; an abrupt and drastic change to a user’s work methodology tends to result in greater resistance to utilise any solutions proposed due to feelings of high unfamiliarity. While the developer naturally brainstorms on how to "get things done better", business users on the other hand (at least from some of my encounters) are more inclined towards "as long as it doesn’t create more work for me it is fine". Although it is no doubt a long and arduous journey ahead to continuously improve the efficiency of current work processes in the public healthcare sector (or rather all sectors as a matter-of-fact), a little bit of patience does go a long way.

Many thanks for reading and feel free to check out the original source code and application at my GitHub! ❤

Link at: incubated-geek-cc/data-extraction-with-Java: A Java application built with Java Swing and other jar libraries to extract data from Outlook and Excel files. (github.com)


Related Articles