The world’s leading publication for data science, AI, and ML professionals.

Creating an Email Parser with Python and SQL

Boost your productivity by automatically extracting data from your emails

Photo by Solen Feyissa on Unsplash
Photo by Solen Feyissa on Unsplash

Huh, what’s that? An email … parser?

You might be wondering what an email parser is, and why you might need one.

In short, an email parser is a software that looks for and extracts data from inbound emails and attachments. More importantly, an email parser uses conditional processing to pull the specific data that matters to you.

So why does this matter? If you’ve ever spent any time working a regular office job, you’ve probably become intimately familiar with reports, and by extension, copy-pasting lines of text from Microsoft Outlook to Excel or Word.

You might even end up doing the same report, week after week. Add in formatting and spellchecking, and this ends up as a huge time drain when you could be focusing on more important tasks.

The good news is that you can automate most of this process with Python and SQL.

In this post, I’ll cover how to open Outlook emails with Python and extract the body text as HTML. I’ll then cover how to parse this in Python and how to upload the final data to a SQL database. From there, you can write this data to Excel or transform it into a Pandas Dataframe.


Getting started

We’ll be using a few key Python libraries here, namely os, sqlite3 and pywin32.

To start off, we’ll first need to decide what we want to extract from our emails. For example, let’s say we have a bunch of emails that each contain a list of news articles like this:

Let’s then say that we want to extract the header of each bullet point, which includes the title, the publication, media platforms, and URL links. In short, we want to take the entire header of each bullet point, then break it down into four different parts.

Sample header that we want to extract text from
Sample header that we want to extract text from

Our pseudocode so far should look something like this:

1. Create list of emails that we want to parse
2. Open first email
3. Iterate over each bullet point
4. Extract data from bullet point
5. Upload data from bullet point to a database
6. Repeat until all data is parsed, then move to next email

Creating a SQL Database

Before parsing our emails, we’ll first want to set up a SQL database with Python. We’ll do this by establishing a connection to the SQLite database with a connection object that we’ll call db.

# Create & connect to database
db = sqlite3.connect("emails.db")

If it doesn’t already exist, a new database will be created as emails.db. We can then create tables in our database that our email parser can write to later on.

# Create empty tables
db.execute("""
CREATE TABLE IF NOT EXISTS "articles" (
    "id" INTEGER,
    "title" TEXT UNIQUE,
    "publication" TEXT,
    PRIMARY KEY("id" AUTOINCREMENT)
);
""")

db.execute("""
CREATE TABLE IF NOT EXISTS "links" (
    "article_id" INTEGER,
    "link0" TEXT,
    "link1" TEXT,
    "link2" TEXT,
    PRIMARY KEY("article_id")
);
""")

db.execute("""
CREATE TABLE IF NOT EXISTS "platforms" (
    "article_id" INTEGER,
    "platform0" TEXT,
    "platform1" TEXT,
    "platform2" TEXT,
    PRIMARY KEY("article_id")
);
""")

In essence, we’re creating three tables, where our main table is ‘articles’, which has a one-to-many relationship with ‘platforms’ and ‘links’. In other words, this reflects how one article can have many different platforms and links.

Our database schema
Our database schema

Accessing your Emails in Python

You’ll want to move the emails that you want to parse from Outlook to a folder. The simplest method to do this is by dragging and dropping.

Demonstration of the drag-and-drop method
Demonstration of the drag-and-drop method

Next, create a variable storing the folder path of your emails. You can do this manually e.g. folder_path = r'C:UsersUsernameEmailFolder' or with tkinter and os, which will generate a file explorer prompt to select a folder.

# Create an folder input dialog with tkinter
folder_path = os.path.normpath(askdirectory(title='Select Folder'))
Obtaining our folder path with tkinter
Obtaining our folder path with tkinter

Here, we’re using a file input prompt created with tkinter to save our folder path, then normalizing the path with os to remove any redundant separators.

We’ll then want to obtain the path headings of each email. We can do this with os.listdir(), which gives a list of all files in the specified directory.

# Initialise & populate list of emails
email_list = 
[file for file in os.listdir(folder_path) if file.endswith(".msg")]

This will save the file name of each email in list that we can access later.

Connecting Outlook to Python

Next, you’ll want to create an object that will allow us to control Outlook from Python. This is enabled through the pywin32 library that helps to connect Python to Outlook via the Microsoft Outlook Messaging API (MAPI).

outlook = win32com.client.Dispatch("Outlook.Application")
                  .GetNamespace("MAPI")

With this, we can begin to open each item as a HTML object, and use regular expressions i.e. Regex to extract the body text of each email.

While conventional wisdom dictates that you shouldn’t use Regex to parse HTML, we’re not worried about this here, as we’re only looking to extract very specific text snippets out of a standard email format (Some commercial email parsers like Parseur are heavily built around Regex).

From this point, Regex can be used to narrow down the specific data that you want to extract.

# Iterate through every email
for i, _ in enumerate(email_list):

   # Create variable storing info from current email being parsed
   msg = outlook.OpenSharedItem(os.path.join(folder_path,
   email_list[i]))

   # Search email HTML for body text
   regex = re.search(r"<body([sS]*)</body>", msg.HTMLBody)
   body = regex.group()

Parsing Email HTML with Python

This is how the first bullet point of our email might look as HTML:

The HTML view of our email snippet
The HTML view of our email snippet

Okay – so we can see that there are several key characteristics here, namely that our data exists as a bulleted list or li class=MsoListParagraph. We can use Regex to extract each bullet point.

# Search email body text for unique entries
pattern = r"li class=MsoListParagraph([sS]*?)</li>"
results = re.findall(pattern, body)

Each bullet point is extracted as a string, and each string is stored in a list. Our first bullet point should look something like this with Regex:

Narrowing down our HTML body text with Regex (Regex101.com)
Narrowing down our HTML body text with Regex (Regex101.com)

To retrieve our title and publication, we can use Regex again. This time, we’ll also use call html.unescape() on our text to help translate our HTML to string e.g. &8211; → – (a unicode dash).

regex = re.search(r"[^<>]+(?=(|sans-serif'>([sS]*?)</span>)", header)

# HTML unescape to get remove remaining HTML
title_pub = html.unescape(regex.group())
Our regex returns the highlighted text as the variable above
Our regex returns the highlighted text as the variable above

From here, it’s as simple as splitting our text. We can use split_list = title_pub.split("–") to give us a list: ["New Arrival: Dell G Series Gaming Computers", "Tech4tea"].

We can then remove any redundant whitespaces and save each item as a variable.

title = split_list[0].strip()
publication = split_list[1].strip()

That’s two down!

To get our media platforms, we’ll use a more straightforward method.

# List of publications to check for
platform_list = ["Online", "Facebook", "Instagram", "Twitter", "LinkedIn", "Youtube"]

# Create empty list to store publications
platform = []

# Iterate and check for each item in my first list
for p in platform_list:
   if p in header:
      platform.append(p)

This will give us a list of publications: ["Online", "Facebook", "LinkedIn"]

Now for the URLs:

# Find all links using regex
links = re.findall(r"<a href="([sS]*?)">", header)

This will then give us the characters highlighted in green below:

Pretty neat, right?
Pretty neat, right?

Our data so far should look something like this:

Title: New Arrival: Dell G Series Gaming Computers
Publication: Tech4tea
Platform: ['Online', 'Facebook', 'LinkedIn']
Links: ['http://tech4tea.com/blog/2020/06/26/new-arrival-dell-g-series-gaming-computers-monitors-keyboards/', 'https://business.facebook.com/gotech4tea/posts/4598490146843826', 'https://www.linkedin.com/feed/update/urn:li:activity:6682511823100542976/']

Uploading data to a SQL database

The final step in this process is to upload each piece of data to our SQL database.

We’ll start by uploading our title and publication data. This can be accomplished with the following code:

# Insert title &amp; pub by substituting values into each ? placeholder
db.execute("INSERT INTO articles (title, publication) 
VALUES (?, ?)", (title, publication))

Uploading our links and platforms are a bit more tricky. First, we’ll copy over our primary id from our main table, then iterate over each platform and link individually.

# Get article id and copy to platforms &amp; links tables
article_id = db.execute("SELECT id FROM articles WHERE title = ?", (title,))

for item in article_id:
   _id = item[0]

for i, _ in enumerate(platform):
   db.execute(f"UPDATE platforms SET platform{i} = ? WHERE    
   article_id = ?", (platform[i], _id))

for i, _ in enumerate(links):
   db.execute(f"UPDATE links SET link{i} = ? WHERE article_id = ?", 
   (links[i], _id))

db.commit()

The last step here is to commit all these changes to the database. With that done, our email parser is complete! If you’d like, you can use something like DB Browser to check that the contents of your database have been successfully updated.

Viewing database with DB Browser
Viewing database with DB Browser

In case you need it, I’ve uploaded the full code for this on my website and Github.


Related Articles