
Huh, what’s that? An email … parser?
You might be wondering what an email parser is, and why you might need one.
In short, an email parser is a software that looks for and extracts data from inbound emails and attachments. More importantly, an email parser uses conditional processing to pull the specific data that matters to you.
So why does this matter? If you’ve ever spent any time working a regular office job, you’ve probably become intimately familiar with reports, and by extension, copy-pasting lines of text from Microsoft Outlook to Excel or Word.
You might even end up doing the same report, week after week. Add in formatting and spellchecking, and this ends up as a huge time drain when you could be focusing on more important tasks.
The good news is that you can automate most of this process with Python and SQL.
In this post, I’ll cover how to open Outlook emails with Python and extract the body text as HTML. I’ll then cover how to parse this in Python and how to upload the final data to a SQL database. From there, you can write this data to Excel or transform it into a Pandas Dataframe.
Getting started
We’ll be using a few key Python libraries here, namely os
, sqlite3
and pywin32
.
To start off, we’ll first need to decide what we want to extract from our emails. For example, let’s say we have a bunch of emails that each contain a list of news articles like this:

Let’s then say that we want to extract the header of each bullet point, which includes the title, the publication, media platforms, and URL links. In short, we want to take the entire header of each bullet point, then break it down into four different parts.

Our pseudocode so far should look something like this:
1. Create list of emails that we want to parse
2. Open first email
3. Iterate over each bullet point
4. Extract data from bullet point
5. Upload data from bullet point to a database
6. Repeat until all data is parsed, then move to next email
Creating a SQL Database
Before parsing our emails, we’ll first want to set up a SQL database with Python. We’ll do this by establishing a connection to the SQLite database with a connection object that we’ll call db
.
# Create & connect to database
db = sqlite3.connect("emails.db")
If it doesn’t already exist, a new database will be created as emails.db
. We can then create tables in our database that our email parser can write to later on.
# Create empty tables
db.execute("""
CREATE TABLE IF NOT EXISTS "articles" (
"id" INTEGER,
"title" TEXT UNIQUE,
"publication" TEXT,
PRIMARY KEY("id" AUTOINCREMENT)
);
""")
db.execute("""
CREATE TABLE IF NOT EXISTS "links" (
"article_id" INTEGER,
"link0" TEXT,
"link1" TEXT,
"link2" TEXT,
PRIMARY KEY("article_id")
);
""")
db.execute("""
CREATE TABLE IF NOT EXISTS "platforms" (
"article_id" INTEGER,
"platform0" TEXT,
"platform1" TEXT,
"platform2" TEXT,
PRIMARY KEY("article_id")
);
""")
In essence, we’re creating three tables, where our main table is ‘articles’, which has a one-to-many relationship with ‘platforms’ and ‘links’. In other words, this reflects how one article can have many different platforms and links.

Accessing your Emails in Python
You’ll want to move the emails that you want to parse from Outlook to a folder. The simplest method to do this is by dragging and dropping.

Next, create a variable storing the folder path of your emails. You can do this manually e.g. folder_path = r'C:UsersUsernameEmailFolder'
or with tkinter
and os
, which will generate a file explorer prompt to select a folder.
# Create an folder input dialog with tkinter
folder_path = os.path.normpath(askdirectory(title='Select Folder'))

Here, we’re using a file input prompt created with tkinter
to save our folder path, then normalizing the path with os
to remove any redundant separators.
We’ll then want to obtain the path headings of each email. We can do this with os.listdir()
, which gives a list of all files in the specified directory.
# Initialise & populate list of emails
email_list =
[file for file in os.listdir(folder_path) if file.endswith(".msg")]
This will save the file name of each email in list that we can access later.
Connecting Outlook to Python
Next, you’ll want to create an object that will allow us to control Outlook from Python. This is enabled through the pywin32
library that helps to connect Python to Outlook via the Microsoft Outlook Messaging API (MAPI).
outlook = win32com.client.Dispatch("Outlook.Application")
.GetNamespace("MAPI")
With this, we can begin to open each item as a HTML object, and use regular expressions i.e. Regex to extract the body text of each email.
While conventional wisdom dictates that you shouldn’t use Regex to parse HTML, we’re not worried about this here, as we’re only looking to extract very specific text snippets out of a standard email format (Some commercial email parsers like Parseur are heavily built around Regex).
From this point, Regex can be used to narrow down the specific data that you want to extract.
# Iterate through every email
for i, _ in enumerate(email_list):
# Create variable storing info from current email being parsed
msg = outlook.OpenSharedItem(os.path.join(folder_path,
email_list[i]))
# Search email HTML for body text
regex = re.search(r"<body([sS]*)</body>", msg.HTMLBody)
body = regex.group()
Parsing Email HTML with Python
This is how the first bullet point of our email might look as HTML:

Okay – so we can see that there are several key characteristics here, namely that our data exists as a bulleted list or li class=MsoListParagraph
. We can use Regex to extract each bullet point.
# Search email body text for unique entries
pattern = r"li class=MsoListParagraph([sS]*?)</li>"
results = re.findall(pattern, body)
Each bullet point is extracted as a string, and each string is stored in a list. Our first bullet point should look something like this with Regex:

To retrieve our title and publication, we can use Regex again. This time, we’ll also use call html.unescape()
on our text to help translate our HTML to string e.g. &8211; → – (a unicode dash).
regex = re.search(r"[^<>]+(?=(|sans-serif'>([sS]*?)</span>)", header)
# HTML unescape to get remove remaining HTML
title_pub = html.unescape(regex.group())

From here, it’s as simple as splitting our text. We can use split_list = title_pub.split("–")
to give us a list: ["New Arrival: Dell G Series Gaming Computers", "Tech4tea"]
.
We can then remove any redundant whitespaces and save each item as a variable.
title = split_list[0].strip()
publication = split_list[1].strip()
That’s two down!
To get our media platforms, we’ll use a more straightforward method.
# List of publications to check for
platform_list = ["Online", "Facebook", "Instagram", "Twitter", "LinkedIn", "Youtube"]
# Create empty list to store publications
platform = []
# Iterate and check for each item in my first list
for p in platform_list:
if p in header:
platform.append(p)
This will give us a list of publications: ["Online", "Facebook", "LinkedIn"]
Now for the URLs:
# Find all links using regex
links = re.findall(r"<a href="([sS]*?)">", header)
This will then give us the characters highlighted in green below:

Our data so far should look something like this:
Title: New Arrival: Dell G Series Gaming Computers
Publication: Tech4tea
Platform: ['Online', 'Facebook', 'LinkedIn']
Links: ['http://tech4tea.com/blog/2020/06/26/new-arrival-dell-g-series-gaming-computers-monitors-keyboards/', 'https://business.facebook.com/gotech4tea/posts/4598490146843826', 'https://www.linkedin.com/feed/update/urn:li:activity:6682511823100542976/']
Uploading data to a SQL database
The final step in this process is to upload each piece of data to our SQL database.
We’ll start by uploading our title and publication data. This can be accomplished with the following code:
# Insert title & pub by substituting values into each ? placeholder
db.execute("INSERT INTO articles (title, publication)
VALUES (?, ?)", (title, publication))
Uploading our links and platforms are a bit more tricky. First, we’ll copy over our primary id from our main table, then iterate over each platform and link individually.
# Get article id and copy to platforms & links tables
article_id = db.execute("SELECT id FROM articles WHERE title = ?", (title,))
for item in article_id:
_id = item[0]
for i, _ in enumerate(platform):
db.execute(f"UPDATE platforms SET platform{i} = ? WHERE
article_id = ?", (platform[i], _id))
for i, _ in enumerate(links):
db.execute(f"UPDATE links SET link{i} = ? WHERE article_id = ?",
(links[i], _id))
db.commit()
The last step here is to commit all these changes to the database. With that done, our email parser is complete! If you’d like, you can use something like DB Browser to check that the contents of your database have been successfully updated.

In case you need it, I’ve uploaded the full code for this on my website and Github.