The world’s leading publication for data science, AI, and ML professionals.

Need Data? Why Not Use Your Email Address?

How I Set Up My Email Account to Analyze Right-Wing Media Messaging

Pixabay
Pixabay

Although we have seen a significant rise in the popularity of messaging apps and chat options over the last few years, email continues to be one of the most prominent forms of digital communication. Today, it is estimated that over 4 billion people use email; a number that is only expected to grow for the foreseeable future.

Moreover, the open-rate of email newsletters is somewhere around 22%, which blows social media business engagement out of the water. In short, email still matters and there are a ton of companies, groups, and organizations who are eager to get you to sign up for their email marketing newsletters.

But what is a data scientist to do with this information? 😉

One of the biggest problems that young data scientists face today is truly understanding the value of Data Engineering in the process of building enterprise-scale data science solutions. Whether it be accessing data, knowing how to deal with messy data, or feature engineering for downstream training, young data scientists are often not exposed to these topics because courses, blogs, and online tutorials often provide "canned" data sets that have been vetted and cleaned of their rawer forms.

Enter email, your own data collection machine that helps expose you to the dirty, less talked about, and certainly not glamorized aspects of data science, while also opening doors to a myriad of business uses. Here are just a few business use-cases for using email to collect, analyze, and model data:

  1. Learn what your competitors are doing by signing up for their email newsletters
  2. Monitor social movements and their use of messaging to identify new trends
  3. Collect feedback from customers that doesn’t feel like another survey

Okay, cool, but how do I do it?

In the remainder of this article, I provide a process and some code to show you how I set up an email account to collect data from right-wing groups to monitor extremist messaging and analyze that data using some natural language processing tools to gain insight. If you’re scratching your head wondering why such a narrow topic, this project comes in collaboration with RivalAnalytics.Ai, a competitive intelligence Data Science group interested in setting up surveillance projects like this one.

The high-level project looks something like this:

  1. Set up a Gmail account and enable programmatic access
  2. Sign-up to target newsletters using the new Email address
  3. Wait while my inbox fills up…
  4. Leverage Python to access emails and extract Date, Subject, & Body
  5. Clean the data
  6. Analyze the data using key phrases, sentiment, and correlations
  7. Consider the implications

Here we go…

Step 1: Setting up the Email Account

Hopefully, I don’t need to show everyone how to set up a Gmail account, what is likely more useful is to show how to enable the account for programmatic access. Once your Gmail account is created, go to your Gmail inbox, and click the "Gear" icon to get to Settings and then click "See all settings" (see black arrows in image).

From there click on the "Forwarding and POP/IMAP" tab and "Enable IMAP" in the "IMAP access" section (see red arrows in image). After selecting, scroll to the bottom and click "Save Changes" to enable the selections on the account.

Finally, click on the circle icon for your Gmail account in the upper left corner (I am using Chrome BTW) and then click on "Manage Your Google Account" (see green arrows). Go to the "Security" section and click to turn on "Less secure app access" (see black arrows). Done! Now your email is configured for programmatic access. Onward!

Steps 2 & 3: Sign-Up and Wait

For steps 2 & 3 in this experiment I chose to sign-up for the Gab newsletter. Gab is a right-wing social media site and is a known platform for right-wing extremist groups.

…waiting…waiting…still waiting…

With my patience running very thin, I finally decided to pull the plug and take a look at the data I had accumulated in a matter of 2 months. In just 2 months time, and focusing on only one newsletter source, I accumulated 28 emails. Not "Big Data" by any means but you could imagine how expanding the scope to include more sources and allowing for a bit more time to acquire data could expand the footprint quite significantly.

Step 4: Accessing the Data with Python

In order to access the data programmatically, I used Python 3.7 on a Windows 10 laptop. First, we need to install the libraries that allow us access to the email server:

Second, we search for emails from our newsletter provider and extract Date, Subject, and Body text. Note that the code only focuses on the text found in the body and does not attempt to extract attachments but here is a great resource for modifying my loop to save attachments to your hard drive.

Now that we have a dataframe that contains the Date, Subject, and Body of each email, we are ready to perform our analysis.

Step 5: Clean the Data

In order to clean the data I have built a function that I commonly use across most of my NLP projects for a quick clean-up of text data. The function allows you to pass a dictionary that will standardize any potential issues with word tokens such as changing acronyms from "NLP" to "Natural Language Processing" should it become relevant. The code uses tools from Python’s NLTK library.

After creating a "clean_text" column in our dataframe, I then build bigrams from the text. I filter the bigrams by both point mutual information (PMI) values and based on their status as a Noun, since most topics are based on nouns in text. PMI is essentially the probability of a given bigram divided by the probability of each word independently. Using these filters improves the likelihood that we will obtain meaningful bigrams. Once we have obtained our final list of bigrams, we then create a bigram frequency matrix for each email.

Step 6: Analysis

Now that we have a dataframe with frequency counts for each bigram in our final bigram list we can analyze the data. To analyze, I first join the Body and Date from the main dataframe to our clean_text dataframe. I then add a sentiment column using the HuggingFace ‘sentiment-analysis’ pipeline. In the function that generates sentiment scores, because BERT-based transformers have a character limit, we need to limit the characters exposed to the model. I iterate over each line of the main Body and select on the first 512 characters to generate sentiment estimates. Finally, I average all of the sentiment scores together in order to get an overall sentiment estimate for each email. Obviously, we won’t capture all of the text data with this character limit in place, but one could seek to extend this example by iterating over 512-character chunks to capture all of the data.

Once complete, we perform a quick time-series analysis that plots sentiment over time as well a correlation analysis with sentiment by bigram to identify which phrases are associated with overall sentiment.

From the code above we generate a few images to help aid in our interpretation. Let’s first check out the time-series plot:

Despite the high average sentiment with the first few emails, we see a quick drop and a subsequent linear trend upward moving from 05–2021 to 06–2021. There is still a lot of variability however the analysis does suggest a trend. Such a trend leads me to hypothesize that these newsletters may be correlated with other happenings in the news. Thus it would be interesting to go further in blending these data with other data sources such as Google Trends. Let’s examine how our bigrams correlate with sentiment:

Step 7: Interpretation

There are a few interesting nuggets of insight in the correlation heatmap that shows the strength of relationship between bigrams and average sentiment. Interestingly the bigram "Donald Trump" is more positive correlated with positive sentiment than the bigram "President Trump" suggesting that far-right groups may use labels for the former president in different ways, perhaps in ways that depend on whether discussing their own enthusiasm for Donald Trump as opposed to describing how liberals treat President Trump.

Other key insights include phrases that tend to be associated with more negative sentiment like "Critical Race" and "Domestic Terrorist."


Take Home or TLDR

In summary, our email addresses are valuable sources of data and if used properly can lead to some significant business and/or social value. By applying the tools of data science we can add structure to the unstructured nature of emails and create new insights that can drive value in a variety of use-cases.

Like engaging to learn more about data science? Join me.


Related Articles