Adhitya Venkatraman, Nohl Patterson, Clayton North, and Wade Klein
Overview
After a contentious election season, we wanted to analyze emotions and ideas in the news and social media from the past few years. We used sentiment analysis to map the subjectivity and polarity of over 140,000 news Headlines and thousands of Congressional tweets. Then, we used Latent Dirichlet Allocation to perform topic modelling on the top tweets of 2019 and 2020 to find correlations between influential ideas and people on Twitter.
Data
We used over 140,000 news headlines from 15 publications, sampled by Andrew Thompson. The chart below breaks down which publications, and the number of articles per publication, we examined. The articles were collected from 2015–2017.

We then looked at thousands of tweets from Congressional representatives collected via the congresstweets database. For the topic modelling section, we also included presidential tweets. Special thanks to Mike Izbicki for his help in sorting this dataset. Finally, for our topic modelling analysis, we used the Tweepy API to sample the top tweets from 2019–2020.
Sentiment Analysis
We started by performing a sentiment analysis of each corpus using the TextBlob library. Sentiment analysis attempts to understand the subjective attributes of text, such as mood. We used TextBlob to determine a polarity and subjectivity score for each datapoint.
Polarity measures the apparent attitude conveyed by the use of a word on a continuous scale from -1 to 1. Subjectivity refers to how likely it is that a word expresses an opinion, rather than a fact. Importantly, a low subjectivity score does not imply that the information is true – just that it is presented as a fact. Subjectivity is evaluated on a continuous scale from 0 to 1. A score of 0 indicates that the word is likely part of a fact, while a higher score indicates that it seems more subjective.
After assigning every article’s headline a polarity and subjectivity score, we plotted them to figure out how each publication was presenting their stories. By plotting each article, we avoid obscuring either the variance or the mean in the visualizations. Then, we compared the plots to find differences in the way outlets craft headlines.
When all was said and done, we found some interesting results. In particular, we want to highlight the distributions for the New York Times and Breitbart, because of each publication’s history. The New York Times is a national paper with a rich history, while Breitbart is a much newer publication with a rapidly growing readership. They have traded blows with one another on the issue of fake news, so we were especially curious to see how each performed along the subjectivity axis.
The plots below present our findings. The darker the dot, the more articles that scored that combination of polarity and subjectivity.


Though the Breitbart corpus is larger than that of the New York Times, there are clear differences in the shape of each distribution. The New York Times is more closely clustered toward the center of the plot, while Breitbart covers a much wider range in both axes. But what happens when we lay these plots on top of each other?

Now, the differences between the two sources become much clearer. The New York Times’ headlines have a lower variance of both subjectivity and polarity than Breitbart, which has many headlines at either end of the plot. However, on average, Breitbart actually ends up having lower subjectivity and polarity rankings than the New York Times. The New York Times also writes headlines that are more measured in terms of polarity than Breitbart. This helps to quantify the stylistic difference between these two publications. Subsequent analyses could focus on how opting for a source with a lower variance or a lower mean of subjectivity and polarity affects willingness to share the article or the reliability of the information included.
This was just one combination of our results. We’ll be posting a notebook with the full results shortly, but for now, you can take a look at the distributions for all 15 news publications here:















Using the same technique, we proceeded to examine at the tweets of nine popular Congressional representatives and conducted a similar analysis of their data. This data was sampled from 2017 onwards.
Here, we wanted to highlight two Congressional representatives who have been vocal opponents of one another: Rep. Alexandra Ocasio-Cortez and Sen. Ted Cruz. While Cruz does seem to tweet with more polarizing language, the plots are not extremely dissimilar overall. Notably, there is a general tendency for positive words that are somewhat subjective.



Below are the results for the nine sampled congressional representatives. Again, many of the distributions seem to be quite similar, with a slight preference for somewhat positive, subjective messages. All members were more positive than negative.









Topic Modelling
Given the apparent similarities in the way certain members of Congress present their views, we wanted to dig deeper into the kinds of things they tweet about. To do so, we used Latent Dirichlet Allocation to perform topic modeling on Twitter data. Topic modelling focuses on detecting patterns in text to understand how different ideas and sources may be connected. We used a dataset of the most popular tweets from 2019–2020.
First, we looked at correlations among the most frequently used hashtags within the corpus. Some of our results were fairly predictable, such as the high correlation between #MaduroRegime and #Venezuela. The hashtag #ForThePeople was positively correlated with #LowerDrugCosts, perhaps an indication of the Medicare-For-All debate that grew in popularity during the 2020 Election. Interestingly, #CombatCOVID19Challenge was not strongly correlated with #COVID19. The full results are plotted below in a correlation matrix.

Then, we looked at correlations among the eight most frequently mentioned political accounts, which belong to five individuals. We found that tweets mentioning Donald Trump were more likely to those mentioning Nancy Pelosi than Ted Cruz. Similarly, tweets mentioning Pelosi’s accounts were more similar to those mentioning Trump than those referring to either Bernie Sanders or Alexandra Ocasio-Cortez. Perhaps this was a consequence of the impeachment vote in early 2020, during which many people may have tweeted about Trump and Pelosi together. The visualization below plots these relationships on a matrix. Another finding was that tweets mentioning Alexandra Ocasio-Cortez had a relatively strong negative correlation with each of the other accounts. These relationships appear in the visualization below.

Conclusion
Overall, we found that there were large differences in the style of reporting employed by certain news organizations, highlighted by the contrast between Breitbart and the New York Times. Though some differences were present in the tweets of members of Congress, we wanted to better understand Twitter behavior, so we employed topic modelling. This last analysis focused on finding correlations between the most topics and political accounts on Twitter. Further work could include an analysis of more recent news headlines, topic analysis of news articles, and comparison of news specifically shared by public officials.
Thanks to Mike Izbicki for feedback and guidance on developing this post (https://izbicki.me/).