Are you Democrat or Republican? Let your tweets define you …

It feels like magic when you analyze hundreds of tweets in just a few seconds and automatically get information such as topic, opinion and sentiment.

Kamal Chouhbi
Towards Data Science

--

Lately, I had a conversation about politics and the next US Presidential Election of 2020. What caught my attention is that people get super polarized about it. They started to say: “You sound just like a Democrat!” or “Are you a anti-Republican?”. And by force, you start to have the arguments to defend a position. So I quickly went and took some online tests trying to answer the question: “Are you a Democrat or a Republican?”

Unfortunately, I didn’t feel convinced, so I wanted to find a new way and thankfully we have the world of Twitter as our dataset!

In this study, I will try to find answers for the questions below:

  • What are the words (terms, hashtags…) Democrats and Republicans use the most in their tweets?
  • Who gets more re-tweets and likes?
  • Can we do a Sentiment Analysis on the extracted tweets?
  • Can we use Machine Learning algorithms to train a model and determine if the tweet was written by a Democrat or a Republican candidate?
  • Will a Neural Network (RNN) help me decide if my Twitter account is more Democratic/Republican?

Data Collection :

To start we need to figure out what Twitter accounts are Democrat or Republicans. TweetCongress.org, a directory of members of Congress on Twitter, lists 101 Republicans on the site and just 57 Democrats. Each party is mostly held up by a few Twitter superstars (like Democratic Sen. Claire McCaskill and Republican Sen. John McCain), but congressional Republicans overall have more followers and tweet more often.

Another option is to look directly at the @TheDemocrats and @HouseGOP member lists on Twitter. After exploring the lists, you can see that we can pull all of the fullname and username class elements. We will also save the URL of the avatar image and keep it for later.

I will use Beautiful Soup to extract the handles. It is a Python library for parsing HTML and XML documents. When the HTML or XML document is poorly formed (for example if it lacks closing tags), Beautiful Soup offers a heuristic-based approach in order to reconstruct the syntax tree without generating errors.

After extracting the handles, I export the results to .csv file, which looks like this :

Then , I need to create a dataset containing the tweets of these members related to the 2018 U.S. Congressional Election. I collected the tweets between January 22, 2018 and January 3, 2019 from the Twitter API using Social Feed Manager. See each collection’s README for dates of collection, accounts, and hashtags used in queries.

Data Collection Method

Exploratory Data Analysis (EDA):

The dataset contains 86,460 tweets related to the 2018 U.S. Congressional Election (44392 for Republican and 42068 for Democrat). We can say that we have roughly similar proportions of Democrats & Republicans in the training dataset.

I am going to start with a quick exploration of tweets from Democrat and Republican lawmakers in the US. The main purpose of this initial analysis will be to visualize hashtags used by both of them.

This is getting interesting. We can now see the Republican party (in red) touting their policies (e.g., taxcutsandjobsact). For those less familiar with US politics, the Republican party controls the presidency and both chambers of the Congress, and are therefore able to implement their agenda relatively unrestrained. On the other side, for the Democrat party, we see some clear bitterness (e.g., goptaxscam).

In addition, I will explore the words used by the different parties. This will be achieved through cleaning the tweets, parsing them into vectors of words, removing common stopwords, aggregating words by count, and plotting the most frequent results using barplots. The choice of the stopwords can be critical sometimes. I thought maybe it’s better to choose the STOPWORDS from the “wordcloud” library because they contain an average list of all of them.

For example, the word “today” (3850 times) is used most in the tweets sent by democrat people. Then we find “trump” (2502 times) and “american” (2053 times).

If we look at the tweets of people who have a republican opinion, the word “today” (4883 times) is in the first place again, “tax” is in the second place and “great” in the third place.

I also categorized the word frequencies into 5 categories :

  • If a word is used less than 50, it is in the Very Low group.
  • If it’s used between 50 and 200 times, it is in the Low group.
  • If it’s used between 200 and 750 times, it is in the Medium group.
  • If it’s used between 750 and 1500 times, it is in the High group.
  • If usage of a word is greater than 1500, it is in the Very High group.
Created with plotly

Interpretation :

I tried to plot a matix in which the ‘X’ axis refers to the word usage frequency of Republican tweets and ‘Y’ axis refers to the same thing but related to the Democrats.

For example, the word “taxreform” is used by Republican 966 times, so it’s placed in their high category. However, Democrats used that word only 12 times. An other example is “gun” word. Its used 876 times in Democrat tweets and 117 times in Republican tweets.

Also there is no words, when Republicans used very highly and Democrats used at the medium, low or very low level. The same as for Democrats. The words : “today”, “american”, “great, “house”, “year”, “family”, “day” and “thank” are used very highly by both Democrats and Republicans.

Dispersion Plots :

I have selected some words in the data set in order to produce a plot showing the distribution of the words through the text. These words are “vote”, “democracy”, “freedom”, “america”, “american”, “tax”, “trump” and “clinton”.

Democrat Tweets Plot

Republican Tweets Plot

Sentiment Analysis

Here, I will try to categorize the tweets into organized groups or parties using different text classifiers, in addition to Sentiment Analysis for understanding if a given tweet is talking positively or negatively about a given subject.

So I began with converting a collection of tweets for both parties to a matrix of token counts, using CountVectorizer function from sklearn. Then I applied ML algorithms for classification: RandomForest, LogisticRegression, Multi-NB, DecisionTrees, AdaBoost

After that, I tried to determine the attitude or the emotion of each party, i.e., whether it is positive or negative or neutral. So I used the famous Python library TextBlob which returns 2 important indicators:

  • The polarity score is a float within the range [-1.0, 1.0]. And it represents emotions expressed in a sentence. 1 means positive statement and -1 means a negative statement.
  • The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. And it refers to personal opinion, emotion or judgment.
Democrat_Tweets :
Sentiment(polarity=0.16538325109722254, subjectivity=0.4649093199451984)
Republican_Tweets:
Sentiment(polarity=0.19837398561739605, subjectivity=0.4590746992419168)

Simple neural network classification

Now let’s try to answer the last question “if my Twitter account is more Democratic/Republican” by using a simple neural network. I will develop a LSTM and Convolutional Neural Network model for Sequence Classification using the Keras deep learning library.

Let’s start off by importing the classes and functions required for this model and initializing the random number generator to a constant value (7 for example) to ensure that the results can be reproduced.

Then, we split the dataset, stratified according to the Party variable so as to have approximately similar proportions of Democrats & Republicans in the training & test sets separately as in the whole dataset. The set will be based on those Twitter handles because we don’t want any individual tweeter to appear in both training & test sets.

Actually, the tweets don’t have the same length in terms of words, but same length vectors is required to perform the computation in Keras. That’s why we need to truncate and pad the input tweets so that they are all the same length for modeling. The model will learn that the zero values carry no information.

We can now define, compile and fit our LSTM model :

  • The first layer is the Embedded layer that uses 32 length vectors to represent each word.
  • We add a one-dimensional CNN and a max pooling layer after the Embedding layer which then feed the consolidated features to the LSTM.
  • The next layer is the LSTM layer with 100 memory units (smart neurons).
  • Finally, because this is a classification problem we use a Dense output layer with a single neuron and a sigmoid activation function to make 0 or 1 predictions for the two classes Democrat/Republican.

I used epochs=10 and batch_size=64 for fitting the model. Once fit, we estimate the performance of the model on unseen tweets:

I got an overall accuracy of 74,52%. It’s not so unusual for a single tweet to be mis-categorized, so perhaps we should consider the overall categorization of each person’s tweets.

With this adjustment, we get ~93% accuracy.

Conclusion :

Text classification is not only fun, but it’s also a powerful tool for extracting value from unstructured data. Twitter is a very popular social media platform used by millions of users, and the published tweets can sometimes be used to see what the public’s opinion is regarding a certain problematic. It feels like magic when you analyze hundreds of tweets in just a few seconds and automatically get information such as topic, opinion, sentiment…

After this study, I found out that we can make classification on people to see if they are Demcorat or Republican, and the results were somehow accurate. If sentiment analysis algorithms were improved , we could use them to predict next presidential election results this year.

Even though, the developed model still have 2 main hurdles :

  • Sarcasm is another weakness of all the Sentiment Analysis Algorithms, and our model also fail to detect sarcasm at this stage.
  • Since tweets can be posted by anyone and can contain spelling & grammar mistakes, the misspelt key words might be incorrectly analyzed by our algorithm.

Notebook :

References :

Congratulations if you managed to get here. Thanks for reading, I hope you’ve liked it. For personal contact or discussion on Machine Learning, feel free to reach out to me on LinkedIn and don’t forget to follow me on GitHub and Medium.

--

--