The world’s leading publication for data science, AI, and ML professionals.

Twitter Political Compass Machine: A Nature Language Processing Approach and Analysis

With the 2020 election going on, it is more important that we understand a politician's affiliation. Today, we will use ML to do that.

How to evaluate a politicians party affiliation and monitor political division using Twitter and Machine Learning Model

With the 2020 election going on, it is more important than ever to understand a politician’s affiliation. Today, I will teach you how to build a Machine Learning model that predicts a political party affiliation based on their tweets.

Data Wrangling

To gather the data/tweets, we will be using the Twitter API. The Twitter handles for all senators are here: (https://www.sbh4all.org/wp-content/uploads/2019/04/116th-Congress-Twitter-Handles.pdf)

I also generated a list of leaders in both parties that we will be using to train our model.

Democrats: Joe Biden (Presidential Nominee), Kamala Harris (VP Nominee), Bernie Sanders, Elizabeth Warren

Republicans: Donald Trump (President), Mike Pence (Vice President), Mitch McConnell, Ted Cruz

Data/Featuring Engineering

To gather useful features, we must transform the tweets into vectors of some sort. In the diagram below, I will show the 3 main features we will be using in this model and how to get them.

Image by Author
Image by Author

Now I will explain in detail why I chose these features and why I think they are essential.

Readability score

The readability score indicates how well a person writes and turns his or her writing into a number that estimates a person’s education level. Readability is essential because tons of reports have shown that post-grads or college grads are more likely to maintain a liberal view. Therefore, politicians may cater to their tweets to their bases, and we would want to capture that.

Below are the codes we used to add in the readability score; we use the textstat package and turn sentences into numbers. (Please refer to their website for detail/science of this conversion). Here, we must remove hyperlinks and retweets because hyperlink is a noun, that is not a noun, the algorithms won’t recognize, and retweets, although may share their sentiment, are not their word.

import textstat
date2_df['retweet'] = date2_df['tweet'].str.contains('RT')
date2_df['tweet_web_r'] = date2_df.tweet.str.replace(r'httpS+','')
date2_df['TS'] = date2_df.tweet_web_r.apply(lambda x:textstat.text_standard(x,float_output =True))
date2_df = date2_df[date2_df['retweet'] == False]
date2_df = date2_df.reset_index(drop=True)

Sentiment score

I am sure we all have heard of "Make America Great Again!". You can probably guess that this quote has a positive sentiment. The sentiment is our feeling in writing. Finding the combination of sentiment towards a particular subject will give us an idea of what party affiliation the politician may maintain.

For example:

‘Make America Great Again’ is probably something a Republican or Trump, specifically, would say (Positive Sentiment towards America).

‘Last night, Joe Biden said that not one single person with private insurance lost their plan under Obamacare.’ is probably something a Republican would say. (Negative Sentiment towards Obamacare).

‘Last night Trump said we are "rounding the corner" on the pandemic. Really?’ is probably something Democrats would say (Negative Sentiment towards Trump).

Suppose we take a look at how the percentage of the political leader’s tweets are categorized. (Shown below) It is quite interesting to note that most of the Republican leaders tweet positively during this time, while the Democrat’s tweets are mostly negative. If I have to guess, the Democrats are probably criticizing the Republicans or reflecting on the danger that the pandemic has brought. This distinction should help us with our model.

Image by Author
Image by Author

Below is the code we used to add the sentiment compound score to each tweet. We used the VADER (Valence Aware Dictionary and sEntiment Reasoner) package to do this. Please refer to their website regarding the details of their algorithms. Also, we removed the hyperlinks and retweets before getting the scores.

from nltk.sentiment.vader import SentimentIntensityAnalyzer
date2_df['sentiment_score'] = date2_df['tweet_web_r'].apply(analyser.polarity_scores)
compound_score = []
sentiment_score = list(date2_df['sentiment_score'])
for i in range(len(date2_df['sentiment_score'])):
 compound_score.append(sentiment_score[i]['compound'])

date2_df['sentiment_score'] = compound_score

TF-IDF (Term Frequency Inverse Document Frequency) Vectors

Last, and the most critical, features are turning the words of a tweet into TF-IDF vectors. To do so, we combine all the tweets of the Republicans and Democrats leaders and turn them into a corpus/document. The TF-IDF’s central concept is that we transform a corpus into a Bag of Words vector (How many times a word showed up!) with the corpus’s length in consideration. If you want to read up on the details, please refer to the TF-IDF Wikipedia page. In this scenario, think of it as we vectorize how often party leaders’ words are used and see if other senators are using the same words that the leaders are using.

To generate these TF-IDF vectors, we have to remove the hyperlinks and retweets, turn all words to lowercase, tokenize, remove stopwords, and lemming the words before. Once the sentences are clean, we can put the corpus into the TF-IDF vectorizer in the Sklearn package. Here are the purposes of the cleaning operation, and below is my code to do so!

  • Lowering the case – to prevent word double count due to case differences
  • Lemming the words – Turn variation of a word into one-word. For example: ‘cover’ ‘covering’ ‘covered’ will all be considered the same word
  • Remove the stopwords – remove meaningless words like propositions
rep_Tfidf = TfidfVectorizer(stop_words=stop_words, ngram_range=(1,3))
rep_Tfidf.fit_transform(df[df['politican'].isin(rep_leaders)].len_sentence)
dem_Tfidf = TfidfVectorizer(stop_words=stop_words, ngram_range=(1,3))
dem_Tfidf.fit_transform(df[df['politican'].isin(dem_leaders)].len_sentence )
# Put the result into a dataframe
rep_vectorize = pd.DataFrame(rep_Tfidf.transform(date2_df['len_sentence']).toarray(), columns= rep_Tfidf.get_feature_names())#.sum(axis = 1)
dem_vectorize = pd.DataFrame(dem_Tfidf.transform(date2_df['len_sentence']).toarray(), columns= dem_Tfidf.get_feature_names())
rep_vectorize['politican'] = date2_df.politican
dem_vectorize['politican'] = date2_df.politican
rep_vectorize['Party'] = date2_df.Party
dem_vectorize['Party'] = date2_df.Party
# Add in the tf-idf vectors to the sentiment score and readability score
date2_df_final = date2_df[['politican','Party','TS','sentiment_score']]
rep_vectorize = rep_vectorize.drop(['politican','Party'],axis =1)
dem_vectorize = dem_vectorize.drop(['politican','Party'], axis = 1)
date2_df_final = pd.concat([date2_df_final,rep_vectorize,dem_vectorize ], axis = 1 )

To demonstrate why TF-IDF is important, here are the most common words used by the leaders. As you can see here, Donald Trump uses MAGA and great a lot. No wonder a good chunk of his tweets has positive sentiment. The opposite trends and choice of words reflect the negative sentiment of the Democrats. They use ‘covid’, ‘pandemic’, and ‘Fire’ (Which are natural disasters) significantly more than the Republicans. Maybe that is why they have so much negative sentiment in their tweets.

Image by Author
Image by Author

Modeling

At this point, you may have noticed that we will train our model only using the party leaders only. The reasons behind this are

  1. We want to measure intra-party dissonance using this model.
  2. Our model will be overwhelmed if we use all the senators due to the sheer number of tweets.

The model of our choice is Random Forest Classifier. Here are the codes for us to do the train test split and use cross-validation to generate our model.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import plot_confusion_matrix
import seaborn as sns
# Change Party code to 0 and 1
party_code = {'R':1, 'D':0}
date2_df['Party'] = date2_df['Party'].replace(party_code)
date2_df_leaders = date2_df[date2_df['politican'].isin(leaders) == True]
date2_df_nonleaders = df[df['politican'].isin(leaders) == False]
# Split data into train and test set 
X = date2_df_leaders.drop(['politican','Party'], axis = 1)
y = date2_df_leaders.Party
X_train, X_test, y_train, y_test = train_test_split(X,y , train_size = 0.75, test_size = 0.25)
# Scale the data 
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
param_grid = {'n_estimators':stats.randint(100,400), 'criterion':['entropy']
 , 'min_samples_leaf':stats.randint(1,6), 'max_depth': stats.randint(100,400)}
# Make the model with 5 fold cross validation 
RF_model = RandomForestClassifier()
RF_model_cv = RandomizedSearchCV(RF_model, param_grid, cv = 5, n_iter = 60, n_jobs = 4, scoring = 'accuracy')
RF_model_cv.fit(X_train, y_train)
y_pred = RF_model_cv.predict(X_test)
print(plot_confusion_matrix(RF_model_cv,y_test, y_pred))
Image by Author
Image by Author

Based on our confusion matrix, you can see that our model’s accuracy is high in the 90s at guessing the leader’s party affiliation based on their tweet, which is pretty good.

We will now use this model to predict all the tweets made by our senators and aggregate their scores individually. Once we gathered the score, we will plot the number of republican-like tweets and the number democrat-like tweets on the xy axis of a scatter plot

Image by Author
Image by Author

Here you can see a clear division between the two-party. You can now use this same code and scrap a politician of your choice and see where he/she lies on this graph. This will then give you a quantitative evaluation of a politician’s affiliation and allow you to monitor any intraparty dissonance. Congratulations, you just built your own Twitter political compass machine.

For more details on how I build the machine, the data exploration I did, and how this model would work on data from another date. Please refer to my GitHub, email me directly at [email protected], or contact me at LinkedIn.


Related Articles