Using the Twitter API and NLP to analyze the tweets of different users.

4 min readJan 20, 2018

2017 U.S. Presidential Candidates Donald Trump & Hillary Clinton

Yesterday I worked on arguably one of the most interesting projects in my burgeoning data science career. Some kind soul was nice enough to put a Python wrapper on the Twitter API so we could use classes and perform methods on it just like we would with any other Python application. So at this point, the possibilities with Twitter and Python are endless. And by endless I mean limited to your knowledge of Python, how well you can Google answers on Stack Overflow, and definitely limited by how many times you can hit the Twitter server per 15 minute intervals (it’s 200).

The Project

Take two Twitter users, scrape their tweets, run them through a natural language processor like Count Vectorizer or TF-IDF Vectorizer, classify with a machine learning algorithm like Logistic Regression or KNN, and finally, use the predict proba method to determine the probability that a particular tweet came from a particular user.

Love him or hate him (for this particular reason or others), there is probably no more iconic tweeter than the current President of these United States, Donald J. Trump. The Commander-In-Chief is known for his legendary 3 a.m. tweet rants that leave his supporters cheering, and the rest of us scratching or shaking our heads.

Speaking of the former U.S Senator, Secretary of State, First Lady of the United States, and 2017 U.S. Presidential Democratic Nominee Hillary Clinton, Trump has what you could call a less than favorable opinion of the matriarch of the Clinton political dynasty. Needless to say, there’s no love lost on her end.

So, I compared their tweets against one another to find out which of their tweets had the highest probability of coming from either Trump or Hillary. Here’s how I did it.

Part One

Step One

Using a function called Tweet Miner, I mined the tweets of the Donald and Hillary and put them into a pandas dataframe.

Step Two

I then merged the two data frames into one and passed the new merged data frame through a natural language processing package in Python called TF-IDF Vectorizer to pull out, analyze, and rank the n-grams within each tweet.

Step Three

I processed the tweets and built a model so that I could do the necessary classification of each tweet.

I did this by:

Cleaning and vectorizing the input data with textacy and creating a target vector
Intializing a model
Grid Searching for optimal hyperparameters using GridSearchCV and LogisticRegression algorithms in Sci-kit Learn.
Training and fitting optimized model
Evaluating the performance of the model by plugging tweets from each person back into the model, using the predict proba method, and having it return a confusion matrix

Step Four

I took the array created by using the predict proba method and turned it into its own data frame. I then merged the existing input data frame with the new data frame containing the predicted probabilities of each tweet. I then filtered the data frame to give me the tweets of each person and made them their own data frames. Lastly, I ran a list comprehension to find the tweets with the highest and lowest predicted probabilities in each data frame.

Low and behold, here are the results:

The HIGHEST probability Trump Tweet

The LOWEST probability Trump Tweet

The HIGHEST probability Hillary Tweet

The LOWEST probability Hillary Tweet

It was very interesting to see which tweets my model predicted to be the most and least likely to come from Trump and Hillary. So this begged the question “What if I ran this process on the tweets of people I know”?

Part Two

I decided to do this with some of the tweets of my old colleagues at The Tab Media Inc. in Brooklyn. They have been making national headlines recently by publishing the famed story about Aziz Ansari and his behavior on a date with a girl who went by the name “Grace”(not her real name), thereby changing the conversation surrounding consent in the era of the #MeToo movement. The people whose tweets I chose to compare were Babe.net editor and writer Amanda Ross and Una Dabiero, and Tab Media editors Matt McDonald and Josh Kaplan.

I repeated all the steps above and here’s what I got.