Using the Twitter API and NLP to analyze the tweets of different users.

Eugene Aiken
4 min readJan 20, 2018
2017 U.S. Presidential Candidates Donald Trump & Hillary Clinton

Yesterday I worked on arguably one of the most interesting projects in my burgeoning data science career. Some kind soul was nice enough to put a Python wrapper on the Twitter API so we could use classes and perform methods on it just like we would with any other Python application. So at this point, the possibilities with Twitter and Python are endless. And by endless I mean limited to your knowledge of Python, how well you can Google answers on Stack Overflow, and definitely limited by how many times you can hit the Twitter server per 15 minute intervals (it’s 200).

The Project

Take two Twitter users, scrape their tweets, run them through a natural language processor like Count Vectorizer or TF-IDF Vectorizer, classify with a machine learning algorithm like Logistic Regression or KNN, and finally, use the predict proba method to determine the probability that a particular tweet came from a particular user.

Love him or hate him (for this particular reason or others), there is probably no more iconic tweeter than the current President of these United States, Donald J. Trump. The Commander-In-Chief is known for his legendary 3 a.m. tweet rants that leave his supporters cheering, and the rest of us scratching or shaking our heads.

Speaking of the former U.S Senator, Secretary of State, First Lady of the United States, and 2017 U.S. Presidential Democratic Nominee Hillary Clinton, Trump has what you could call a less than favorable opinion of the matriarch of the Clinton political dynasty. Needless to say, there’s no love lost on her end.

So, I compared their tweets against one another to find out which of their tweets had the highest probability of coming from either Trump or Hillary. Here’s how I did it.

Part One

Step One

Using a function called Tweet Miner, I mined the tweets of the Donald and Hillary and put them into a pandas dataframe.

Step Two

I then merged the two data frames into one and passed the new merged data frame through a natural language processing package in Python called TF-IDF Vectorizer to pull out, analyze, and rank the n-grams within each tweet.

Step Three

I processed the tweets and built a model so that I could do the necessary classification of each tweet.

I did this by:

  • Cleaning and vectorizing the input data with textacy and creating a target vector
  • Intializing a model
  • Grid Searching for optimal hyperparameters using GridSearchCV and LogisticRegression algorithms in Sci-kit Learn.
  • Training and fitting optimized model
  • Evaluating the performance of the model by plugging tweets from each person back into the model, using the predict proba method, and having it return a confusion matrix

Step Four

I took the array created by using the predict proba method and turned it into its own data frame. I then merged the existing input data frame with the new data frame containing the predicted probabilities of each tweet. I then filtered the data frame to give me the tweets of each person and made them their own data frames. Lastly, I ran a list comprehension to find the tweets with the highest and lowest predicted probabilities in each data frame.

Low and behold, here are the results:

The HIGHEST probability Trump Tweet

The LOWEST probability Trump Tweet

The HIGHEST probability Hillary Tweet

The LOWEST probability Hillary Tweet

It was very interesting to see which tweets my model predicted to be the most and least likely to come from Trump and Hillary. So this begged the question “What if I ran this process on the tweets of people I know”?

Part Two

I decided to do this with some of the tweets of my old colleagues at The Tab Media Inc. in Brooklyn. They have been making national headlines recently by publishing the famed story about Aziz Ansari and his behavior on a date with a girl who went by the name “Grace”(not her real name), thereby changing the conversation surrounding consent in the era of the #MeToo movement. The people whose tweets I chose to compare were Babe.net editor and writer Amanda Ross and Una Dabiero, and Tab Media editors Matt McDonald and Josh Kaplan.

I repeated all the steps above and here’s what I got.

Amanda and Una

Left to Right: Amanda Ross and Una Dabiero

The HIGHEST probability Amanda Tweet

The LOWEST probability Amanda Tweet

The HIGHEST probability Una Tweet

The LOWEST Probability Una Tweet

Matt and Josh

Left to Right: Matt McDonald & Josh Kaplan

The HIGHEST probability Matt Tweet

The LOWEST probability Matt Tweet

The HIGHEST probability Josh Tweet

The LOWEST probability Josh Tweet

If you would like to the see the code for how I did all of this, you can check it out here.

--

--

Eugene Aiken

Data Scientist and Analytics Professional living in Brooklyn, NY. eugeneaiken03@gmail.com