Natural Language Processing of Social Media Content

Alternate Title: Is the world talking about The Walt Disney Company? Yes, yes it is.

Sky B.T. Williams
Towards Data Science

--

Photo by geralt on pixabay

The Set-up

The Walt Disney Company is a prolific powerhouse of a multinational corporation, with about 200,000 employees, and what I estimate to be a total worth to be around $150-$160 billion, based solely on the stock value at the moment of writing this post, and the number of outstanding shares (DIS). As with most companies, you better believe that they have a vested interest in gauging how their customers and the public at large feel about the services that they provide, and the products that they create, but collecting that information, whether through in-person, over the phone, email, or mail-in customer surveys and questionnaires, cost money and time to create, disseminate, collect, and analyze. I wanted to see if I could use Natural Language Processing (NLP) tools and unsupervised machine learning to gauge public opinion of Disney in real-time, at minimal cost.

Getting the Data

Photo by geralt on pixabay

First, I needed data in the form of text, which is publicly posted, and likely to contain topic-specific opinions; I turned to Twitter. Using the Twitter API (Application Program Interface) I wrote a script to download tweets related to keyword and hashtag ‘disney’ as they are posted online. Yes, absolutely anybody can freely do this for any tweets, and they can even be queried from a back catalogue of storage. In addition to the text of the tweet, you can also download a plethora of data and metadata related to that tweet and the user who tweeted or retweeted that status, including, but absolutely not limited to: time, date, location, language, number of followers, number of accounts following, date of account creation, profile picture, and username of who made the original tweet and who retweeted the status. I streamed the real-time tweets into a MongoDB collection, and since I had kind of a lot of tweets to analyze, I stored them on a personal AWS (Amazon Web Services) instance. If anyone is interested in any of these tools, or even just looking into what kinds of information the entire computer-owning world has access to about your Twitter account, I’ll include some links at the bottom of this post that you can check-out.

Preliminary Stuff

I collected somewhere around 100,000 to 120,000 tweets each day, in over 30 different languages, all related to Disney. As I explored the data, I discovered my first source of bias. While I searched for many permutations of keyword disney (i.e. “disney”, “disney’s”, “#disney”, etc.), this will not collect all possible tweets related to my topic, but only tweets that mention one of the words in my search list. For example, my search criteria would have collected a tweet that said something like “I love Disney World!”, but not “I love EPCOT!”, since the second tweet does not contain a word similar to ‘disney.’ I could have expanded my search list to include each theme park specifically, but it would have biased my tweets towards theme parks, and I simply can’t include the title of every Disney movie, TV show, and character ever created. In order to keep my analysis as objective as possible, I kept my search criteria generalized, which biased my tweets towards over-representing properties that had the word ‘Disney’ in the title, such as Walt Disney World and Disney Channel.

I took my ~500,000 Disney tweets, cut them down to the 320,000 that were in English and I could actually read, and I ran them through a Python library called TextBlob, which analyzes each data point of text, in this case a single tweet, and calculates the Polarity to be somewhere between -1 (unfavorable) and 1 (favorable), with 0 meaning that the sentiment was neutral. The distribution of all of the tweets together had a mean of about 0.115, or just positive of neutral, as can be seen below.

Image by Author

And I guess that that is pretty neat, but the truth is that The Disney Company is actually a massive conglomerate of other companies, including Marvel, Lucas Films, ESPN, ABC, Disney Parks and Resorts, Disney Cruise Line, and many more, as well as a very diverse portfolio of the associated intellectual properties. I don’t think that knowing overall sentiment would be nearly as useful to a company as would knowing sentiment of specific intellectual properties or brands. To achieve grouping these tweets into separate categories, without looking at each individually, we need to turn to clustering through unsupervised learning, but first we need to explore vectorization and dimensionality reduction! Exciting stuff.

Data Cleaning and Vectorization

Photo by LoboStudioHamburg on pixabay

In general I find data cleaning pretty interesting to do, but very tedious to read about. Unfortunately it is also super important, so I have to mention it. The biggest issue with working with tweets is duplicate tweets. I couldn’t eliminate collecting retweets because a retweeted positive tweet, I feel, still indicates a new positive sentiment on the topic and may be useful later, but you can’t cluster data with duplicate data points; I’ll touch on this later, but for now just take me at my word that having a ton of tweets with the exact same text is really bad for my analysis. Therefore I removed any url, symbol, or retweet tag in any of the tweets, then I kept only a unique set of each text data point to clean out anything that was either a retweet or an automated bot, resulting in about 98,000 tweets.

The input for all algorithms is numbers. There are many different ways to turn other forms of data into a numerical format, such as deconstructing a picture into values for red, green, and blue for each pixel, or in my case, turning a tweet of words into counts for each word in the tweet, called Tokenization and Vectorization. First you define what exactly a token is, which could be each unique word, combinations of adjacent words, or even any combination of a certain number of adjacent letters. These techniques can certainly be used separately or in combination, depending on your needs. Next you vectorize your text, which assigns a value for each of these tokens for each data point. The most basic example would be taking the text “I like soup”, which count vectorizes to [i:1, like:1, soup:1]. We now have a numerical representation of our text! Next, in order to have all of our tweets in the same table, we can’t just have a row for “I like soup” and 1 column for each ‘i’, ‘like’, and ‘soup’, but we need a column for each word that is contained in any of our data points. For instance, if our second data point is “I do not like flying”, the vectorization of “I like soup” now has to expand to specify that it does not contain the new words, such that its new value is [i:1, like:1, soup:1, do:0, not:0, flying:0], while the value of “I do not like flying” is [i:1, like:1, soup:0, do:1, not:1, flying:1], so that both of our data points have different values, but the exact same format. You can see how with 98,000 unique tweets, the number of counts for each text data point will quickly become huge, even though most counts will be 0.

Photo by geralt on pixabay

Still with me? Excellent! So now we have a beautiful table/dataset of 98,000 rows, each representing a tweet, and around 10,000 columns, each representing the number of times a unique word appeared in that tweet. This table is mostly 0s, with some 1s, 2s, and occasionally 3s mixed in, which is now actually a different kind of problem. If someone (such as a computer program) were to just look at the numbers in the table, and decide whether these tweets were similar or different, they would all look about the same, because pretty much every row is made up of around 10,000 0s, with a handful of 1s, 2s, and 3s. This is an example of the curse of dimensionality, which is a problem many fields of analysis have to contend with everyday.

Before moving on to dimensionality reduction, I also want to mention that I removed stop words, such as ‘a’, ‘the’, and ‘if’, which carry very little analytical value, as well as weighting the words so that if a word like ‘disney’ appears in a lot of the data points, then it carries less value in determining differences between the topics of the tweets.

Dimensionality Reduction

Wow, this post is getting long, so let’s charge ahead and try to cover some ground. In case you took a break and came back, here’s where we are: we want to group a bunch of tweets by Disney topic, we’ve converted our text into a numerical format, and now we have a dataset with ~10,000 columns of mostly zeros, which is accurate, but bad for analysis. In order to reduce the number columns I have in my dataset, but still contain lots of descriptive information, I used Singular-Value Decomposition (SVD). Basically what this does is it takes all of the information (variability) within the tweets, and kind of distills it down into a smaller number of newly created columns. This process is abstract, heavy in linear algebra, and complicated. What is important to take away is that we can get a lot of the information originally contained in the 10,000 token columns compressed down into just a few dozen new columns, which is great.

We have now created what’s called an LSI space (Latent Semantic Indexing) that can help us determine the usefulness of each new column in separating the tweets into topic groups that have words in common. Unfortunately, because of the ‘distilled’ nature of the new columns, this data is a lot harder to interpret than the straight word counts, so let’s feed these columns to a clustering algorithm. Side note: The new columns will put tweets closer together that have words in common, which is why retweets and bots are bad. If the exact same 20 words appear in identical tweets 15,000 times, then it artificially creates a relationship between those words that confuses the model.

Clustering

I would love to go into the ins and outs of the K-Means unsupervised clustering algorithm, but I won’t, and instead I will just say two things. First off, the algorithm (or model) assigns each data point to a cluster number so that tweets that are textually similar are clustered together, ideally into some sort of logical theme or topic. Secondly, what makes it unsupervised is that you do not use it to create predictions, then test it against known data, which would be supervised. An example of supervised learning would be to give a model a bunch of information, like day of the week, weather conditions, time of day, etc., and have the model predict whether it thinks a bus will be on time or late. Once you have a prediction, you then compare it to what actually happened and decide if it was right or wrong; pretty straight forward. With an unsupervised model like k-means, I give it a bunch of data, such as vectorized tweets, then the algorithm groups the data points into clusters, but since we don’t know the content of the tweets to see how well the model performed, I need to sift through the tweets for each grouping to see if the clusters follow some useful theme. To cut to the chase, in this case, they did not.

Image by Author

Above is a silhouette plot, which basically shows that the majority of my tweets are in a giant, fairly diffuse cluster (12), while the other clusters are much smaller in comparison. When looking at the tweets contained in each grouping, in some cases there were the makings of clear themes, though there was almost always a lot of overlap between topics and confusion in the signal. In other words, I was not able to pull out a single cluster and trust that it contained all tweets on a specific topic, such as brand, character, movie, etc.

Image by Author

Why Disney tweets are hard to work with

So after a lot of model tuning and trying different things to get my clustering algorithm to produce something really exciting and insightful (shoot me a message and I would be happy to go into all of it in painful detail), there came a time when I had to conclude that the tweets just didn’t want to work with me. That said, I think that there is still something that can be learned, and you may even find it interesting, so let’s dig into a couple of the clusters and see why Disney tweets don’t like to be separated into groups.

First off, this is a common example of a tweet in my data set:
- “I wanna go to Disney!”
As much as I appreciate, and maybe even agree with, the sentiment, if you can imagine what the vectorized version of this tweet would be, there isn’t much there for the model to work with. If you had to label this tweet with a topic, what would it be? Maybe we could infer that they are talking about a theme park, but there isn’t really mention of a specific park, or a movie, or even a character for the model to use as a topic label. There is definitely a lack of signal, or usable information, in this tweet.

Image by Author

When digging into another identified cluster, I would see the following tweets:
- “Ariel is the only Disney Princess to have had a child.”
- “How is Pocahontas not a Disney Princess?”
From these I could see some type of ‘Disney princess’ theme starting to form, which would be great! But when I dug deeper I could see that mentions of princesses were in many other clusters as well, indicating confusion between my topics.
- “Disney owns both Marvel and Lucas Films, idc what you say Shuri and Leia are Disney Princesses in my eyes”
This is an example of a tweet in the same group as the first two, but the third undeniably includes material related to Marvel and Lucas Films. If I had a cluster for each Marvel and Star Wars/Lucas Films, where would the model place this tweet?

One final example:
- “Disney Is Not Only Disney Parks! #Disneysmmc @DisneyCruise @DisneyAulani @DVCNews @Disneymoms”
In this single tweet I see ‘Parks,’ ‘smmc’ (a special event at Disney — Social Media Moms Celebration), ‘Cruise,’ ‘Aulani’ (Disney resort in Hawaii), and ‘DVC’ (Disney’s timeshare program — Disney Vacation Club). As with the previous example, I interpret this as a lot of potential signal in a tweet, but not enough distinct signal for the model to be able to place it in a single category with textually similar tweets of the same topic.

Image by Author

Closing thoughts

So maybe I wasn’t able to complete the efficient and informative, brand-focused sentiment analysis on behalf of Disney like I had hoped, but I did take you on a quick, or not so quick, tour of how a simple NLP project is structured, as well as how and why certain steps are done along the way. As it turns-out, The Disney Company is really good about taking an intellectual property, like a single successful character, and spreading it across many of its businesses. It would not be at all uncommon for a character like Aladdin to appear in movies, TV series, theme parks, cruise ships, merchandise, and even more, at times even appearing with seemingly unrelated characters. Not only that, but it seems that when somebody loves a Disney movie, they will probably also love the parks, cruise ships, and maybe even have a Disney wedding, mentioning it all in a single 280 character tweet. These are just a couple of reasons that Disney is so incredibly successful, but also why my model had such a hard time separating these tweets into distinct topics.

Image by Author

Thanks for hanging in there for the whole thing, or at least skipping past the technical bits to read the end! I hope that you found something interesting, and feel free to reach-out here or on LinkedIn to let me know what you think! (https://www.linkedin.com/in/sky-williams/)

Resources

Twitter API: https://developer.twitter.com/en/docs
Amazon Web Services: https://aws.amazon.com
MongoDB: https://www.mongodb.com
If you want details about any of the techniques, algorithms, approahces, or libraries that I used: https://www.linkedin.com/in/sky-williams/

--

--

Total Rewards Analyst at Bedrock Detroit, with a background in Marine Biology and Fisheries Management