NLP on The Office series

Leveraging text mining techniques such as tokenization, tf-idf and sentiment analysis to analyze a television series’ transcripts.

Published in

Towards Data Science

17 min readMay 29, 2020

Heads up💬👂: for you to believe that my analysis and findings are valid you need to be ‘quite’ familiar with the show. Otherwise you may not be able to recognize how powerful tools text mining and NLP offer to conduct research projects with awesome findings.

There are 2 main goals to this blogpost

Introduce text analytics methods such as tokenization, n-grams, tf-idf, different sentiment extracting and scoring tools and cover the basics of the LDA algo that can be used to find topics / ‘cluster’ text.
Offer The Office fans a nostalgic feeling as they read about their favorite show and see how data science can back up beliefs and presumptions such as Angela was overall a mean person but was nicest to Dwight or how personal words, such as Andy’s famous ‘(Big) Tuna’ , or Ryan’s WUPHF can be easily extracted from the text and used to identify people.

I’ll be using R throughout the whole process and will be leveraging the following libraries:

tidyverse and data.table for data manipulation
tidytext, stringr, textdata, textstem, stopwords, sentimentr and topicmodels for NLP related work
ggplot, igraph and visnetwork for visualizations

I won’t be doing any Deep Learning for text analysis, so please don’t expect any embedding layer knowledge or RNN / LSTM networks.

Before any analysis I needed to get all transcripts (all episodes from all seasons) from the internet, for which I used web scraping. To find the scraper visit the following link to my GitHub.

kristofrabay/web_scraping

web scraping assignments. Contribute to kristofrabay/web_scraping development by creating an account on GitHub.

github.com

I’m also pasting the link to the actual analysis code also on my GitHub up-front. The original output of the analysis was a Shiny Dashboard application, so the code is found in the app.R file. I’ll embed Github Gists in the blog, but if someone is particularly interested in part of my analysis, they can find the plot and network renderings in the server side of the script. (For those who aren’t familiar with Shiny, apologies, reach out to me with questions, or pull my repo, run the app.R and enjoy the nice Dashboard I created!)

kristofrabay/The-Office-NLP

Conducting NLP (tokenization, tf-idf, sentiment, topic modeling) on The Office series. Output in a Shiny app …

github.com

I’d like to mention a couple of people who gave me the idea to share my analysis with the world: (1) Eduardo Arino who is a head data science guy in Silicon Valley and taught me NLP (https://www.linkedin.com/in/earino) and (2) Mihaly Orsos, who held R courses and showcased quick and efficient web scraping — a very very useful skill to have(https://github.com/misrori). Both courses were elective at the Business Analytics Master’s program at CEU, Budapest, Hungary (https://economics.ceu.edu/program/master-science-business-analytics).

Let’s get to it!

Source: giphy

I’m dividing this whole blogpost into 4 chunks. Here’s the ‘syllabus’:

Word count by season, selecting top x people to work with (limited data for better visualization, more comprehensible outcomes)
Analyzing most frequently used words, phrases (bigrams), and most personal words (tf-idf)
Running sentiment analysis with and without seasonality, determining ‘who is nice to whom, who is mean to whom and to what extent’
Showing what LDA can be capable of — heads up: not useful in this use case, use topic modeling with news articles or blogs (where there are indeed a limited amount of specifiable topics)

#1. Getting familiar with the data, selecting sub-data to work with

As mentioned, I’ll be working with a limited # of people to reduce the complexity of my visuals. I’ll be selecting the top 12 people with the most lines throughout the whole series. They cover almost 80% of all lines spoken, so I find this to be a good ratio. In addition, when comparing the twelve people I can use a 3 x 4 ggplot grid structure, making my plots symmetric.

Let’s see how many lines these people had overall and by season.

It’s clear Michael had the most lines, even without participating in the final 2 seasons. If we check the line count by season, we see exactly who turned out to get the majority of the lines after Michael left.

ggplot offers a great way to automatically order your bars / columns within each category / facet (the season in this case). I’m providing a sample code for you to see:

The data frame has 3 columns that are used: (1) the name indicating who speaks the line, (2) the actual line, and (3) the # of the given season. What we do here is group the data by season and name, aggregate on count, select the top 12 results (people) by season and then reorder the bars — indicating names — within seasons. This simple reorder_within gives us very nice, smooth looking charts.

Now that we know which people we’ll be working with, we can start doing some text analysis.

#2: Text analysis (tokenization, bigrams, tf-idf)

A. Tokenization — checking most frequently used words

As a first step, let’s see what words people use most frequently. TidyText offers a very easily usable funtion called unnest_tokens that can be run on any string vector to take each string (word) out of it and store them in a ‘column’ of a data frame. If our string is ‘My name is John’, then it will unnest this sentence into a 4 x 1 string vector with elements [‘My’, ‘name’, ‘is’, ‘John’].

As a second step, let’s realize that in the previous string vector example there were only 2 out of 4 words with a meaning / interest: ‘name’ and ‘John’. ‘My’ and ‘is’ are considered to be stopwords. These are commonly used words that carry no meaning, enhance the text with no further information. Thankfully, lexicons, containing stopwords from any given language, are publicly available and we just need to load them into our script. English examples are ‘onix’, ‘snowball’, ‘SMART’, but there’s any language you want.

In order to get rid of the stopwords in out newly created string column where each cell contains one word, what we need to do is the opposite of the join method. With a simple join (let’s say left join) we are matching everything that can be found on the left side. Here however, if a stopword occurs in the left column, we want to drop that observation / row. That is why the dplyr library offers R users anti_join, that does exactly that. After applying an anti_join of stopwords on the previous string vector we go from [‘My’, ‘name’, ‘is’, ‘John’] to [‘name’, ‘John’].

Before the results here’s the code to do that using 2 different stopword lexicons and a manually created vector of strings that are commonly found in the text but have no ‘meaning’. Here we unnest the ‘text’ column containing lines by people to create a new column called ‘word’. Then we drop the rows with stopwords.

And here’s the actual output.

So there’s something interesting here. For each person, the list containing their top words contain names of other people. This is logical, as this is a series based on conversation, and when not considering stopwords, the most common tokens are created when people turn to their colleagues, call them by their names. Also interesting that some people, like Dwight and Pam, have their own names in the top. This is because they have to introduce themselves over the phone: Dwight being a salesman and Pam being the office receptionist.

This really gave me an idea: let’s check who people mostly converse with. For this I’ll be shifting the column indicating who speaks the line by one row, then I’ll see who spoke the line and whom it was directed to, and finally I will drop lines that are spoken by and spoken to the same person (multiple lines, or closing and opening a scene).

Unfortunately I cannot paste an interactive visNetwork object here, but here’s a screenshot.

It tries to indicate that most talking happens between Michael, Dwight, Jim and Pam, while I also selected Dwight to show how you can check the results by person. I set the width of the edges to be in line with the # of lines spoken between the two characters, and the size of the node to represent overall line count. visNetwork is a great tool and very easy to use!

B. Bigrams (ngrams)— frequent phrase analysis

We now know top words include names of other people and occasionally some tokens that hint at a person’s identity (i.e.: ‘Mike’ spoken by Darryl, ‘Tuna’ spoken by Andy or ‘Vance’ spoken by Phyllis). Let’s turn to finding the most common phrases that people use.

A phrase is when you have more than just one word. You can analyze any number of words that follow each other — that is why the methodology is called ngrams analytics. If n happens to be 2, then we call them bigrams. This is what we’ll be doing now.

Here’s the code how to do that.

We can leverage the same unnest_tokens function, but this time we’ll call the output column ‘bigram’, we’ll use the ‘text’ column to create the output from, we set the token argument to ‘ngrams’ instead of ‘words’ (default one) and set the ’n’ parameter to 2, indicating we want to create bigrams.

Sticking to the ‘My name is John’ example, while tokenization (token = ‘words’) created this vector: [‘My’, ‘name’, ‘is’, ‘John’], now, the bigram method results in the following vector: [‘My name’, ‘name is’, ‘is John’]. We then separate this new column into two, by taking the first and second token of the bigram. As a next step, we’re only interested in bigrams where neither tokens are stopwords, therefore both contain some sort of information, so we filter out stopwords from both the first and second token vectors. As a last step, we unite the two columns by glueing them together by a space. And we’re done, we have ‘bigramized’ the textual data. Here are the results, by person:

Top bigrams (phrases) used by top people

Now this is something! If you’re familiar with the show, you clearly see that the bigrams are very much capable of identifying people. Who else would use ‘nard dog’ and ‘broccoli rob’ other than Andy? Who would be talking about ‘business school’ and ‘mifflin infinity’ other than Ryan? Dwight clearly likes the phrases ‘regional manager’ and ‘assistant regional’. If we’d have used trigrams (3 words to make up a phrase), we’d see ‘assistant regional manager’ in Dwight’s list for sure.

Bigrams are much more capable of identifying a certain person than simple tokens. However, there’s a method that is even more trustworthy than ngrams. It’s called tf-idf and is the subject of the next topic.

C. tf-idf — finding most personal / unique words by person

I sort of hinted at what this algorithm is capable of, but let me quickly explain how it does that. The tf part of tf-idf stands for Term Frequency, while idf means Inverse Document Frequency.

The first part is straightforward: it takes words, ranks them by their absolute count by document (for example Michael is a ‘document’ here, at least his vocabulary is — tf finds Michael’s top words). A basic tokenization and count aggregation.

IDF is where the magic happens. IDF checks, considering all documents (in this case vocabularies of people) where certain words rank. It determines, if a word is found in most of the documents, then it is common, if it is particular to certain documents only, then it is rare.

Then tf-idf compares (multiplies) total frequency within a certain document with the inverse document frequency, and determines of any word, whether or not it unique to a certain document. For example, Michael may have said rabies lots of times, but others have not mentioned that word really. Therefore the tf-idf algo will determine that rabies is a unique word in Michael’s vocabulary.

Having understood the basics of the algo, let’s apply it to the data and see what it found. First, the code, then the explanation.

It seems there’s a lot going on here, but it’s quite simple.

I unnest the string data into tokens — simple words
I apply lemmatization this time. This is a procedure that tries to get words back to their ‘normal’ , ‘root’ forms. It may take the following words: [‘studying’, ‘studies’, ‘studied’, ‘study’] and after lemmatization all of the words will be [‘study’] as all derive from this word. This way I’m dropping the information of the ‘structure’ of the words, but gain information regarding which ‘root’ words were mostly used.
I then count the words by people, an input necessary for tf-idf to determine a word’s idf (the count function is a group_by and then a summarize(count) funtion in one)
And then is the most important part: the bind_tf_idf function by tidytext. It doesn’t get much easier than that. It takes the document (name of person) column, token column (words after lemmatization and unnesting) and the count column (how many times the word occured in the given document) and runs the tf-idf math formula.
As a last step I decided to visualize only the top 8 unique words by each person, due to ties.

So let’s see. The most unique words have been identified. Really, ALL people can be identified by a real Office fan by looking at the tf-idf words.

Andy: tuna, nard, treble, Jessica
Angela: Sprinkles, parum pum from Little Drummer Boy
Darryl: beanie from Justine, Mike
Dwight: deputy, Mose, sheriff, farm
etc…

Such a simple tool capable of such great results.

After getting familiar with vocabularies, let’s start focusing on the ‘other big thing’ people usually associate NLP with: sentiment analysis.

#3. Sentiment analysis

There are numerous ways of running sentiment analysis / sentiment scoring on textual data. Some possible methods are

Categorical sentiment by words (i.e. positive / negative classes such as Bing lexicon, or emotion classes like anger, joy, trust, anticipation from the NRC lexicon)
Numerical scoring of sentiment (AFINN lexicon: beautiful, amazing +3, troubled, inconvenience -2)
Sentiment scoring run on ngram / sentence level — algorithm determining the overall sentiment of a sentence between -1 and 1 where -1 is all negative, +1 is all positive, 0 is neutral / non-classifiable.

Of the above methods I’ll leverage 3:

I’ll classify tokens into positive and negative, count them by people and create a list for each person with their mostly used positive and negative words.
I’ll apply the AFINN lexicon and score each word, then multiple that sentiment score by the frequency of the word and create a list of words that contribute most positivity or negativity to people’s vocabularies.
I’ll run sentiment scoring on sentence / line spoken by character level and compare the results to token-level sentiment aggregation by AFINN scores.

A. Running sentiment analysis using categorical classification

This is just a simple intro step. I take all words, classify them into positive and negative categories, count each word, and determine the mostly used positive and negative words by person. This is really just a warm-up exercise.

Most frequently used positive and negative words by character

Here what we see is just that comparing the counts of top positive and negative words, most people use their most positive words more frequently than their most negative ones. With the exception of Angela maybe, where her most positive word (fine) occurs almost as many times (~22) as her least positive word (bad). This in itself is not representative of personalities really. For that, I’ll be using the AFINN lexicon.

B. Numeric sentiment scoring using AFINN

This time, instead of simply categorizing words, I’ll be assigning numbers representing positiveness and negativeness to each word. Then, I’ll be multiplying the sentiment scores by the count of the words, creating a ‘contribution’ factor: how much positivity or negativity a word contributed to person’s vocabulary. For example, the word ‘disgusting’ has a score of -3, while the word ‘pretty’ is scored at +1. This means it takes 3 ‘prettys’ to balance out 1 ‘disgusting’.

Before showing results, let me show you the code to do that.

Here’s what’s going on there:

We unnest the lines into words by people (keeping the information about who spoke the line)
We apply lemmatization to get words back to their root form (‘studying’ → ‘study’)
We get rid of stopwords from 2 lexicons, a manually created list and the list of the names of people (we cannot score sentiment on names, they’ll be classified as neutral and affect mean and median statistics)
We apply the ‘count’ function, which first groups data by ‘name’, then ‘summarizes’ it by count aggregation. We now have 3 columns: (1) name, (2) word and (3) count.
We join the AFINN lexicon to our data by the ‘word’ column. Now each word has been scored between -3 and +3 (we use inner join so only matches are kept in the end)
As a last step, we multiply the ‘score’ with the ‘count’ to get the contribution factor

Let’s check the visual results.

Now all we need to look for is where the red bars are longest to find people who contribute most negativity to their conversations. Angela, Darryl and Dwight seem to be the ones where the average length of the red bars is close to that of the green (positive) ones.

C. Sentiment between people

There’re a couple more things I realized I should do with sentiments. One of them is to check who’s nicest and meanest to whom in the series. For this I’ll use a similar approach as to my ‘conversation network’. I’ll determine who spoke the line to whom, run sentiment scoring by the ‘from — to’ columns and visualize my results!

Let’s take Angela: she’s nicest to Dwight, meanest to Oscar. The bar chart is easly interpretable, but a network shall make this look a lot nicer. Again, I cannot paste an HTML element here, so check out two screenshots of the otherwise interactive networks.

How nicer! There’s one more thing I can do to make that look better. I don’t necessarily want to visualize all relationships, but focus on the most positive and negative ones. But how to decide which ‘edges’ to keep to the network? Let’s run a distribution analysis of scores by people, and drop the values around the mean / median and only visualize ‘extreme’ relationships.

Here’s what the above chart means. Take Jim — his sentiment scores with ther people range from around 15 to 100 with high extremes. Most scores tend to be between -10 and +30, so I’ll keep everything outside of this range to work with extremes. The visNetwork containing most ‘extreme’ relationships is the following:

Sentiment network with edges representing measure of positivity / negativity

We can sort of see (again, sorry, this would be an interactive network) that the green edges are widest between Jim and Pam, meaning the sentiment between them is most positive amongst all. Oscar and Angela are mutually considerably negative to each other to take another example from the network.

D. Seasonal sentiment trend

As a last step to sentiment analysis, let me quickly run a by-episode and by-season sentiment trend by all people to see if we can follow their happiness / sadness.

First, the episode-level overview.

The above chart offers no information whatsoever. The average sentiments by episode (calculated by using the sentiment_by function of the sentimentR package where sentiment is scored per line between -1 and 1) are too volatile to offer any insights. Let’s check seasonal data.

This is somewhat more interpretable, but no clear trend can be extracted. Maybe Andy’s firing is hinted at, but there is no clear relationship between seasonal average sentiment scores and happiness / sadness of people.

What else can be looked at tho, is how average sentiment developed over time between the two main rivals of The Office: Jim and Dwight.

I ran two algos on this, first using AFINN on token level, then using sentimentR on line / sentence level, and the results are quite similar.

Both suggest the pair’s relationship got better towards the end, which is in line with the story, as rivalry stopped and a friendship began. This is arguable tho, sentimental trend is difficult to model here.

#4. Quick intro to LDA

Before finishing up let me show you another typical NLP job: running topic modeling with LDA (Latent Dirichlet Allocation). To me it’s quite similar to how one unsupervised machine learning algorithm, the k-means clustering does its job. It differs in methodology a lot, but the results are similar: at the end, clustering finds datapoints that somewhat ‘belong’ together, form a ‘similar but unlabeled group’. LDA’s output is a list of words that ‘belong together’, ‘make up an unlabeled topic’.

I’m not getting into the details of how the algo actually runs, but I’ll give you an example. We’ve been working with the top 12 people (by line count). What we can use LDA for in this case, is to find people with similar vocabularies. That is, have (force) LDA create 12 clusters / topics, and spit out the probability of a certain person being part of a given topic.

Let’s do just that.

Here’s what happens in the above 3 lines of code

The input data has 3 columns: (1) name of person who speaks the line (usually refered to as document), (2) the word column (after tokenization) and (3) the column indicating how many times the word occured in the given document
We need to create a document-term-matrix for the LDA to run on
We set the # of clusters / topics to 12 and set a random state to make our work reproducible
Once the LDA algo has finished, we can extract 2 probabilities: (1) betas — probability of a word being part of a topic and (2) gammas — probability of a topic being part of a document: here we extract gammas, as we want the probability of a topic (vocabulary) being part of (spoken by) a document (person)

Here is what we get after visualization

Most people have one particular vocabulary they use, however Dwight and Michael seem to both make up 2–2 vocabularies based on their choice of words. Oscar and Angela share a cluster, meaning they have similarly sounding vocabularies (they’re both accountants), and it’s interesting to see that while Jim and Pam have their own respective topics, they share one (topic # 8) which may be their personal (out of the office) lives, like their families, daughter, wedding planning, etc, etc…

This is far from perfect and LDA does not guarantee actual topics, like ‘finance’ or ‘IT’, the topics need to be named by the analyst after some creative, but possibly subjective thinking.

Finishing up

In this blogpost I have touched upon the following NLP topics:

Tokenization, bigramization and tf-idf to extract words, phrases and unique tokens from textual data
Sentiment analysis using categorical and numerical outcomes, how they can be used to show contributed sentiment to text
Minimal LDA to ‘cluster’ similar-sounding people together, or to at least extract likeliness of one person sharing a topic with another one

I showcased all the above NLP methods on The Office transcripts, and as someone who’s quite familiar with the show, I can honestly say that some of these methods, as easy they are to use, result in awesome findings . Regarding tf-idf and ngrams, no doubt they’re capable of doing wonders to any textual data. Even sentiment scoring seemed to hold up, although sentence level aggregation and trend analysis is difficult, but token level comparison is promising. Regarding LDA, let’s just ‘proceed with caution’.

Overall this was a great way for me to try these methods out, see how they work on ‘live’ data. With questions regarding my code, visit my GitHub page (https://github.com/kristofrabay) and contact me there.