Sitcoms natural language comparison

Applying visualisation and modelling to TV comedy

David Mulholland
Towards Data Science

--

It’s the analysis no-one asked for. Can we learn anything by comparing the scripts of some of the biggest U.S. sitcoms of modern times, using natural language processing?

I collected transcripts for as many of the big sitcoms as I could find: Friends, Frasier, Will and Grace (seasons 1 to 8), The Office, Seinfeld, Married With Children (partial seasons), and Scrubs (seasons 1 to 5). Thanks to all those fan sites for transcribing the episodes so thoroughly. With some processing, stage and intonation directions were removed, to leave each episode transcript in a format of one row per spoken line, with the character speaking and the spoken text. I spent far longer on this than could be considered sensible.

Line statistics and most distinctive words

Some show-level stats are shown here. Medium doesn’t do HTML tables, but producing this exported version with R’s formattable did, at least, lead me to this post showing how sparklines can be included therein.

Table showing text statistics for seven sitcoms

Scrubs and Frasier included the largest vocabulary of words per episode; the former was probably helped by the use of medical terminology. Seinfeld involved relatively more nouns and fewer verbs (identified using spaCy’s part-of-speech tagger) than other shows. Does this fit with it being the ‘show about nothing’: Friends and Will and Grace were more narrative-driven and included more verbs describing the characters doing something, whereas in Seinfeld they talked about stuff in the abstract rather than doing things? The sparklines show the median number of characters appearing in each episode, by show and season. The Office and Scrubs regularly featured a larger roster of characters per episode than the others, and The Office increased in later seasons in this area.

Using term-frequency inverse-document-frequency (TF-IDF), by show, after excluding characters’ names (which occur very commonly, but only in one show each) we can look at some of the more ‘distinctive’ words spoken on each show. These are the words that occur commonly in one show but not in others:

Plot showing most distinctive words by tf-idf for seven sitcoms

A lot of the highest-scoring terms are minor character names or other names that were not removed, which is not surprising, but the method also picks up some terms that are genuinely characteristic of one show, such as ‘regional’, ‘sherry’, and ‘surgical’. The word ‘will’ appears for several shows because it is not considered a stop word by the ‘snowball’ lexicon (but does not appear for Will and Grace, as it is a main character name).

Character speech patterns

To get some idea of each character’s vocabulary, below I show the mean word length in letters, and the ratio of number of unique words used to total words used (both including stop words). Both are calculated by season, using only the first 5000 words spoken per character, and (weighted) averaged, to avoid biasing the unique to total words ratio towards shows with a smaller number of episodes in some seasons, or characters with smaller numbers of lines per season (since the more words are spoken, the harder it is to avoid repeating words). Only characters with an average of at least 3000 words per season are included in the plot.

Plot showing character word length and vocab size for seven sitcoms

The characters who tend to use the longest words are Dr. Kelso and Dr. Cox (perhaps because of their use of medical terminology), Dwight, and Niles, while Rachel, Joey, and Grace tend to use the shortest words. Word length and vocabulary size are positively correlated, but there are some deviations from this, mostly by show, with Frasier and Seinfeld showing longer words relative to their vocabulary sizes (names sitting above the dashed line), and Will and Grace and Married With Children the reverse (names below the dashed line).

It’s noticeable how clustered the characters from each show are, particularly those of Friends, Seinfeld, and Married With Children. If the characters in the plot above were real people, we might expect their speaking manner to be a random distribution across shows, perhaps with some clustering due to self-selecting groups of friends of similar levels of education. The clustering seen here is probably a reflection of the biases of the writers, who are writing for all of the characters in the show at once. It could also be due to the different styles of each show, such as ‘New York twenty-somethings’ or ‘in a hospital’. To be fair, Frasier and Scrubs show characters spread along the main diagonal, displaying a range of speaking styles or education levels.

Plot showing correlation grid between speech patterns of sitcom characters

I calculated a document-term matrix by counting the number of times each character used each of the most common 300 words that appear in all seven shows. The grid above shows the correlation between each pair of rows in this matrix. The similarity of language used within each show can be seen as the boxes of high correlation along the diagonal. The highest correlations between pairs of characters are for Frasier and Niles, Phoebe and Rachel, and Will and Jack, and the lowest correlations are between Frasier and Darryl, and Niles and Darryl. The most unique voices (lowest mean correlation with all other characters) are Dr. Kelso, Frasier, and Darryl.

Plot showing sitcom character speech visualised in 2-d space

When this document-term matrix is plotted on principal component (PC) axes, the characters are grouped together by show, and Friends stands out as being somehow different to the other shows (more isolated in PC space). As can also be seen in the grid above, Marcy and Peggy show a surprising similarity with the characters of Frasier. Jim and Pam are virtually identical :) There is no obvious male/female pattern found in the characters’ language here. From what we’ve seen previously, we can infer that the first PC axis (left-right) roughly corresponds to the frequency use of longer words.

Character sentiment

The sentiment of the characters’ language in each show was scored using the NRC lexicon, by differencing the fractions of words labelled as ‘positive’ and ‘negative’. The nice package ggridges allows us to see that Will and Grace is, on average, the most ‘positive’ show, and Seinfeld and, maybe surprisingly, Scrubs, have the most negative sentiment:

Plot showing episode text sentiment scores for seven sitcoms

Recalculating using the VADER compound polarity score (not shown) gives a fairly similar result, with some changes in the show rankings.

NRC also gives us scores for several specific emotions: Scrubs, Married With Children and Will and Grace score highest on ‘disgust’; Seinfeld is far below the others on ‘joy’; and Scrubs scores highest, and Friends lowest, on ‘sadness’. This is probably somewhat affected by the use of certain favourite and/or slang terms on each show. Some characters seem to ‘hit’ the NRC lexicon words more consistently than others: for example, Turk appears in the top three characters by fractions of ‘joy’ words and by fraction of ‘sadness’ words.

Indicators of quality

Finally, can the transcript text information be used to infer the quality of an episode? For the most objective indicator of episode quality (in fact, it seems, the only one available), I pulled the IMDB ratings for each episode using the handy IMDbPY package.

Are these ratings accurate? They at least capture the general sense of a show’s quality increasing from season one to season two, and tailing off after around season five:

Plot of mean episode ratings from IMDB for seven sitcoms

However, there is a strong — probably too strong — ‘by-show’ component to the ratings. Friends (8.46) and Seinfeld (8.44) are highest by mean rating, whereas Frasier is fifth out of the seven here (8.01), which is pretty hard to justify. It’s worth noting then that what our predictive model finds to be indicative of a ‘good’ episode may be weighted towards what goes into a good Friends or Seinfeld episode rather than a good Frasier episode; although we will also try to control for differences between shows in mean ratings.

I derived several statistics for each episode: number of lines, mean words per line, mean word length, number of characters with at least three lines, and number of unique words; some of which were shown above. Noun-to-verb and noun-to-adjective ratios, fractions of lines spoken by each of the main characters, and eventually term frequencies (100 common words and bigrams), were also included as features. I split the episodes randomly into train and test and fitted simple Generalised Linear Models (GLMs) (since my interest is in interpreting the coefficients) using scikit-learn.

Table showing predictive model results for sitcom episode IMDB ratings

As we could tell from the previous plot, predicting the training set mean by show (model B) rather than overall (A) gives a better baseline (lower RMSE). Compared to this, the text summary terms, together with show and season (as numeric), do show a small amount of predictive skill. Model C includes significant negative coefficients for mean words per line and mean word length, meaning that the larger these terms are, the lower the episode’s rating, on average. It also fits a negative coefficient for season, and a weakly positive coefficient for number of unique words per episode.

Noun-verb ratios are around 0.6–0.8; highest for Seinfeld and lowest for Friends. The model finds that a higher value for this term is better, after controlling for the show, to the extent that it is worth around 0.1 rating points to Seinfeld over Friends, on average. Does this reveal a fundamental component of ‘good writing’? I’m not sure. Noun-adjective ratio was not found to be predictive of episode rating.

Character line fractions provide further lift, when added to the show terms (model D). These coefficients, in theory, tell us which characters in each show tended to appear more or less prominently in what are considered to be the better episodes; in other words, the characters that viewers liked or disliked. Significant positive coefficients were returned here for Michael, Daphne, Niles, Monica, Ross (!), George, and Kramer. Significant negative coefficients were found for Pam, Erin, Jerry, and marginally so for Dr. Cox and Marcy. So apparently the better Seinfeld episodes had more George or Kramer in them and less Jerry, all else being equal.

Including the bag-of-words term frequencies did not improve the model, which is probably not surprising, as this would imply that there are ‘magic’ words that boost the episode rating simply by being present in the transcript. A model using just these terms does beat baseline A, but not baseline B, which is probably because it picks up on words that distinguish one show from another (e.g. ‘office’, ‘gay’, ‘dr’).

From the beginning, I had the hope that a neural network model (an LSTM) would be able to ‘read’ the episode transcripts and learn to predict the episode rating based on sequences of words: essentially, learning what makes for good and bad sitcom writing. When I trained such a model on just the transcripts to predict rating, breaking up episodes into scenes and assigning the overall episode rating to each scene (giving 8000 training rows), it did show some skill (model F), and this could probably be improved by altering the model architecture. But, when I tried to do the same to predict the residuals (prediction errors) from model C, the LSTM could not extract any further signal. This probably means that model F picked up on key words to distinguish one show from another, allowing it to perform at the level of baseline B, but beyond this it hadn’t learned anything meaningful from the transcript sentences.

Does any of this help us to write a good sitcom episode? Probably not. But some of the terms fitted in the models could potentially be used as a guide for writers, if they could be calculated while the show was still running; particularly character fractions, although there are probably more direct ways to gauge viewer opinion on characters than extracting the information from tens of episode ratings. The noun-verb ratio finding is interesting: I have no idea if this is a concept that has been studied in linguistics before; it was just one of the simple show-level text metrics that I thought to calculate.

--

--

Data scientist and former climate physicist, with special interest in politics and sports analytics