Providing an alternative look at the most looked at show

FRIENDS is one of my favourite shows (probably the favourite) and I’m sure I’m not alone in having rewatched the entire series more than once. I’ve always wondered if there was anything left to know about this oh-so familiar group. After seeing this post using R to look at the show, I thought I would give it a go myself. This post dives into the show’s scripts to find out more, including the most popular characters and their journey through the seasons. But first it will introduce methods to format and export text files into a SQLite database using Python. The text files used in this projects contain scripts from the T.V. show F.R.I.E.N.D.S. and was downloaded from this repository. Then provide some interesting findings about the characters we know so well, some expected and some surprising! It has been a really enjoyable hobby project and one I have been wanting to do for a while. Feel free to skip the coding bits and jump to the visualisations, hope you enjoy it!
Iterating Through Scripts
Each script is a text file containing some information about the episode, the title, writers and transcribers before the script actually starts. We need to find a way to turn a script into rows in a database and then work out how to do this for multiple scripts.

We will start trying to iterate through the scripts. They are stored in multiple text files and helpfully titled using the format season.episode
. We can utilise the OS library in python to navigate through our text files.

As it stands, the code below will iterate through all the files in our scripts
folder to obtain the filename
. This filename is then split using the .
separator and those numbers are stored in variables to be appended to the master list. The master_list
is created as eventually we will want to store the results in a DataFrame.
Regular Expressions
Now we know how to move through our FRIENDS files, we need to see how to isolate the lines from each file. To do so I will be using regular expressions, the scripts are quite messy and all formatted differently depending on the transcriber. The pertinent pattern is character_name: speech
however this can sometimes span multiple lines. Regular Expressions is like a really powerful ctrl-F
, they are used to search for patterns in strings, a nice intro on can be found here. The aim of our regular expression is to match the space before our intended line as indicated by the pink dots. We aim to find this space as we can then split the whole file using these positions, giving us groups of character-speech pairs.

The regular expression used is shown below. First we mathc the string before a colon w+(?=:)
, so now we have "found" the names of each character. However if we want to match the space before we must use s
. You can test it out for yourself, as you can see in the example, the regular expression also matches the space before the writers and transcribers, this will need to be removed after. Now we implement the regular expression in python. In the below code we are also able to split the character name and the speech.
This is combined with our loop in the previous section and the mater_array
is converted to a pandas data frame:
Cleaning
Despite our best efforts, the results are still not 100% ready for analysis. Our first issue is that there are multiple names for each character, this can be seen by executing sorted(df['char'].unique())
, this will return a list of all unique values in the column. To rectify this takes some manual work which involves looking at the multiple spellings of a certain name, case sensitive! To change the names we use the pandas replace
method:
Now we need to address the issues caused by our regular expression, as it caught the authors and transcribers. The format of these lines all end in by. Therefore the regular expression takes the last word before the colon as the character name. This means we can drop all of these rows by removing the character by. Bye by.
- Written by
-
Transcribed by
Clean data is key! via gfycat
Sentiment
Sentiment analysis is on the table when dealing with strings, a more in-depth discussion can be found in this blog post. Similar methods are used, for each line in the database a sentiment score is calculated and stored in the line_sent
column:
Export to SQL
Now this may not be a necessary step as most of the SQL commands we would be using could be done using pandas. However, I think sometimes altering different data frame scan sometimes get messy and SQL language may provide a ore readable way to access this data. Therefore we are now going to move the pandas dataframe into a SQL database. I am using DB Browser for SQLite.
Finally our scripts are formatted and placed in a SQL database. Data wrangling in this way can transform raw data into a more useful data set. Even though we are not adding too much to the data set, the different organisational structure can enable a wider breadth of analysis. Now we have the scripts formatted in this way, we can utilise SQL to gain further insights into the show as carried out in this article.

The Most Popular Friend
This section looks at each character’s role in the show. The previous post walked through the process of putting the data into a SQL database. This was in order to make a query like "who had the most number of lines during the whole series" fairly simple:

Rachel just edges the top spot with 9294 lines over the entire series Ross coming in a very close second (9070), both averaging around 39-ish lines per episode. This isn’t entirely a shock, as they were both the main plot throughout 10 seasons. Almost inseparable are Monica and Chandler, 8403 and 8398 respectively.

A look at the number of lines breakdown throughout the series confirms this pattern, we can see Ross and Rachel dominating the lines until around Season 4. This is when the London episodes happen and Chandler and Monica have a bigger joint story, translating in more lines. I think it is a shame Phoebe never got more lines, staying rooted at around 800 lines per season. Rachel did say it:
Ugh, it was just a matter of time before someone had to leave the group. I just always assumed Phoebe would be the one to go. — Rachel 5.05
Most Spoken About

Being the one doing the most talking does not necessarily mean you’re the most popular, so now we will take a look at who’s talked about the most. This is a pretty difficult task to accurately capture all mentions of each character. A possible solution is a list of nicknames for each character (let me know if I have missed any out!). It’s pertinent to note, this is the method we will use to find any reference to each character throughout this post, using the nicknames detailed below.
In order to get the count, we first iterate through the characters, keeping a count of the mentions. Using a nested for-loop to get each characters nickname, we use the pandas count()
method to keep a tally of the number of mentions.

When using only full names, Ross is the most mentioned. "Chan", "Joe", "Mon" and "Rach" are all mentioned more than their full names. This supports the decision to include the nicknames but does also highlight how sensitive the results are to picking the right names.

Words
Catchphrases
There are a few running catchphrases, for example "Smelly Cat " was mentioned 37 times throughout the whole show. The infamous "We were on a break" line was referred to 17 times. And Joey’s pick up line "How you doin’" was said 37 times.
Largest Vocabulary

Another interesting aspect to look at is the lexicon of words each character uses. This is done by first selecting all the lines said by the main characters as shown above. After which all non alphabetical characters are removed. Every line by each characters is then split into words (using the space in between to split) and added to a set. A set allows no repeated values which is perfect for our use in this case.
Unsurprisingly Ross tops the list his passion for dinosaurs is a running joke throughout the series. Despite his career, starting off at the New York Museum of Prehistoric History and then professor at New York University, some real-life paleontologists aren’t convinced. I’m sure I’m not the only one surprised to see Joey in not-last-place. Given the role’s stereotypical caricature it appears Joey does have a couple of words up his sleeve, even if they are made up!

How you Doin’?
As we have calculated a sentiment score for each line, we are able to monitor this score throughout the course of a season.

The chart above shows tracks the sentiment for Rachel and Ross throughout the first 2 seasons. Total sentiment score per episode is calculated, as the scores range in-between -1 to 1 the total will give an indication of the majority of sentiment throughout a particular episode.
Episode 104 is where Rachel gets her first paycheck, may be the cause of such positive sentiment as is episode 117 with a guest appearance from George Clooney. Ross really experiences the highs and lows throughout the first episodes, finding out he was having a boy in episode 112 before saying bye to marcel in episode 121. Before finally, both characters show a spike on episode 207, The One where Ross Finds Out and a conflicted Ross finds out Rachel has feelings for him. This may be why Ross’ overall sentiment for that episode was "muted but positive".
Networks
So far we have mostly looked at out FRIENDS isolation, here we will see how they interact. Looking at how many times a character mentions another characters name the show so we can draw networks relating each character to another. The table below shows the results; read from left to right tells us that Rachel mentioned herself 187 times and mentioned Joey the most: 739 times. Read from top to bottom can be understood as Rachel mentioned Chandler 321 times, Ross mentioned him 332 times and his wife (Monica) mentioned him the most: 622.
The table throws up some interesting findings, Rachel was mentioned the most by Ross (622, and one cost him his marrige ) and Ross was mentioned by Rachel the most: 550. Interestingly, although Monica says chandler the most, Chandler says Joey the most.

The table does provide some insight but it isn’t the most ascetically pleasing way to look at the findings. So we can create a chord diagram using this fucntion provided on Github. The size of the chords for each characters section represents how many times they said the connecting characters name. In other words, if you were to read the values from left to right in the table, that is what each characters portion shows. This makes it clearer just how much both Joey and Monica occupy Chandler’s mentions by looking at the pink slice.

Graph and Centrality
So we have now built a network of FRIENDS we can calculate a centrality score for each of them. Centrality aims to answer the question: Who is the most important or central person in this network?. Obviously this is a subjective question depending on the definition of importance. Before we define our measure of importance, we must first convert our table into a graph. We will use network x to create a directed, weighted graph using the values in the table above (stored in network_data
). Nodes are the characters and the weights are the number of mentions. We can also check the graph has been created correctly by checking the edge weights between nodes.
out: {'weight': 426} # yay! it matches our table
Now we have created our graph, we calculate the Eigenvector Centrality as a measure of importance (used in Google’s page rank). This algorithm aims quantify influence of people in a social network, based on connections with important people. In this case we are defining "importance" as connections with important people. With an emphasis on links with other people, it is easy to see how this may be applied to other larger networks such as Twitter. Using "interactions" (retweets and likes) as weights, this algorithm may be able to give you the most connected accounts in a network, potentially gaining more insight than a count of the highest number of followers. Valuable information for anyone looking to gauge (or alter) public opinion.
Networkx makes life easy, apply the eigenvector_centrality_numpy
method and define the weights to calculate the scores for each node. The result in order of importance is shown below. I was surprised upon initially looking at the results, however when I thought about the measure it started to make sense. I think Joey could be seen as the glue of the group, always interacting with the other characters. To see Ross and Rachel at the lower end isn’t entirely surprising given that they occupy most of each others time. This post hasn’t been great for Phoebe 🙁 .These results are subjective, as as is the interpretation and I would love to hear what you think about the centrality scores.
I hope you enjoyed this alternative view on the popular show. Whilst I understand FRIENDS may not be everyone’s cup of tea I do think this kind of analysis can be applied to almost any long running series. Maybe you could try out something similar for your favourite show and let me know what you find!
Thanks for reading 🙂
Originally published at https://quotennial.github.io/friends-analysis/