NLP ON VIDEO SCRIPTS
Recently I started playing with the scripts of YouTube videos, which are uploaded by creators along with the videos or transcribed automatically by the website through speech recognition systems. Currently, I’m looking for ways to display the content of a video in a graphical way that allows me to quickly explore its contents without having to watch it all. In the longer term, my goal is to set up a full "video script explorer" that you can use online to quickly overview what the different sections of the video talk about -stay tuned because this promises to be a fun project, plus maybe useful too!
For the moment I have made some interesting progress that I will share here. It’s all manual steps so for the moment there is no code. Briefly, here I show you (1) how to get the script of a YouTube video, (2) clean up the content of the sentences by removing stop words, symbols, etc., (3) reformat the data to have sentences of sizes reasonable for analysis, (4) convert the sentences into numbers, and (5) finally apply PCA on these numbers to display the results. It’s a quite simple approach, but the results make sense at least on a script of a 50 minute long video that presents 3 different stories about a related topic.
I hope this article will be teaching you a few basic things, and serve to me as a stepping stone for more advanced analyses and for my own future video script-browsing tool.
1. Retrieving the script of a YouTube video
Not all YouTube videos have scripts available. If they do, then you will see "Open transcript" when you click the three dots in the bottom right of the video:

You can select all the transcript text and paste it into your favorite program. You’ll see when you paste the text that this results in a single column with alternating rows of time indexes and text, and also that there’s quite some trash in there. Therefore you will of course need to perform some cleanup.
Note: there are several programmatic ways to get the scripts of YouTube videos, but none of the methods I found worked consistently on all videos.
2. Cleaning up the content of the script by removing stop words, symbols, etc.
You can see that the script is stripped into very small "sentences". In the video I analyzed (which is not the one shown in the figure above) I got 2032 lines, which actually means 1016 lines of raw text. That’s from a 50 min long video from a regular TV program in my country.
Many sentences actually don’t have any content at all, just indicating that a passage of the video is "[Music]" and other kinds of tags. I removed these lines as well as all symbols, numbers, and words of 3 or fewer characters which are mostly junctions and noise from the automatic script generation process with not much content.
3. Reformatting a script to get sentences of sizes reasonable for analysis
To this point, I extracted 996 lines of text from the script. You can see that each line is quite small, containing between 1 and 10 words. (I suspect the true limit is given by the number of characters, as the system wants to ensure that the whole text fits the screen, but gaps of silence or music also produce shorter sentences).
As they are, these raw lines of text are too short to be analyzed. Therefore I rearranged my 996 lines by merging every 12 consecutive lines into single "sentences". That means I now have 83 lines, each containing between 30 and 40 words.
These 83 lines involve 847 words (remember I already cleaned up all the stop words, short words, symbols, numbers, etc.). Of them, 75% appear only once in the whole bag of words, 15% appear once, 4% appear three times, and 6% appear between 4 and 11 times.
4. Converting the sentences into numbers
At this point I move from words to numbers. For this, I take the words I compiled and count how many times they show up in each of the 83 sentences. That means I get the "bags of words" for each sentence. In what follows I stay with the procedure in which I included only those words that appear 2 or more times, which means 25% of the 847 words (i.e. 214 words).
Of course, most words do not appear in any sentence; however, the way I prepared words and sentences implies that each word will appear at least once in at least one sentence. Therefore I get a matrix that looks dominated by zeros but actually has at least one number > 0 in all its rows and columns.
Having filtered words with 2 or more total counts, and having 83 sentences, at this point I got a matrix of 214 rows (words) and 83 columns (sentences). The following is a representation of that matrix where all 0s were removed and any number > 0 is seen as a black dot:

You can see that the density of points increases as you go down. That’s because the rows (words) are ordered by increasing total occurrences.
Now that we have a numerical representation of the data (yes, I know it’s very simple and for sure has many problems, but it’s a start) we can begin the fun part of crunching it.
5. Applying PCA on the data and interpreting the results
First PCA attempt
Applying a simple PCA procedure to the matrix above already gives some meaningful results. To aid interpretation, I took advantage of the fact that the video presents 3 separate stories on a common topic: all 3 are about painting, but each one focuses on a different painter who is interviewed separately. In the next picture you can see the input matrix prepared above but colored by story number, and then the PCA plot where each dot (sentence) is colored according to the color of the story it was extracted from.

You can see how stories 1 and 3 are rather separated, especially along PC2. Story 2, instead, remains roughly in the middle.
Playing with different PCA runs
What’s the effect of choosing words of higher or lower frequency? In my tests, PCA considering only the words that appear twice in the whole text does not produce any clear spreading of the data. Meanwhile, running PCA only on words that appear a total of 5 or more times (26 words) produces a better separation of the red dots from the green + blue dots (story 1 against stories 2 + 3):

Most interestingly, the loadings plot explain what words are weighing more into the separation of points in the PC plot:

In such plot, where we have one number per input variable (here words) per principal component (here the first three are shown), both positive and negative values matter. The positive and negative peaks in the last variable correspond to the word "hotel" which appears a total of 11 times but all of them in story 1. By watching the video one understands that the whole story 1 revolves around pieces of art that are currently exhibited in a hotel, and that the interview itself takes place in the hotel and even covers how the hotel was recycled from ruins.
The negative peak at position 6 corresponds to the word "yellow". This word is mentioned a total of 5 times, all of them in a single sentence of story 1 talking about the colors of the autumn, when the story was filmed. Such strong sensitivity to a single sentence is probably something to be softened somehow. In particular, the story is not especially centered around the fall and its colors.
Removing the word "yellow" improves the range of spread a bit, still dominated by "hotel":

Last, removing "hotel" to shift the focus of the PCA procedure to other words leads to less spread of the sentences and stresses the words "look", "painting", and "art":

Conclusion
The procedure might not be the best, but is very easy and it does have some power to spread the contents of a script, being especially sensitive to words that are very frequent in only one of the stories. If made into an interactive web app where the user can dynamically see the results updated when words are removed or included, and possibly see the full sentences when (s)he hovers over the data points, this could be I think a quite powerful tool.
What do you think? What would you expect from a tool supposed to facilitate inspection of the contents in a video script (or any text for that matter)?
All PCA runs were carried out with the tool I describe in this article:
I am a nature, science, technology, Programming, and DIY enthusiast. Biotechnologist and chemist, in the wet lab and with computers. I write about everything that lies within my broad sphere of interests. Check out my lists for more stories. Become a Medium member to access all its stories and subscribe to get my new stories by email (original affiliate links of the platform for which I get small revenues without special costs to you). Donate here through various means. Contact me here.
To consult about small jobs (on programming, biotech + bioinf project evaluations, science outreach + communication, molecular data analysis and design, molecular graphics, photography, moleculARweb tutorials, science teaching and tutoring, etc.) check my services page here.