The world’s leading publication for data science, AI, and ML professionals.

Evaluating Cinematic Dialogue - Which syntactic and semantic features are predictive of genre?

This article explores the relationship between a movie's dialogue and its genre, leveraging domain-driven data analysis and informed…

DALLE Generated Image by Author
DALLE Generated Image by Author

Natural Language Processing

From fragmented speech in thrillers to expletive-laden exchanges in action movies, can we guess a movie’s genre simply by knowing its semantic and syntactic characteristics in the dialogue? If so, which ones?

We will investigate whether or not the nuanced dialogue patterns within a screenplay – its lexicon, structure, and pacing – can be powerful predictors of genre. The focus here is twofold: to leverage syntactic and semantic script characteristics as predictive features and to underscore the significance of informed feature engineering.

One of the primary gaps in many data science courses is the lack of emphasis on domain expertise and feature generation, engineering, and selection. Many courses also provide students with pre-existing datasets, and sometimes, these datasets are already cleaned. Moreover, in the workplace, the rush to produce results often overshadows the process of hypothesizing and validating predictive features, leaving little room for domain-specific exploration and understanding.

In my own experience outlined in "Using Multi-Task and Ensemble Learning to Predict Alzheimer’s Cognitive Functioning," I witnessed the positive impact of informed feature engineering. Researching known predictors of Alzheimer’s allowed me to question the initial task and data, ultimately leading to the inclusion of key features during modeling.

DALLE Generated Image by Author
DALLE Generated Image by Author

In this article, I delve into a project that examines movie dialogue to illustrate my approach to research and feature extraction. The focus will be on identifying and analyzing textual, semantic, and syntactic elements within film dialogue, investigating how they interrelate, and evaluating their capacity to accurately predict a movie’s genre.

Initial Questions

I like to start every project by conducting a literature review. I begin by jotting down relevant concepts and questions to guide my review. This initial phase is crucial and, depending on the time I have, I intentionally steer clear of research directly related to the modeling problem at hand. The goal is to understand the broader context and seek out supplemental information first. This strategy helps in cultivating an unbiased understanding of the subject matter, ensuring that my approach to the problem is informed, yet not prematurely narrowed by the solutions and methodologies already explored by others.

A few questions I’d jotted down:

  • Is there a relationship between dialogue and emotions?
  • How do conversations differ in real life vs. in screenplay?
  • What can I understand about movie dialogue and how it relates to genre?

What I found

There is a body of literature that explores the interplay between natural dialogue and our emotions. Screenwriters capture an emotion or mood by capitalizing on textual and syntactical relationships. These vary across genres since different moods are associated with different genres.

What are these syntactical & textual characteristics?

We will extract and evaluate the 4 characteristics listed below. In each section, I’ll explain the rationale:

  1. Length attributes
  2. Types of sentences
  3. Part of speech and profanity
  4. Sentiment analysis

Data

The dataset used here is the Cornell Movie-Dialogs Corpus (MIT License) from Kaggle, which was originally retrieved from the ConvoKit toolkit (Chang et al., 2020). This is comprised of over 300k spoken lines ** across ~220k conversational exchanges derived from 61**7 different movies.

Load the Data

We’ll begin by loading data using the movie_lines.txt file.

# Define the directory path where the 'movie_lines.txt' file is located
corpus_directory = 'cornell movie-dialogs corpus'

# Construct the full file path
file_path = os.path.join(corpus_directory, 'movie_lines.txt')

# Open the file in read mode with 'mac_roman' encoding
with open(file_path, 'r', encoding='mac_roman') as file:

    # Read the contents of the file and split them into individual lines
    lines = file.read().splitlines()
lines[:2]
['L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!',
 'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!']

The columns are split by +++$+++so this will be used as the separator to split each line, extract the columns, and read the data into a data frame.

# Split each line in 'lines' using ' +++$+++ ' as the separator
preprocessed_list = list(
    map(lambda x: (str(x).split(' +++$+++ ')), lines)
)

# Define column names for the DataFrame
column_names = ['line', 'speaker_id', 'movie_id', 'name', 'text']

# Create a DataFrame using 'preprocessed_list' 
df = pd.DataFrame(preprocessed_list, columns=column_names)

# Display the first 2 rows of the DataFrame
df.head(2)
Sample of Data. Image by Author.
Sample of Data. Image by Author.

Preprocess Text

I used spaCy – an open-source natural language processing library written in Python and Cython – to process the text. This included cleaning contractions, removing punctuation, and lemmatizing words.

# Transforms all contractions to their longer form
df['text'] = df.text.map(clean_contractions)

# Removes all punctuation and punctuation errors in the data
df['text_no_punct'] = df.text.map(remove_punctuation)

# Remove words <2 chars and stopwords, lemmatize, & transform to lowercase
df['clean_text'] = df.text_no_punct.map(
    lambda x: preprocess(' '.join(x))
)
DataFrame After Preprocessing. Image by Author.
DataFrame After Preprocessing. Image by Author.

1. Length Attributes

In suspense movies, dialogue is often sparse, showcasing the link between syntax and emotions. When characters are in states of terror, their speech tends to be concise, while nervousness often leads to longer utterances (i.e. rambling), a trait more commonly seen in comedies. Therefore, we will examine the length attributes of each line in the corpus.

DALLE Generated Image by Author
DALLE Generated Image by Author

In this section, we’ll take a look at:

  1. Average # of words in a line
  2. Average # of sentences across the whole corpus
  3. Distribution of the average # of words per line
  4. Distribution of the average # of sentences per line across the corpus
# Calculate the number of words in each line
df['num_words'] = df['text_no_punct'].map(len)

# Extract the number of sentences in each line
df['num_sentences'] = df['text'].map(
    lambda x: len(nltk.sent_tokenize(x))
)

# Remove entries with empty or non-textual content
df = df[df['num_words'] != 0]
DataFrame After Adding Length Features. Image by Author.
DataFrame After Adding Length Features. Image by Author.

Statistics Per Movie: Boxplot and Statistics DataFrame

Length Feature Boxplots. Image by Author.
Length Feature Boxplots. Image by Author.
Length Features Statistics DataFrame. Image by Author.
Length Features Statistics DataFrame. Image by Author.

In the boxplot and the statistics data frame above, we see that:

  • Word length ranges from 0 to 30, with a median length of 7. The interquartile range (maintaining half the data) indicates that the word lengths in the middle 50% of lines range from 4 to 14 letters.
  • Sentences are usually 1 or 2 sentences, sometimes ranging to their upper bound of 3 sentences.
Proportion of Lines Greater Than a Sentence. Image by Author.
Proportion of Lines Greater Than a Sentence. Image by Author.

Less than half of the script lines maintain more than 1 sentence. This informs us that each script line is short, and should be framed accordingly.

Distribution on a Per Movie Basis

The metrics mentioned above were calculated on a ‘per line’ basis within the movie script data. In the next section, we shift our focus to explore the average length of lines per movie, allowing us to examine variations in word length at the movie level.

Length Features Statistics Across the Corpus. Image by Author.
Length Features Statistics Across the Corpus. Image by Author.

Dialogue Density Variation

The "Length Features Statistics DataFrame" figure shows that individual lines in scripts range from 0 to 582 words, with a median of 7 words, which suggests a high degree of variability in dialogue density on a line-by-line basis. In contrast, the aggregated movie data shows a much narrower range, with a maximum average of 38.69 words per line, indicating that while individual lines can be extremely verbose or concise, movies tend to balance out to a moderate density of words.

Narrative Rhythm

With over 39% of script lines containing more than one sentence, the per-line analysis indicates a tendency towards compound or complex sentences. However, the tighter standard deviation in the movie averages (0.29 for sentences) suggests a consistency in narrative rhythm across different films, aiming for a steady pace in dialogue delivery.

DALLE Generated Image by Author
DALLE Generated Image by Author

Scriptwriting Consistency

The contrast between the median length of individual lines (7 words) and the average across movies (11.36 words) implies that screenwriters might often intersperse shorter lines of dialogue with longer monologues or exchanges. This technique could be a deliberate choice to create dynamic interactions between characters, keep the audience engaged, and ensure that each movie has its unique tempo and style.

Visualizing the Outliers That Pull the Average to the Right

The histograms show a right-skewed distribution, with a central tendency for movies to feature lines averaging 7–13 words. This skewness is indicative of a minority of films with unusually long lines, which heavily influence the overall average.

Histograms of Length Features. Image by Author.
Histograms of Length Features. Image by Author.
Histograms of Length Features with Outliers Removed. Image by Author.
Histograms of Length Features with Outliers Removed. Image by Author.

After outliers are excluded, the bimodal distribution for words per line becomes more evident, suggesting that there are two common line lengths in scripts. This observation is interesting as it could reflect different styles or genres within the corpus. The distribution of sentences per line appears to be approximately normal, with a negligible right skew, indicating a consistent sentence structure across screenplays.

2. Types of Sentences

Exclamation Points

There are various ways to represent a heightened state of emotion in a script. One of which is to use an exclamation point (!) for emphasis and another is to use CAPITALIZATION FOR EMPHASIS. We’ll look at the presence of both and see if there’s a correlation with the overarching sentiment.

Hyphens

A hyphen placed at the end of a character’s dialogue (-) may signify an interruption in their speech or an abrupt pause in the character’s thinking (e.g., the character has an epiphany). It can also convey fragmented speech.

Questions

I had no prior knowledge or intuition about the relationship between the presence of questions in a script and other features. However, the proportion of questions is easily measurable, and it could be intriguing to explore whether any patterns can be detected.

DataFrame with Added Sentence Type Features. Image by Author.
DataFrame with Added Sentence Type Features. Image by Author.

Below, we see that the proportion of lines with questions, indicated at 31.4%, suggests a strong preference for interactive dialogue within movies. This is substantially higher than the proportion of lines with exclamations, at 8.9%, which could indicate that while intense emotional expressions are present, they are less frequent than interrogative exchanges.

Visualization Representing the Sentence Type Proportions. Image by Author.
Visualization Representing the Sentence Type Proportions. Image by Author.

The boxplot for the count of all-caps words reveals that the use of capitalized words is not common, suggesting that screenwriters may prefer subtler methods of conveying emphasis in dialogue rather than relying on text formatting.

Boxplot for the Count of All Caps Words. Image by Author.
Boxplot for the Count of All Caps Words. Image by Author.
Histogram for the Punctuation Usage Distributions. Image by Author.
Histogram for the Punctuation Usage Distributions. Image by Author.

While questions are more common, the range of usage varies widely among movies, potentially reflecting different genres or directorial styles. For example, a thriller may have more questions built into the dialogue to maintain suspense, whereas a comedy may use exclamations to highlight punchlines.

Histogram for Proportion of Lines that Contain Hyphen at the End and the Average Number of Uppercased Words. Image by Author.
Histogram for Proportion of Lines that Contain Hyphen at the End and the Average Number of Uppercased Words. Image by Author.

The histogram for lines that end with a hyphen shows a significant skew towards a lower proportion, indicating that lines ending with a hyphen are relatively uncommon in movie scripts. This could suggest that interrupted dialogue or sentences leading into actions (which are often denoted by hyphens) are used sparingly, perhaps to maintain the flow of dialogue or to avoid overusing a device that might otherwise lose its impact.

3. Part of Speech and Profanity

Part of speech helps us understand the grammatical function of a word in a sentence. For instance, genres like historical or biographical films are often flooded with proper nouns, making the tracking of these and other common tags potentially revealing.

According to "Judging Screenplays by Their Coverage" by Stephen Follows and Josh Cockcroft, "swear words (are) not spread equally across all scripts […] Comedies are the sweariest, beating Action and Horror scripts by a tiny margin (and) the genres featuring the lowest levels of swearing are Family, Animated and Faith-based scripts" (42).

Part of Speech Tagging

We’ll start by taking a look at the most frequent tags from the text by flattening the text, taking a sample, and using SpaCy for POS tagging.

Barplot of the Most Frequent POS Tags from a Sample of Text. Image by Author.
Barplot of the Most Frequent POS Tags from a Sample of Text. Image by Author.

Overall, nouns are by far the most common parts of speech, with adjectives and verbs maintaining relatively similar counts. Adverbs are the rarest part of speech for our movies.

NN: noun, singular or mass
JJ: adjective
VB: verb, base form
VBP: verb, non-3rd person singular present
RB: adverb
Distribution of POS Tags. Image by Author.
Distribution of POS Tags. Image by Author.

I chose to display all four histograms on this plot because it highlights a clear differentiation in the usage of various parts of speech within movie dialogues. Nouns dominate the linguistic landscape, occupying 40% to 60% of the dialogue whereas adverbs range anywhere between 0 to 10%. This prevalence underlines the concrete and tangible nature of film narratives, which often rely on specific nouns to anchor the conversation and set scenes. Adverbs, conversely, appear infrequently, suggesting that movie dialogue may favor direct and concise language over descriptive or qualifying phrases.

Profanity

We’ll detect profanity using the ‘badwords.txt’ from profanityfilter.

Histogram of the Proportion of Lines that Contain Profanity. Image by Author.
Histogram of the Proportion of Lines that Contain Profanity. Image by Author.

While most movie lines are devoid of profanity, there is a significant presence of it in certain scripts, with a few reaching a proportion as high as 0.37. This might reflect the genre, setting, or character development choices, where profanity is used to add realism, and intensity, or to delineate characters’ personalities.

4. Sentiment

We’ll utilize two sentiment analysis models: NLTK Vader, which is quick but uses a basic rule-based approach, and Flair, which is more accurate but computationally intensive.

NLTK Vader assigns sentiment scores based on individual words and may be biased by neutral words even in the presence of strong negative words, making it less precise. It also struggles to identify sarcasm or context nuances.

Visualizing Frequency of Positive, Negative, Non-Neutral and All Words

Flair is an embedding-based model which enables it to capture context. Words with similar vector representations are often used in the same context. The downside to using this approach is that it’s significantly slower than the naive, rules-based approach. The NLTK model took ~ 4 minutes to run while this model took ~3 hours to run.

WordCloud for Various Sentiment Groups. Image by Author.
WordCloud for Various Sentiment Groups. Image by Author.
DataFrame with All Features Added. Image by Author.
DataFrame with All Features Added. Image by Author.

Relationships Among Variables

Correlation

"The Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is associated with a proportional change in the other variable. The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate." (source)

In our analysis, we will use the Spearman correlation coefficient to identify a monotonic relationship between all values.

Correlation Heatmap Between Features. Image by Author.
Correlation Heatmap Between Features. Image by Author.

Only Display Significant Correlations

Below displays only the significant correlations where the p-value for the Spearman correlation is less than 0.05.

Correlation Heatmap Only Showing Significant Correlations. Image by Author.
Correlation Heatmap Only Showing Significant Correlations. Image by Author.

I expected to find some significant correlations, such as those between the average number of words and the average number of sentences or the average number of uppercase words. I’d also anticipated the following correlations:

  • Sentiment features correlate with the proportion of profanity.
  • Correlations among different part-of-speech tag proportions.

There were a few interesting observations:

  • No correlation between the use of exclamation marks and profanity
  • A significant, albeit weak, correlation between the use of questions and profanity.
  • A weak negative correlation between the Flair sentiment and the use of questions.

A significant positive correlation between the average number of words and the use of proper nouns (prop_noun) may also indicate that more complex dialogues include more specific references to entities or names, which could be characteristic of certain genres like science fiction or fantasy with complex world-building.

Profanity Against Variables

As noted above, I was quite surprised to see a correlation between questions and profanity yet no relationship between exclamation marks and profanity. Therefore, I decided to plot out a slope graph to see if we could uncover any relationship there.

Slopegraph for the Percentage Change of Profanity Proportion Between Lines With Exclamation & Those Without. Image by Author.
Slopegraph for the Percentage Change of Profanity Proportion Between Lines With Exclamation & Those Without. Image by Author.

Profanity Against Variables Summary

Interestingly enough, despite there being no significant correlation between the proportion of exclamation and the proportion of profanity, it appears that the most significant jump between the proportion of profanity occurs from dialogue with no exclamation marks to dialogue with exclamation marks.

Final Data

Final DataFrame. Image by Author.
Final DataFrame. Image by Author.

Modeling

I am going to fast-forward through this last part and provide a brief overview of the modeling process and performance. However, please feel free to let me know if you’d like a more in-depth exploration of the modeling work done here and I’ll release a part two 🙂

Here we are building a classifier to predict the genre of drama.

LazyPredict

To expedite the modeling phase, we utilized LazyPredict, an AutoML Python package that applies all of the common machine learning algorithms to a dataset and presents common metrics based on the task.

Hyperparameter Tuning: Bayesian Optimization

We then performed hyperparameter tuning on the first 4 models:

  1. ExtraTreesClassifier
  2. AdaBoostClassifier
  3. Perceptron
  4. XGBClassifier

Classically, hyperparameter sweeps are run via grid search (brute force), where all possible combinations of hyperparameters are empirically evaluated for optimization. Given that the number of trials grows exponentially with every new hyperparameter, this is usually non-feasible. Another approach, random search, randomly combines hyperparameters, reaching a local optimum more efficiently than grid search if all combinations are not exhausted.

Instead of either of these options, I will utilize Bayesian Optimization. This method constructs a Gaussian process to model the black-box function and search space. The overarching advantage is that we are converging to a local solution (like any ML model does) rather than shooting simply trying out different hyperparameters.

Manual Hyperparameter Tuning Extra Trees Classifier

Train F1 Score: 0.985
Train Accuracy Score: 0.984

Test F1 Score: 0.696
Test Accuracy Score: 0.675

The F1 score, a harmonic mean of precision and recall, serves as a key indicator of our model’s performance. Precision reflects the model’s reliability in correctly identifying a movie as belonging to the ‘drama’ genre, while recall measures the model’s ability to capture all relevant instances of drama movies.

Considering the constraints, such as the absence of a fully developed pipeline for filtering low variance columns, addressing potential multi-collinearity, and a more extensive feature engineering process, the model demonstrated reasonable effectiveness. The following section will highlight the features that were most important for the model.

Feature Importance Scores. Image by Author.
Feature Importance Scores. Image by Author.

Future Work

This article was mainly focused on the process of feature generation and analyzing the data within the context of screenplays. However, if I wanted to work more on modeling, I’d focus on feature engineering, examine the effects of multi-collinearly, and spend more time on model selection.

More specifically, I would:

  • Use an ensemble approach where the first model would be fit on the most significant words in the corpus using a tfidf vectorizer and the second model would be the model fit on the features I have here. Then, I’d combine the two models to see if the integration of the two models would boost performance.
  • Perform topic modeling using TFIDF and Latent Dirichlet allocation. Perhaps the extraction of topics may boost performance, specifically for a multi-label classifier.
  • Add other features that could hold predictive power for the genre such as the movie rating, the movie year, and the average amount of dialogue per character
  • Analyze the average sentiment across time, and categorize it into a "plot arc" category. According to Fellow, there are 6 common story plot arcs: Riches to Rags (a continuing emotional fall), Rags to Riches (a continuing emotional rise), Oedipus (fall-rise-fall), Cinderella (rise-fall-rise), Man in a Hole (fall-rise) and Icarus (rise-fall).
  • We could create a function that would take the average sentiment and categorize the movie into one of the plot arcs and see if that would also be predictive of the genre.
Average Sentiment Visual Example. Image by Author.
Average Sentiment Visual Example. Image by Author.

Concluding Remarks

I hope you enjoyed this analysis and that this article showcased the potential of tailoring analyses to the unique characteristics of a field. While I focused on cinematic dialogue, the principles of domain-driven data analysis and modeling are universal. I encourage you to research your chosen domain, remain curious, and get creative with feature engineering during your next modeling task. I would also love to hear about your own experiences with interesting domain-driven analyses so feel free to write a comment here or email me at [email protected]. Thanks!

References


Related Articles