Moviegoer

Moviegoer — 5 Reasons Why Cinema is the Perfect Dataset of Emotional and Anthropological Knowledge

Movies demonstrate emotional communication atop valuable societal context. They are the perfect dataset for emotional AI models.

Tim Lee

Published in

Towards Data Science

5 min readNov 30, 2020

What can movies teach AI models about how people communicate and express emotion? (image by author)

Smart devices, digital assistants, and service chatbots are becoming ubiquitous, but they don’t yet have the emotional capacity they need to fully understand how we communicate. Emotional AI models should be able to detect specific emotions and act with empathy, understand societal norms, and recognize specific communicational nuances, like irony and humor. Datasets of emotional data exist, but are inadequate in terms of size and realistic context. Fortunately, there’s a dataset which can satisfy all these requirements: movies.

In a previous post, I demonstrated the prototype of Moviegoer, a tool which breaks films into structured data. This essentially allows a machine to “watch” a movie: identifying individual scenes, parsing conversational dialogue, and recognizing characters. This is the third and final post in the unveiling of the prototype.

There are many existing facial emotion datasets, most of which consist of video clips divided into six or seven emotions. There are two sources of data: filming original content, or hand-labelling clips from movies/TV. The first type of data films actors or test subjects delivering lines with various emotions. Examples include RAVDESS and MUG. The second type hand-labels (very short) existing clips from movies and television shows. AFEW has 971 labeled clips from 37 movies; CAER has 13,000 labeled clips from 90’s shows such as Friends. These datasets have shortcomings, especially with regards to size and contextual information.

Cinema is the holy grail of emotional and anthropological data, for these five reasons:

Film Data is Self-Labelling

Movies contain multiple “dimensions” (e.g. word choice, facial expression, etc.) of emotional data, which can be used in tandem as “self-labelling” data. For example, what if we want to know the emotional impact of the phrase “I never want to see you again”? We can look across every film to see a character’s reaction when the phrase is said to them — the reactions to this phrase are already labeled.

Most times, the character’s face will fall into a frown. Sometimes, the character will laugh. Never, will the character’s head will fall off. A film is a reproduction of how we interact with each other. The data is valid because movies depict dramatizations of society. A model trained on movies will find only proper reactions to this phrase. And these are only two dimensions — there are many more, such as voice tone, word choice, expression of the phrase-delivering character, etc.

This character has a valid reason to be angry. (image by author)

A Movie is 90 Minutes of Cause-and-Effect

Existing datasets lack a clear emotional antecedent, or stimuli which causes a change in behavior. RAVDESS has actors delivering a line “The kids are sitting by the door”, but this doesn’t have any actual context — there are no kids. The direct, literal antecedent of original-content datasets is “I am demonstrating how this sentence is delivered in six different emotional contexts.”

In cinema, there’s cause-and-effect for everything. At a plot level, we can track character motivations and what’s important to them, which allows us to interpret antecedent reactions on a broader scale. If we know the character is excited about a job interview, how do they react when they hear they didn’t get the job?

Abstractly, a two-character dialogue scene is just a layering of conversational antecedents atop one another. A character says something, which causes the other character to respond, which causes the first character to respond, and so on. What was just previously said in the conversation to make a character declare “Prison changes you”?

Cinema is a Document of Societal Norms

Going further into the contextual benefits of cinematic expression, movies offer real-world societal context. Consider an average movie scene that takes place in a restaurant. In a restaurant, there are a number of rules that diners follow. What usually happens when a server asks “What can I get you?” or “Can I take your order?” And a common dynamic in restaurant scenes involves one character starting to lose their cool and yelling, and the other calming them down and asking them to be quiet. It’s an unspoken rule to not make a scene in restaurants.

Airport scenes are typically filled with goodbyes. Driving scenes depict characters seated in a specific pattern, facing forward. Birthday celebrations show a number of rituals which include cake, candle, and song. Although they aren’t necessarily important to the plot of the film (e.g. a restaurant dialogue could be adapted to take place anywhere else), they provide important societal context, reflecting the typical anthropological rules and rituals of how people actually act in specific locations and situations.

Driving scenes feature characters seated in a specific pattern — just like real life. (image by author)

Understanding Human Behavior Requires Multiple Streams of Emotional Data

Many datasets may only focus on one specific type of data: facial expression, word choice sentiment, voice tone, etc. They all get the job done in their respective fields, but some societal norms are too nuanced to be quantified by just a single type of emotional data. One of the biggest examples, sarcasm, has notoriously stumped AI models. In this clip from The Simpsons, Comic Book Guy sarcastically muses how (un-)useful a sarcasm detector would be.

Sarcasm often fools AI models which track word choice sentiment, because positive words are used in a negative manner. But if we looked across all of cinema for positive word choice, deadpan voice tone, and a neutral facial expression, we could reliably find instances of characters being sarcastic. There are a number of other mannerisms which require context of other streams, like facial models detecting crying (tears of heartbreak vs. tears of joys), and body language models detecting clapping (a crowd cheering vs. a villain emerging from the shadows).

It’s Already There

Hundreds of thousands of movies already exist, ready to be parsed as structured data. They contain lots of emotional cause-and-effect information, and exist as a mirror image of how society perceives itself. Existing datasets are limited in size, may only focus on specific streams of emotional data, are devoid of societal context, and lack the behavioral antecedents needed to truly understand emotional response. Hiring actors to deliver lines or paying people to watch and label clips of TV requires lots of human effort (which translates to a monetary cost). Again, movies are already there, ready to be turned into self-labelling, contextually-rich data.

I hope to convince data scientists of the merits of using cinema as emotional and societal data. I invite you to poke around the Moviegoer prototype to see how films can be turned into structured data, research how we can begin to use this data to train emotional AI models, or reach out to discuss these ideas.