Predicting the Oscars using Preferential Machine Learning

The Oscars and their preferential balloting led me to create a novel machine learning approach to mimic this voting system

Published in

Towards Data Science

6 min readFeb 3, 2020

Last year was a great year for film, and if you are like me, basking in the afterglow of the Movie Pass craze and still seeing a lot of films in theaters, you know Once Upon a Time in Hollywood, Parasite, 1917 and many more films delivered unique cinematic experiences. Every year on Oscar Sunday, Hollywood gets together and gives itself a big pat on the back. The biggest prize of the night is the award for Best Picture, which can cement a movie in the annals of film history. Unlike the other 23 awards given out on Oscar Night, the coveted Best Picture award is chosen using a method called preferential balloting which is more complicated than a traditional vote. Preparing for this year’s Oscars, and learning about the preferential balloting led me to write some programs to mimic this voting system using machine learning.

‘2020 Oscars’ Art by Reddit user u/Tillmann_S

In this article, I:

Pick the data used to predict the Oscars with
Explore how preferential balloting works from a data science perspective
Demonstrate a method of my own design I call a Preferential Balloting Random Forest
Simulate what is happening behind the scenes of the Best Picture vote
Predict this year’s best picture winner

I don’t include any of my code in this article, but here is the repository with my notebooks used in this analysis

How to Predict the Oscars: The Dataset

To predict anything using machine learning, we need a meaningful dataset to train our model on. In the case of the Best Picture race, we have the nine 2019 films nominated for the award. As reverential as I am to the Oscars (I am interested enough to write this article, after all) I hold no reservations that the best movie of the year is the one which will win the Best Picture Oscar. The Academy is made up of thousands of members working throughout various areas of the film industry, and they each have biases which lead to their votes. Because there are real people behind the votes, we can’t rely on numerical indicators of film quality like box-office profits or aggregate critic scores. But you know what correlates well with filmmakers’ votes? Other filmmaker votes.

There are numerous other awards shows which make up “Awards Season”, and the voters for events like the Screen Actors Guild Awards and the Directors Guild Awards are often the same people which make up the voting body of the Academy Awards. Using the results of earlier awards shows like the SAGs, DGAs, PGAs, Golden Globes, and BAFTAs and combining that with Oscar info like nomination count, I can train a model on previous years best picture winners to predict this year’s. To get consistent movie data and naming conventions, I scraped the data for each awards show’s nominees and winners from Wikipedia’s and merged them all together into one dataset in Python using the Pandas and Beautiful Soup packages.

How Preferential Balloting Works

Preferential balloting, also called Instant Run-off Voting, is commonly used in situations where there are many candidates for only one winning spot. The Oscars has used this vote tallying system to decide the Best Picture race since 2009, when the field expanded from five nominees to up to ten. In preferential balloting, rather than voting for one film, voters submit a ballot with all the options ranked, and the #1 choices are tallied up as votes for that film. Then an iterative process begins in which the least popular film is eliminated and all ballots are re-ranked until a single film has greater than 50% of the #1 votes. After a film is eliminated from all ballots, the ballots which previously had the eliminated film at their #1 spot, now have their #2 move to the top spot, which increases the number of votes for the remaining films. This process continues until one film has greater than 50% of the #1 votes and then it is declared the winner. A simulation of this elimination process is shown below.

Figure 1: Simulation of preferential balloting elimination. Generated by my Preferential Random Forest

Critics of the preferential ballot method claim that it rewards easy-to-like or non-controversial films since non-controversial films will be around the middle of people’s rankings and controversial films may be at the top of some folks ballots but at the bottom of other’s, so they are prone to being eliminated. This effect was seen last year when the more artistic film Roma lost to the more generally-appealing film Green Book.

Photo Credit: Left — Alfonso Cuarón/Netflix, Right — UNIVERSAL PICTURES/PARTICIPANT/DREAMWORKS

Preferential Balloting Random Forest

We’ve seen in the past that preferential balloting can change the result of the Best Picture race, and so I created a model that reflects this distinct vote-counting method. A Random Forest Classifier model makes predictions by using a number of decorrelated Decision Tree Classifiers. Here is an article focusing more on the specifics of how a traditional Random Forest works. Generally, a Random Forest counts each tree’s ‘vote’ as a score based on leaf size and picks a final label by which class has the most ‘votes’ amongst all the trees. For this Preferential Balloting Random Forest, we instead use the ProbA values for each film in the test set and use them to create 1st - 9th place rankings of the film. ProbA values are the likelihood of that item being in the ‘Winner’ class and represents a softer prediction than the binary ‘Winner’ or ‘Loser’ classification. This softer prediction allows us to change the predictions from a boolean classification to a range. Each Decision Tree produces one ballot, and once the entire Forest has created their ballots, the iterative process of preferential ballot elimination begins to determine the Forest’s choice for winner. By using rankings rather than picking one class, my Preferential Balloting Random Forest is saving information that would otherwise be discarded by a Traditional Random Forest and using it again later in the elimination and reranking stage of preferential balloting.

Figure 2: An Individual Decision Tree’s vote on the test set

Simulating the Oscars

Using my Preferential Balloting Random Forest I simulated this year’s Best Picture race. To de-correlate each Decision Tree, I varied which awards show each tree saw, similar to Random Forest’s max_features hyperparameter. In this simulation, max_features represents what guild the voting academy member may be in, or how closely they follow the other awards shows that season. I also included a random noise feature for each Decision Tree to train on, representing each voter’s innate bias towards certain films. The Academy is made up of around 7,000 unique voters, so I fired up my Forest, which soon produced 7,000 ballots. After 6 rounds of eliminating the last place film, the top film had over 50% of the #1 votes, and my model had chosen the Best Picture Winner…

Figure 3: The final standings after 6 rounds of preferential balloting elimination. The process stopped once the film 1917 had greater than 50% of the vote.

Final Prediction

My Preferential Balloting Random Forest is a novel approach to simulate the Oscars, and I hope it helped you understand a bit about what goes into the Best Picture voting and Random Forest Classifiers, but preferential balloting aside, let’s get down to business and really predict these bad boys. Using my scraped dataset of Awards winning films, I implemented H2O’s powerful AutoML tool to train 100 different Random Forest, XBGT, and Deep Learning models with various parameters to predict this year’s Oscars. AutoML chose a XGBoost model that correctly predicted the Oscar outcomes of 147 out of 159 films on cross-validation. And which film did this maelstrom of models predict? Also 1917! Looks like things are looking good for this flick since the Preferential Balloting Random Forest and my AutoML model both predicted it.

Photo Credit: Universal Pictures, François Duhamel

Links and shoutouts:

Github Repo For This Project

Scraping Code Inspired from Github user Buzdygan

University of San Francisco MSDS