Super Bowl Prediction Model

Predicting the winner of the 2019 Super Bowl based on regular season data 1966–2019

Matthew Littman

Published in

Towards Data Science

19 min readDec 13, 2019

NFL Super Bowl Trophy (Lombardi Trophy) image source: Wikipedia

Written By:

Matthew Littman M.S Business Analytics, UCI

Addy Dam M.S Business Analytics, UCI

Contributor: Vedant Kshirsagar M.S Business Analytics, UCI

Introduction

The Super Bowl is an enormously popular sporting event that takes place each year to determine the championship team of the National Football League (NFL). Millions of fans gather around televisions on a Sunday in February to celebrate this de facto national holiday. Broadcasted in more than 170 countries, the Super Bowl is one of the most-watched sporting events in the world. Featuring elaborate halftime shows, celebrity appearances and hilarious commercials that add to the appeal. After more than 50 years of existence, the Super Bowl has become a legendary symbol of American culture.

With a regular season that begins in September, the Super Bowl closes out the season. The telecast of last year’s Super Bowl LIII drew out a worldwide TV audience of about 98.2 million viewers. This makes it one of the trendiest topics of the year. For this reason, our team decided to build a prediction model to forecast the winner of Super Bowl LIV (54) using the data published on the official National Football League website nfl.com, as well as several others.

The Super Bowl is a huge celebration party for Americans. With 98.2 million viewers worldwide, this makes the Super Bowl the most watched sporting event in the US. Major businesses are shelling out over $5 — $5.5 million for a 30-second commercial and the price is still competing to increase. These huge numbers and passion for the game brought our attention towards this project. This year also marks the 100th anniversary of America’s favorite sport.

NFL 100 year anniversary logo. image source: Wikipedia

Data Summary

We used 3 different websites to collect all of our data: NFL.com, pro-football-reference.com, and topendsports.com. Utilizing the Beautiful Soup package in Python, we began to scrape. We wanted to collect all stats from the past, starting with the first Super Bowl of 1966, leading up to the present (Week 13 of the NFL). Starting with NFL.com, the stats fell under the 2 categories of Offense and Defense with 11 tables for offense and 8 tables for defense. NFL.com, the official NFL website, was our main resource to scrape the historic stats due to their extensive stat record and cleanliness. The one downfall for the website was that it did not contain the records of wins, losses, and ties for each team. Our project required these important metrics, so we included the data from a website by the name of Pro Football Reference. After scraping the records for each team dating all the way back to 1966, the last thing that we needed was to know which teams actually won the Super Bowl from the past. After gathering all of this data, we merged the information based on the Team and the Year in order to have each row become one team from one year and all of their corresponding stats. After sorting by ascending year, our dataset contained 1579 rows and 242 columns and almost 400k data points. It was ready for cleaning.

Cleaning

Column names were not intuitive and needed to reflect where the information came from.
Columns with multiple pieces of information needed to be separated into multiple columns.
Many of the data types were incorrect and needed to be adjusted.
Reordering of columns to ensure readability.
Columns with similar names to other tables needed to be manually changed to avoid confusion.
For the meantime, we replaced the missing values or NAs with “-99999”
The most important issue was that many of the columns contained missing data.

After further exploration, it was apparent that certain stats were not recorded until certain years. For example, most of the defense and kicking stats were not recorded until 1991. The offensive line stats were not recorded until 2009. There were 89 of our 242 columns that contained missing data up to all different years.

Each team’s stats are independent events relative to that team for a specific year. This means that all calculations or comparisons must be performed after grouping by year. We did not drop any columns yet, as we planned to build the model to select only the most important variables.

Exploratory Analysis

The first thing we looked at were the Super Bowl winners over the past 53 years from 1966 to 2018. Figure 2 shows that the Pittsburgh Steelers and the New England Patriots are leading the pack with 6 wins each, followed by the San Francisco 49ers and the Dallas Cowboys with 5 wins each; The New York Giants and the Green Bay Packers are close behind with 4 wins each. The Washington Redskins and Denver Broncos each sit at 3 wins, followed by the Oakland Raiders, Miami Dolphins, and Baltimore Ravens who each have 2 wins, and the Bottom 11 teams each have one win.

**Figure 2.** Ranking of Super Bowl winners from 1966 to 2018

Figure 3 below shows the distribution of wins and losses of all the teams over the last 53 years. Each red circle represents a team’s regular season record for a non winner of the Super Bowl, while the green circles represent the Super Bowl winners. Note that some of them are stacked because teams can have the same records. Notice that there are no teams that have won the Super Bowl with fewer than 9 wins, except the 1982 Redskins, but there were only 9 games in that season due to a strike. There are two key points labeled in figure 3; the 1972 Miami Dolphins and the 2007 New England Patriots. In 1972, the Miami Dolphins went undefeated in the regular season and won the Super Bowl. After their impressive season, the Dolphins came up with a tradition to celebrate when all teams had lost at least once in the entire season, meaning they would remain as the only undefeated team. In 2007, the New England Patriots nearly broke that 35 year tradition as they won all 16 games, but unfortunately lost the Super Bowl. It is important to note that there used to be only 14 games in the regular season, but in 1978 this changed to 16 games.

**Figure 3.** Super Bowl wins and losses from 1966 to 2018

Looking at touchdown statistics, there are 8 ways to make a touchdown in a football game; 4 from the offense and 4 from the defense. As shown in figure 4, in the entire history of the NFL, Receiving and Rushing Touchdowns make up the majority of all touchdowns, corresponding to 55% and 36% of the total, respectively. This is an average across all teams regardless of Super Bowl winners or losers. Interestingly enough, when looking at specific Super Bowl winning teams and their averages, the central tendency is around 55% and 36%, but the ranges vary greatly. This leads us to believe that the teams that won in those years were doing something different than what everyone else was doing and were either running more or throwing more depending on the year. This exploration shows us that the percent of touchdowns for a team for rushing or receiving will be an important predictor of the Super Bowl winner.

**Figure 4.** Distribution of 8 types of touchdowns

We looked at several statistics that differentiate Super Bowl winners from everyone else. It is natural that the win count for the regular season is a big factor for deciding who will win the Super Bowl. Historically, Super Bowl winners make 3.18 more Offense Rushing Attempts per game on average. This suggests that teams should attempt more running plays. Offense Passing average is also 0.75 yards higher among Super Bowl winners than among non winners. The suggests that teams should run plays where they throw the ball farther to increase their chances of winning.

The stark difference between winners and losers for Offense Game Stats Turnovers suggests that turnovers are a huge differentiator when it comes to who wins the Super Bowl. Lastly, Defense Passing Sacks are 7.15 points higher among Super Bowl winners compared to everyone else. Strong defenses lead to more sacks, which could potentially lead to more turnovers!

**Table 1.** Comparison of key statistics between Super Bowl winners and losers

After exploring all of the data, it was difficult to find trends or correlations amongst all of the variables due to the sheer number. With over 240 stats, the insight that could be gained was mostly based off of prior knowledge of the sport. This game sense helped direct our attention to those fields.

Modeling

The first issue that needed to be addressed within our dataset was that of the missing data. The problem with dropping those columns that contained missing values is that many of them capture very important aspects of the game and this negates information for the years we did have value. Filling those values with the means of their respective columns did not make sense because some columns contained about half missing values and half existing values. This was dependent upon when the NFL started recording them. Using the mean would severely skew the results of those columns, but most importantly, team performance varies greatly throughout the years and a mean does not reflect that variation. The missing stats should reflect the teams performance for that year which led us to use more advanced predictive techniques to fill in the missing data.

Data Imputation

Linear Regression

The first technique used a linear regression where the dependent variable was one of the columns that contained missing data, and the independent variables would be the other columns. The model trained on the existing data for those columns as such, each observation that was missing would be predicted based off of the other columns for that team and that year. It should be noted that for all of the predictive imputation methods, the data was normalized within each year in order to adjust for scaling issues when predicting. To further boost predictive accuracy, a technique known as recursive feature elimination was performed for each column. Recursive feature elimination (or RFE) is a way to reduce the number of columns to only the relevant columns. It does this by taking a subset of all of the columns and then allows you to assign a decision function that decides which columns of that subset are the most important columns. In our case, we used a logistic regression which was based on returning those variables with the highest coefficients. These coefficients measure the overall impact of that variable on the dependent variable. After removing the least impactful column among that subset, the process repeats by taking a different subset. It repeats this process until the number of important columns that are left equal the number of columns that the user specifies, which in our case was 20. This means that each column with missing data that was to be predicted used a different set of columns that were specific to that column for its linear prediction. After predicting all of the missing data points, the points had to be re-aggregated with the former dataframe. Unfortunately, after comparing the existing values to the predicted values, it was clear that the predicted results were not in line with the existing values and therefore could not be used.

K-Nearest Neighbors

After consultation with a previous statistics professor, Mohamed Abdelhamid, it was suggested to use the technique of K-Nearest Neighbors (or KNN). This technique looks at every observation as a collection of all of its attributes and plots each observation in n-dimensional space, where n is the number of columns. For the values that are missing, KNN attempts to solve this by finding the minimum distance between the current observation with its missing value and its K-nearest neighbors, where k is the number of neighbors that you are looking for. An example is shown in figure 5 where the black value is the observation who’s attribute we wish to impute.

**Figure 5**. KNN visual for classification in order to impute missing values.

Once the number of neighbors has been set, the algorithm takes the average of all of the neighbors values for the value you are trying to replace. In our case, we used the 5 nearest neighbors. If multiple columns need imputed values as in our case, the algorithm sorts the columns that need replacing from least amount of values needed to most amount of values needed. This algorithm only gets better with less missing values for a column, therefore this sorting helps by doing the best predictions first. This method worked very well in theory, but again could not be used as the predicted values were not consistent with the values that existed already.

Multiple Imputation by Chained Equations

The two previous imputation methods had not worked and a final method named multiple imputation by chained equations (or MICE) was attempted. This method is also known as “fully conditional specification” or “sequential regression multiple imputation”. In Azur Melissa’s article (2011) about MICE, she explains that this technique has been used for data that is missing at random. Our data is missing not at random (MNAR), but further research has shown that it depends on the dataset and that others have used it for data that is MNAR. The python package, “fancyimpute”, was utilized for the MICE algorithm of this project.

MICE can be thought of as a combination of the linear regression approach that we came up with earlier and the KNN algorithm. It takes each column as the dependent variable and runs a regression, in this case, a Bayesian ridge regression. This is done in order to use the other attributes related to that row to predict that value. The user specifies how many times they would like the algorithm to cycle and then it creates one data frame per cycle. In detail the algorithm works as follows:

Take each column with missing data (one out of our 89), and run the MICE algorithm for each of those columns.
Begin by filling the missing data for that column with the median.
Run a regression for each row in that first column that has a missing value.
Use the average of the 4 (number specified by user) nearest neighbor’s regression values as the missing value for that row.
After all of the values have been filled, check how many times the user has specified as max cycles through a single column (in our case 1000).
Each cycle produces 1 data frame containing filled in values for that column.
The algorithm stops for the column it is working on based off of whichever of the 2 options occur first:
If the values have converged and are not changing from cycle to cycle according to the stopping tolerance specified by the user, then use that converged value as the missing value for that row.
Else repeat the previous process until 1000 data frames are created.
If the values did not converge, the algorithm will pool/gather the results of that specific row for all 1000 data frames that it has created.
A distribution is created based on those values.
The most likely value (highest peak of the distribution) is used as the missing value for that row.
All values for that column should now be filled and repeat for the remaining columns

Figure 6 gives a visual for how the algorithm works. The amount of cycles has been set to 3 thereby creating 3 data frames rather than a max of 1000 in our case.

**Figure 6.** Visual representation of MICE algorithm with cycles = 3 for one column (Stef Van Buuren, 2011)

Evaluating the results of this method proved very promising. Most of the values made sense within their respective rows and because the data was normalized before feeding into the algorithms, the resulting outputs followed the basic normal distribution. With the predicted values mostly falling within 1 standard deviation to the left and right of the mean or between .3 and .7.

**Figure 7.** Shows the normal distribution of the output prediction imputations. image source: statisticshowto.datasciencecentral.com

Handling Unbalanced Data

Now that the missing data has been handled, it was time to handle the unbalanced data issue. There have only been 53 past Super Bowl winners and with 32 teams per season (there were fewer teams in earlier years), this leads to an unbalance of 1525 teams being “non-winners” and 53 teams being winners. With only about 3% of our total population being winners, it is very difficult to have a model predict these values without intervention. This is similar to the problem with labeling transactions as fraudulent where most of the transactions are real and there are only a few that are fraudulent. Granted there is not a huge cost associated with predicting a Super Bowl winner as a non Super Bowl winner, like with the transactions. This unbalance makes modeling difficult. Below are some methods to deal with unbalanced data and their efficacy:

**Table 2.** Summary table of sampling methods described in original SMOTE research paper (Chawla,Bowyer,Hall,Kegelmeyer, 2002)

After researching the possible methods of sampling, whether it be oversampling the minority (does not lead to better minority detection) or undersampling the majority, or some combination of the two, the Synthetic Minority Oversampling Technique (or SMOTE) proved to be the best option.

How SMOTE Works

In addition to undersampling the majority as mentioned in (Chawla et al, 2002), the minority oversampling method is the true genius behind this sampling method. The first item is to decide how much oversampling needs to be done. In our case, we wanted a 50 50 split between “winners” and “non winners”. SMOTE was implemented using the “imblearn” package in Python.

For example purposes, we will oversample the minority by 200%. This would mean that each of the minority points will be responsible for creating 2 additional synthetic points in order to triple our count of minority data points. Next, the user must decide how many nearest neighbors to use for the generation, in (Chawla et al, 2002) they used 5. After these decisions, the algorithm does the following:

Select the first minority point to use.
Find the K-nearest neighbors specified by the user (in our case 5).

(Note these nearest neighbors are only the minority neighbors, not all points)

Draw edges between those 5 points.
Check how much oversampling needs to be performed (if 500% you would create one new point per edge for all 5 edges)
Since our example percent is 200%, randomly select 2 of the 5 created edges to create new points on.
Randomly select a number between 0 and 1 which will determine the distance along the edges that have been created. (where 1 would be right on top of the neighbor point).
Repeat process for all minority points in order to have triple the amount you started with.

**Figure 8.** Synthetic samples visual from Kaggle sampling technique article (Rafjaa, 2017)

**Table 3.** SMOTE results before and after

Feature Selection

As mentioned above in the data imputation-linear regression portion, recursive feature elimination (RFE) is a convenient way of reducing the number of columns to only those that matter most. Now that the missing data has been filled in and the unbalanced minority has been balanced, we conducted RFE to filter out the top 20 most impactful columns for who will win the Super Bowl. They are as follows:

**Figure 9.** Top 20 most impactful columns after using RFE

With the 20 most impactful columns selected, all of the coefficients make sense except for the offensive line stats. These stats only started being recorded in 2009 and therefore had the most missing values. When these columns are taken out; however, the coefficients of the other columns no longer make sense. Wins became negative, and as previously mentioned in our exploratory analysis, this should clearly be a positive coefficient considering the more wins you have the more likely you are to make it to the Super Bowl. Our hypothesis is that these 3 columns are being used as noise added in order for the other columns to do a better job and avoid overfitting, therefore we left them in our model. To validate, we checked a correlation table to ensure there was no multicollinearity among the variables. Though there is a strong negative correlation between winning and losing, we decided to keep both of these as indicators that our model is performing correctly. This correlation table can be found below.

**Figure 10.** Correlation table with 20 columns used for model

After using RFE, we went back and performed a two sample independent T-test between Super Bowl winners and non winners and selected the columns that it labeled as significant. We ran just those selected columns through our model as well as filtering the dataset by those columns and then performing RFE on the T-test filtered results. Overall, the best method was to just use RFE in terms of accuracy and recall.

Prediction Algorithm

We have fixed all of the missing data, our unbalanced minority, and selected the 20 most impactful columns that we wished to feed into the prediction algorithm. We chose to use a logistic regression due to its explainability and ease of use, though a neural network could have potential given the amount of numeric data that we have. This could possibly be explored in the future.

We used a 20–80% train validation split where we trained on a random sample of 80% from 1966–2018 and then validated our model on the other 20%. This produced a test accuracy of 95%.

**Table 4.** Confusion Matrix from Logistic Regression showing actual versus predicted results

**Table 5.** Classification report from Logistic Regression validating accuracy, recall, and precision

With a recall of 89% for non winners, and 100% for winners as well as an overall accuracy of 95%, our model errors more on the side of predicting winners when they are truly non winners. It also predicted 0 non winners as winners which means that our model is actually more likely to say that someone is a winner when they are not. This is actually favorable when considering an actual prediction for a team during the season. We would rather have our model give more people a “potential” chance rather than labeling everyone as non winners. With a very promising confusion matrix, we plotted our ROC curve which showed an area under the curve of 95%.

**Figure 11.** ROC curve shows an AUC of .95

Results

Predicted Winners

After training and validation, it was time to put our model to the test by trying to predict the current 2019 season. Our model is able to rank each team with a Super Bowl winning chance by their chance of winning with the highest percent teams at the top as seen in Figure 12. The San Francisco 49ers and the New England Patriots are predicted to have 96% and 76% chance of winning the Super Bowl, respectively. In the current 2019 NFL playoffs standing, the 49ers are slated to be in the Wild Card slot, with a record of 10 wins and 2 losses. The Patriots are one of the division leaders, also with record of 10 wins and 2 losses. Further, all of the teams that our model gave percent chance to, are all the teams slated to be in the playoffs, or teams with a good chance to make the playoffs. The only exception being the Carolina Panthers.

**Figure 12.** Winners that our model predicts vs. current NFL results

Predicted Losers

Not only does our model predict the likelihood of teams becoming the champions with astounding accuracy, it also predicts the teams with low to zero chance of winning the Super Bowl. As seen in Figure 13, our model predicts the Lions, Cardinals, Falcons, Giants, Dolphins, and Bengals to have zero chance of winning the Super Bowl and this makes sense considering they have already been eliminated from even making the playoffs. Our model also predicts several other teams to have zero chance and these teams are what’s known as being “in the hunt”. This means that they still do have a chance of making the playoffs if they can improve their records within the last 4 games of the season. Our model does not think that these teams can do that. For Figure 12, the bottom teams have been cut out to improve visibility, and for Figure 13, the middle teams were cut out. This explains the different team names listed in each of the figures.

**Figure 13.** Losers that our model predicts vs. current eliminated NFL teams

Conclusion

To conclude, the best methodology that provides our model with the highest prediction accuracy is a sequential technique using MICE, SMOTE, RFE, and Logistic Regression. This methodology gives us a 0.95 area under the curve, a recall rate of 0.89 for non winners and 1.00 for winners. In addition, the F1 scores are .94 for non winners and .95 for winners. This statistic represents the harmonic mean of precision and recall

Some of the most significant attributes in predicting the winner of Super Bowl 2019 are:

Percentage of offense receiving touchdowns
Number of offensive rushing attempts per game
Average offensive passing yards per play
Average defensive passing yards per
Turnover differentials
Defensive sacks
Defensive touchdowns total

For future projects, we would like to explore the feasibility of using neural networks with our data, as well as predicting matchups at the game levels utilizing player statistics, weather data, whether it is a home or an away game, and team matchup history.

In addition, we think that a quantitative approach to injury prediction, enabling measures for prevention, could be beneficial for players as well as for team management. This would be challenging and may not be possible considering many injuries happen as accidents rather than degradation.

Using the same dataset, it would be interesting to discover whether a good defense or a good offense matters more for winning the Super Bowl.

Finally, we think that building a recommendation system for buying/trading players or building fantasy football team, similar to Netflix’s advanced recommendation system could be extremely valuable. A model such as this would have to capture team dynamic and what criteria make up good teams, then evaluate a current team and recommend players based on the team’s weaknesses.

All of our code can be found at

Github: https://github.com/kelandrin/NFL-Superbowl-Prediction

(Currently working on making a web app that can use the code)

Related Work/References

Super Bowl History

History.com Editors. “Super Bowl History.” History.com, A&E Television Networks, 11 May 2018, https://www.history.com/topics/sports/super-bowl-history.

Super Bowl Viewership

Perez, Sarah. “Super Bowl LIII Set Streaming Records, While TV Viewership Saw Massive Drop.” TechCrunch, TechCrunch, 5 Feb. 2019, https://techcrunch.com/2019/02/05/super-bowl-liii-set-streaming-records-while-tv-viewership-saw-massive-drop/.

Multiple Imputation by Chained Equations

Drakos, Georgios. “Handling Missing Values in Machine Learning: Part 2.” Medium, Towards Data Science, 5 Oct. 2018, https://towardsdatascience.com/handling-missing-values-in-machine-learning-part-2-222154b4b58e

Azur, Melissa J, et al. “Multiple Imputation by Chained Equations: What Is It and How Does It Work?” International Journal of Methods in Psychiatric Research, U.S. National Library of Medicine, Mar. 2011, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/#R8.

Buuren, S. van, and K. Groothuis-Oudshoorn. “MICE: Multivariate Imputation by Chained Equations in R.” Stef Van Buuren, 1 Dec. 2011, https://stefvanbuuren.name/publication/2011-01-01_vanbuuren2011a/.

Synthetic Minority Oversampling Technique

Chawla, Nitesh, et al. “SMOTE: Synthetic Minority Over-Sampling Technique.” View of SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research, June 2002, https://www.jair.org/index.php/jair/article/view/10302/24590.

Rafjaa. “Resampling Strategies for Imbalanced Datasets.” Kaggle, Kaggle, 15 Nov. 2017, https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets.

Recursive Feature Elimination

Li, Susan. “Building A Logistic Regression in Python, Step by Step.” Medium, Towards Data Science, 27 Feb. 2019, https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8

Keng, Brian, et al. Diving into Data, 20 Dec. 2014, https://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/