The world’s leading publication for data science, AI, and ML professionals.

The Aaron Bowl

The top name of the NFL Divisional Round

TL;DR. The charts in this blog mostly use randomly created data, as I don’t have the rights to actual data.

On 1/16/21, in the first game of the NFL Divisional Playoffs, the Green Bay Packers will face the Los Angeles Rams. These teams represent the NFL’s top-ranked defense (Rams) and offense (Packers).

Interestingly the best player on each team is named Aaron. Aaron Rodgers is possibly the best quarterback in football, and he is one of the hot favorites for this year’s Most Valuable Player award. Aaron Donald is one of the best ever to play his position. He is one of three favorites for this year’s Defensive Player of Year. Throw in the Packer’s running back Aaron Jones as well. No matter how this game turns out, one of the Aarons will be highly influential. That’s why I’m calling it the Aaron Bowl.

Photo by Nathan Shively on Unsplash
Photo by Nathan Shively on Unsplash

In this blog, I will display some visuals to show how unusual it is to have players at this level in the same game … and even more so for them to have the same first name!

Post Game Recap

Green Bay Packers 32: Los Angeles Rams 18

Aaron Rodgers completed 23 of 36 passes for 296 yards with two passing touchdowns and one rushing touchdown. (including one 58 yard pass for a touchdown)

Aaron Jones rushed 14 times for 99 yards with one touchdown at 7 yards per attempt. (including one run for 60 yards)

Aaron Donald had limited plays, still dealing with an injury from the previous game.

Two of the Aarons did dominate this game – both from Green Bay. Aaron Donald’s injury (and a strong performance from Green Bay’s offensive line) meant that we didn’t see the expected Aaron vs (Aaron + Aaron) battle.

Getting the Data

It is more difficult to get your hands on NFL data than it is for other sports – Baseball and Basketball, for example. That may be changing with the NFL supplying data to Kaggle for the Big Data Bowl. Some helpful sources for your own research are the NFL website, last year’s Kaggle NFL Big Data Bowl (last year’s competition has more recent data than this year’s), and Pro Football Reference. Unfortunately, none of these sources of data are completely open-source.

Kaggle allows "Competition Use, Non-Commercial, and Academic Use Only." NFL.com data is "solely for your own individual non-commercial and informational purposes only." Pro Football Reference has the most open data, "we encourage the sharing and reuse of data and statistics our users find on our Site" but warns not "to copy a materially significant portion of our data" – so no scraping.

There are some additional sources of data that create databases by compiling publicly available play-by-play data. However, since these are secondary sources, and I couldn’t access the primary data directly, I chose not to use these in a public blog.

I am going to look at the leaders of the most measurable statistics: passing yards, passing touchdowns, rushing yards, and sacks. I will then compare these to a dataset of dummy statistics generated at random to illustrate how unusual the Aarons are.

I know this approach seems a little weird – using mostly made up data. Sports statistics are commonly reported, are in the public domain, and are treated as facts that are not protected by copyright (commentary on a recent court case involving baseball data here). However, collections of these statistics (databases) are protected. There is a lot of public interest in who the best players are, so it is easy to find statistics on the leaders in the public domain. However, there is not the same interest in identifying all players – so I couldn’t build a dataset without a data source.

There are four Aarons in the NFL in positions that can impact these statistics. Aaron Donald, Aaron Rodgers, Aaron Jones, Aaron Lynch.

Photo by Alexander Schimmeck on Unsplash
Photo by Alexander Schimmeck on Unsplash

Dummy Data

There are several options for creating dummy data in Python. Passing Yards and Touchdowns are highly correlated. I’m using NumPy’s random.multivariate_normal to create correlated random samples from a multivariate normal distribution. Here’s a reference to the official doc.

The parameters required are mean, cov, and size. Mean is the mean of each dimension of data. Cov is a covariance matrix of the distribution. Size is the size and/or shape of the sample.

Building the covariance matrix is a little tricky. I used the variance of yards and touchdowns on one diagonal and the standard deviations x the correlation on the other diagonal.

corr = 0.8  # estimating passing yards and TDs are 80% correlated         #            (it's probably higher than 80% in reality)
covs = [[yd_std**2, yd_std*td_std*corr], 
       [yd_std*td_std*corr, td_std**2]]
# std**2 is the variance
# use std / 3, so that most of the data generated (3 standard deviations worth) is within the range I want.

I used a range that gives a mean of zero. I made the sample size twice as big as I need, so I can drop half the values with no problem. This will leave me with half of a bell curve with its peak at zero.

data = np.random.multivariate_normal(means, covs, 200).T

These data are certainly not perfect but are useful for the next illustration.

Quarterback Visualizations

# Plot distribution of QB rating
sns.set(color_codes=True)
sns.set(rc={'figure.figsize':(10,8)})
sns.displot(df['QB_TD'], label='QB Passing Touchdowns')
plt.title('Distribution of QB Passinh Touchdowns')
plt.legend()
plt.show();
Touchdowns based on dummy data
Touchdowns based on dummy data
# scatter chart of QB Rating by Pass Yards
sns.relplot(x="Pass_Yd", y="QB_rating", hue="Aaron", 
            size="Aaron", sizes=(100,50), palette=["b", "r"], alpha=.8, height=6, aspect=2, data=df);
Aaron Rodgers vs. top QBs and dummy data
Aaron Rodgers vs. top QBs and dummy data

Aaron Rodgers doesn’t have the most passing yards, but he definitely is an outlier.

References

Pro Football Reference: https://www.pro-football-reference.com/

NFL Website: https://www.nfl.com/stats/player-stats/

Kaggle NFL Big Data Bowl: https://www.kaggle.com/c/nfl-big-data-bowl-2020

Helpful stackoverflow answer on numpy.random.multivariate_normal: https://stackoverflow.com/questions/18683821/generating-random-correlated-x-and-y-points-using-numpy


Related Articles