The world’s leading publication for data science, AI, and ML professionals.

Expected Goals: What’s Hiding Behind the Black Box?

A glimpse into what proprietary tracking data can tell us.

Note: This article is my submission for the 2021 Big Data Cup. Before I go any further, I’d like to thank Stathletes for hosting the event and giving me the opportunity to work with their data set!

From October through January, I worked extensively with the NHL’s play-by-play (PBP) data and built various analytical models. These included but were not limited to an expected goal model that determined the probability of shots becoming goals, a wins above replacement model that provided a point estimate of the value that a player contributed in a season, and a game simulation model that I used to forecast the 2021 NHL season and obtain daily win probabilities.

The latter two models leaned heavily on the expected goal model, and I believe this was their biggest limitation. While I like the expected goal model and believe I did a very good job with the data I had access to, it was hamstrung by issues with the NHL’s PBP data. This data doesn’t include valuable context like screens or pre-shot passes, and shot coordinates are notoriously inaccurate, partially due to an issue known as scorekeeper bias which I’ve touched on at greater length before. Throughout my work, I often wondered how much more accurate my model outputs would be if I had access to a more granular data set that allowed me to build a more accurate expected goal model, and which players would see their estimates change the most.

To give an example of a model output that especially perplexed me, my work found that at even strength, Patrick Kane’s isolated impact on expected goals for over the past three seasons was roughly average. This is an extremely difficult conclusion for any hockey fan to reconcile, as Patrick Kane is known to be one of the best offensive players in the world, and even outside of his individual goal scoring, his 5-on-5 primary assist rate and high-danger pass rates (tracked manually by Corey Sznajder) are both near the top of the league. My hypothesis is that Patrick Kane consistently generates scoring chances that feature certain factors such as pre-shot movement that aren’t picked up by the NHL’s PBP data but increase the probability of those shots becoming goals. In other words, Patrick Kane does something special that very few NHL players do, which consistently leads the expected goal value of Blackhawks shots that occur when he’s on the ice to be underrated. This is just a hypothesis, though; I’m also open to the idea that his estimate is inaccurately low because my ridge regressions have erroneously "given credit" for his offensive impact to other players, or that the outputs of my model are actually accurate and I’m chasing down a red herring. But without access to tracking data, I’m unfortunately left to hypothesize.

As soon as I read about the Big Data Cup and learned that I’d have access to tracking data from Stathletes, there was no doubt in my mind that whatever I did would involve expected goals. Frankly, given the combined sample size of 53 games between the two data sets, this data set may have been better suited for a more systematic, micro-level analysis like "is it ideal to shoot from the point on the power play, or should you just always pass?" and I may not have done the "best" thing with the data that I could have. But there was nothing that excited me more than expected goals, so I set out to answer the question: How much would access to Stathletes tracking data improve an expected goal model, and how much uncertainty should we apply to the outputs of expected goal models derived from the NHL’s PBP data due to the issues with it?

I chose to work exclusively with the prospects data set because the sample size was roughly 3 times larger and I didn’t feel comfortable combining the two data sets. I initially attempted to build an expected goal model from the prospects data set using extreme gradient boosting, a hyper-efficient machine learning technique that I used for my NHL expected goal model, but I came to the conclusion that the sample size of 3,808 unblocked shot attempts (fanned shots were excluded) across 40 games was insufficient for extreme gradient boosting, and that logistic regression would perform better. In order to avoid overfitting, for every game in the data set, I removed one target game from the data set, trained an expected goal model on the other 39 using logistic regression, and then tested them on the target game.

Here are the variables that I used to build my model, alongside their coefficient estimates from a logistic regression ran on all 40 games:

The factor that influenced goal probability the most was whether a shot was taken on an empty net, which makes a lot of sense; it’s just easier to score without a goalie in net. While exclusive Stathletes information like whether a shot was a one-timer, came after a royal road pass, or was accompanied by traffic all significantly increased goal probability, none of these made as much of an increase as whether the prior event was a missed shot; something we can still derive from the NHL’s PBP data.

These coefficients can be slightly misleading because unlike most of the variables, shot distance and angle are continuous variables with a wide range of observed numbers, so a decrease of one to these numbers only increases goal probability very slightly, but these are the two factors that actually have the largest impact on goal probability, just as they do with the NHL’s PBP data.

For shots taken at all situations, the area under curve (AUC) for this expected goal model was 0.787. Here’s how it looks:

As of January 31st, my NHL model is currently sporting a very similar AUC of 0.782 at all situations for the 2021 season, which is on par with how it has performed in prior seasons. Here’s how the curve looks:

These curves look almost identical, and the difference in AUC values is only 0.005. Does this mean the tracking data doesn’t actually improve things? No, not at all. This comparison is inherently unfair because the NHL expected goal model was trained through extreme gradient boosting on a sample size of 3,624 games, while the OHL expected goal model was trained through logistic regression on a sample size of 40 games. The NHL model should have a far higher AUC, and the fact that it doesn’t tells us that the additional data provided by Stathletes almost certainly does make a big difference.

In order to make a proper apples-to-apples comparison between the two models, I chose to compare them on the same grounds by taking a sample of 40 random games played from 2017–2018 through 2019–2020 from the NHL’s PBP data, training and testing an expected goal model through logistic regression on those 40 games, and obtaining the AUC for that expected goal model. I had to ensure that I didn’t just fluke into an especially high or low AUC value, though, so I repeated this process 1,000 times with a different random 40 games each time. The AUC values I obtained from these tests wound up representing a fairly normal distribution:

The average AUC value was 0.737 and the standard deviation for AUC 0.018. This means the AUC obtained from the Stathletes data set is 7% higher than the average AUC obtained from the NHL’s PBP data, and we can state with 99.73% confidence that the AUC from this Stathletes data set is higher than the true AUC from the NHL data set.

If we assume that access to the tracking data set would provide the same 7% improvement to the AUC of the more robust NHL expected goal model that were trained on 3,624 games using extreme gradient boosting, we would expect to see an AUC of around 0.835 for an NHL model built with Stathletes tracking data for the 2021 season. However, this figure comes with a few caveats:

  • This tracking data all came from OHL games and is being compared to NHL games. It’s possible that it’s just easier or harder to predict whether OHL shots become goals, and that this discrepancy is skewing the figure, but I suspect if this is the case, the difference would still be marginal.
  • The sample size for the OHL data is only 40 games and 3,808 unblocked shot attempts, and every single game featured the Erie Otters. It could have been just a good or bad set of 40 games.
  • The granularity of the tracking data provided by Stathletes means that if the sample size were large enough for gradient boosting to be stable, the data would almost certainly benefit more from the switch from logistic regression to gradient boosting than the less granular data provided by the NHL’s PBP would benefit, as machine learning is better suited to handle the unique relationships between different events than logistic regression. In other words, I’d actually expect increasing returns – not diminishing returns – from using gradient boosting to train an expected goal model using this Stathletes tracking data.

Despite the caveats, things we’ve heard from those who do have access to tracking data make this predicted 0.835 AUC figure sound reasonable. Alex Novet conducted a more robust test of how an expected goal model would perform with tracking data provided by Corey Sznajder and reported an AUC value of 0.797, but his test used a sample size of only 1,085 games; less than one full season’s worth. In my experience, expected goal models trained on gradient boosting perform far better with a sample size of multiple full seasons, so I suspect that if this experiment were re-run with a much larger sample size like the 3,624 games I used to build my model using the NHL’s PBP data, the AUC figure would be significantly higher. On a different note, Paul Maurice said that the Winnipeg Jets have a proprietary expected goal model that is "accurate to about 85%." I assume he just interpreted an AUC value of roughly 0.85 to mean this; I’ve seen others use AUC as a "percentage" in the past." While I’m generally skeptical of things NHL executives say about their proprietary models and data, I think it’s reasonable that the Winnipeg Jets have an expected goal model with an AUC of around 0.85.

Admittedly, I undertook this project to find actionable information for myself, and not for anybody else. I just wanted a better idea of how much uncertainty I should apply to expected goal models using public data, and the answer is about as much as I had previously applied, but now I can match a tangible number to that uncertainty, and state that expected goal models built from the NHL’s PBP data are about 7% less accurate than they would be if they were built with more accurate tracking data. However, there are still actionable insights from my research that others can use:

  • Fans and analysts should take note of this uncertainty figure when referencing the outputs of public expected goal models or data that is derived from these models.
  • Teams should take note of the uncertainty that comes with data from the NHL’s PBP and invest in more accurate and granular tracking data if they’re not already doing so. However, they should also note that the uncertainty isn’t large enough to flip everything on its head; if your team is getting outshot 15–10 every night, public expected goal models all say you’re deep in the red, and your analytics department is saying you’re dominating every game, there’s a solid chance that your analytics department is the one that’s in the wrong.
  • Coaches and players should take note of the fact that outside of shooting against an empty net, whether a shot came after a missed shot is the binary variable that increases goal probability the most; more so than a shot coming after a royal road pass, a shot being a one-timer, or a shot being accompanied by traffic. Teams scheme to make more royal road passes, take more one-timers, and get more traffic in front of the net, but how many of them scheme to miss the net and capitalize on the chaos created by missed shots? Based on the results of my research and what I’ve seen from NHL teams, I’d argue they don’t do it as frequently as they should. This is something I’d like to explore in deeper depth with a larger sample size at a later date.

Related Articles