The world’s leading publication for data science, AI, and ML professionals.

Wins Above Replacement 1.1 and Expected Goals 1.1: Model Updates and Validation

A few minor tweaks. a lot more data, and clarity into how much it's really worth.

Just over six months ago, I released two descriptive models for evaluating NHL skaters and goaltenders: Expected Goals and Wins Above Replacement. For anybody who’s either unfamiliar or wants their memory refreshed, here’s a quick run-down on each model:

  • Expected Goals leverages extreme gradient boosting, an advanced machine learning technique, to calculate the probability that an unblocked shot attempt will become a goal based on factors like shot distance, angle, and the event which occurred prior to the shot. Expected goals can be interpreted as weighted shots.
  • Wins Above Replacement (WAR) uses ridge regression (RAPM) to isolate the impact skaters have on expected goals for and against at even strength, as well as their impact on expected goals for on the power play and expected goals against on the penalty kill. It also previously used ridge regression to isolate the impact which each player had on the rate at which their teams took and drew penalties, and the impact that shooters and goaltenders had on unblocked shot attempts they took or faced becoming goals (shooting/saving). For skaters, the outputs of these regressions were broken into six components: Even strength offense, even strength defense, power play offense, shorthanded defense, penalties, and shooting. (Note that even strength here differs slightly from the NHL’s definition, and does not include plays where a net is empty; it only includes 3-on-3, 4-on-4, or 5-on-5 play where both teams have one goaltender in net. This will be true of any reference to even strength play which I make in this article.) The number of goals a replacement level player would be expected to provide in each component is subtracted from the number of goals a player actually provided, and this number of goals is then converted to wins using a divider commensurate with the value of one goal in order to obtain wins above replacement.

Expected Goals was built for all seasons from 2013–2014 through 2019–2020, while Wins Above Replacement was built for all seasons from 2014–2015 through 2019–2020. Both models were used for the 2020–2021 season as it was played.


I’ve made a few updates to both models. The most exciting update, which I am thrilled to announce, is that I now have data from 2007–2008 through 2020–2021 for both models and will be making all of it available shortly. Full play-by-play data and location coordinates were unfortunately not available for a handful games between 2007–2008 and 2009–2010; I’ve chosen to exclude these games entirely from any of these models. You may notice within the outputs that certain teams or players who played 82 games in one of these seasons are listed as having played fewer; that means that they played in a game (or games) that could not be modeled.

The other updates are marginal enough that I could probably get away with implementing them and not mentioning anything, but I think transparency is important, and above all else, I just love writing about this stuff.

Both of my models were previously built in R, and they have both been re-built entirely in Python. The modeling process was followed as closely as possible outside of intentional tweaks I made, but it’s also natural that there will be some minor variance between model outputs. A certain shot may be worth 0.2 expected goals in one model, and 0.3 in another for no real reason.

With that in mind, here are the specific tweaks I chose to make to each model:

Expected Goals

  • The data which every season’s model was trained on varied greatly. From 2007–2008 through 2009–2010, I removed 100 "target" games from the sample, trained the model on the remaining data, and then ran it on the target games I had removed. I repeated this process for every 100 game sample available in these 3 seasons. I did this because unlike data from 2010–2011 through today, all shot location coordinates were exclusively sourced from ESPN’s XML reports, which vary slightly from the NHL’s API reports in the way that shots are tracked.
  • From 2010–2011 through 2016–2017, I simply removed one target season from the sample, trained the model on the remaining seasons, and ran the model on the target season. I repeated the process for all 7 of these seasons.
  • From 2017–2018 through 2021, I used the same process I did for 2007–2008 through 2009–2010: Remove 100 target games from the sample, train the model on the remaining games, and then run the model on the target games. I split these seasons from 2010–2011 through 2016–2017 because of the adjustments that the NHL made to goaltender equipment regulation starting in 2017–2018. A reduction in goaltender equipment size came with a predictable increase in the likelihood that shots become goals, and a model should account for this.
  • Outside of which seasons each model was trained on, the modeling process for this remained as similar to the prior modeling process as possible; the same variables were all accounted for, and the parameters for model training were obtained using the same methodology of cross-validation.

Wins Above Replacement

  • The previous version of this model used a prior-informed (Bayesian) RAPM for the even strength offense, even strength defense, power play offense, and shorthanded defense components. I began this with a non-prior-informed (vanilla) RAPM ran for 2013–2014, then calculated a prior-informed RAPM for 2014–2015, then used the outputs of that 2014–2015 RAPM as a prior for 2015–2016. I repeated this process through 2020–2021, effectively creating what is known as a daisy chain. The new model is not unlike the old model in that it uses a daisy chain, but this daisy chain starts with 2007–2008, and simply uses the outputs of the vanilla RAPM from 2007–2008 to create even strength offense, even strength defense, power play offense, and shorthanded defense components.
  • Impact on penalties is no longer calculated through ridge regression, but instead through individual penalties drawn and taken. While I believe factors outside a player’s control do impact the rates at which they draw and take penalties, I’m just not convinced that regression is adept enough at properly adjusting for this external context. Similar to running a ridge regression using goals for and against as target variables, I believe this sounds great in theory, but in practice, the number of occurrences of the target variable with a player on the ice in one season is probably too low to place much stock in the outputs of such a regression. The number of counter-intuitive outputs and low repeatability of penalty WAR in the 1.0 version of this model pushed me towards this decision.
  • Impact on shooting and saving is no longer calculated through ridge regression, but instead through goals scored and allowed relative to expected goals. This change was made primarily to break down shooting into 3 components which could then be added to other portions of the model; the 1.0 version of the model simply had one "shooting" component and one "saving" component, which made it difficult to truly gauge a player’s impact at even strength and on the power play. This also made it impossible to gauge what a goaltender had done for each team. Given that the sample size of goals scored in a season is quite small, I’m not currently confident using a ridge regression to isolate impact in this facet of the game; especially not after further breaking down these samples into three smaller sub-samples of even strength, power play, and shorthanded play. I plan to re-visit this using some sort of logistic ridge regression at some point, as I do think a metric which evaluates the performance of shooters and goaltenders should account for the impact of the respective goaltenders and shooters they face. For now, though, I’m comfortable with the way that I calculate the shooting and saving components of the model.
  • Because salary data for all skaters is not readily available in some earlier seasons, replacement level is no longer defined by a salary cap hit of $850,000 or lower and UFA signing status. Rather, replacement level is defined by ice time, with a slightly different definition for each game strength:
  • Replacement level at even strength is defined as all forwards who ranked lower than 13th and all defensemen who ranked lower than 7th on their team in even strength time on ice percentage.
  • Replacement level on the power play is defined as all forwards who ranked lower than 9th and all defensemen who ranked lower than 4th on their team in power play time on ice percentage.
  • Replacement level on the penalty kill is defined as all forwards who ranked lower than 8th and all defensemen who ranked lower than 6th on their team in shorthanded time on ice percentage.
  • Replacement level at all situations (for the purpose of penalties) is defined as all forwards who ranked lower than 13th and all defensemen who ranked lower than 7th on their team in all situations time on ice percentage.
  • Replacement level for goaltenders is defined as all goaltenders who ranked lower than 2nd on their team in games played.
  • Note that while penalties at all situations are used, no other component uses play with the net empty. Because of this, the total time on ice value which appears next to WAR, and subsequent WAR per ice time rates which are derived from it, do not incorporate ice time with an empty net.
  • Penalties at all situations are still used because it is impossible to derive with certainty from the NHL’s play-by-play data whether a penalty was taken with a goaltender in net, or if the goaltender was pulled on a delayed penalty after it was drawn. Penalties taken with a goaltender pulled are few and far between, so this impact of this issue will be negligible, but it is worth noting.
  • I previously calculated value of a goal relative to the value of a win using a method from Christopher Long and found that from 2017–2018 through 2019–2020, one win was equivalent to roughly 5.33 goals. For version 1.1 of the model, I followed Christopher’s methodology to calculate one stable Pythagorean Exponent using every NHL season from 2007–2008 through 2020–2021: 2.022. I then used the league average goals per game scored in each season – excluding all goals scored with either net empty or in the shootout, just as I did with the rest of my model – to determine the number of goals equivalent to one win for each season. Here are the vales which I obtained:
╔══════════╦═══════════════╗
║ Season   ║ Goals Per Win ║
╠══════════╬═══════════════╣
║ 20072008 ║         5.051 ║
║ 20082009 ║         5.345 ║
║ 20092010 ║         5.201 ║
║ 20102011 ║         5.115 ║
║ 20112012 ║         4.989 ║
║ 20122013 ║         4.975 ║
║ 20132014 ║         5.002 ║
║ 20142015 ║         4.908 ║
║ 20152016 ║         4.853 ║
║ 20162017 ║         5.053 ║
║ 20172018 ║         5.375 ║
║ 20182019 ║         5.438 ║
║ 20192020 ║         5.422 ║
║ 20202021 ║         5.285 ║
╚══════════╩═══════════════╝

Validation

I’ve summed up the changes made to the Expected Goals and Wins Above Replacement models. But with data going back to 2007–2008, I decided it was time to use the entirety of this sample to validate these models, and see if that old-school play-by-play data was really as bad as it’s made out to be.

The metrics which I’ve used to validate my expected goal model include area under curve (AUC), which you can read more about here, and expected goals per actual goal. These are the values for the test metrics for every season:

╔══════════╦════════════════╦═══════╦═════════════════╗
║  Season  ║ Game Strength  ║  AUC  ║ xGoals per Goal ║
╠══════════╬════════════════╬═══════╬═════════════════╣
║ 20072008 ║ All Situations ║ 0.769 ║           0.985 ║
║ 20072008 ║ Even Strength  ║ 0.778 ║           0.990 ║
║ 20072008 ║ Power Play     ║ 0.710 ║           0.977 ║
║ 20072008 ║ Shorthanded    ║ 0.767 ║           0.931 ║
║ 20082009 ║ All Situations ║ 0.775 ║           0.993 ║
║ 20082009 ║ Even Strength  ║ 0.783 ║           0.991 ║
║ 20082009 ║ Power Play     ║ 0.713 ║           1.000 ║
║ 20082009 ║ Shorthanded    ║ 0.815 ║           0.983 ║
║ 20092010 ║ All Situations ║ 0.760 ║           1.033 ║
║ 20092010 ║ Even Strength  ║ 0.767 ║           1.017 ║
║ 20092010 ║ Power Play     ║ 0.700 ║           1.071 ║
║ 20092010 ║ Shorthanded    ║ 0.787 ║           1.092 ║
║ 20102011 ║ All Situations ║ 0.773 ║           1.000 ║
║ 20102011 ║ Even Strength  ║ 0.782 ║           0.987 ║
║ 20102011 ║ Power Play     ║ 0.715 ║           1.031 ║
║ 20102011 ║ Shorthanded    ║ 0.773 ║           1.075 ║
║ 20112012 ║ All Situations ║ 0.775 ║           0.999 ║
║ 20112012 ║ Even Strength  ║ 0.781 ║           0.988 ║
║ 20112012 ║ Power Play     ║ 0.722 ║           1.028 ║
║ 20112012 ║ Shorthanded    ║ 0.807 ║           1.047 ║
║ 20122013 ║ All Situations ║ 0.771 ║           0.973 ║
║ 20122013 ║ Even Strength  ║ 0.779 ║           0.981 ║
║ 20122013 ║ Power Play     ║ 0.697 ║           0.929 ║
║ 20122013 ║ Shorthanded    ║ 0.780 ║           1.168 ║
║ 20132014 ║ All Situations ║ 0.773 ║           0.991 ║
║ 20132014 ║ Even Strength  ║ 0.781 ║           0.981 ║
║ 20132014 ║ Power Play     ║ 0.715 ║           1.042 ║
║ 20132014 ║ Shorthanded    ║ 0.783 ║           0.849 ║
║ 20142015 ║ All Situations ║ 0.771 ║           1.002 ║
║ 20142015 ║ Even Strength  ║ 0.778 ║           0.996 ║
║ 20142015 ║ Power Play     ║ 0.706 ║           1.013 ║
║ 20142015 ║ Shorthanded    ║ 0.812 ║           1.083 ║
║ 20152016 ║ All Situations ║ 0.771 ║           1.024 ║
║ 20152016 ║ Even Strength  ║ 0.780 ║           1.036 ║
║ 20152016 ║ Power Play     ║ 0.696 ║           0.996 ║
║ 20152016 ║ Shorthanded    ║ 0.782 ║           0.953 ║
║ 20162017 ║ All Situations ║ 0.771 ║           1.011 ║
║ 20162017 ║ Even Strength  ║ 0.777 ║           1.017 ║
║ 20162017 ║ Power Play     ║ 0.709 ║           1.002 ║
║ 20162017 ║ Shorthanded    ║ 0.789 ║           0.925 ║
║ 20172018 ║ All Situations ║ 0.766 ║           1.032 ║
║ 20172018 ║ Even Strength  ║ 0.772 ║           1.030 ║
║ 20172018 ║ Power Play     ║ 0.697 ║           1.050 ║
║ 20172018 ║ Shorthanded    ║ 0.828 ║           0.958 ║
║ 20182019 ║ All Situations ║ 0.762 ║           1.001 ║
║ 20182019 ║ Even Strength  ║ 0.771 ║           1.004 ║
║ 20182019 ║ Power Play     ║ 0.676 ║           0.991 ║
║ 20182019 ║ Shorthanded    ║ 0.798 ║           0.974 ║
║ 20192020 ║ All Situations ║ 0.772 ║           0.989 ║
║ 20192020 ║ Even Strength  ║ 0.779 ║           0.984 ║
║ 20192020 ║ Power Play     ║ 0.696 ║           1.004 ║
║ 20192020 ║ Shorthanded    ║ 0.821 ║           1.003 ║
║ 20202021 ║ All Situations ║ 0.774 ║           0.975 ║
║ 20202021 ║ Even Strength  ║ 0.782 ║           0.966 ║
║ 20202021 ║ Power Play     ║ 0.706 ║           0.994 ║
║ 20202021 ║ Shorthanded    ║ 0.813 ║           1.107 ║
╚══════════╩════════════════╩═══════╩═════════════════╝

The expected goal models performed marginally worse in the 2007–2008 through 2009–2010 seasons than they did in later years, but still better than I expected. Note that according to the documentation which I referenced for AUC, a value between 0.6 and 0.7 is poor, a value between 0.7 and 0.8 is fair, and a value between 0.8 and 0.9 is good; this means that at all situations and at even strength, the model is fair, and closer to good than bad in every single season. Whatever issues are present with the location coordinates from the first 3 seasons are not problematic enough to prevent the model from posting a respectable performance. On the power play, though, the model was classified as bad in 5 seasons, and closer to bad than good in the other 9.

These numbers bear out my general stance on public expected goal models: In the aggregate, they’re fair, and I would say they’re closer to good than bad. But on the power play in particular, they’re missing a lot of important context.

After testing and validating that my expected goal model is fair in every season, it came time to test my WAR model. While WAR is descriptive in nature, it’s very difficult to implement descriptive tests of the model, and I believe an accurate description of past results should generally do a decent job of predicting future results, so I chose to test the descriptive and predictive capabilities of the model.

A common descriptive test of a WAR model is to test the correlation between the sum of WAR at the team level and another metric like standings points or goal differential which clearly defines team quality. This test is inherently flattering to WAR because it doesn’t necessarily test how well WAR is evaluates skaters within a team, but just how well WAR evaluates the sum of that that team’s skaters as a whole.

For example, say that if we had a perfect WAR model, it would tell us that in 2018–2019, Cedric Paquette provided the Tampa Bay Lightning with -1 WAR and Nikita Kucherov provided them with 5 WAR. This would mean the aggregate of their contributions was 4 WAR. Now, say I built a terrible model which said that Paquette was worth 5 WAR and Kucherov was worth -1. The sum of their WAR would still be 4, which would perfectly match the true combined value of their contributions. But I would be horribly off on each player, and my model would have done a terrible job of actually isolating their impact. This is a made up example; nobody’s model says anything close to that about Paquette or Kucherov, but it’s worth keeping in mind before we begin analyzing the results of my descriptive tests.

As I mentioned, some of the outputs of my WAR model are obtained through ridge regressions. These regressions treat every player within one season as exactly one entity, regardless of whether or not they change teams during the season. This means that I can’t split the contributions of players who played for multiple teams back to the teams they played for, and the best way to test the model is therefore to remove these players entirely. This is actually slightly unflattering to the model. Here are the outputs from my first descriptive test:

The R² value of 0.82 tells us that the model can explain 82% of the variance in standings points per 82 games. The rest can be explained by some combination of wins contributed by players who played for multiple teams, pure luck, and modeling error.

While this value may sound fair, I actually found it quite alarming when I first came across it. When I released the outputs of my model last winter, I used the same exact test methodology and obtained a significantly higher R² value of 0.89.

What happened here? Did the model get significantly worse?

Thankfully, no. It was just applied to a set of seasons where it was not quite as effective. When tested on just the same three seasons that WAR 1.0 was tested on in this image, WAR 1.1 actually performed marginally better:

I can throw around R² values all day, but these aren’t too valuable without some sort of comparison or baseline. In order to get an idea of how much value we really gain from using WAR, the next step is to compare it to Hockey‘s current most popular statistic: Points.

The general consensus among the analytics community is that WAR and similar models are vastly superior to points. If this is true, they should do a better job of describing performance at the team level.

In order to compare the two on equal footing, I first removed all goaltenders, as nobody cares about the number of points a goaltender scores, so we won’t use them for either test. Repeating the same test using only Skater WAR from 2007–2008 through 2020–2021 leads to a considerable drop in R²:

By comparison, points don’t perform as poorly as the analytics community might like to think:

I have to admit, I’m impressed by how well skater point totals hold their own here. They’re clearly inferior, but not by the massive margin that I’d expected. This evidence suggests that in the absence of a superior metric, there is absolutely nothing wrong with using points descriptively. With that being said, I do also think it stands to reason that while skater point totals alone do a solid job of describing success at the team level, they’re probably much worse at distributing individual credit among skaters on teams than WAR is, and we know that WAR is far from perfect in that regard.

Perhaps a better way of determining how well these metrics properly distribute credit among teammates is to shift over to testing out the predictive capabilities of the model. While a model which said Cedric Paquette was the real driver of Tampa Bay’s success in 2018–2019 may have passed a descriptive test for that season, it would greatly fail as soon as Paquette changed teams and was expected to bring major success to his new club at the expense of Tampa.

The methodology I chose for predictive testing remained fairly simple: Use the target metric as a rate stat in year 1 and plug that in to the amount that a player played in year 2. In this case, I first used WAR in year 1 to predict standings points in year 2:

I obtained an R² value of 0.35, which is quite solid: It means that the number of wins skaters are expected to contribute in year 2 is enough to explain 35% of the variance in standings points at the team level. The other 65% can be explained by some combination of skaters performing better or worse than they’re projected to, skaters the model has never seen before, skaters who played fewer than 10 games in year 1 and were thus ignored, goaltenders, and pure luck.

I then repeated the same exact test with points:

Once again, much like the descriptive test, skater points are clearly outclassed, but they do a better job of holding their own than analytics geeks like myself might suggest. Again, in the absence of superior metrics, I think using skater point totals is just fine.

I was also curious to see how the WAR approach stacked up to team-level metrics like standings points. It wouldn’t make sense to use standings points for a descriptive test, as doing so would mean comparing standings points in year 1 to standings points in year 1. These are the same exact metric and would provide you with an R² of 1.0. But using standings points in year 1 to predict standings points in year 2 is completely valid, so I chose to do that as well:

Similar to aggregate skater points, team standings points alone do a decent job of predicting future standings points, but they don’t stack up to WAR. If you want to determine who the best team is going to be in the following year and you have a good idea of how much everybody is going to play for each team, you’re a lot better off than you would be if you just had last year’s standings points.


Limitations

These models come with many limitations: The expected goal models can’t even be classified as good, and the ridge regressions which fuel some WAR components come with a large degree of error. I also believe that at the skater level, this particular WAR model places too much emphasis on shooting and not enough on play-driving. There are two reasons for this:

  1. The play-driving components are obtained through ridge regression, which biases coefficients towards 0. This is necessary to avoid extreme wacky values, but it also means that the model will likely understate the true per-minute contributions that a player has on play-driving, and assume they’re closer to an average player than they actually are.
  2. Replacement level shooting is much further below average than replacement level play-driving. I’m certain why this is, but my belief is that while the difference is mathematically exaggerated by the method in which the model biases coefficients towards 0 to a higher degree for players who’ve played fewer minutes, the difference also does exist because it’s simply easier for coaches to identify and quickly cut the ice time of skaters who shoot poorly.

If I were to "weigh" each component in an entirely arbitrary manner based on what I personally think is important, the model would place less emphasis on shooting and more on play-driving. And if I were using WAR to judge an acquisition made by a team, I’d keep in mind the fact that the play-driving components are much more repeatable than shooting.

This is all by way of saying that WAR is not only an imperfect estimate of value added in the past, but that much more than WAR should be considered when projecting future performance. Put quite simply, WAR serves much better as the starting point for any discussion than it serves as an ending point.

With the limitations in mind, I will also state that WAR is a clear-cut upgrade on more rudimentary metrics that are still referenced more frequently. You don’t have to like or use WAR, but if you cover your eyes every time you see it, or plug your ears every time you hear about it, and stick to using skater points instead, you’ll be wrong a lot more often.


Related Articles