In Part One of this article, I laid out the groundwork for a few key concepts:
- Evolving Hockey‘s highly effective goals above replacement model concluded that the Minnesota Wild had the NHL’s best skaters and worst goaltenders in the 2019–2020 regular season.
- The scorekeepers at Xcel Energy Center – the arena where the Minnesota Wild play their home games – exhibit a pattern of erroneously reporting that shots are taken further from the net than they actually are, which I refer to as "scorekeeper bias."
- If shots (defined as shots on goal and missed shots) are reported further from the net than they are taken, this will lead this aforementioned goals above replacement model to overrate the defensive performance of Minnesota’s skaters at the cost of their goaltenders, with no penalty to the offensive performance of their skaters.
- The scorekeepers at Xcel Energy Center are not the only ones who exhibit scorekeeper bias. They’re not even the worst.
I concluded by stating that my aim was not to "fix" this issue, but rather to provide an estimate of how much scorekeeper bias misrepresents what skaters and goaltenders do on the ice in rinks that are prone to it. I then outlined the three things that I planned to do:
- Build an expected goal model that performs well enough in testing that I am comfortable using it as a descriptive model for evaluating the quality of shots that have already occurred.
- Determine an adjustment for scorekeeper bias that does not harm the performance of my expected goal model when I replace the reported shot distances that I’ve obtained with the adjusted shot distances that I’ve calculated.
- Build a Regularized Adjusted Plus-Minus (RAPM) model that allows me to provide a point estimate of a skater’s isolated offensive, defensive, and net impact on expected goals. This will allow me to compare a skater’s isolated impact before and after adjusting for scorekeeper bias, and determine the impact which scorekeeper bias may have on their unadjusted results.
If you haven’t read part one of this article, I recommend you read it before you go any further. It will give you a better understanding of the main concepts at hand and explain why I’m comfortable working under the assumptions that I’m making. With that being said, if you’ve already got a good understanding of concepts such as goals above replacement, expected goals, RAPM, and scorekeeper bias, and you’re really just interested in the methodology behind some of the charts and statistics that I’ve been posting on Twitter lately, then feel free to continue reading.
I started this process with minimal coding experience. I had done a little bit of work in Python including one introductory-level college course, and I had completed EDX’s free online "Statistics and R" course. That was about it. While I had more experience in Python, I chose to work in R instead because I felt more comfortable with R in my limited time there, and more importantly because the Evolving Hockey Scraper was written in R. While I’m on the topic of this scraper, I cannot give enough thanks to EvolvingWild for creating this scraper and making it available to the public.
I used this scraper to extract data from every single recorded play in the 2018–2019 and 2019–2020 NHL seasons. I wanted to use more than a one-year sample size because all of the metrics that I’m working with are known to be more reliable in multi-year samples, but I opted against using a sample size of three or more years because the NHL shrunk maximum goaltender pad size prior to the 2018–2019 season. This fundamental change made significant changes to the probability of any shot becoming a goal, and I had noticed in the past that expected goal models built before 2018–2019 exhibited a pattern of severely understating shot quality; I didn’t want mine to suffer from the same problem.
Once I had determined the sample that I wanted and scraped the data for it, the next step was to build the expected goal model. I started by doing some research on the history and theory of expected goal models and found they’re generally built with one of two methods:
- Logistic Regression: A statistical modeling technique that uses other predictor variables to predict the probability of one binary variable. In this case, the binary variable is whether a shot becomes a goal and the predictor variables are distance, shot angle, game strength state, whether a shot was a rebound, etc.
- Gradient Boosting: A machine learning technique for regression and classification problems.
The two key differences here are that models built on gradient boosting produce slightly superior results to those built on logistic regression, but logistic regression is much easier to code and implement and still provides good results. Because my goal was to build a solid model that I was comfortable using and not necessarily build the best model, I opted to go with logistic regression.
After establishing that I could build an acceptable model using logistic regression, the next question was how I would determine that a model was acceptable. How do you hold an expected goal model up to scrutiny? One way that I had "tested" public models in the past is checking the correlation between expected goal shares (xGF%) and actual goal shares (GF%) at the team level. Most public models have an R² between 0.4 and 0.46, so an expected goal model with an R² of 0.4 would pass this test. But there are issues with this test – some teams just have better shooting and goaltending than others – and I’ve never seen a credible analyst test a model this way. I wanted to make sure that whatever method I used to test my model was a solid method that had been used by other analysts, so I set out to determine the best process for testing my model.
My research found that the most common way expected goal models are tested is by measuring area under curve (AUC); a value that is obtained through a receiver operating characteristics (ROC) test. If you’re unfamiliar with what either of these things are, you’re no different than I was just a few weeks ago, so let me break it down for you: in World War II, a radar receiver operator was a person who analyzed radars and determined the likelihood that blips on the radar were friendly ships, enemy ships, or just random debris/noise. Their proficiency at making these distinctions was known as their "Receiver Operating Characteristics." A receiver operating curve plots the false positive rate of a signal detector on the x-axis and the true positive rate of a signal detector on the y-axis, with a curve moving across and showing the corresponding true positive rate for a given false positive rate.
A useless or completely random model will have an AUC of 0.5, while a "fair" model is generally considered one with an AUC in between 0.7 and 0.8. To my knowledge, his is where every public expected goal model ranges, with most of the more popular models in the upper end of that range. (For reference, Evolving Hockey’s expected goal model had an AUC of 0.7822 for even strength shots, 0.7183 for power play shots, and 0.7975 for short-handed shots.)
Remember, my goal here was not to create the best expected goal model that exists, but to create one that was solid enough that I could use it as a descriptive model of past events. The very first model that I built used distance as the only variable, and I still obtained an AUC of 0.7045, which technically means it qualified as a fair model. To give you an image, here is what my ROC curve looked like:

At this point, I was excited that I had already seen some results, but I also realized that I hadn’t done all that much, and that I should probably do my due diligence to make some improvements to this model before moving forward.
I spent the next few days tinkering with variables I thought might improve model performance. The process here was just me building different dummy variables (categorical variables with a value of zero or one denoting whether or not that variable is present) and testing the AUC of my model before and after they were added to see if they made any improvements. Here are the variables I used in my final model:
- Shot distance. (Continuous)
- Shot angle. (Continuous)
- Time difference between the shot and the prior event. (Continuous)
- Shot type. (Categorical)
- Shooter Strength. (Categorical)
- Game strength state. (Categorical)
- Whether the prior event was a takeaway by the shooting team. (Dummy)
- Whether the prior event was a giveaway by the opposing team. (Dummy)
- Whether a shot follows a blocked shot by the opposing team. (Dummy)
- Whether a shot is a rush. (Dummy)
- Whether a shot is a rebound occurring zero seconds after the prior shot. (Dummy)
- Whether a shot is a rebound occurring one second after the prior shot. (Dummy)
- Whether a shot is a rebound occurring two seconds after the prior shot. (Dummy)
- Whether a shot is a rebound occurring three seconds after the prior shot. (Dummy)
Why was I aggressively meticulous with how I chose to classify rebounds? Because I found that rebounds had a significantly different impact on goal odds depending on how recently they occurred. Here is the estimate I obtained for each of these coefficients:

I totally would have guessed that zero second rebounds are most likely to become goals. That makes a ton of sense. What I would not have guessed is one second rebounds are less likely to become goals than two or three second rebounds. Why is this? I can’t say for certain, but here is my theory: Zero second rebounds are quick, bang-bang plays that don’t give goalies the time to get back into position, while one second rebounds are often the result of "forced" shots where the goalie has had just enough time to get back into position and prepare themselves to make a save. By contrast, two and three second rebounds are more often the result of plays where there was a pass or a deflection after the first shot that moved the goalie out of position and created more "chaos."
This is just my theory though, and it ultimately doesn’t matter whether it’s true; what matters is that including these variables added a unique touch to the model and improved performance. In the end, I wound up with an AUC of 0.7707, which is just behind the AUC of the top expected goal models out there. Here’s an image of the ROC from my improved model:

If you look back to the first ROC curve from the model that used only distance, you’ll see there’s a massive improvement. I was very proud of what I had done up to that point and I was excited to go forward with the next step of this process, which was to create the venue adjustment which accounted for scorekeeper bias.
Creating the venue adjustment was the easiest part of this process because there’s already a lot of great work on venue adjustments that is available to the public. Of the venue adjustments I found, the two that I found most viable were Michael Schuckers’ (found on page 9 of [this article](http://www.hockeyanalytics.com/Research_files/SQ-DistAdj-RS0809-Krzywicki.pdf)) and Ken Krzywicki’s (found on page 2 of this article).
I implemented both adjustments and tested the model again using the adjusted distance figures. I found that Schuckers’ adjustment slightly harmed model performance while Krzywicki’s slightly improved model performance, increasing AUC from 0.7707 to 0.7708, so I went with Krzywicki’s. This was a very small improvement, but it was all that I needed. Here is the methodology behind Krzywicki’s adjustment:
Adjusted Distance = (Reported Distance)-(Rink Mean Reported Distance)
Sometimes the simplest method works.
Now that I had built an expected goal model and implemented an acceptable venue adjustment, I had two sets of expected goal values for every unblocked shot in the 2018–2019 and 2019–2020 seasons: one before venue adjustments were made and one after. The next step was to build a regularized adjusted plus-minus model (RAPM) model to isolate the offensive, defensive, and net impact of every individual skater using the expected goal values calculated from reported shot distance, and then do the same with the expected goal values from adjusted shot distance.
This was by far the hardest part of the process. Remember, my coding experience prior to this project was very limited, and while I learned a lot through scraping the data, creating the expected goal model, and applying the venue adjustments, I was still far from a programming expert. Before I discuss my experience, though, I’m going to provide you with an overview of what RAPM is and why I used it here.
I said before that once I build a RAPM model, I will have "a point estimate of a player’s isolated offensive, defensive, and net impact on expected goals." But how do you get that? Hockey is such a fluid sport that obtaining even a point estimate of a player’s isolated impact is quite difficult. And it sounds even more difficult when you say that RAPM is calculated without using any individual results, and just on-ice results.
To be honest, this is not a simple concept. It took me a while to understand it, and I’ve tried explaining it many different times and failed almost every time. This time around, in order to explain how RAPM works, I’m going to take a lesson that I learned from my experience building these models and "chunk" this concept by explaining it with as few data points as possible.
Let’s pretend that the only contextual factor that impacts offensive on-ice results is whether or not Connor McDavid is on the ice playing offfense. Everything else – all other teammates and opponents, which zone a shift began in, what the score is – none of that matters. Does this sound silly? Yes, it totally does, and it totally is silly, but I promise it’ll help get the point across.
This silly made-up world has one thing in common with the real world: Connor McDavid is extremely good at offense and has a strong impact on the rate at which his team generates offense, which is measured by expected goals per hour (xGF/60). We have ten data points that will allow us to determine exactly what McDavid’s impact is: five shifts without Connor McDavid where a team generated offense at a rate of 2.5 xGF/60 and five shifts with Connor McDavid where a team generated offense at a rate of 3.0 xGF/60. If we create a dummy variable called "Connor McDavid" with a value of 1 with McDavid on the ice and 0 with him off the ice, here is what our data looks like:

If we run a very simple regression on this data using xGF/60 as our target variable, our trendline in slope-intercept form will be xGF/60 = 2.5 + 0.5*(Connor McDavid). This means that for a typical shift without Connor McDavid, we expect a team’s on-ice offensive expected goal rate to be 2.5 xGF/60, and the presence of Connor McDavid increases this value by 0.5, so we expect a shift with Connor McDavid will see offense generated at a rate of 3.00 xGF/60. This means that Connor McDavid’s isolated impact is increasing his team’s rate of offense by 0.5 xGF/60.
Now, let’s introduce one other player who can also have an influence on their team’s on-ice rate of offense. Remember, nothing else besides McDavid matters to this player, but this player matters too. But we don’t know exactly how much they matter.
Luckily, with the data we have, we can easily figure out how much they matter. Say this player jumps out there for a shift without Connor McDavid and their on-ice xGF/60 is 2.8. Well, we know that absolutely no other external factors outside of a player’s control influenced this rate, and we know a normal rate would be 2.5 xGF/60, so we can conclude this player’s isolated impact on their team’s rate of offense was increasing it by 0.3 xGF/60. We don’t need to watch this shift, we don’t need to count how many goals or assists this player had – we know exactly what their impact was.
Now, what if this player jumps out there for a shift with Connor McDavid and they generate offense at a rate 2.9 xGF/60. We know nothing else besides McDavid matters, but McDavid was out there for this shift, so we instead subtract the rate of 3.0 xGF/60 from their on-ice rate, and find that their isolated impact on their team’s rate of offense was decreasing it by 0.1 xGF/60. Again, we don’t need to watch this shift to know what this player’s impact was. This is the essence of top down hockey analysis: using on-ice results to isolate a player’s impact.
Now, back to the real world. Everything matters again. Not only Connor McDavid, but every other teammate, every opponent, whether a shift was started in the offensive or defensive zone, what the score is, etc. Oh, and we want to isolate not only a player’s offensive impact, but their defensive impact as well. This all makes things a lot more complicated and adds a lot of uncertainty to our final values which is why I use the term "point estimate" of a player’s impact whenever I can.
While things are now more complicated, the McDavid example provided us with an example of how to isolate everybody’s impact: create dummy variables that denote whether or not a skater is on the ice in addition to dummy variables denoting the game strength, score, what zone a shift started in, and a few other variables, since we know those things matter and we want to account for them. Also, for every skater, we must create not just one dummy variable that denotes whether or not they were on the ice, but one that denotes whether they were on the ice playing offense and one that denotes whether they were on the ice playing defense, since we want to isolate their offensive and defensive impact and find separate point estimates for each.
To give an example of what this looks like, I’m going to "chunk" my real data set by pulling just one shift that occurred in the third period of the second game of the 2018–2019 season. Here’s what you need to know about this shift:
- The Washington Capitals played at home against the Boston Bruins.
- The Washington Capitals held a lead of three or more goals.
- This shift lasted 24 seconds.
- This shift began with a faceoff in Boston’s offensive zone.
- The Bruins took an unblocked shot that had a value 0.1 xGF on this shift.
- The Capitals took an unblocked shot that had a value of 0.02 xGF on this shift.
- Boston’s five skaters were Brandon Carlo, Joakim Nordstrom, Chris Wagner, Noel Acciari, and Zdeno Chara.
- Washington’s skaters were Andre Burakovsky, Brooks Orpik, Chandler Stephenson, Lars Eller, and Madison Bowey.
And here is a partial view of this shift in RAPM prepared format:

As you can see, the top row displays this shift from the perspective of the home team as "offense" and the away team as "defense," while the bottom row displays this shift from the perspective of the away team as "offense" and the home team as "defense." The above image only displays some of the skaters on defense, but here are the skaters on offense.

Remember, the entire top row is from the perspective of the Washington Capitals. In this top row, xGF/60 is the rate at which the Capitals generated offense, skaters marked as "offense" are those who played for the Capitals, and skaters marked as "defense" are those who played for the Bruins. By contrast, the bottom row is from the perspective of the Boston Bruins, which means that xGF/60 is the rate at which the Bruins generated offense, skaters marked as "offense" are those who played for the Bruins, and skaters marked as "defense" are those who played for the Capitals. The Capitals also led by at least 3 goals, so they were marked as "Up 3" while the Bruins were marked as "Down 3."
To home in on two specific players, look at only Andre Burakovsky and Brandon Carlo. From the perspective of the Capitals – the top row – Brandon Carlo is on defense and Andre Burakovsky is on offense. From the perspective of the Bruins – the bottom row – Brandon Carlo is on offense and Andre Burakovsky is on defense.
This example was of one shift with two corresponding rows. The entire data set for the sample size that I used contains 544,502 shifts with 1,089,004 total rows. This sounds like a lot, but the key concepts are still in place here: every shift contains two rows from each team’s perspective and the coefficients returned from dummy variables will determine the impact that the presence of a certain variable (such as Connor McDavid playing on offense) has on a team’s on-ice rate of expected goals for.
This concept of using regression to analyze hockey players based on everything that happens when they’re on the ice is at the heart of most modern public hockey analysis. The way that I like to think of it is that instead of analyzing hockey from the ground up by focusing on individual statistics such as goals and assists, or even more "advanced" individual statistics like blocked passes and zone exits, RAPM is analyzing hockey from the top down – hence the username. If a player blocks a lot of passes and in doing so, prevents the opponent from accumulating dangerous scoring chances, that will show in their defensive RAPM statistics. If a player is so-so in the defensive zone and struggles to block passes, but they’re an excellent forechecker who constantly blocks the opponent from building up plays in transition, this will also show in their defensive RAPM statistics. If they do some of these things well, but they also do something else that prevents them from having a strong impact, who really cares? That’s my philosophy.
My explanation makes it obvious that I’m a huge fan of RAPM, but I didn’t use it here because I like it. I used it because a player’s RAPM impact on expected goals against is one of the most important metrics for the defensive component goals above replacement, and the other components in goals above replacement correlate very closely with RAPM, so providing a point estimate of a player’s RAPM before and after venue adjustments will show how much a venue adjustment may impact certain components of their goals above replacement. (RAPM on its own is also a very popular metric, and I think it helps to know how a venue adjustment may affect RAPM itself.)
Now that I’ve explained the concept of top down hockey analysis, RAPM, and why I chose to use it, it’s time to explain how I implemented it. As I showed above, my data frame contained two rows for every shift from the past two seasons: one where the home team is "offense" and one where the away team is "offense." My goal was to run a regression to isolate the impact that each external variable has on the on-ice rate of xGF/60 by obtaining the coefficient estimates.
Unfortunately, unlike the prior example with the made-up Connor McDavid data, I couldn’t just run a standard multi-linear regression. The reasons for this are three-fold:
- xGF/60 expresses the rate at which something occurs and does not take ice time into account. Two shifts may have an equal rate of 4.0 xGF/60, but if one of them occurs over the span of two minutes and the other occurs in the span of two seconds, we can’t just combine them and act like they’re the same thing. The one that occurs in two minutes is much more meaningful.
- There is a lot of multicollinearity within the data. This is a fancy way of saying that many variables in this data are highly correlated with one another, which is a fancy way of saying that players like Sean Monahan and Johnny Gaudreau spend almost all their minutes playing with one another. Traditional linear regression on datasets with multicollinearity leads to extremely unstable coefficient estimates.
- This instability will also exist for players who posted extreme results at either end of the spectrum in very small samples. If you run a simple multi-linear regression, the "best" player might end up being some guy who played thirteen minutes and had a 72% on-ice xGF% or something ridiculous like that.
There is a very simple fix for issue one: turn this into a weighted multi-linear regression using shift duration as the weight. Bam, easy, done.
There’s also one solution that solves both issues two and three, but it’s a bit less simple: Regularization (The R in RAPM). Regularization is a mathematical technique that biases coefficient estimates towards the population mean of zero. To simplify this concept and put it into hockey terms, we are essentially saying that until we have enough data to accurately estimate what a player truly is, we will assume that they are roughly average. This applies to not only players who have played a very small sample on their own, but also players who have played a large sample, but a small one without a certain teammate.
The version of regularization that I used is known as ridge regularization or Tikhonov regularization, and the degree to which I biased these coefficients towards zero or "shrunk" them was determined by a Lambda value that I obtained through a cross validation test on my original data set. The final regression that was run was considered a "weighted ridge regression." I could go into greater detail as to what exactly this all means mathematically, but for now, with this model still in the beta stage, I would rather stick to explaining the process from a hockey perspective with as little math as possible.
This entire process was extremely similar to the process that Evolving Wild used to build their RAPM model. I owe a lot of credit to them for their write-up; I’m not sure I would’ve been able to build my own model without it. I intentionally followed their process very closely because as I stated in the first article, the defensive component of their RAPM model plays a large part in their goals above replacement model, and most of the other components correlate very closely with RAPM, so finding a point estimate of a player’s defensive impact according to RAPM before and after venue adjustments would give me an idea of how venue adjustments may impact defensive goals above replacement.
While my process was very similar by design, it did differ in two ways: I used a different expected goal model to build it, and I used two years at once. Using two years at once gave me a larger sample size, which meant that the Lambda value which I obtained through cross validation was smaller than it would have been if I ran the cross validation on just one year of data. Therefore, my coefficients were there "shrunk" or biased towards zero to a lesser degree than Evolving Wild’s and I got larger values on both ends of the spectrum.
While this is part two of this article, and I’m not supposed to go into the results until part three, I still feel the need to provide one result as an example of what the smaller Lambda values did, so I will with Mark Stone. Over the past two years, his net impact on expected goals at even strength was the highest of any player according to both of our models. But their model’s coefficient estimate for his net impact was +0.46 xG/60; my model’s coefficient estimate was +0.701 xG/60 before venue adjustments.
Intuitively speaking, I prefer using multiple years at once because of the lower Lambda value and subsequently larger coefficient estimates. In other words, I think that the true isolated impact of an elite play-driver like Mark Stone is probably a bit closer to my coefficient estimate than theirs, and I’d prefer to use an approach that makes this type of result possible. But this is a matter of personal preference, and both approaches have their strengths and weaknesses.
Once I completed this process using the expected goal values that were calculated with reported shot distance, I repeated the same exact process using the expected goal values that I calculated with adjusted shot distance.
The results were very interesting. They’ll be covered in part 3. Stay tuned.