2021 NHL Projection Model: High-Level Overview

What you need to know about the newest game simulation model from TopDownHockey

Patrick Bacon
Towards Data Science

--

A Forecast for the 2021 San Jose Sharks. (Image by Author)

I simulated the 2021 NHL season 10,000 times in order to determine the probability of each outcome. I’ve began sharing the results of my work on Twitter, and I plan to write a full season preview soon, but before I do so it’s essential that I provide an overview of what I did, so that readers can analyze the process and determine what they believe to be the strengths and weaknesses of the model.

I began by using extreme gradient boosting to build an expected goal model that determines the probability of each shot becoming a goal. More about this process can be found here. (Note: When I use shot here and for the remainder of this article, I am referring to all unblocked shots, including those that miss the net.) I then used a prior-informed ridge regression to obtain a point estimate of the impact that each skater has on the rate at which their team generates and allows shots and expected goals for and against at even strength, shots and expected goals for on the power play, and shots and expected goals against on the penalty kill. In order to obtain a point estimate of the impact that each skater has on the probability of their own shots becoming goals, and that each goaltender has on the probability of shots they face becoming goals, I followed a very similar process but instead used a non-prior informed (vanilla) ridge regression. More about the process of using regression to analyze NHL skaters can be found here.

I repeated this process for each of the last 6 NHL seasons. (2014–2015 through 2019–2020.) I then ran a linear regression on each of these seasons to obtain coefficient estimates that would provide me with the proper weights to place on years 1, 2, and 3 in order to most accurately forecast a player’s performance in year 4, and then applied this every NHL skater who played over each of the past 3 seasons. For skaters who had not played in each of the last three seasons, I repeated this process using only years 1 and 2 to predict year 3 and applied those weights to year 1 and 2. For skaters who had played only one year, I simply copy and pasted their estimated impact from last season and used this as their projected impact next season.

I initially planned on using the same process for goaltenders but found in testing on prior seasons that goaltender performance was not very repeatable from year-to-year, and that my projected goaltender performance played a much larger role in explaining variance in my projected standings than it did in the actual NHL standings. This led me to “place less weight” on goaltending by repeating a similar process to the one I did with skaters but instead directly obtaining the fitted values from a linear regression, which pulled the projected performance of every goaltender much closer to zero. This improved the performance of the model in testing on prior seasons.

Every projected value that I obtained here came in the form of a rate; for the play-driving components the rate was per-minute, and for the shooting and saving components it was per-shot. Not all skaters will play the same number of minutes and not all skaters will take the same percentage of their team’s shots when they are on the ice, which makes these numbers useless without an idea of how much they will play and shoot next season.

In order to estimate the number of minutes that each skater will play, I repeated the above process using every player’s time-on-ice percentage (TOI%) in order to project their TOI% for 2021 and then adjusted this percentage based on the projected TOI% of all of their teammates to determine a proper estimate of how much each player will actually play. This was done with TOI% at even strength (EV), the power play (PP), and the penalty kill (PK). I then multiplied their projected TOI% at each game strength by the amount of time I expect teams to play at each game strength — roughly 90% at even strength and 10% on special teams, distributed evenly between PP and PK — and then multiplied their projected per-minute impact at each game strength by the minutes that I projected they would play at that game strength.

In order to estimate the percentage of their team’s shots that each skater will take, I essentially repeated this process for shots and obtained an estimate of the percentage of their team’s shots that each player will take. I then multiplied the percentage of their team’s shots that I projected each shooter would take by the projected impact that each shooter has on their shots becoming goals.

For goaltenders, I used games played last season as a rudimentary forecast of the games they will play next season and then divided the number of games a goaltender is projected to play by the number of games projected to be played by all goaltenders on their team, which provided me with an estimate of the percentage of their team’s games that each goaltender will play. I then multiplied the percentage of their team’s games that each goaltender was projected to play by their projected impact on shots they face becoming goals.

For each team, I used the top-12 forwards, top-6 defensemen, and top-2 goaltenders to inform these calculations. I used ranked skaters by EV TOI% to determine which skaters made the cut and last season’s games played to determine which goaltenders made the cut, but made a few arbitrary lineup decisions based on external knowledge. For example, Aaron Dell played more games than Jack Campbell last season, but based on the salary that the Toronto Maple Leafs are paying each player and the price in trade assets that they paid to acquire Campbell, I chose to use Campbell in place of Dell. There were a few teams who had only one goaltender; for these teams I multiplied the projected performance of that goaltender by 47 and divided it by 56, as a way of assuming that they would play 47 games and the goaltender I know nothing about would play 9. For teams who I knew would be using a skater who I had not seen before, such as the New York Rangers and Alexis Lafreniere, I simply used one fewer skater for that position to build the team components.

I summed up the projected impact on each component of the game for every player on each team. This summation of these factors provided me with a team’s isolated impact on the following six components: Shots for, shots against, expected goals for, expected goals against, shooting, and goaltending. A team’s impact on the first four components was in the form of impact per-60 minutes, while their impact on the latter two was in the form of per-shot. These six components make up the “guts” of the model.

To determine the rate at which each team would take shots and score goals on those shots, in my simulations, I merged the projected impact of each team with the corresponding impact of their opponent to determine the overall impact that each team would have on those components, and then added this to the league-average rate that teams take shots and score goals. To better understand this concept, picture the following fictional game (made-up of fictional data) and for the sake of brevity focus only on the offense of the home team:

- League-average teams take 44 shots and score 3 goals per-60 minutes.

- Being the home team increases a team’s rate of hourly shooting rate by 1 and increases their hourly goal rate by 0.3.

- The home team has an impact on shots for of +5.0 and away team has an impact on shots against of -1.0.

- The home team has an impact on expected goals for of +0.3 and the away team has an impact on expected goals against of -0.1.

- The home team’s skaters have a shooting impact on goal probability of +0.1% and the away team’s skaters have a goaltending impact on goal probability of -0.01%.

Before adjusting for the quality of the home team’s shooting and the away team’s goaltending, we see that the home team should take 49 shots per hour and score 3.5 goals per hour, which means that they should score on 7.14% of their shots. After this adjustment is complete, we determine that they should score on 7.23% of their shots.

This process is repeated for the home and away team for every single game to obtain the rate at which they will take shots and score on the shots they take. Once these values are acquired, the simulation is ready to be run.

I start the simulation by creating a dataframe in R with 3,600 rows and 2 columns. Each row represents a second of the game I’m simulating, and each column contains a random estimate of whether a team will take a shot or not based on the rate at which they shoot; there is one column for each team’s shots. If a team does take a shot, I estimate whether that shot will become a goal based on the rate at which they score on their shots. Once this process is complete for every row, the number of goals scored by each team is summed and if one team has scored more, a winner is declared. If both teams have scored the same number of goals, I create an additional dataframe with 600 rows and 2 columns in order to simulate overtime. The first team to score in overtime wins the game, and if neither team scores within 600 seconds of overtime, a winner is randomly declared with each team being provided with a win probability of 50%.

I tested the model on the 2019–2020 season following the same process and used log loss and average error between projected standings points per 82 games and actual standings points per 82 games. The log loss which I obtained was roughly 0.678 and the average error I obtained was roughly 8.1. The test was slightly different from a true and proper test; I was simultaneously at an unfair advantage because I had access to the first team that a player played for and an unfair disadvantage because I was not updating lineups every game. I take my results with a grain of salt due to this, but still feel confident based on the results that the model will perform well for the 2021 season.

In the future I wish to build a more granular simulation engine that more accurately mirrors an NHL game by including penalties, score effects, empty net situations, and more factors, but it is my suspicion that this would not greatly change the probability of teams winning games but make it take significantly longer to obtain these probabilities. For now, though, I’m comfortable with what I’ve put together.

I currently have win probability calculated for each game. I plan to update lineups throughout the season based on who is projected to play for each team and run 10,000 simulations for each game the night before each game is to be played and update probabilities based on the results of these simulations. At some point in the season, I plan to use a prior-informed ridge regression to update my projection of each player’s impact, but I would like to obtain a fairly large sample worth of data from this season before I do this.

--

--