The world’s leading publication for data science, AI, and ML professionals.

Predicting 2020-21 NBA’s Most Valuable Player using Machine Learning

What do ML models say about the MVP race?

Photo by Keith Allison on Wikimedia Commons
Photo by Keith Allison on Wikimedia Commons

At the end of every season, media members across the National Basketball Association (NBA) are asked to decide on the winner of the league’s most sought-after individual regular season award: The Most Valuable Player (MVP). Created in the 1955–56 season, it aims to reward the best performing and most consistent player of the regular season.

Sparking a lot of debate amongst basketball fans and analysts every year, the MVP race is usually one of the most entertaining (and intense) storylines of the NBA season. While narratives play a big part in the ultimate decision on the winner, it mostly comes down to the player who had the biggest statistical impact on his team’s success.

With more than half of the season’s games in the books, it is becoming clear who the real MVP candidates are. But who will actually end up winning it? Based on the games already played and historical data, the goal of this article is to predict the results of the MVP award using ML models.

Data

For the historical data, we make use of Dribble Analytics’ dataset , which consists of data pertaining to the top 10 players in MVP voting of each season, from 1979–80 (the start of the 3-point era) to 2017–18. To add to this, we have collected the same data from the two following seasons (2018–19 and 2019–2020).

As for the current season, we have collected data from the 10 players currently on Basketball Reference’s 2020–21 NBA MVP Award Tracker: Nikola Jokić, Joel Embiid, Giannis Antetokounmpo, James Harden, Damian Lillard, LeBron James, Kawhi Leonard, Luka Dončić, Kyrie Irving and Rudy Gobert.

All of the data can be found on Basketball Reference.

Feature Selection

Before mentioning the features of our models, it is important to define our target value. The target value we are trying to predict is the Share of the total MVP votes each player gets.

Share = (MVP votes on Player)/(Total MVP votes)

As for the features, we start by having 16 in total:

Games
Team Wins
Overall Seed
MP
PTS/G
TRB/G
AST/G
STL/G
BLK/G
FG%
3P%
FT%
WS
WS/48
BPM
VORP

The first four are simple stats. They represent the number of games played by each player, how many wins their team has, their team’s position in the league and minutes played per game, respectively.

PTS/G, TRB/G, AST/G, STL/G, BLK/G stand for points, total rebounds, assists, steals and blocks per game.

FG%, 3P%, FT% represent field goal percentage, three-point percentage, and free throw percentage.

WS, WS/48, VORP and BPM are advanced stats. WS and WS/48 stand for Win Shares and Win Shares per 48 minutes. These stats aim to divide team success on individual members of the team.

BPM stands for Box Plus/Minus and is a metric that estimates a basketball player’s contribution to the team when that player is on the court.

Finally, VORP stands for Value Over Replacement and is a box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level player, translated to an average team and prorated to an 82-game season.

To help us decide which features to use, we fit a Random Forrest regression and inspect the feature importance results on that model.

RF feature importance results
RF feature importance results

Following that, we also find the correlation matrix. We want to identify features that may be highly correlated between them, and remove one to avoid feeding the model duplicated information.

After investigating the correlation matrix, we identify some correlations that make sense. For example, the feature Overall Seed is strongly and negatively correlated to the number of Wins. There is no need to use the two features. We can also see some strong correlations between the variables WS and WS/48, BPM and VORP. This also makes sense since the metrics WS/48 and VORP are dependent on WS and BPM, respectively.

Of all the original 16 features, we end up removing seven of them. The final features for our models are:

Overall Seed
PTS/G
TRB/G
AST/G
STL/G
BLK/G
FG%
WS
VORP

Training and Testing

In order to train and test our models, we need to have a training set and a test set. As is common in Machine Learning techniques, the division of both datasets was made randomly, with the training set consisting on 75% of the full dataset.

The metrics which we used to evaluate the models’ performance on the test set were the Mean Squared Error (MSE) and the R-squared.

In our experiment, we used the following models:

  • Deep Neural Network (DNN)
  • k-nearest neighbors regression (KNN)
  • Random forrest regression (RF)

The table below shows the MSE and R-squared for each of the three models. As we know, a lower MSE and a higher R-squared indicate a more accurate model.

Looking at the results, the models don’t have a very high R-squared but we have achieved significantly low MSE values. Considering that the majority of MVP winners had a vote share advantage over their runner-ups above 0.1, these are good results. As we can see, KNN was the best performing model on the test set, with the highest R-squared and the lowest MSE.

Predictions

The 4 graphs below show each model’s predictions for the 2020–21 MVP Award Vote Share.

Two of our models have Jokić as the winner, one has Embiid. Two models predict Giannis as the runner-up. James Harden comes in third in two models and fourth in the other. One curious result: In comparison to the other models, our RF model absolutely loves Rudy Gobert’s chances (for some reason), ranking him fourth in MVP voting share.

The average shows that the models have Jokić winning MVP by a slight margin, over Giannis. Harden and Embiid also hold a comfortable lead in share over the remaining candidates.

Conclusion

Our models crown Jokić as the MVP but the battle for the award is still very much alive.

Jokić’s MVP case is stronger than ever. Adding to his incredible statistical impact on the court, the Joker has played every game of the season so far. This is something that can set him apart from the other candidates.

Embiid was having a historical season, but his recent injury might have permanently damaged his chances of winning the race.

Giannis is a perennial MVP candidate and should come as no surprise to anyone if he ends up as the first placed player in voting.

James Harden can be seen as the dark horse in this race, but there is no denying that his impact has been felt on his new team. If he keeps up his recent production, he will certainly have a strong case for MVP by the end of the season.

A lot can (and will) still change but one thing seems clear: Between Jokić, Giannis, Embiid and Harden, we can confidently predict that the award will go to one of these players.

Github Repository


Related Articles