Using Data Science to Predict the Next NBA MVP

Can statistical models confidently tell us who will receive the MVP each season?

Published in

Towards Data Science

12 min readJul 22, 2019

Giannis Antetokounmpo giving acceptance speech after receiving the 2019 NBA MVP award — *Michael Kovac/Getty Images*

Every year, one standout NBA player is chosen by over one hundred sports media members as the year’s Most Valuable Player (MVP). As with anything else in sports, the discussions around who should or shouldn’t receive the award can be fiery. It’s not hard to narrow down the field to a handful of players, but fans and the media are hardly ever in complete agreement as to who should be the recipient. Well, except for Steph Curry that one time.

Some people believe the MVP should be the player with the most impressive stats. Others argue a true MVP doesn’t have the best individual performances, but rather raises the level of others around them (see LeBron James).

But who’s right? We could endlessly debate the question and get nowhere. Or, we can see what the numbers say. To tackle this debate, I decided to use various basketball statistics to model who would be the 2019 NBA MVP.

Data Collection

Web scraping Basketball Reference with Selenium

The first part in any data science project is acquiring a complete dataset to analyze. Luckily, we’re dealing with basketball, a sport rife with data.

There are two main sources of NBA data: the NBA’s own stats website or the fan-beloved third-party Basketball Reference website. The main difference here is the way in which the data is formatted: while the NBA website includes a JSON API, BBallRef allows you to directly download CSV files. The first requires JSON requests and the second a bit of clever web scraping. As I was hoping to get some experience with the latter, I went with BBallRef.

Using Selenium — a tool for automating browsers — within Python, I built a simple scraper that loaded each page in a list of years, toggled the built-in “Download as CSV” button, and saved the CSV output to disk. Thankfully, BBalRef pages follow a template, allowing the same scraper to be employed regardless of year or data type. The complete commented function is shown below. This also seems like a good time to mention my complete codebase is fully available on GitHub.

Once the scraper was complete, all I had to do was let it work its magic and collect season standings, advanced, total, and pergame individual player statistics, and award data from 1976 to 2018. The next step was to process the data.

Data Processing

Using R to load, clean, and merge data

While a dataset of simple player statistics along with names, teams, and seasons isn’t a particularly difficult dataset to deal with, my data did have some peculiarities that proved challenging. First, let’s look solely at total player stats. Here’s what a single entry might look like.

Example row from 2018 total statistics

Alright, this looks simple enough, but there is one problem: the formatting of the player’s name. It takes the format “James Harden\\hardeja01,” when really we just want “James Harden.” This became a perfect opportunity to practice simple text cleaning in R.

And that’s all we need to do. There’s certainly a more efficient way to accomplish the task, but sometimes it’s better to just accept having a few more lines in exchange for making your code that much more legible.

From here, we need to loop through each year in our dataset, load the CSV for that year, clean the player names, do some more straightforward data type management I’ve omitted for the reader’s sake (you’re welcome), and merge the results into a single spreadsheet. This spreadsheet holds the total stats for every player for each season from 1976 onward, in the format shown above. I also did this for per game stats, as well as advanced stats, award stats, and season standings.

Next, I wanted to add a Team.Wins column to individual stats, as team wins often factors into discussions about individual awards. I thus matched each player’s seasons with the appropriate team and season in my season standings data. Let’s look at how the standings data is formatted.

Example row from complete standings data

While this is technically just a matter of matching up team names and grabbing the Wins column, we have an issue with team name formatting. The player data codes the Houston Rockets as HOU, while the standings data uses the full Houston Rockets name. This wouldn’t be a problem for any NBA fan, but my computer doesn’t have enough information to match team names with their abbreviations.

The solution was simple yet tedious: manually map each abbreviation to each full team name. This was especially challenging given how frequently teams have changed names over the years. Who knew there was such a thing as the Kansas City Kings (KCK)? After creating this mapping, adding a Team.Wins column to player data required just looping through each player/season pair and finding the appropriate value.

If you want to see what this code looks like, the full data processing code is available here along with the rest of the codebase. With our data processed, it’s time to proceed to summary analysis.

Data Analysis

Summary statistics and preliminary analysis

Before modeling, it’s always good practice to do some summary analysis on your dataset. For the rest of this article, we’ll be looking at just data from the year 2000 onward. First, let’s do a simple summary() in R with our total stats, looking specifically at minutes played, field goals made, 3-points made, rebounds, assists, steals, blocks, and points. I use the dplyr tibble format to easily select the relevant columns.

Summary of individual player season totals

As expected, we see a large range of values for any given statistic. Let’s also quickly see how many observations we’re dealing with using the dim function.

Dataset dimensions for individual player season totals

We see that we have nearly 6500 observations of 36 columns each. Since we’re looking at a 18-year range, that gives us roughly 360 observations for each year. Next, to get a sense of which columns might serve as useful predictors, I created a correlation plot to see which basketball statistics correlated the most with MVP votes received.

Plot of correlation between player season totals and number of first place MVP votes received

The first attempt at this was not very informative. The plot above shows weak positive correlations between first place votes and points, field goals made/percentage, and free throws made/percentage. These statistics are also inherently correlated within themselves (more field goals = more points). This suggests that scoring more points increases your odds of being MVP, but this is a fairly obvious conclusion. Let’s try again with more advanced stats.

Plot of correlation between player season advanced stats and number of first place MVP votes received

Our results are more interesting here, suggesting player efficiency rating (PER), win shares, plus-minus, and value over replacement player (VORP) are more strongly correlated with votes received. Let’s further visualize this by seeing where past MVP recipients have fallen in those stats.

Plots of VORP vs. PER and BPM vs. WS with MVP recipients shown in red (players with less than 41 games played omitted)

These two plots, showing the MVP recipients in red, clearly demonstrate the effectiveness of advanced stats in capturing the qualities of an MVP-level player. However, they do not appear to be enough to judge the winner, with several MVPs falling in the middle of the pack in either graph.

At this stage, the final step I took before modeling was to normalize each statistic by season by dividing it by that season’s max. That is to say, each statistic now lies between (0,1), with the stat leader setting the bar for 1. This is mostly to avoid abnormally low or high stats from hurting the model. Now we can finally get to modelling.

Modeling

Building and tuning a model to predict the NBA MVP

We start off simple with a logistic regression model using some of the relevant variables in the player season totals dataset with the glm() function in R.

tot.log <- glm(MVP~G+X3P+DRB+AST+BLK+PF+PTS+Team.Wins,
data=dat_totals,family = binomial(link = "logit"))

To clarify what’s happening here, MVP~ tells R to predict the MVP winner using all of the following variables. The family=binomial section tells R we intend to perform a simple (0,1) classification. While we’re at it, let’s also fit a model using advanced stats.

adv.log <- glm(MVP~PER+TS.+X3PAr+FTr+TRB.+AST.+STL.+BLK.+
                 TOV.+USG.+WS+BPM+VORP+Team.Wins,
                data=adv.shortlist,family = binomial(link="logit"))

Just as before, we pass the variable we’d like to predict followed by the predictors. Now let’s see how good our models are. To measure model performance, we first have to make predictions using these models, and then see how accurate those predictions turn out to be. To simplify this, I created functions in R to do both tasks as follows.

The first function makes predictions for each player, then finds the player with the highest predicted odds in each season and awards him the MVP. The second function compares the chosen MVPs to the correct choice, and returns the accuracy. Now let’s test the accuracy of our two models.

Accuracy and error count for total stats and advanced stats models

Our total stats model reports an accuracy of 84 percent, while the advanced stats model was only able to correctly predict the MVP winner about two-thirds of the time. I was surprised to see the advanced stats model performing worse given that advanced statistics showed higher correlation with MVP votes in our preliminary analysis. To better judge what’s going on here, let’s look at the model coefficients for the advanced stats model.

Model coefficients for advanced stats logistic regression classifier

Strangely, the model coefficients suggest that higher PER and BPM actually decrease a player’s odds of becoming MVP. This is obviously false, and so something must be wrong. Let’s investigate by looking at the PER leader for each season.

While this list certainly has some expected names like Shaquille O’Neal and LeBron James, others like David Wingate and Jarnell Stokes seem to stand out. Fortunately, the problem here is clear: players with unusually low games and minutes played throw off advanced stat calculations.

The easy fix here is to simply ignore observations that fail to meet a minimum of games played. This is definitely a worthwhile step, but we can go further. Why consider players that have no chance of receiving the award at all? By training our model on every player to just step foot on the court each season, we dilute the model’s ability to highlight elite players, instead flooding it with rotation players.

At this point, I decided to implement a two-stage model pipeline. First, we determine which players are even in contention for the award, and then we predict which player within that shortlist has the highest odds of winning the MVP award. Since our dataset contains the voting share each player received, we can easily code a binary Shortlist variable as follows.

dat_totals$Shortlist<-dat_totals$Share!=0

We then fit a simple logistic regression model and predict the shortlist players.

tot.short.mod <- glm(Shortlist~G+MP+X3P+DRB+AST+
    BLK+TOV+PF+PTS+Team.Wins,
    data=dat_totals,family = binomial(link = "logit"))
## grab shortlist
tot.shortlist<-dat_totals[which(predict(tot.short.mod,type="response")>.75),]

After doing the same for the advanced stats dataset we’re ready to model just as we did before. Let’s see if our accuracy improves.

Accuracy and error count for total and advanced stats models

Our two-stage pipeline seems to have been a success, as our total stats model maintained the same accuracy while the advanced model significantly improved. In both cases, we correctly predict the NBA MVP out of nearly 400 players each season 84 percent of the time.

Of course, we’re committing a major data science sin: measuring model performance solely on the training data. It’s entirely possible our models are heavily overfitting and would be useless for future seasons. In order to test this, we have to employ cross-validation.

Cross-Validation

Season-wise leave-one-out cross-validation (man that’s a lot of hyphens)

While typically you might just split your data into a train and test set, it’s a little trickier in this case given we only have one positive observation per season. Removing a subset of seasons would create a test set too small to be useful. Instead, I employed leave-one-out cross-validation.

To do this, I had to train each model n times, where n is the number of seasons. Each time, one season was held out from the training set and the prediction made was solely for that season, resulting in a prediction for each season where the model has never seen the data it’s predicting on. To simplify this, I wrote an R function to quickly run through this procedure.

Armed with this handy helper function, we can now cross-validate the accuracy for our previous models.

Leave-one-out cross-validation error for each model

Unsurprisingly, we see a decrease in the model accuracies after applying cross-validation. That being said, we still have a reasonably accurate model, particularly with advanced stats. Though why settle there?

At this stage, we have all the tools we need to fit and test a variety of different models. In my search for the ideal model, I fit a linear, logistic, and polynomial regression to the total stats, per game stats, and advanced stats datasets, as well as a merged dataset made up of the total and advanced stats. The results of each of these models are shown below.

Accuracy and leave-one-out cross-validation accuracy for each model fit

We can see that many of the models perform quite similarly after cross-validation, with our highest CV accuracy being 73 percent, which is shared by two different models — both using advanced stats. We can also see overfitting in other cases, such as the cubic polynomial models, which have the highest non-validated accuracy but some of the lowest validated scores.

Our trusty logistic advanced stats model is one of the ones tied for highest accuracy, so let’s stick with that one. Now it’s time for the moment of truth: predicting the 2019 NBA MVP.

Prediction

Predicting the 2019 NBA MVP

In order to predict the 2019 NBA MVP, we’ll have to first load the advanced data for that year. Thankfully, we’ve prepared a function just for that in our data loading and processing file.

Once the data is loaded in, we simply run the data through our two-stage pipeline. First, the top 10 candidates are selected to form the shortlist, and each player’s odds are predicted. The code below does all this and then turns the final predictions into percentages.

Now all that’s left to do is anxiously look at our results.

Ultimately, our model just barely fails to correctly predict the MVP winner for 2019: Giannis Antetokounmpo, but the proximity of the prediction score between Harden and Giannis reflects how close the MVP race actually was. In reality, Giannis earned roughly 35 percent of the vote share, while Harden reached just under 30 percent.

Our shortlist model also performed quite well, nearly perfectly predicting the top 10 vote-getters. The only mistake in the shortlist was including Kyrie Irving over Kawhi Leonard. So while we didn’t exactly make the correct final prediction, our two-stage pipeline still performed quite well.

Future Work

Model improvements and data visualization

While we’ve already covered plenty of different techniques and approaches in this article, there are still many paths I’d like to explore to improve this project. The most obvious one is using more complex models. A support vector machine or random forest might greatly outperform my simple logistic regression classifier.

The more ambitious improvement I’d like to make is to include social media data. Debate around the MVP often centers around how a player with superior numbers might lose to a stronger media narrative (see James Harden in 2015 or LeBron in 2011). Finding a way to quantify social media buzz might be an interesting avenue to explore this question, and could also improve our results.

Finally, the ultimate intention with this project is to create a website that tracks MVP odds over the season by updating predictions after each game. While I’ve already partly built the HTML interface, I still need to code the system to update the predictions after every match using the NBA API.

Thanks for reading! Let me know if you enjoyed this article and if I should make Part 2 in the future. If you’re interested, my entire codebase is available here.