Predicting the outcome of NBA games with Machine Learning

How we used (and you can too) machine learning to better understand the role statistics play in sports.

Published in

Towards Data Science

15 min readJan 5, 2021

When deciding on a final project for our Big Data Analytics class, my partners Jack Rosener, Jackson Joffe and I looked to combine an interest in sports with the principles learned throughout the semester. After a few days of discussion, we settled on a project that would aim to predict the outcome of NBA games. While implementing our goal, we found it helpful to distill the project into the following steps with the following questions:

Scraping Relevant Data – where do we gather the relevant team and player statistics over several seasons from?
Cleaning and Processing the Data – how can we efficiently combine our scraped data so that it is both readable and usable?
Feature Engineering — what additional metrics can we append to our datasets that would help any user or ML model to better understand and predict, respectively, outcomes and trends from the data?
Data Analysis — can we determine any collinearity or other relations within the data that may better inform our predictions?
Predictions – which models and features would be most useful for us in developing an accurate prediction? Do we focus on team or aggregated individuals’ statistics?

Before we get into the nitty-gritty of our workflow, let’s take a moment to review and acknowledge the other work done on this exact topic. First, in 2013 Renato Torres from the University of Wisconsin-Madison set out to accomplish a similar goal as we did and predict specific season outcomes of NBA data using different machine learning models. He used several techniques featured in our project, primarily feature reduction to eliminate multicollinearity from the available data, and also explored different models to explore those with the highest accuracies. Like our project, his selected features included points scored, but unlike our project had a particular focus on win-loss percentage at home and away. (We will explore our feature analysis later.)
A lot of other fantastic work on this has been done before and can be read here:

Cheng, Ge & Zhang, Zhenyu & Kyebambe, Moses & Nasser, Kimbugwe. (2016). Predicting the Outcome of NBA Playoffs Based on the Maximum Entropy Principle.
Jones, Eri. (2016) Predicting the Outcomes of NBA Games. North Dakota State University.
Fayad, Alexander. Building My First Machine Learning Model | NBA Prediction Algorithm. Towards Data Science.
The Complete History of the NBA. FiveThirtyEight.

Through our research, we found that the best published model had a prediction accuracy of 74.1% (for playoff outcomes), with most others achieving an upper bound between 66–72% accuracy. Most published research, too, focused on predicting playoff scores — which may lend towards biased data: playoff teams are more consistent in a number of stats throughout the regular season, and playoff game expected outcomes likely experience less variance as a result. Critically, note that the upset rate across the entire season in the NBA averages 32.1%. In the playoffs, the upset rate — as defined by teams with a lower regular season record winning— drops to 22% (which actually means most NBA-playoff prediction models underperform). Because our project looked to predict the outcome of any NBA game and is playoff agnostic, we were looking to develop a model that could reach and hopefully beat a 67.9% accuracy — and in doing so predict some upsets.

Feel free to follow along here or view our files on GitHub.

Scraping our Data

We scraped our data from available information at Synergy Sports — which has extremely detailed team and player data for each game played since the 2008–2009 season. Scraping took several days due to rate-limiting (as we had to query the results of each game over 12 seasons), and was initially compiled into a JSON format and finally saved to csv files.

Author’s Note: We accessed data from Synergy Sports with funding from the University of Pennsylvania. As an unfortunate consequence of this, we are not at liberty to provide public access to our dataset. However, we have compiled a list of alternative datasets through which one can replicate and improve upon our data:
https://www.basketball-reference.com/leagues
https://www.kaggle.com/datasets/wyattowalsh/basketball
https://www.kaggle.com/datasets/nathanlauga/nba-games
https://www.kaggle.com/datasets/drgilermo/nba-players-stats

Cleaning the Data

We now had both player stats and team stats for each NBA season saved as seperate csv files. Our next step was to read in all this data and combine it into two large dataframes: one will containing the player stats for the past 12 seasons, and the other containing the team stats. Once created, we would clean the dataframes to remove invalid statistics (negative minutes) and columns that served little purpose to us (charges taken/committed, for example).

We then saved these new dataframes to the following csv files, which allowed us (and you) to skip the laborious and lengthy steps of both scraping and cleaning the data when restarting our notebook’s runtime:

Feature Engineering

This is where the fun stuff begins. Our primary goal was to make all of the available data understandable: game-by-game rebounds for an entire team don’t help us much unless we can use that data in a higher-level analysis that leads us to our ultimate goal – predicting wins and losses. To that end, we sought to create five different features which we would use in understanding how our teams progressed and regressed throughout each season:

Elo Ratings

This is perhaps the best existing method to relativize NBA team strength and performance over many seasons. The way Elo Ratings are calculated is simple: all teams start at a median score of 1500 and are either given or subtracted points based on the final score of each game and where it was played with weights being given to point difference, upsets, and location. In essence, it’s a more sophisticated win-loss record. Most NBA-prediction models don’t look at Elo Ratings but instead amalgamate a simple win-loss record with several other stats. We wanted to use Elo to appropriately weight quality wins (and losses), while also recognizing that not all teams are created equal.
The exact formula is as follows:
If 𝑅_𝑖 is the current Elo rating of a team, its Elo rating after its played its next game is defined as follows:

We calculate ELO for each team and each game for every season of data that we have.

Here, S_team is a state variable: 1 if the team wins, 0 if the team loses. E_team represents the expected win probability of the team, which is represented as:

k is a moving constant, dependent on both the margin of victory and difference in Elo ratings:

It’s also important to note that Elo ratings carry over from season-to-season (as all teams are not created equal, good teams tend to stay good or at least gradually decline — very rarely do teams drop onto or off of the map). If R represents the final Elo of a team in one season, it’s Elo Rating at the beginning of the next season is approximately:

We can actually take a look at this metric over time, randomly selecting three teams to view, and immediately see that we can get key insights about the strength of teams throughout seasons:

Here we can actually see that Elo Ratings track quite well with teams’ performances in given seasons: the years in which Golden State and Cleveland appeared and dueled in the NBA Finals is apparent by the upper peaks of their Elo Ratings. We can also see what was widely confirmed by most basketball analysts at the time: that the Western Conference was much tougher than the East — as exhibited by the influence of quality wins on Elo for the Warriors versus the Cavaliers. We can also see how far these teams fell quickly after their championship seasons and as they all suffered from roster losses and injuries. (Image by Author)

2. Recent Team Performance (Avg. stats over 10 most recent games)

These are pretty self-explanatory, we are simply looking to average the stats for each team over their last 10 games. To do this we wrote a simple function that would calculate a sliding average for a given team’s stats and a window of n games:

After saving this data into a new dataframe, we sought to separate each game (which contains stats for both home and away teams) into its own rows by team, which allows us to group-by and aggregate team stats much more easily and simplifies existing features. Finally, we added a win state-variable column to include the most critical measurement to our project: wins and losses.

3. Recent Player Performance (Avg. stats over 10 most recent games)

We create our player_recent_performance dataframe using similar methods to the above section, this time with individual players as opposed to teams. This created a dataframe of each player’s performance over the past 10 games.

4. Player Season Performance

We also sought to include the average player stats over the entire season: unlike teams, players themselves get injured or fall in and out of the rotation and it’s perhaps more critical for us to understand how player performances in individual games track with their averages. We will use this later in our models to see if it will allow for accurate predictions on the team-level.

5. Player Efficiency Ratings (PER)

Critically, just as we had done with teams via Elo Ratings, we wanted to be able to relativize player performance using a metric that combines seemingly unrelated statistics. Our hope was that we could use Hollinger’s Player Efficiency Ratings to compare and predict team performance by the aggregated PER scores of their players. In the NBA, it is easy for players to experience wildly inflated or deflated statistics (such as points per minute) simply by virtue of the amount of playing time they get, against bench players or versus starters, number of games played, or even from outlier performances. We did not want to rely solely on player averages simply because of their ability to skew. PER solves that problem by weighting certain in-game statistics by the inverse of number of minutes played, which creates a metric that defines player performance relative to the number of minutes played.
Thus for each player, we added a column for PER in a given game according to the following formula:

PER = (FGM x 85.910 + Steals x 53.897 + 3PTM x 51.757 + FTM x 46.845 + Blocks x 39.190 + Offensive_Reb x 39.190 + Assists x 34.677 + Defensive_Reb x 14.707 — Foul x 17.174 — FT_Miss x 20.091 — FG_Miss x 39.190 — TO x 53.897) x (1 / Minutes)

Data Analysis

Our data analysis was centered around the use of Elo Ratings as our test metric. Essentially, could we be confident that Elo correlates with and correctly aggregates other statistics? Furthermore, would it be more appropriate for us to predict game outcomes using team stats (Elo Ratings) or averaged player stats (PER Ratings)?

First, let’s explore the density of Elo Ratings across the NBA on a per-season basis. This tells us a little about the level of parity across the league: if we can see Elo Ratings approach a normal distribution that would suggest the league’s teams are relatively well-matched. Otherwise, we see large disparities and the development of “super-teams”.

Twelve seasons of league Elo densities. (Image by Author)

Moving away from an understanding of Elo Ratings from a league-perspective, we endeavored to see how Elo Ratings tracked against an individual team’s performance in other statistics.

First, we looked to plot the distribution of Elo for a random team against the average number of points scored in recent games:

We can actually see from this that there is some correlation between the average number of points a team scores versus its Elo Rating — the higher the average points scored across a window of games the higher the Elo Rating seems to climb. However, we can also see that the Elo may also exhibit a high variance across similar scoring figures. So, to better understand how Elo Ratings track with points scored, we examine how the average points scored compares to season averages across the league – from there we can determine if points scored improves ELO, provided that high scoring is relative to the rest of the league. To do this, let’s look at that same team for the same seasons and plot the distribution of points scored against its opponents.

This confirms our suspicions, as we can see that when the distribution of average points is greater than those of its opponents, or is more concentrated at an equal or higher level, the Elo is higher for those seasons. When the groupings approach an even or lesser value, those seasons’ Elo ratings for the given team are lower. Therefore, average points scored is a alone a solid determinant of predicting game outcomes, but better when relativized. This demonstrated for us that Elo would be a much better determinant in predicting wins for us than points, as it is by design a relative statistic.

Shifting away from team statistics, we sought to understand if Elo tracks better with player performance than it does team performance. To do this, we took a similar approach to how we plotted Elo Ratings with average points scored for the same random team, this time with PER.

From the plotted data, we can see that aggregated PER as compared to opponents doesn’t show much of any correlation with the strength of a team as determined by Elo Rating. Instead, points scored translates better — which makes some sense as a player’s efficiency isn’t necessarily tied to scoring the most points — and points scored against opponents is the determinant of winning a game and therefore impacting Elo.

We can see this further by mapping the Orlando Magic’s mean and median PER ratings against its opponents for the same given seasons, and find that there is almost no relation between team-PER averages or medians and team strength.

From these and the above plotted distribution, we see that Average PER — while having a slight correlation to Elo Ratings throughout the season — generally show us little about how individual player efficiency impacts team strength when tracked against opponents. (Image by Author)

Median PER ratings show even less of a correlation to Elo Ratings throughout a given season. Here we can observe that in winning seasons (2011–12), the Orlando Magic had a lower median PER than its opponents for most games, yet they had their highest Elo Ratings and best record in recent years. (Image by Author)

From all of our analysis of relativized team versus aggregated player statistics, it looks clear to us that our Elo Rating and its determinants would be better features to train our models on when it comes to predicting the outcome of NBA games.

Predicting the Outcome of Games Based on Team Statistics and Elo Ratings

Our first step here was to split our data into features and columns. Reading from our dataset, once split, we then used sklearn to randomly split our data into train and test sets with an 80:20 ratio.

The first model we aimed to use to predict the outcome of an NBA game was a Logistic Regression model. Unlike a Linear Regression model which predicts outcomes on a range of values between (and sometimes outside) 0 and 1, Logistic Regression models aim to group predictions into binary outcomes. Since we are predicting wins and losses, this type of classification suits us perfectly.

To begin, we used a simple non-parameterized LR model with our team stats and Elo Ratings as parameters using sklearn:

After playing around with some hyperparameter tuning, we found that using max_iter=131 and verbose=2 slightly improved our initial testing accuracy to 66.95%. Definitely not bad for a non-parameterized model and very close to our desired prediction accuracy. However, we sought to see if we could better tune our hyperparameters to improve our overall accuracy. Essentially, we would try out many combinations of possible hyperparameters on our data to give us the absolute best weights for our LR model.

We accomplished this using cross-validation: because we only have a vague idea of the parameters we might want to use, our best approach is to narrow our search is and evaluate a wide range of values for each hyperparameter.

Using RandomizedSearchCV, we searched among 2 * 4 * 5 * 11 * 3 * 3 * 5 * 3 = 59,400 possible settings – and so the most efficient way to do this would be to take a random sample of the values.

Running our model with the best parameter values of the random samples actually decreased the accuracy of our model to 66.27%, which showed us that while random sampling helped us narrow down our hyper parameter tuning within a distribution, we would have to explicitly check all combinations with GridSearchCV.

In this case, implementing GridSearch only marginally increased our accuracy with our LR model.

The second model we looked to implement was a RandomForestClassifier, which can be efficiently used for both regressions and classifications. In this case, we will see if the Classifier can build a proper decision tree to determine wins from the given team-stats.

Immediately, we get that the RandomForestClassifier reaches an initial accuracy of 66.95%, which again is pretty good. Like with the LR model, we attempted to tune the hyperparameters to give us more accurate results — first using RandomizedSearchCV.

Unlike with the LR model, we find that RandomizedSearch improves our hyperparameter tuning, giving us a better accuracy of 67.15%.

Running GridSearchCV in a similar manner to what we did above, we also sought to explicitly test 2 * 1 * 6 * 2 * 3 * 3 * 5 = 1080 combinations of settings instead of randomly sampling a distribution of settings. GridSearch also gave us an improvement from the base RandomForestClassifier, with an accuracy of 67.11%.

Overall, when running both a LinearRegression and RandomForestClassifier on the team stats and Elo Ratings, we achieved a win-prediction accuracy of 66.95%–67.15%. For basketball games, which as we established earlier are quite variable in their actual versus predicted results, this is a significant result.

Predicting the Outcome of Games Based on Individual Player Statistics and Scoring

We then took a different approach to predicting the outcome of a game to see if we can achieve any better perfomance. Using the larger dataset of individual player statistics that we’ve collected, we will train a model to predict how many points a player will score in a given game. We will predict this based on a players average season stats up until the game we are trying to predict as well as their average performance over the past 10 games. We already created this data in the feature engineering section above. We will also make use of Elo ratings in our prediction as well, as presumably the higher rating of the opposing team the less points a player will score. Once we have this model we can predict how many points a team will score in a game by summing the predicted number of points of each individual player will score. With this information we will be able to predict which team will score more points and thus win the game.

Before we run our models, we need to clean our data slightly. For some games in this dataset, we have the statistics for one teams’ players, but not for the other team — generally only for the first game that other team plays in the season. Thus, we will remove all these games from the dataset.

Unlike with the above games, we can’t randomly split our data into train and test sets. We are looking to use individual player statistics to predict the final score of a team, thus we must keep all players playing in the same game together. To do this, we will split up our train and test sets by game so players playing in the same game stay together. About 80% of the games will be in the train set and 20% will be in the test set:

Instead of using a Logistic Regression model, for player scoring we will use a Linear Regression model as we are looking to predict a range of possible values (points scored) instead of simply predicting a win or a loss. Our RMSE (Root Mean Squared Error) for all players was 5.56, or the equivalent of each player making or missing around 2–3 baskets game around their averages.

On the test set, we grouped each team’s predicted scoring for each game and compared it with their actual scoring numbers. Computing the numbers of games won versus the winner based on predicted scoring gave us a ratio of 1483/2528, or an accuracy of 58.66%. Clearly, and as we realized earlier when looking at PER distributions of teams versus their opponents, aggregated player performance is too variable of a determinant to accurately predict the outcome of games — especially when compared to team performance which tends to be more consistent across games.

Conclusions and Future Considerations

As avid NBA fans, we felt that creating a model to predict the outcome of NBA games would be an interesting project and taught us a lot about building classifiers for professional sports game outcomes. We were able to utilize many of the concepts learned in our Big Data Analytics class for this project — including scraping, data cleaning, feature analysis, building models and hyperparameter tuning — and want to thank Professor Ives for his fantastic work in teaching throughout the semester.

Our Random Forest Regression model, with parameters optimized through RandomSearchCV, gave us the highest testing accuracy of 67.15%. It is slightly higher than the Logistic Regression model, and it is much higher than the Linear Regression model based on individual player statistics. Optimizing parameters using GridSearchCV and RandomizedSearchCV was time consuming and computationally costly, and it resulted in only marginal changes in testing accuracy. If we had more time, we’d likely spend less time optimizing parameters and more time selecting a model.

The best NBA game prediction models only accurately predict the winner about 70% of the time, so our logistic regression model and random forest classifier are both very close to the upper bound of predictions that currently exist. If we had more time, we would explore other models and see just how high of a test accuracy we could get. Some of those candidates include an SGD Classifier, linear discriminant analysis, convolutional network, or a naïve Bayes classifier.

Hopefully, you enjoyed reading about our work as much as we enjoyed making it — and learned something from it too.