Predicting NBA Salaries with Machine Learning

Building a machine learning model with Python to predict NBA salaries and analyze the most impactful variables

Gabriel Pastorello
Towards Data Science

--

(Photo by Emanuel Ekström on Unsplash)

The NBA stands out as one of the most lucrative and competitive leagues in sports. In the last few years, the salaries of NBA players have been on an ascending trend, but behind every awe-inspiring dunk and three-pointer lies a complex web of factors that determine these salaries.

From player performance and team success to market demand and endorsement deals, numerous variables come into play. Who never pondered why their team spent so much on an underperforming player, or marveled at a successful deal?

In this article, we use the capabilities of machine learning with Python to predict NBA salaries and uncover the crucial factors with most impact on players’ earnings.

All the code and data used are available on GitHub.

Understanding the problem

Before diving into the problem, it is essential to grasp the fundamentals of the league’s salary system. When a player is available on the market to sign a contract with any team he is known as a free agent (FA), a term that will be used a lot in this project.

The NBA operates under a complex set of rules and regulations that aim to maintain competitive balance among teams. Two key concepts are at the core of this system: the salary cap and the luxury tax.

The salary cap serves as a spending limit, restricting how much a team can spend on player salaries in a given season. The cap is determined by the league’s revenue, and it is updated every year to ensure that teams operate within a reasonable financial framework. It also intends to prevent large-market teams from significantly outspending smaller-market counterparts, promoting parity among franchises.

The distribution of the salary cap among players can vary, with maximum salaries for top-tier players and minimum salaries for rookies and veterans.

However, exceeding the salary cap is not uncommon, especially for teams aiming to construct championship-contending rosters. When a team surpasses the salary cap, it enters into the realm of the luxury tax. The luxury tax imposes a penalty on teams that spend above a certain threshold, discouraging teams from excessive spending while also providing additional revenue for the league.

There are many other rules that act as exceptions, like the mid-level exception (MLE) and trade exception that allows teams to make strategic roster moves, but for this project the knowledge of the salary cap and luxury tax is enough.

NBA Salary Cap Evolution from 1984 to 2023 (Image by Author)

Due to this continuous ongoing growth of the salary cap, the approach selected will be using the percentage of the cap as the target instead of the salary amount itself. This decision aims to incorporate the cap’s evolving nature, ensuring that the outcome remains unaffected by temporal shifts and remains applicable even when evaluating historical seasons. However, it should be noted that this is not perfect and only an approximation.

Data

For this project, the goal is to predict the salaries for players signing a new contract next season using data only from the previous season.

The individual statistics used were:

  • Average statistics per game
  • Total statistics
  • Advanced statistics
  • Individual variables: age, position
  • Salary-related variables: salary of the previous season, maximum cap for the previous and the current season and the % of the cap of that salary.
Salary distribution for the 2022–23 season (Image by Author)

Since we do not know the team that the player will sign, only individual features were included.

In total, this study had 78 features for each player plus the target.

Mostly all data was obtained using BRScraper, a recent Python package created by me that allows scraping and easy access to basketball data from Basketball Reference, including NBA, G League and other international leagues. All guidelines concerning causing harm to the website or impeding its performance were followed.

Data Treatment

One interesting aspect to consider is the selection of players for the training of the models. Initially, I selected all available players, however most of them would be already under a contract, and in this case the value of the salary does not change drastically.

For example, imagine if a player signs for $20M for 4 years. He receives approximately $5M per year (very rarely all years are the same exact value, usually there is a certain progression in the salary around $5M). However, when a free agent signs a new contract, the value may change much more drastically.

This means that training a model with all available players may result into a better performance overall (after all, most players would have a salary very close to the last!), but when evaluating only free agents, the performance would be significantly worse.

Since the goal is to predict the salary of a player signing a new contract, only this kind of player should be in the data, so the model can better understand what are the patterns between these players.

The season of interest is the 2023–24 season that is coming up, but data from 2020–21 onwards will be used to increase the number of samples, which is possible due to the choice of the target. Older seasons were not used due to the lack of data of the FAs.

This leaves 426 players in the three seasons selected, 84 being FAs from 2023–24.

Modeling

The train-test split was designed so that all free agents from 2023–24 were exclusively included in the test set, maintaining an approximately 70/30 split.

Initially, several regression models were used:

  • Support Vector Machines (SVM)
  • Elastic Net
  • Random Forest
  • AdaBoost
  • Gradient Boosting
  • Light Gradient Boosting Machine (LGBM)

The performance of each one of them was evaluated using the root mean squared error (RMSE) and the coefficient of determination ().

You can find the formula and explanation of each metric in my previous article, Predicting the NBA MVP with Machine Learning.

Results

Looking at the whole dataset with all seasons, the following results were obtained:

RMSE and R² values ​​obtained among the different models (Image by Author)

The models had an overall good performance, with Random Forest and Gradient Boosting obtaining the lowest RMSE and highest R², while AdaBoost had the worst metrics among the models used.

Variables Analysis

An effective approach for visualizing the key variables influencing the model’s predictions is through SHAP Values, atechnique that provides a reasonable explanation of how each feature impacts the model’s predictions.

Again, a deeper explanation about SHAP and how to interpret its chart can be found in Predicting the NBA MVP with Machine Learning.

SHAP chart related to the Random Forest model (Image by Author)

We can draw some important conclusions from this chart:

  • Minutes per game (MP) and points (PTS) per game and total are the three most impactful features.
  • Salary of the previous season (Salary S-1) and % of the Cap of that salary (% Cap S-1) are also very impactful, being 4th and 5th respectively.
  • Advanced statistics are not predominant among the most important features, with only two appearing on the list, WS (Win Share) and VORP (Value Over Replacement Player).

This is a very surprising result, as differently from the MVP project, where advanced statistics dominated SHAP’s final result, player salaries appear considerably much more related to common statistics like minutes, points, and games started.

This is surprising because most advanced statistics were designed exactly with the objective of better evaluating a player’s performance. The absence of PER (Player Efficiency Rating) among the top 20 (it appears at 43rd place) is particularly striking.

It raises the possibility that during salary negotiations, general managers might be adhering to a relatively simpler approach, potentially overlooking the broader spectrum of performance evaluation metrics.

Maybe the problem is not so complex after all! Simplifying, the player who playes the most minutes and scores the most points earns more!

Additional Results

Focusing in this years’ free agents and comparing their predictions with the actual salary:

Main results of Random Forest model for 2023–24 season (values in millions) (Image by Author)

In the top we have the five players who appear to be more undervalued (receiving less than they should), in the middle five players correctly valued and at the bottom the five players more overvalued (receiving more than they should). It is important to note that these assessments are solely based on the model’s outputs.

Starting from the top, the former MVP Russell Westbrook is the most undervalued player according to the model, which in my opinion is a fact, having signed a ~$4M/year contract with the Clippers. Eric Gordon, Mason Plumlee and Malik Beasley are also in a similar situation, earning very small contracts with good performance. D’Angelo Russell also appears in this top five despite earning a salary of $17M/year, which indicates that he should be earning even more.

Interesting to note that all of these players signed with contending teams (Clippers, Suns, Bucks and Lakers). This is a known behavior where players choose to earn less to have the chance to play for a team that can win the title.

In the middle, Taurean Prince, Orlando Robinson, Kevin Knox and Derrick Rose all earn small salaries that seem to be adequate. Caris LeVert wins $15M/year, but also appears to be worth exactly that.

At the bottom, Fred VanVleet was appointed as the most overvalued player. The Rockets, operating as a rebuilding team, made a notable move with his new three-year contract valued at $128.5M. They also signed Dillon Brooks for a value higher than expected.

Khris Middleton signed a big extension this summer. Despite being a contender team, the Bucks belong to a non-major market and cannot afford to lose one of their best players. Draymond Green and Cameron Johnson have similar situations in their respective teams.

Conclusions

Predicting outcomes in sports is consistently challenging. From the choice of the target to the selection of players, this project proved to be more complex than expected. However, the outcome proved to be quite simple and the results obtained were very satisfying!

Certainly, there are multiple ways they can be improved, one of them being the use of feature selection or dimensionality reduction techniques to reduce the feature space and thus the variance.

Moreover, having access to the free agents of previous seasons would also make possible to increase the number of samples. However, such data doesn’t seem to be publicly accessible at the moment.

A lot of other external variables also have influence in this matter. For instance, there is no question that if somehow the team was known, variables like last year seed, playoff outcome and percentage of cap already used could be really informative. However, maintaining the approach that mirrors the circumstances of an actual free agent scenario where the team is unknown may potentially yield a result that aligns more closely with the player’s “real value”, regardless of the signing team’s context.

One of the main premises of this project was the use of only data from the previous season to predict the next salary. Incorporating statistics from older seasons could indeed yield improved results, given that a player’s historical performance can offer valuable insights. However, the expansive nature of such data would necessitate thoughtful feature selection to manage the complexity and high dimensionality.

Again, all the code and data used are available on GitHub.

(Photo by Marius Christensen on Unsplash)

I’m always available on my channels (LinkedIn and GitHub).

Thanks for your attention!👏

Gabriel Speranza Pastorello

--

--