NBA Draft Analysis: Using Machine Learning to Project NBA Success

An attempt on using Machine Learning to project success in NBA.

Saadan Mir
Towards Data Science

--

Photo by Edgar Chaparro on Unsplash

Introduction

It’s no secret that one of the most crucial parts of building a successful sports franchise is acquiring the most talented players possible. For NBA teams, the most surefire way to acquire this talent is with a selection in the NBA draft. Once a year, NBA teams and their fans get filled with the hope that they can bring in a college star who can translate their success to the professional level.

However, in reality, the NBA draft is a very inexact science and one that often leads to players with high expectations failing to live up to them. As an avid fan of the NBA draft, I wanted to determine if it was possible to utilize a draft prospect’s college career to assist in the decision-making process when evaluating their NBA future. This seemed like a great intersection of data analysis and sports for me, and an opportunity to gain new skills along the way.

I knew that my one-stop-shop for basketball statistics would be Basketball-Reference. After doing a bit of research, I discovered the BeautifulSoup library for Python which I knew would work well for me in being able to obtain data from Basketball-Reference in an efficient manner.

NOTE: All code used for this post can be found on GitHub, organized in this notebook.

Collecting the Data

As mentioned previously I obtained the NBA data for this project from Basketball-Reference. I also needed NCAA basketball statistics for my analysis which I obtained from Sports-Reference College Basketball. To get this data I used the Python library Beautiful Soup, which allows for pulling data out of HTML and XML files. My first step was to obtain data from past NBA Drafts. I chose to look at the past 20 NBA Drafts initially, mainly due to how rapidly the NBA game changes and wanting to predict success in the current NBA landscape.

The page that I obtained this data from is the above. The URL for this page is https://www.basketball-reference.com/draft/NBA_{draft_year}.html. To obtain a dataset for several NBA drafts I needed to utilize a for loop to iterate through a list of years and update the URL for multiple NBA drafts. Once finished, I was able to obtain a dataset of all players drafted in the past 20 NBA drafts as well as their NBA career statistics.

My next goal was to obtain college statistics that would be tied to each of the applicable players in this data. Before doing so, I chose to look at the distribution of Games Played for all players drafted as I wanted to see what a typical NBA career length looks like.

Prior to creating a visualization, I removed any players who did not play any NBA games as well as those who did not attend college. There are many reasons why drafted players may not ever log an NBA minute, this could include overseas players getting drafted and choosing to stay in overseas professional leagues or even serious injury in some unfortunate cases.

It’s apparent that the distribution of NBA Games played is right-skewed, which makes sense as there is an ever-rotating list of hopeful NBA players coming into the league year after year. It’s a lot more common for NBA players to last only 2–3 seasons than it is for there to be someone like Lebron James with a career spanning almost two decades. To ensure that there was a decent baseline of NBA performance accounted for in my upcoming analysis, I decided to remove any players who did not have at least 82 games played in the NBA, the equivalent of one full season.

After making these changes I followed a similar process as earlier to obtain the college statistics for the remaining players in the data.

The link for individual player college statistics is as follows: https://www.sports-reference.com/cbb/players/{player_first_name}-{player_last_name}-1.html. The dataset I obtained for college players included the career college statistics for NBA draftees.

I considered 4 metrics to utilize as the target variable for a regression model that would indicate the level of success for an NBA career; WS, WS/48, VORP, and BPM. A quick description of each of these:

Target Selection

WS — Basketball Reference defines a Win Share as “a player statistic which attempts to divvy up credit for team success to the individuals on the team.” The calculation for a win share includes crediting both offensive and defensive win shares to a player. This involves calculating the marginal offensive and defensive impacts of a player relative to league average to develop a metric meant to quantify the impact on wins a player will make throughout their NBA career.

WS/48 — The same calculation methodology as Win Shares adjusted to standardize for minutes played. Because Win Shares are calculated using counting stats, they favor players who spend more time on the court. This statistic evens the playing field to look at the impact of all players if they played the same number of minutes.

VORP — Per Basketball Reference, VORP is defined as “ a box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team and prorated to an 82-game season. Multiply by 2.70 to convert to wins over replacement.”

BPM — Once again referencing Basketball Reference, “Box Plus/Minus, Version 2.0 (BPM) is a basketball box score-based metric that estimates a basketball player’s contribution to the team when that player is on the court. It is based only on the information in the traditional basketball box score — no play-by-play data or non-traditional box score data (like dunks or deflections) are included.”

I decided to utilize WS as the final choice for the target variable in my regression model. None of these metrics are perfect encapsulations of the impact a NBA player will have on the court, but when looking in at the list of career leaders in NBA win shares we can see that this metric is definitely pointing in the right direction. The top performers in this metric are littered with Hall of Famers and players who would be if they retired today.

Looking at the distribution of WS across the data we can see that it is right skewed. This makes perfect sense as Win Shares is a statistic that accumulates across a player’s entire career.

Following the decision of my target variable, I moved forward with preparing the data I would be working with to be clean for a Machine Learning model. This included handling any missing values in the college statistics and removing any features that were unnecessary.

Before training the data in a regression model, I chose to perform feature selection in order determine and filter which college statistics would have the strongest information gain with NBA Win Shares. After finishing this process the college statistics that would be kept in my model as explanatory features included:

  • Games Started — The number of games the player started (was on the court at the beginning of the game) throughout their college career.
  • Field Goals Made Per Game — Number of total shots made per game.
  • 3 Pointers Made Per Game — Number of 3-point shots made per game.
  • 3 Point Field Goal Percentage — Number of 3-point shots made divided by the number attempted.
  • Free Throws Made Per Game — Free Throws made per game.
  • Offensive Rebounds Per Game — Number of rebounds per game following a teammate’s missed shot.
  • Defensive Rebounds Per Game — Number of rebounds per game following an opposing player’s miss.
  • Total Rebounds Per Game — addition of offensive and defensive rebounds per game
  • Assists Per Game — number of direct passes leading to a teammate’s made shot per game.
  • Blocks Per Game — Number of times the player deflects (or blocks)a shot attempt from an opposing player per game.
  • Turnovers Per Game — Number of times when the player loses possession of the basketball to the opposing team per game.
  • Points Per Game — Total points scored per game.

Training/Test Data Determination

Because Win Shares is a cumulative stat, I had to be careful in selecting the training set for the predictive model. Many of the players included from these drafts still have a lot of NBA games ahead of them, especially those from the 2010 NBA Draft onward whose players may just now be entering their prime NBA years. Because of this, I chose to limit the training data of my model to be inclusive of players from the 2000–2009 NBA drafts while having the test be on the 2015–2019 NBA Drafts.

My reasoning for this was that players from the 2000–09 drafts would be out of the league/on the decline at this point, thus their Win Shares would no longer be rapidly increasing which would be the case for more recent NBA Drafts. I utilized a Random Forest Regression model as the predictive model. The model’s predictions were higher than the actual win shares of these NBA players, which was to be expected as these players are still within the first 7 years of their careers, and are expected to accumulate more win shares. To account for this I chose to analyze my model’s results with a subjective process, rather than looking strictly at any accuracy metrics.

Below, I compare the top 15 players by predicted Win Shares for each NBA Draft against the top 15 players in actual Win Shares thus far.

Regression Results

Interactive Chart (Hover over a data point to see player name and value)

Visually, we can see the discrepancy between the model’s predictions for Win Shares versus the current standing for these NBA Drafts. As mentioned, this is to be expected as Win Shares are accumulated throughout a player’s entire NBA career. I will have to wait to evaluate my model in terms of mean absolute error or mean absolute percentage error until the players from these draft classes have called it quits on their NBA careers.

2015 NBA Draft

The results of my model’s predictions on this draft class are very much a mixed bag. The № 1overall pick in the 2015 draft was Karl-Anthony Towns, who would surely be a consensus top 3 pick in a redraft, who has led this draft class in Win Shares thus far into his career.

However, my model has him outside the top 15 (landing at № 19) in predicted Win Shares which is quite a large discrepancy. Towns have already blown past my model’s projection for his career Win Shares by over 30 with a long career ahead of him to increase this variance.

Despite this, my model does predict some players who would be considered to have outplayed their original draft slots including Devin Booker, Delon Wright, Kevon Looney, Rondae-Hollis Jefferson, Norman Powell, Tyus Jones, Myles Turner, and Bobby Portis.

Devin Booker in particular, who was the 13th pick in the original draft, performed extremely well in my model, and he has backed that up in the NBA while leading the Phoenix Suns to the NBA Finals last season. These players would all most likely be drafted earlier than they were on their original draft night if NBA GMs were given a do-over.

The biggest misses from my model’s predictions would have to be Jahlil Okafor, Stanley Johnson, Trey Lyles, and Justice Winslow, who have all thus far in their careers been nothing more than fringe rotation players.

Sports Writer redraft for comparison:

2016 NBA Draft

My model has the original № 1 selection, Ben Simmons, with the highest predicted Win Shares, and he would most likely prove that prediction right thus far into his career if he was not currently sitting out this current NBA season.

Player’s two, four, and five however, would have to be classified as misses as Marquesse Chriss, Patrick McCaw, and Kris Dunn are not anywhere close to top 5 performers from this draft. Players like Caris LeVert, Domantas Sabonis, and Dejounte Murray all perform well in the model compared to where they actually ended up being selected on draft night.

Others such as Tyler Ulis, Deyonta Davis, Isaiah Whitehead, Skal Labissiere, and Henry Ellenson have failed to make a top 15 level impact. Discrepancies from my model also include not having Jaylen Brown, Jamal Murray, Pascal Siakam within the top 15 predicted WS, as they have all proven to be All-Star caliber players in the NBA.

Sports Writer redraft for comparison:

2017 NBA Draft

My model, just like a majority of NBA scouts and front offices, predicted that Markelle Fultz would be the best player from the 2017 NBA draft class. Unfortunately, like scouts and front offices, this model did not account for the injuries that have plagued Fultz throughout his still-young NBA career.

The next four highest players’ predicted Win Shares from my model, Tony Bradley, Dennis Smith Jr., Semi Ojeleye, and T.J. Leaf, are nowhere near the top in terms of actual Win Shares from this draft class thus far. Lonzo Ball at the № 8 slots in terms of predicted Win Shares looks to be a good bet as someone who has broken out this year for the Chicago Bulls.

For this year’s draft class there are several players who performed well in the model who has been hampered by injuries thus far in their NBA careers. Besides the aforementioned Markelle Fultz, this also includes Zach Collins and Harry Giles. On top of this, my model failed to identify a few of the biggest steals from this draft class including Jarrett Allen, Bam Adebayo, and Donovan Mitchell.

Sports Writer redraft for comparison:

2018 NBA Draft

For the 2018 NBA Draft my model has predicted many players who have outplayed their NBA draft selections thus far into their careers, however the order of these players is still very much up for debate.

This list of players would include Jarred Vanderbilt, De’Anthony Melton, Shake Milton, and Devonte’ Graham. Another player who has failed to live up to expectations due to injury also performed well in my model, Marvin Bagley III who was the № 2 selection on draft night. Alternatively, Players such as Wendell Carter Jr., Shai Gilgeous-Alexander, Michael Porter Jr, Collin Sexton, Mo Bamba, and Miles Bridges are still young in their NBA careers, but no one would question they deserve to be in the top 15 picks of a do-over for this draft class.

Sports Writer redraft for comparison:

2019 NBA Draft

The 2019 NBA Draft is still extremely fresh, with plenty of room for players to shift around in career win shares, but I am relatively happy with my model’s predictions based on the NBA performance thus far of the top 15 players in predicted Win Shares. Zion Williamson and Ja Morant were named All-Stars in just the second seasons of their young NBA careers, while Darius Garland is joining them this season. Tyler Herro, Matisse Thybulle, Brandon Clarke, Jordan Poole, and P.J. Washington have all outplayed their draft selections thus far into their careers and look to have many seasons ahead of them to prove they were draft day steals.

The misses from my model’s predictions would have to include Ty Jerome, Cam Reddish, and Bruno Fernando who thus far would not be considered top 15 players from this NBA Draft.

Sports Writer redraft for comparison:

Conclusions

One key takeaway that my analysis has given me is that the NBA Draft is very much an inexact science, and it would take a combination of both numerical analysis and traditional player scouting to get the best accuracy when predicting a college prospect’s NBA future.

Although my model has correctly predicted quite a few players who would now be considered “steals” (players who were taken later than they should have been in the draft compared to their performance in the NBA) from their NBA draft if you followed my model to a tee you would also end up with a few draft “busts” (players who were taken too early) as well as missing out on players who perform much better in real life than what my model has predicted for them.

Some issues that I ran into along the way of this project that I could address in future projects include:

  • Having more training data available for the model — I could have included data from previous NBA drafts prior to the year 2000, however, I felt that the NBA has changed too drastically over the past two decades. In order to obtain predictions that would be true to today’s NBA, I chose to omit a selection of data that I believed would muddy up the model’s predictions. However, in doing so, I limited the amount of training data the model would have available.
  • Incorporating international players — It would also have been beneficial to incorporate players who played overseas prior to the draft into the predictions by either including them in the dataset for this model or creating a separate model. International players make up roughly 25% of the NBA today, and missing out on these players in the draft predictions eliminates a pool of talented players, including the reigning league MVP Nikola Jokic.
  • Win shares might not be the best stat to measure individual success. Basketball is a team sport, and measuring individual performances isn’t simple because of this fact. Some extremely talented players may be drafted onto bad situations/teams in which case they would not be earning the expected Win Shares equivalent to their level of talent. In most cases, this would balance out over a player’s entire career as their on-court situations transform, but could still cause discrepancies in the model results.

All in all, I learned that, yes, one can utilize college basketball statistics to make predictions on the success of NBA Draft prospects. However, projections based solely on college performance are not ideal, as there are many unquantifiable aspects of a prospect that can affect their NBA careers, such as player intangibles and injury, that cannot be explained by college statistics.

More content at plainenglish.io. Sign up for our free weekly newsletter. Get exclusive access to writing opportunities and advice in our community Discord.

--

--