The world’s leading publication for data science, AI, and ML professionals.

Predicting the Probability of Scoring a Basket in the NBA using Gradient Boosted Trees

Introduction

In my first blog on Machine Learning in the NBA I wrote on how professional sports generate extensive amounts of data, opening a wide avenue of possibilities. We can look beyond the perception of spectators and deeper into the inner workings of the game of Basketball.

With as much data as we have available, data scientists can be creative in both the models they create and the data used to create them. That brings us to this blog post. To start us off, let’s pose a bizarre question – and use Machine Learning to see if we can get close to a rational answer:

You’re putting $1 million on the line for LeBron James to make a shot in a basketball game. How could you be certain that you will win the payout?

Check out the companion video for this blog as well: https://youtu.be/Lxfsvw7rHgU

Photo by Alexander Schimmeck on Unsplash
Photo by Alexander Schimmeck on Unsplash

The Stats

We want to use information that will give us a strong indication as to whether or not James will score a basket. At the time of writing, LeBron James has a 50.4% career shooting percentage. Loosely speaking, that means if we look at his entire career dating back to 2003, every time he’s attempted a shot, the odds of him scoring were about the same as flipping a coin. Not great if we have $1 million on the line. We need to tip the scale in our favor.

Talent comes in all shapes and sizes in the NBA, and one example of this is seeing how different players have different play styles and shooting preferences. Some instances of this include:

  • Shooting right-handed versus left-handed
  • Shooting long-range versus mid-range
  • Performing well under pressure (e.g. in the final seconds of a close game)

Believe it or not, this information, and more, is captured in NBA stats data. We can leverage this data and exploit it to see how we can find the right basket attempt to place a bet on. While we’re on this note, I’ll mention that all of the data referenced henceforth in this blog comes from https://www.basketball-reference.com/.

Without overcomplicating things, we’ll stick with just two criteria to build our model: location of the shot attempt and the stage of the game. Our model will use these features from LeBron James’ career shooting data:

  • XY coordinates of shot attempt (i.e. the location on the court)
  • Distance from the player to the basket
  • Time remaining in the quarter when the shot took place
  • Whether or not the shot was attempted towards the end of a half
  • Whether or not the shot was in the fourth quarter

The last two features are included to add further context to the stage of the game. In close matchups especially, those last-minute or last-second baskets can influence a game’s outcome significantly, and we want to capture that. This means we’re assuming that baskets scored as the game clock runs out are more valuable, and a player will be more incentivized to score a basket.

Gradient Boosted Trees

We’re going to try answering this question using Gradient Boosted Trees. These are a type of decision tree __ used in supervised Machine Learning.

Decision trees are a widely-used family of algorithms where a "mathematical flowchart" is learned from the input features. Gradient Boosted Trees are a variant where decision trees are built in series, and each tree tries to correct the mistakes the of the previous one. One tree could look something like this:

One of many decision trees that can be used to determine the likelihood of scoring. Visualized with GraphViz.
One of many decision trees that can be used to determine the likelihood of scoring. Visualized with GraphViz.

Above, we can see a tree with its values learned from the input features. It shows us how each value gets us closer to the decision of the tree. The "samples" field tells us how many of the total samples (or basket attempts) fell in this category. The "value" field (as well as the darkness of the shade of color) tells us the more significant indicators are of this tree. The number of total trees used will be equivalent to the _nestimators parameter described below.

Learn more about Gradient Boosting here: Gradient Boosting explained [demonstration] (arogozhnikov.github.io

There are two hyper parameters we’ll spend time tuning in our trees:

  • _n_estimators_: number of boosting stages to perform (in other words, the number of trees in sequence; we "boost" or learn from one tree to the next)
  • _learning_rate_: how strongly each tree learns from the last

In this experiment, we expect the algorithm to learn "sweet spots" on the court to shoot from as well as a window marking the best time to take a shot. It may look something like "short-range to the hoop with about 3:26 left will yield a 73% chance at scoring a basket."

What I like about decision trees is that they are simpler to understand in concept as well as in analysis. We’re looking for the highest probability of making a field goal attempt. This will be at some node of the tree, and there exists some combination of our feature set that will navigate us there.

Results and Case Analysis

We’ll spend more time discussing the approach and the results in this blog rather than the code setup. If you’re interested in the source code, I have a GitHub repository linked at the end of this blog.

Train-Test Split

LeBron James has more field goal attempts than any active player in the 2021–22 season. This is great news for us in that we can use all of his scored baskets and all of his missed baskets to train a model to understand where and when he performs best. The features we’re using are known for each shot attempt, alongside whether or not the attempt was successful. For training the following models, I took a standard train-test-split with 75% of the data used to train and 25% used for testing. It’s important to shuffle the data before splitting it to avoid biasing our model to learn from only the first 3/4 of James’ career.

Hyper Parameter Tuning

Next, to maximize the likelihood of success, I ran a brief experiment to look at the influence of _learningrate and _nestimators. This helps select the best model.

I used learning rates between 0.01 and 0.1 with a step size of 0.01. Accuracy ramped up quickly and gradually tapered off, all while hovering within a 2% range. The best accuracy was 64.619% with a learning rate of 0.02.

Tuning the learning_rate (lr) hyper parameter
Tuning the learning_rate (lr) hyper parameter

Next, I tuned the n_estimators parameter. Accuracy peaked marginally higher at 64.733%. This was found using the learning_rate=0.02.

Tuning the n_estimators hyper parameter
Tuning the n_estimators hyper parameter

Model Testing

With our model ready for action, we can feed it test data to examine the probability of LeBron making a basket given his court position and time remaining on the game clock. I should note that the roughly 64% prediction accuracy is an average prediction rate across all of LeBron’s shots (i.e. the shots used to train the model). This means there are specific shooting scenarios where the model predicts whether or not a basket is scored with even greater accuracy. We can exploit those scenarios to win our cash prize.

With thousands of samples to select from, what I found to be the most insightful way to look at this was to look at the upper percentiles for his shooting. In other words, we can plot out the shot scenarios that LeBron is at least 70% likely to land a field goal:

70-percentile shooting by LeBron James (axes are X-Y court coordinates)
70-percentile shooting by LeBron James (axes are X-Y court coordinates)

Very little surprise here. LeBron stands at a tall 6’9" weighing in at 250 pounds. Throughout his career, he’s had few issues driving straight to the basket, knocking down anything from clean layups to high-flying dunks. His most reliable shots should be right under the rim. That blue blob comprises of 609 unique shots. Interestingly enough, these are shots that exclusively take place with between 1:30 and 5:07 left on the game clock in any quarter.

Therefore, we now have an answer: despite LeBron James’ 50.4% shooting average, if we were to put $1 million on the line for LeBron James to score a basket, we’d have a 70% chance at winning on an attempt that’s right under the hoop with between 1:30 and 5:07 left on the game clock.

What about other players?

We can apply this algorithm to any player in the League. Ideally those who are more experienced with scoring. Let’s briefly take a look at two other prolific shooters: Kevin Durant and Stephen Curry.

70-percentile shooting by Kevin Durant
70-percentile shooting by Kevin Durant

Kevin Durant has a similar stature to James, and therefore makes baskets in the paint (under the rim) with ease. He is, however, known to be the "@EasyMoneySniper" hitting long range shots as well. If you were to place this same bet on Kevin Durant, you would have a few more locations to choose from, and you’ll want his shot to take place with between 2.8 seconds and 5:33 left on the game clock. Talk about clutch!

70-percentile shooting by Stephen Curry (4th quarter shots in orange)
70-percentile shooting by Stephen Curry (4th quarter shots in orange)

If you happen to skip over the text in this blog and look only at these plots, you’d still see that Steph Curry is in a league of his own when it comes to shooting. Plotted above are 613 shots taken by Curry. To my surprise, Steph Curry has a lower career shooting percentage than LeBron James, at 47.6%. So why can we select a shot from Steph Curry from virtually anywhere?

Because Steph Curry can shoot from anywhere.

His shots are not concentrated like James’. He simply has the versatility to land from nearly anywhere on the court. If you want to place your bet on Curry instead, make sure it’s between 0.8 seconds and 11:53 remaining. That’s effectively any time throughout the game! I’ve also color-coded optional fourth-quarter shots in orange. Given his fourth-quarter performance still covers the entire court, Curry is solidified as one of the most reliable shooters the game has ever seen.

Model Limitations

With every great machine learning model comes its limitations. Provided we’re talking about professional basketball games here, we should acknowledge that there are many quantified and unquantified variables not accounted for in our model that could influence the outcome of a game or the outcome of a shot attempt. Some examples include:

  • Contested shots: in a game, a player can attempt a shot completely unguarded, or have up to 5 defenders guarding them. Obviously, more defenders implies a tougher shot and a lower probability of scoring.
  • Injury history: players coming off of hand or foot injuries may be reluctant to shoot from one side or the other, even if it’s their dominant side. They may also choose to avoid heavy-contact, which is likely to happen under the rim (some still prevail, however).
  • Regular season vs. playoffs: While the model described in this blog accounts for shots made in both the regular season and the playoffs, no explicit distinguishment is made in the data itself. Some players are known to elevate their game in the NBA post-season or play more minutes overall, which can affect their in-game performance for the better or worse.

Further studies would be needed in order to understand a player’s mentality when taking shots under pressure. We could additionally consider player physiology, such as how high they jump, the angling of their wrist and shoulders, or their visibility to the net. This data, however, is more scarce.

Bias, Overfitting, and the Context of NBA Basketball

In the proof-of-concept stages of this experiment, I shared a video to my Twitter followers showing that LeBron James could, in fact, have as high as 98% likelihood of making a basket. Purely from a number’s standpoint, it is possible for a probability to be high. One scenario could look something like this:

Shaquille O’Neal is notoriously known for almost never taking a three-point shot throughout his 19 years in the league. In 22 total attempts, he’s scored once. Say that, instead, he scored all 22 of those attempts and his 3-point shooting percentage was near 100%. Looking at all of Shaq’s attempts, he’s attempted over 19,000 shots, most of which are two-point shot attempts averaging a scoring percentage of 58.3%. If his short-range, two-point baskets are successful at a rate of 58.3% and his long-range, three-point attempts are successful at a rate of 100%, you would expect our model to treat long-range three-pointers as an effective guaranteed shot. This is model bias, __ as we are overestimating the true likelihood of making a three-point attempt, which is widely-agreed as a more difficult shot.

The other scenario, which was the case as per my video, is the result of overfitting to the training data. This highlights the importance of tuning hyper parameters to select an adequate model that can perform in the context of professional Sports. The competitive nature of the NBA would likely cause a high probability like this to never see the light of day.

Conclusion

To recap this blog, we covered a space-time analysis for how NBA players have varying likelihoods of scoring a basket. Using Gradient Boosted Trees, I created a model that would account for a player’s positioning on the court as well as the remaining time in the quarter to predict the likelihood that a respective combination would yield a scoring shot attempt. In analysis, we found that a player like LeBron James will often score shots close to the rim, and players like Steph Curry are even better than we thought, being able to reliably land a basket at almost any time, virtually anywhere.

I hope you learned something new and learned something useful in the blog. If you’re interested in checking out the source code, you can find my GitHub repo here:

ChristopheBrown/nba-ml: Main Repository for all NBA Machine Learning demos (github.com)


Related Articles