BASKETBALL ANALYTICS / MACHINE LEARNING

Redefining NBA Player Classifications using Clustering

Using Hierarchical Clustering to define NBA Players

Ahmed Jyad
Towards Data Science
12 min readNov 16, 2020

--

Source: unsplash.com

Basketball has existed for more than a hundred years, and as the game evolved with new rules and regulations, so did the players. The NBA has now come to a time where Point Guards grab 10 plus rebounds and Centers shoot effectively from the 3 Point line, where 7 feet tall players are skilled at being the primary ball handler while players below 6 feet 6 get minutes as Centers. Players have begun expanding their skill sets to assert dominance all over the court. These players can no longer be defined by which position they play in. Yet, in this ‘Position-less Era’ of basketball, teams still limit themselves to bracketing players into traditional positions, constraining them to roles meant for players of the past. This article aims to provide a new way to define players into certain brackets using Unsupervised Learning methods.

Data

All Data used is sourced from NBA.com and consists of various recorded in-game player statistics from the 2018–2019 NBA Season. A total 530 data points were collected with 336 features, which includes Player Data, General Statistics, Advanced Statistics, Defensive Statistics, Hustle Statistics, Play Style Statistics and Passing Statistics. Players who played less than 12 minutes/game and less than 10 games were dropped, which reduced the number of data points to 388.

Dimension Reduction

The curse of dimensionality can be a daunting thing. It not only makes a model harder to interpret, but can also lead to overfitting. It is reasonable to assume not all features give valuable information about the data, while some give similar information given by another feature. The dataset here has 336 features, surely not all are relevant to this project.

Feature Selection/Dimension Reduction is an important step in any Data Science pipeline. Here Pearson Correlation and Variance Inflation Factor were used to reduce the number of features from 336 to 87.

Principal Component Analysis

Principal Component Analysis is another popular dimension reduction method, one that manages to retain all information without throwing away any features. Feature Selection already reduced 336 features to 87 and throwing away more features could be detrimental to the model. The use of PCA strikes a perfect balance between retaining all information and simplifying the model. Here PCA was conducted using the Sci-Kit Learn library. The following table shows the variance explained by each Principal Component.

Image by Author

PC1, PC2 and PC3 combined explain nearly half of the total variation created by the data. Having around 60% of variance explained by the data is a good cutoff point and perfect balance between model complexity and interpretability. The first 6 Principal Components successfully explain around 61% of variance created by the data. Subsequent components have negligible impact as variance explained by these features are less.

Interpreting the Principal Components

Principal Components are interpreted by using their eigenvalues. A feature having high positive eigenvalue means the component has a high positive association with the feature. A feature having high negative eigenvalue means the component has a high negative association with the component.

PC1

PC1 explains 25% of the variance created by the data and will hence be a very important in clustering players. The following table shows the top 20 features with highest absolute eigenvalues.

Image by Author

PC1 seems to reward players who:

  • grab a lot of Rebounds,
  • create Screens and Rolls towards the basket during Pick&Roll,
  • have high Defensive Impact 6 feet from basket,
  • Contest 2 Points attempts,
  • Cut towards the basket,
  • make a lot of Blocks,
  • and have high Field Goal percentage.

PC1 seems to penalize players who:

  • have high Defensive Frequency near the 3 Point line
  • and Handle the ball during Pick&Roll.

High PC1 values indicate that the player predominately plays close to the basket. It’s safe to assume Centers would have high PC1 values.

PC2

PC2 contributes to a good percentage of the variance. The following table shows the top 20 features with highest absolute eigenvalues:

Image by Author

PC2 has a positive association with almost all forms of offense. In particular, highly rates players who:

  • score a lot of Points,
  • have high Player Impact Efficiency,
  • have high Usage,
  • and score a lot through Isolation plays

High PC2 values indicate the player being elite and highly efficient. One could assume All-Stars would have high PC2 values. PC2 value both offensive and defensive impact

PC3

From PC3 onwards, Principal Components do not do a great job in explaining the variance created by the dataset and hence do not have much impact in separating players, as compared to PC1 and PC2. The following table shows the top 20 features with highest absolute eigenvalues:

Image by Author

PC3 seems to reward players who:

  1. make a lot of Spot Up shots,
  2. make a lot of Off Screen shots,
  3. make shots beyond 20 feet from basket,
  4. Contest 3 Points attempts,
  5. and have high 3 Point percentages

PC3 penalizes players who:

  1. make a lot of Assists,
  2. make a lot of Turnovers,
  3. and handle the ball during Pick&Roll

High PC3 values indicate that the player shoots a lot from beyond 20 feet from basket and are efficient 3 Point shooters.

PC4

The following table shows the top 20 features with highest absolute eigenvalues:

Image by Author

PC4 seems to reward players who:

  1. Post Up a lot,
  2. have high Usage,
  3. score beyond 5 feet from basket,
  4. Roll towards the basket during Pick&Roll,
  5. and have high 3 Point percentages

PC4 seems to penalize players who:

  1. score through Transition,
  2. Cut towards the basket,
  3. have high Field Goal and True Shooting percentage,
  4. make Deflections,
  5. make Putbacks,
  6. and have high +/-

High PC4 values indicate players who can post up and stretch beyond 5 feet from basket. Players with high PC4 can score from anywhere, whether it’s close to the basket, midrange or from the 3 Point line.

PC5

The following table shows the top 20 features with highest absolute eigenvalues:

Image by Author

PC5 seems to reward players who:

  1. score from Handoffs and Off Screens,
  2. have high Defensive Frequency beyond 6 feet from basket,
  3. and have high Field Goal and True Shooting percentages

PC5 seems to penalize players who:

  1. make Deflections and Steals
  2. contest 3 Point shots,
  3. have Defensive Impact near the 3 Point line,
  4. take Spot Up shots,
  5. Recover loose balls,
  6. and score through Transitions

High PC5 values indicate that the player predominately plays away from the basket, have high shooting splits and low defensive impact. Players with high PC5 are predominant shooter and have no positive impact in other aspects of the game.

PC6

The following table shows the top 20 features with highest absolute eigenvalues:

Image by Author

PC6 seems to reward players who:

  1. have high Defensive Impact beyond 6 feet from basket,
  2. draw Charges,
  3. and make a lot of Passes

PC6 penalizes players with high offensive impact. High PC6 value means the players have massive defensive impact around the perimeter.

Hierarchical Clustering

The goal here is to redefine how players are classified. Classifying players as per their position is an outdated system and should have no place in the modern game. But, if not positions, what should be used to classify players who play similarly? Clustering is a popular method used to group similar data when their labels are unknown. Here, Hierarchical Clustering is used to group players based on the data available.

Hopkins Test

Prior to clustering the data, a Hopkins test was conducted to verify spatial randomness of the data. This is done to ensure whether our data does in fact show Clustering tendencies. (Randomly generated data points have no clustering tendencies).

H0: Data points are generated by non-random, uniform distribution

H1: Data points are generated randomly

By conducting a Hopkins test, a p-value of 0.3 was observed. Therefore, there is some evidence to reject the null hypothesis and hence, there is evidence to conclude that the data points have clustering tendencies.

Linkage

After playing around with different linkage methods, Ward linkage seems to be doing the best job in clustering the data in hand. Below is a Dendrogram that visualizes the clustering methodology.

Image by Author
Image by Author

Intuitive thinking and heuristic methods were used to determine the suitable number of clusters needed to group the data. Below is the elbow plot.

Elbow Plot (Image by Author)

From the Elbow plot, it’s observed that 2, 6 and 9 number of clusters do a nice job in creating good clusters. Having only 2 or 6 clusters beats the point of making an elaborate attempt to group our data. Hence, the data are classified into 9 clusters.

Visualizing the Clusters on a 3-d plane (Image by Author)

Interpreting Clusters

Mean values of Principal Components in each Cluster (Image by Author)
Heatmap of Principal Components in each Cluster. (Image by Author)

Cluster 1 — Elite Modern Big Men

Cluster 1 has high PC1, PC2 and PC4, and negative PC5. This means players in Cluster 1 play mostly within 6 feet from the basket, are efficient and score a lot and can stretch the floor and shot beyond 6 feet from basket. Notable players in this Cluster:

Image by Author

Cluster 2 — Traditional Big Men

Cluster 2 has the highest average PC1 value and the lowest PC4 value. Players in cluster play within 6 feet from basket and have high defensive impact. However, they are unable to stretch the floor and shoot beyond 6 feet from basket. Notable players in this Cluster:

Image by Author

Cluster 3 — Elite 3 Point shooters

Cluster 3 has the highest average PC5 value, high PC3 and the lowest average PC1 value. Players in Cluster 3 play around the perimeter and are high efficient shooters. They rarely go near the basket and grab rebounds. Notable players in this Cluster:

Image by Author

Cluster 4 — Role Players

Cluster 4 have pretty low values among all Principal Components. Players in this cluster are not elite in any specific category. They have a positive average value only with PC3, which could indicate they play near the perimeter and are decent shooters. They have the lowest average PC2 value, indicating that they score less and are not highly efficient. Notable players in this Cluster:

Image by Author

Cluster 5–3 and D Players

Cluster 5 has high average PC3 and PC6 values and the lowest average PC5 value. Players in this cluster are high efficiency shooters who have very high defensive impact near the perimeter. Notable players in this Cluster:

Image by Author

Cluster 6–3 Level Scorers

Cluster 6 has the highest average PC3 value. Other than PC2 and PC3, the average Principal Component values in this cluster are negative. Player in this cluster are shooters and have some impact in scoring. They do not necessarily only score from beyond the 3 Point line. Notable players in this Cluster:

Image by Author

Cluster 7 — Decent Ball Handlers

Cluster 7 has the lowest average PC3 and very low average PC2 value. It has high average PC4 value. Players in this cluster are highly inefficient and play within the 3 Point line but away from the basket and have high usage. This could maybe indicate they have the ball in their hands a lot but don’t really score a lot. Notable players in this Cluster:

Image by Author

Cluster 8 — Elite All Stars

Cluster 8 has the highest PC2 value and lowest PC6 value. Players in Cluster 8 score massive amounts of points and are highest efficient. They are responsible in scoring most of the points in a game. Notable players in this Cluster:

Image by Author

Cluster 9 — Two way Perimeter Players

Cluster 9 has high average PC6 value and pretty high average PC2 value. These players are highly effective in both offensive and defensive players. They score a lot of points and bolster the perimeter defense. Notable players in this Cluster:

Image by Author

Checking the Validity of Cluster

Image by Author
Image by Author
Image by Author
Comparing various statistics among the Clusters (Image by Author)

Cluster 1

Cluster 1 has the following traits:

  • Second highest Points/game
  • Highest average Total Rebounds/game
  • Highest average Blocks/game
  • Second highest average Field Goal percentage
  • Second highest Post Field Goals/game made
  • Highest Post Defended Field Goals/game made
  • Majority of Cluster 1’s offense comes from Pick&Rolls, Post Ups and Spot Up Shots

All these traits are very common among Elite Big-Men

Cluster 2

Cluster 2 has the following traits:

  • Second highest Rebounds/game
  • Second highest Blocks/game
  • Highest Field Goal percentage
  • Lowest 3 Point percentage
  • High Post Field Goals Made/game
  • Second highest Defended Field Goals made/game
  • Majority of Cluster 2’s offense come from Pick&Rolls and Cuts

All these traits are very common among Traditional Big Men.

Cluster 3

Cluster 3 has the following traits:

  • Highest 3 Point percentage
  • Second highest Perimeter Field Goals made/game
  • Majority of Cluster 3’s offense comes from Handoffs, Off-Screen shots, handling Pick&Roll and Spot up shots.

All these traits are very common among Elite 3 Point Shooters.

Cluster 4

Cluster 4 has the following traits:

  • Fewest Points/game
  • Second Lowest Total Field Goals/game
  • Majority of Cluster 4’s offense comes from Spot up shots

All these traits are very common among Role Players.

Cluster 5

Cluster 5 has the following traits:

  • High 3 Point percentage
  • High Steals/game
  • High Perimeter Field Goals/game
  • High Post and Perimeter Defended Field Goals/game
  • Majority of Cluster 5’s offense come from Spot up shots

All these traits are very common among 3 and D Players (3 Point and Defensive Players)

Cluster 6

Cluster 6 has the following traits:

  • High Points/game
  • High Rebounds/game
  • High Field and 3 Points percentage
  • Comparatively high Post and Perimeter Field Goals made/game
  • High Spot up shots, Post ups, Pick&Roll, Off Screen, Isolation, Handoffs, Cuts and Transition Field Goal made compared to other clusters

All these traits are very common among 3 Level Scorers.

Cluster 7

Cluster 7 has the following traits:

  • Very low Points/game
  • Comparatively high Assists/game
  • Fewest Field Goal percentage
  • Lowest Field Goals made/game
  • Lowest Defended Field Goals made/game
  • Majority of Cluster 7’s offense comes from handling Pick&Rolls and Spot up shots.

All these traits are very common among Decent Ball-Handlers.

Cluster 8

Cluster 8 has the following traits:

  • Highest Points/game
  • Highest Assists/game
  • High Rebounds/game
  • High Field Goal percentage
  • Highest Steals/game
  • Highest Field Goals made/game
  • High Defended Field Goals/game
  • Majority of Cluster 8’s offense comes from isolations and handling Pick&Rolls.

All these traits are very common among Elite All-Stars.

Cluster 9

Cluster 9 has the following traits:

  • High Points/game
  • Second highest assists/game
  • High 3 Points percentage
  • Second highest steals/game
  • Highest Perimeter Defended Field Goals/game
  • Majority of Cluster 9’s offense comes from handling Pick&Rolls

All these traits are very common among Perimeter 2 way players.

Commentary

Through Hierarchical Clustering, 9 clusters were made that describe players with a new light. Each cluster has a mixture of players from traditional positions. With such classifications, teams no longer have to confine themselves to choose players based on their positions, but can widen their scope my choosing players that complement each other. Teams can also experiment by playing certain players in different positions based on which cluster they drop into.

Potential Improvements

It is a little unreasonable to classify players to just one category. Having a secondary cluster for each player could be really helpful for teams while creating a team that work well together.

Hope you enjoyed this read. You can find all data and scripts on my GitHub. Feel free to reach out to me via LinkedIn. For more fun NBA Analytics reads, try Analyzing NBA Free Agency using Machine Learning.

--

--