Assessing NBA player similarity with Machine Learning (R)

Using Unsupervised Machine Learning & PCA to determine which NBA superstars are the most alike

Nadir N

Published in

Towards Data Science

25 min readNov 10, 2018

Background

If you know me well, you are probably aware that my basketball fandom has been one of my life’s defining attributes. You may also know how I like to keep track of my favorite players’ season-average stats, often crunching numbers in my head after every shot when watching games. But this is the first time I am using an elaborate arsenal of Applied Statistics/Machine Learning (ML) tools to look at NBA data. Needless to say, I’m very excited about where we can go with this. Let’s get started!

Defining the Problem

As is the nature with Unsupervised ML projects, there isn’t one particular prediction problem that we are looking to solve. But even without specific goals, there is still plenty to learn from open-ended explorations. In this project, we will be looking at player statistics from NBA’s last complete regular season. We will be exploring individual players and differences/similarities between them, and see if the stats measure up to what fans and pundits hold up as ground-truths about NBA players. Every aspect of the game of basketball is of course not statistically measurable, and looking at individual stats in a team-sport only gives us a fragmented view of the bigger picture. But even with these limitations, insights offered by advances in sports data-science have revolutionized modern basketball. Given that each team is paying hundreds of millions of dollars to its players, it isn’t surprising that all NBA teams, which are billion-dollar businesses at the end of the day, are investing more into this field to assess the quality of their investments (players). In this project, we will try to come up with conclusions that we couldn’t have reached from superficially looking at the data, and hopefully reach findings that are interesting to fans and analysts alike.

Data-sets used

We are using player data from the 2017–18 NBA Regular season and are combining three data-sets from the websites, Basketball-reference and NBAmining.com. The first data-set from Basketball-reference comprises of traditional NBA statistics (points, rebounds, etc). We are looking at player statistics normalized to 36 minutes of game-time as opposed to looking at per-game averages. Normalizing by minutes-per-game gives us a fairer representation of each player’s contributions. If we looked at per-game statistics instead, players who play more minutes like Lebron James would look better on paper than players like Steph Curry who play on elite teams and often don’t play at the end of games that are blow-out wins. Now, you could argue that our method is unfair to players who play more minutes, and that Lebron probably would put up better numbers per 36 minutes if he played fewer minutes like Curry and wasn’t forced to play longer at the brink of exhaustion. There are also arguments for stats per-100-possessions being the best metric for evaluating a player’s impact, as they normalize player stats by the team’s pace. I think that metric has its drawbacks as well and reduces the impact of players who aren’t go-to scoring options and therefore don’t get used on possessions where the team has to score in a rush. Unfortunately, we can never completely get rid of such biases when evaluating sports-statistics. But we can try our best to minimize them and be mindful of these biases later when evaluating our results,. This is part of why the human-element in evaluating ML results is crucial. The second Basketball-reference data-set we are using comprises of “Advanced” statistics (usage %, total rebound %, etc) from the same same group of players over the season. The third data-set we are using is from a lesser-known website called NBAminer.com. This one provided miscellaneous statistics (fast break points, points in the paint, etc) that offer useful insights on player impact that were not included in the first 2 data sets.

The “traditional” data-set has data on 540 players (all active players) with 29 features, the “advanced” data-set has data on the same players with 27 features, and the “miscellaneous” data-set contains data on 521 players with 14 features. Like all other projects I work on, I am open to exploring and improving my model. So if you know of any more data-sets/sources of statistics that could be incorporated into this model to better represent player impact, feel free to let me know.

Pre-processing: Data-cleaning & Feature Engineering

This project will be completed on R. R is great for Unsupervised Learning projects because data visualization, one of R’s main strengths, comes very handy in such projects. My R code and plots are publicly available on Github.

Merging data sets and fixing mismatches: The first 2 data-sets we use are from the same source and has data on the same players. So merging them together requires just one line of code. The third, “Miscellaneous” data-set, however is from a different source so there are discrepancies we need to address. This new data-set has information on 19 less players. But as we investigate by matching for player names, we will find that there are in fact 45 players in the first 2 data-sets that don’t seem to have any data in the third, and 26 players vice-versa. Taking a closer look, we find that the data-sets have named players differently when the names include periods(.), numerical values (ex: IV) or suffixes (ex: Jr.) at the end of names. So, “JJ Redick” in one data-set is named “J.J. Redick” in the other. We remove the special characters and find that there are still five players named differently in the data-sets (one of them decided to not include Nene’s last name for some reason). So we fix them manually. Now we are left with only the original 19 players missing from the third data-set. If you look into the statistics for these players further, as I did, you will find that most of them have played very limited minutes for very few games. But we don’t have to worry about these players, as we would have excluded them anyways when we subset our data (more on this step later).

Removing features: After merging the aforementioned data-sets, we remove repeated features (ex: games played). Next we remove redundant features like season number and season type as we are only looking at one season. We also remove the feature, Ranking, as it’s a meaningless indexing variable. Finally, we also remove the points-per-quarter for all 4 quarters as these values that are not normalized by the minutes averaged by the players in each quarter.

Missing values: Unlike most real-world data-sets for ML, missing values aren’t a big issue here. Thankfully, the NBA has professionals hired to ensure that’s the case. The only features where there are missing values were 3P%, 2P%, FT% and TS%. After sub-setting our data-set (again, more on that soon), the only feature that still has missing values is 3P%. It is missing for players who haven’t attempted a single 3 pointer during the whole season. In this case, we replace these missing values with 0% as it’s fair to say that the 3 point-shooting skills for such players are probably similar to others who have attempted but not made a single 3 pointer during the season.

Feature extraction: Out of the features that we have so far, there are 2 more important features we can extract. First is Minutes-per-game, which can easily be calculated from the total season minutes and the total games played. The second is the Assist-to-turnover ratio. Even though it’s a linear function of 2 existing variables, it is often considered an important metric for evaluating how careful/careless a play-maker is with the ball on offense.

Sub-setting data: Our goal for this project is to make comparisons between players who make significant contributions to games, not to make accurate predictions based on data from all players. So we are going to subset our data to look at the more interesting players. It’s a good way to prevent our visualizations and clusters from getting too cluttered. One way to subset the data is setting a threshold for the minimum number of minutes-per-game and including players who meet the threshold. We set this threshold at 28.5 minutes-per-game, which limits our data-set to 104 players.

After combining the 3 tables, feature-engineering and sub-setting our data, we have a data on 104 players with the following 55 features:

Player — Player name
Pos — Position
Age — Age of Player at the start of February 1st of that season.
Tm — Team name
G— Games Played
GS — Games Started
MP — Minutes Played over the entire season
MPG — Minutes averaged per game
FG — Field Goals Per 36 Minutes
FGA — Field Goal Attempts Per 36 Minutes
FG% — Field Goal Percentage
3P — 3-Point Field Goals Per 36 Minutes
3PA — 3-Point Field Goal Attempts Per 36 Minutes
3P% — FG% on 3-Pt FGAs.
2P — 2-Point Field Goals Per 36 Minutes
2PA — 2-Point Field Goal Attempts Per 36 Minutes
2P% — FG% on 2-Pt FGAs.
FT — Free Throws Per 36 Minutes
FTA — Free Throw Attempts Per 36 Minutes
FT% — Free Throw Percentage
ORB — Offensive Rebounds Per 36 Minutes
DRB — Defensive Rebounds Per 36 Minutes
TRB — Total Rebounds Per 36 Minutes
AST — Assists Per 36 Minutes
STL — Steals Per 36 Minutes
BLK — Blocks Per 36 Minutes
TOV — Turnovers Per 36 Minutes
A2TO — Assists to turnover ration
PF — Personal Fouls Per 36 Minutes
PTS — Points Per 36 Minutes
PER — Player Efficiency Rating- A measure of per-minute production standardized such that the league average is 15.
TS% — True Shooting Percentage- A measure of shooting efficiency that takes into account 2-point field goals, 3-point field goals, and free throws.
3PAr — 3-Point Attempt Rate- Percentage of FG Attempts from 3-Point Range
FTr — Free Throw Attempt Rate- Number of FT Attempts Per FG Attempt
ORB% — Offensive Rebound Percentage- An estimate of the percentage of available offensive rebounds a player grabbed while he was on the floor.
DRB% — Defensive Rebound Percentage- An estimate of the percentage of available defensive rebounds a player grabbed while he was on the floor.
TRB% — Total Rebound Percentage- An estimate of the percentage of available rebounds a player grabbed while he was on the floor.
AST% — Assist Percentage- An estimate of the percentage of teammate field goals a player assisted while he was on the floor.
STL% — Steal Percentage- An estimate of the percentage of opponent possessions that end with a steal by the player while he was on the floor.
BLK% — Block Percentage- An estimate of the percentage of opponent two-point field goal attempts blocked by the player while he was on the floor.
TOV% — Turnover Percentage- An estimate of turnovers committed per 100 plays.
USG% — Usage Percentage- An estimate of the percentage of team plays used by a player while he was on the floor.
OWS — Offensive Win Shares- An estimate of the number of wins contributed by a player due to his offense.
DWS — Defensive Win Shares- An estimate of the number of wins contributed by a player due to his defense.
WS — Win Shares- An estimate of the number of wins contributed by a player.
WS/48 — Win Shares Per 48 Minutes- An estimate of the number of wins contributed by a player per 48 minutes (league average is approximately .100)
OBPM — Offensive Box Plus/Minus- A box score estimate of the offensive points per 100 possessions a player contributed above a league-average player, translated to an average team.
DBPM — Defensive Box Plus/Minus- A box score estimate of the defensive points per 100 possessions a player contributed above a league-average player, translated to an average team.
BPM — Box Plus/Minus- A box score estimate of the points per 100 possessions a player contributed above a league-average player, translated to an average team.
VORP — Value over Replacement Player- A box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team and prorated to an 82-game season.
Fast Break PTS — Fast break points per game
Points in Paint — Points score in the paint per game
Points off TO — Points scored after the opposing team turned over the ball
2nd chance points — Any points scored during a possession after an offensive player has already attempted one shot and missed
Points scored per shot — Calculated by dividing the total points (2P made and 3P made) by the total field goals attempts.

Normalizing data: We are also normalizing our data-set. This means that for each value, we are going to subtract the mean of the feature-range from that value and divide it by the standard deviation of the feature. This ensures that features with high-value ranges (such as Points scored) do not have a greater impact on our overall similarity comparison than features with low-value ranges (such as Blocks or Steals). Note that by normalizing the data, we are affecting the outcomes with our own biases. Would it be incorrect to leave the data as it is, or to manipulate it so defensive statistics such as Steals have higher ranges? It wouldn’t. It would just mean that the similarity results that we will assess at the end would be weighted more towards those particular statistics.

Data Analysis Methods

We are going to apply the following statistical methods to investigate our data:
1) Principle Component Analysis
2) K-means Clustering
3) Hierarchical Clustering

For each method, we are going to compare the players in terms of Overall impact (from all available statistics). When interesting, we will also look at Offensive impact (from offensive statistics) and Defensive impact (from defensive statistics).

For Overall Impact calculations, we will be looking at the last 48 features (features numbered 8–55 above) as. We are not interested in statistics like overall season minutes and games played, as we don’t want to separate players who missed games due to injuries from the others. We are also not using categorical variables in our calculations but we’ll use them when examining our results. For Offensive Impact calculations, we are going to look at the following features from above: 8:20, 23, 26, 28, 30:33, 36, 39:41, 45, 49:53,55. For Defensive Impact calculations, we are going to look at features numbered: 21, 24, 25, 27, 34, 37, 38, 42, 46.

Before applying any statistical methods, it’s good practice to take a quick look at the correlation plot between the features to see if there’s anything interesting. Unless you have a microscope ready, here’s a zoomed-in version of the 48 X 48 correlation matrix below. As expected, Second Chance Points positively correlate with Offensive Rebound %, and 3-pointer Attempt Rate negatively correlates with Points Score in the Paint, and so on. Nothing surprising.

Method 1- Principal Component Analysis (PCA)

PCA is a 117+ year old statistical dimensionality-reduction method. According to the Wikipedia page, PCA is defined as:

“A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. If there are n observations with p variables, then the number of distinct principal components is minimum (n-1,p). This transformation is defined in such a way that the first principal component (PC1) has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set.”

For readers interested in learning ML, it is of paramount importance to understand the linear algebra that goes behind this method. For good old basketball fans who could care less, allow me to explain PCA in an over-simplified way. We are trying to get the 2 main principal components in our data-set. This means we are taking all 48 features that we are studying, combining them and extracting 2 NEW features that represent a combination of the previous features. These 2 principal components are the best possible combination of 2 features for numerically conveying how players differ from each other. For assessing impact on Offense and Defense, we repeat the process but start with the limited set of features we mentioned in the last section. Now comes the fun part of analyzing the data. Let’s look at Overall Impact of players in our first PCA plot.

The above plot speaks volumes about the players in our data. The size of the circle that represents each player depends on the player’s VORP (Value Over Replacement Player) rating which, along with the PER, is considered one of the best all-encompassing statistics in basketball to measure a player’s overall impact. If we drew a line to separate the 8 players in the top-right corner, we would get the 8 top scorers of the season. Lebron, Russel Westbrook, Steph Curry and particularly James Harden are on the edges of the overall group. Unsurprisingly, these are 4 of the last 5 MVPs, and the 5th one , Kevin Durant, is right there next to them. The proximity of Damian Lillard to the aforementioned elite group may be a good reason to rethink his value as a player. Also note that the players are color-coded by their traditional NBA position. As NBA playing styles and positional responsibilities continuously evolve, there is a lot of overlap between positions. But, for the most part, players who play the same positions are also clustered. At the right (mid to bottom) we have the best rebounding big-men in the NBA. These are mostly players who play as centers (C), but as we go towards the middle, they are joined by freakishly athletic power forwards (PF) like Anthony Davis and Giannis Antetokounmpo, and also Ben Simmons, the tallest point guard in NBA history. Antetokounmpo and Davis seem to have a territory of their own in the plot, which is particularly interesting given that these 2 were voted as the top players that NBA managers would like to start a franchise with at the start of the 2018–19 season. Also, notice how Centers are separated into clusters of players known mainly for their defense and rebounding (bottom cluster) and those known also for their scoring (top center cluster). Point guards (PG) seem to mostly group together in the top, with high-scoring ones like Curry, Lillard and Kyrie Irving close together. Shooting guards (SG) and small forwards (SF) have their own space from the center to the bottom-left with a lot of overlap in between. But in general, the small forwards are noticeably closer to the power forwards than the shooting guards are. Let’s move on to the Offense PCA plot.

This plot is very similar to the Overall PCA plot. This makes sense as most of the statistics that we are looking at are offense statistics and therefore these dominate our overall analysis. But a interestingly few things do change. Note how players who usually only score from the low post have their separate cluster in the top. Also notice that Nikola Jokic and Marc Gasol, arguably the best play-making centers in today’s NBA, find themselves closer to point guard territory that the center area. Ben Simmons, is the furthest from all the point guards and right next to the post players. This is understandable given his lack of mid-range/outside shooting skills, and his rebounding abilities. Reigning MVP, Harden, is once again on a league the furthest from all other players. Finally, let’s look at PCA on Defensive stats.

It seems that rebounding prowess is a large factor in the x-axis for determining Principle Component 1(PC1) value. Players farthest to right are those that led the league in rebounds. And the ones to the left, like JJ Redick, are arguably those least known for their rebounding in our selected subset. But there is more to this component than rebounding. If you look closely, Kevin Durant is to the right of Russell Westbrook, While Durant grabbed far fewer rebounds (6.8 per game) than Westbrook (10.1 per game), he is closer to the big men in terms of other defensive stats like Blocks (5th in the league). It’s not a surprise that he is close to the top right cluster, which has some of the most feared rim-protectors in the league. If you go back to the previous Offense plot, you will notice that Centers seem to impact the game in a very diverse way on offense. But on defense they are not as spread out, as all of them have the primary job of protecting the rim and securing defensive rebounds. The one center who is farthest from the others, Steven Adams, is an understandable outlier. The OKC Thunder are famous for their strategy of making Steven Adams give up defensive rebounds that he would have otherwise secured so that Russell Westbrook can get them and lead the fast-break. Despite leading the league at boxing out for rebounds, Adams routinely puts up low defensive rebounds. In one of the more bizarre stat-lines last season, he averaged more offensive rebounds than defensive rebounds. Also notice that Lebron James, often credited for being one of the more versatile defenders in the league, is right in the middle of this plot.

The short comings of PCA plots are that while some players may seem very close to each other and almost overlap in a 2-D plot, when we consider further Principal Components, they could be miles away. So this visualization method is not making use of all the data variance we have. To solve this issue we move on to the more sophisticated statistical processes of Unsupervised ML.

Method 2- K-means Clustering

If you would like to understand how this popular Unsupervised ML algorithm works start here. Instead of going deep into discussing the statistical methods, we’ll focus on their applications instead. The goal of clustering is to determine the internal grouping in a set of unlabeled data. In K-means clustering, our algorithm divides the data into K clusters and each cluster is represented by it’s centroid/center, that is the mean of the players in that cluster. For specifics about the programming, refer to my documented R script on Github.

Distance measurement: In order to classify the players into groups, we need a way to compute the distance or the (dis)similarity between each pair of players. The result of this computation is known as a dissimilarity or distance matrix. Choosing a distance measure is a critical step, as it defines the similarity between each player-pair and influences the shape of the clusters.We choose the classic “Euclidean distance” as our distance measure to determine player similarity. Standardizing the data and replacing missing values is a prerequisite to all of this, but we have already handled that.

Once we calculate the Distance Matrix we visualize it in the plot below. Here is a zoomed-in image so we can investigate it better. Red indicates high dissimilarity and teal indicates high similarity between players. Unsurprisingly, players who play similar roles have the least distance between them.

Computing K-means: Briefly put, the algorithm can be summarized with these five steps:

Specify number of clusters (K) to be created
Randomly select k players from the data set as our initial cluster centroid
Assign each player to their closest centroid, based on Euclidean distance
For each of the k clusters, update the cluster centroid by calculating the new mean values of all the player features in the cluster. The centroid of a Kth cluster is a vector of length p containing the means of all features for the players in the kth cluster, where p is the number of features.
Iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached (R uses 10 as the default value for max number of iterations).

Determining optimal number of clusters (k):
There is no perfect way to determine the optimal number of clusters. But there are a few statistical measures that allow us to investigate how the value of k affects our clustering, so we can come to a reasonable conclusion. We use the following 2 methods:

The within-cluster sum of squares (WSS) is the sum of the squared deviations from each observation and the cluster centroid. Generally, a cluster with a small sum of squares is more compact than those with large sums of squares. Intuitively, this score only gets smaller as we increase the value of k and the clusters get smaller. So we look for a point in our plot called the “elbow /knee” where the WSS value drops significantly and then doesn’t drop too much for increasing values of k. Unfortunately, as seen from our plot (below), there is no elbow in our plot.
The second method we use to find the optimal K value is observing the Gap statistic. It is more mathematically complex than the previous measure, so we won’t get into how it works, but you can take a look by following this link. The higher the value is for the gap-statistic, the better our value of k is. Again, our goal is to get the best clusters while minimizing k. We see (see below) that the gap statistic peaks at k = 8

Given our findings, we are going to group our data into 8 clusters using k-means clustering and analyze the results. If you are wondering why the X and Y axes below look very similar to our PC1 and PC2 axes from PCA Analysis, it’s because they are the same. Whenever working with more than 2 features, the R function we are using performs PCA and plots the data points on to the first two principal components as they explain the majority of the data variance. But do not confuse PCA as being the same thing as K-means Clustering, as there is a lot more that goes into the latter. Moving on to the fun part, we will now analyze how each cluster stands out and come up with appropriate labels for them based on the statistics we have on the cluster means.

Black cluster: “Mail-men”
Characteristic player- Deandre Jordan
To the bottom-left of our plot, we have players who deliver in the low “post” and don’t do much outside the paint. If you asked for cheesy and unoriginal labels, you got it. But don’t think about these players as reliable go-to low post scorers like Karl Malone. They lead all clusters in FG%, Rebound %, Block %, and despite leading in True Shooting %, they are far from leading in points. They are actually 2nd worst out of all clusters in terms of mean points. And these points are often coming with put-backs or alley-oops. Unsurprisingly they have the lowest in FT%, 3P% and Fast Break points. In terms of both PER and VOPR, they are 3rd, suggesting they are very valuable on the court despite their low scoring.
Light Grey cluster: “Superstar guards/slashers”
Characteristic player- James Harden
This is the best of the best, leading all other clusters in Win-shares, Points, PER and VORP. As expected from elite guards/slashers, they also lead in 3P, Assists, FT% and trail in True Shooting % (TS%) only behind the “Mail-men” cluster.
Gold cluster: “Superstar bigs”
Characteristic player: Karl Anthony Towns
To the mid-left we have best two-way players in the league who play primarily in the Paint, but are comfortable taking it outside. They lead all clusters in Points in the Paint and are 2nd in rebounds and DPBM (Defensive Box Plus Minus) only after the “Mail-men”. They are also have 2nd best PER, VOPR, Points-per-36-mins and Win Share numbers, after the “Superstar guards/Slashers”
Hazelnut cluster: “Reliable scorers”
Characteristic player: Lou Williams
This cluster is a bit diverse. It mostly consists of players who can score a lot in multiple ways, but are behind our “Superstar guards/slashers” in all scoring categories. They are in the middle of the pack when it comes to all-encompassing stats like PER, VORP and Win-Shares.
Indigo cluster: “Low-scoring playmakers”
Characteristic player: Lonzo Ball
To the mid-right, we have players that are better at making others score than scoring themselves. They have the best assist-to-turnover ratio out of all clusters and lead all clusters in TOV% (Turnovers per 100 possessions). This cluster also trails all other clusters in TS % and Points, so I must reemphasize that their scoring abilities are not what’s earning these players their 28.5+ minutes.
Medium Blue cluster: “Backup shooters’
Characteristic player: Trevor Ariza
This is one of the larger clusters with 19 players, so there isn’t the most specific skill-set here. They do score the highest 3P%, commit the least turnovers, and are 2nd-best at FT% and 3 pointers made. But they also score the 3rd fewest points out of all clusters. So it seems like this cluster has players who are not go-to scorers that create their own shots, but rather players who specialize at making open shots at a high percentage.
Red cluster: “High-post players”
Characteristic player: Thaddeus Young
This cluster seems to be made of players who contribute to the game from both inside and outside the paint. This group is 3rd in Blocks, Rebound stats and 2nd chance points, only trailing the “Superstar bigs” and “Mail-men”. But this cluster takes a lot more shots than the “Mail-men”. This group also takes and makes more outside shots than “Superstar Bigs”.
Sky-blue cluster: “Low-output players”
Characteristic player: Carmelo Anthony
Keep in mind that we are only looking at players who play 28.5+ minutes. But out of this group of well-known NBA names, this group relatively posts the worst numbers when it comes to PER, VORP, Win-shares and both offensive and defensive plus-minus calculations. There are of course players who make contributions that don’t show up on statistics. But if that is not the case for some of these players, and if the same players are also not young with potential to improve, the teams should reconsider how much game-time these players get.

K-MEANS CLUSTERS from OFFENSE STATS

Looking at the Gap Statistic, we once again find 8 to be the optimal number of clusters. As expected, the overall breakdown remains similar, but there are a few interesting changes. Russell Westbrook is no longer part of the “Superstar guards/slashers” once you take out his defensive statistics. This aligns with continuing criticisms of Westbrook as an inefficient scorer. He is arguably the most inefficient volume-scorer in recent NBA history, and factually shot a worse True Shooting % than anyone in the “Superstar guards/slashers” cluster. This makes me wonder where he would land in the overall clusters if he didn’t get the defensive rebounds that Steven Adams helped him secure. Ben Simmons also leaves the “Superstar bigs” and joins the “Mail-men”. This makes sense as on offense, Ben Simmons’s scoring arsenal lacks any form of a jump-shot.

K-MEANS CLUSTERS from DEFENSE STATS

The optimal k is found to be 3 and using that, we get the plot below. The Grey cluster is marked by players who put up the highest defensive rebounds, blocked shots and defensive win-shares (DWS). The Blue cluster puts up the highest steals and steal % stats and is the second best DWS. The third Yellow cluster has the lowest values in all defensive statistics. But we will notice that according to these clusters, Klay Thompson, a premier perimeter defender is supposed to be worse than his team-mate Steph Curry, known for being significantly worse than Thompson on defense. So this clustering highlights either a) the limits of the K-means algorithm or b) the limits of NBA stats for accurately conveying the defensive impact of players.

Method 3: Agglomerative Hierarchical Clustering

Finally, we are going to move on to our last inspection method. Unlike K-means clustering, Hierarchical clustering doesn’t require us to pre-specify the number of clusters. We are going to perform Bottom-up/Agglomerative Clustering instead of Divisive/Top-down Hierarchical Clustering. The former is better at identifying small clusters while the latter is better is identifying large clusters.

The algorithm begins with all players initially considered as single-element clusters/leaves. At each new step, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This is iterated until all points converge to one single big cluster/root, as shown in the plot below.

This, in my opinion, is the best visualization yet. It allows us to pick any player and find out who they are most similar to. The dendogram overcomes the problems of large clusters in K-means clustering where clusters contained players that weren’t necessarily always similar. When I first looked at this, I had a few “aha” moments where I saw the players paired closest to each other and made connections I never had made before.

Interesting observations:
1. Lebron and Durant, the 2 players in the center of the debate about who’s the best player in the NBA, are the most similar to each other.
2. The last 5 MVPs (Harden, Westbrook, Curry, Durant and Lebron) all fall into one exclusive mini-cluster. May be modern-day MVP(s) need to follow more of a strict pattern than we realized in order to win the award.
3. Antetokounmpo and Anthony Davis, the two young players considered by NBA General Managers to have the highest potential, are most similar to each other.
4. Lou Williams is in the company of quite the lethal group of Scorers (Irving, Lillard, Derozan, Butler and Walker). Not bad for someone peaking in his 30s.
5. The “Superstar Bigs” cluster from our K-means plot is still a tight unit.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

That completes our lengthy analysis of NBA players from the last season. If you made it this far, I hope you enjoyed reading this as much as I enjoyed writing it. Thank you for reading. If you think this may me interesting to others you know, please do share it with them. If you liked the article, and had something you wanted to share with me, feel free to comment, contact me via email at nadir.nibras@gmail.com or at https://www.linkedin.com/in/nadirnibras/.
I am committed to improving my methods, analyses or data-sets, so if you have any suggestions, feel free to comment or let me know otherwise. If you want to follow more of my work on data science, follow me on Medium and Linkedin.
Also, if you are a fan of Data Science or of NBA analytics, please do connect with me. It’s always fun talking to fellow stats nerds :)