Finding optimal NBA physiques using data visualization with Python
Using python, pandas and Plotly to explore & visualize data for quick, instant insights (including code & interactive graphs)

Did you ever want to evaluate a dataset better, before starting detailed analysis? This article is an example for showing exactly how I did that with data relating to physiques of basketball players. If you’re interested, read on. As ever, the subject matter (basketball) discussion is left to a minimum.
I’ll bet that you’ve seen an extremely tall person, and wondered (aloud) if they play basketball. Or suggested it to a parent of an overgrown child. Basketball players are tall. In a sport that is built around putting a ball into a hoop placed at 10 feet (305cm) above the ground, height is going to be an advantage.
But how much of an advantage is it? Does that advantage manifest itself, especially at the highest levels? If I ran a basketball team, all else being equal, what kinds of physiques are advantageous, and how would that player develop or age over time?
As I am looking to build a model which would predict player performance based on past performance data as well as a player’s physical attributes. I wanted to get a feel for answers to these questions above before building a rigorous model. So let’s me share with you how I went about doing this.
As usual, the code is included in my GitLab repo here (nba_physiques directory), so please feel free to download it and play with it / improve upon it.
I also include links to interactive versions of graphs along the way where applicable. The links are in the captions for figures.
Before we get started
Data
This article uses Kaggle’s NBA player dataset. The dataset comes from basketball-reference.com, and as it is relatively small, a copy is included in my repo (in srcdata directory).
Packages
I assume you’re familiar with python. But even if you’re relatively new, this tutorial shouldn’t be too tricky. Feel free to reach out on twitter or here if you’re not sure about something.
You’ll need plotly and pandas. Install them (in your virtual environment) with a simple pip install [PACKAGE NAME].
Data cleaning / pre-processing
Although we got straight into the data visualisation in the previous articles, let’s start this one with a little bit of data work.
Note: I have kept this section brief, but if you are not interested in data cleaning / processing, feel free to skip to the ‘data overview’ section — we pick it up there by loading the processed data (provided in my repo).
We are going to use Seasons_Stats.csv and player_data.csv from the kaggle dataset. As player_data.csv contains players’ physical attributes, we will join datasets by matching player names from the Seasons_Stats.csv dataset.
Load both datasets with:
import pandas as pd
player_data_df = pd.read_csv('srcdata/player_data.csv')
season_stats_df = pd.read_csv('srcdata/Seasons_Stats.csv', index_col=0)and have a look at the data, with .head(), .info() and .describe(). There are some (minor) issues.
The player height values are string values, which makes comparison difficult, and in [FEET-INCHES]. There are also some missing values, as you’ll see by counts of non-null values in each column (in .info()).
First, let’s fill the NA values in the player data for both the height & weight.
Pandas’ .describe() method provides statistical information on the distribution of the dataset. A review of the season stats via season_stats_df[[‘Year’, ‘MP’]].describe() tells me that 75% of the stats are from 1981 onwards, and the data includes seasons with 0 minutes played (median 1053 minutes).
For this study, let’s discount old (pre-1980) stats, and discount small-sample size seasons with fewer than 1000 minutes played (which works out to about 12 minute a game).
This will introduce a little bit of survivorship bias, but as we are looking for optimal physiques, it probably doesn’t matter so much. (Pls feel free reach out (twitter) if you have comments on this — I would love to learn more.)
Further data cleaning is carried out here to:
- Drop partial seasons (and just keep the totals) for players who’d changed teams mid-year (e.g. via trade)
- Add height/weight data to the season stats dataframe
- Simplify positions for players listed under multiple positions (“C-F” becomes “C”)
- Fill values,
- Drop some blank columns, and
- Reset the dataframe index
There was an error with the dataset, where players’ names had been truncated if they had more than two names. “Nick Van Exel” had become “Nick Van”. I corrected these, but after building in some checks to see that the right full names are being looked up. Which is just as well — because:
Amazingly, multiple human beings with first two names “Hot Rod” have played in the NBA. Hot Rod Williams and Hot Rod Hundley.
A check for the stat year vs player bio years helped me find the right Hot Rod.
The resulting data is saved as ‘Seasons_Stats_proc.csv’.
Data overview
Load the dataset (if you haven’t) with:
proc_stats_df = pd.read_csv('srcdata/Seasons_Stats_proc.csv', index_col=0)The data includes a significant number of columns, but we’ll use just a few here for simplicity.
In our analysis of physiques, weight is probably not that helpful a stat by itself, without having the context of height.
Instead, let’s add a new measure (BMI) which by definition takes height into account. The code below introduces a new column:
proc_stats_df = proc_stats_df.assign(bmi=proc_stats_df.weight / ((proc_stats_df.height/100) ** 2))Going forward, height & BMI are going to be used as the main independent variables (physical attributes). We produce a scatter plot for an initial view:
import plotly.express as px
fig = px.scatter(
proc_stats_df, x='height', y='bmi',
color='pos_simple', category_orders=dict(pos_simple=['PG', 'SG', 'SF', 'PF', 'C']),
marginal_x="histogram", marginal_y="histogram", hover_name='Player')
fig.show()
The height distribution looks mostly normal. Which suggests that there are optimal heights for playing in the NBA, although it isn’t very clear why.
Interestingly, plenty of NBA players have BMIs over 25, which is considered “overweight” for standard people. It does go to show that these standardised measures don’t really apply to everybody. They are (mostly) elite athlete, and there’s just no way that this many of NBA players are overweight.
You might notice that there are more tall players with higher BMIs than there are shorter players. Keep that in mind as we move on.
Additionally, the height & BMI attribute data are going to be put into discrete data bins (similarly to in histograms). This helps to make the data less sensitive to outliers, allows easier comparisons and will help us overfit the data mentally / visually. After all, we only have thousands of data points (although each ‘point’ is composed of data collected over an entire season — not all data points are made the same).
Pandas provides a handy function to do this (pandas.cut). Implemented as:
ht_limits = [0, 190, 200, 210, np.inf]
ht_labels = [str(i) + '_' + str(ht_limits[i]) + ' to ' + str(ht_limits[i+1]) for i in range(len(ht_limits)-1)]
proc_stats_df = proc_stats_df.assign(
height_bins=pd.cut(proc_stats_df.height, bins=ht_limits, labels=ht_labels, right=False)
)The bin widths were derived by looking at the percentile data and using my own judgements.
The lists ht_limits and ht_labels are introduced so that I can keep easier track of what limits are used, and so that the labels can be re-used later on to specify the order in charts.
The BMI is split into four bins. We now ready to start looking into the details of physical attributes vs careers — let’s get going.
Note: If you would like bins based on quantiles — use
pandas.qcut
Let’s get bodied
Metrics
For this article, I use one advanced metric — the “PER” (player efficiency rating). This metric aims to “boil down all of a player’s contributions into one number”. You can read more about it here.
I chose this statistic because it normalises a player’s stats against their peers for that year (so that the average PER is always 15 each year), allowig easier comparison of players across eras.
Career length
As a check of common sense, let’s take a look at number of years played in the league (or, in our database), vs height / BMI. I calculated them this way.
Interesting. Each height & BMI group show very similar years / player numbers across the board! So career longevity looks relatively even across the board, based on BMI / height, at least according to this very rough measure.
PER (Player Efficiency Rating) — Scatter plots
What about looking at performance metrics, then. Does being taller make you on average, better? A scatter plot of PER data vs height can be created thus:
fig = px.scatter(
proc_stats_df, x='height', y='PER', hover_name='Player'
, color='Year', color_continuous_scale=px.colors.sequential.Teal,
)
fig.show()
That’s interesting. There appears to be an optimal ‘zone’ of heights. Seasons with PERs of over 30 have only ever been achieved by those with height between 183 cm (6') and 216 cm (7'1").
Plotting the same data against BMI instead of height:

Again, there seem to be a middle, optimum range. Except for one person — that’s Shaquille O’Neal, who dominated in his time as an unstoppable force, while having defenders bounce off of him like children vs a jumping castle.
He is very much an outlier, though, and to be fair the data does not adquately capture his changing physique over the years, as his BMI fluctuated significantly throughout his career.
Another observation is that the lower range of the PERs appear to be increasing along with the BMI figures.
PER (Player Efficiency Rating) — Box plots
Still, the data is quite noisy. Let’s plot these again, but as a box plot.
Using box plots, we can easily visualise the distribution of data (PER) as a function of the independent variables (BMI & height bins).
fig = px.box(
proc_stats_df, x='height_bins', y='PER', color='bmi_bins', hover_name='Player',
category_orders=dict(height_bins=ht_labels, bmi_bins=bmi_labels))
fig.show()
Two outputs jump out in the results, and I have marked them above in the output.
The plot for under 190cm players show a PER increase for the 22.5 to 24 BMI band. And the data for the far right group, of 210cm and taller, suggest that being larger increases your chance of being great.
Pausing for a moment, you will recall that the data also included each player’s position. Let’s look to see if those height bins correspond to particular subgroups of player positions.
We can plot counts of these positions as histograms, and use subplots to separate out the distribution of positions for each height and BMI bin.
This code snippet does the trick:
fig = px.histogram(
proc_stats_df, x='pos_simple', facet_row='bmi_bins', facet_col='height_bins', color='pos_simple',
category_orders=dict(height_bins=ht_labels, bmi_bins=bmi_labels, pos_simple=['PG', 'SG', 'SF', 'PF', 'C']))
fig.show()
Firstly, it turns out that height is a pretty reasonable predictor of positions. Turning out attention back to the box plot, it suggests that the increases BMI of 22.5 to 24 mostly relates to point guards.
Point guards in basketball are the primary ball handlers, who run the offence, and distribute the ball.
My amateur interpretation of the increased PER for this BMI band (22.5–24) is it is indicative of a body type that maximises agility and explosiveness that is valuable in a primary ball handler, without being too small which might adversely affect your durability.
The data for big men (on the far right of the box plot and the histogram grid) is perhaps more straightforward. Playing near the basket as a centre or power forward, being bigger might simply allow you to physically dominate your opponent, like Shaq or Charles Barkley.
Ageing
Great. What about Father Time? Do certain body types age better than others? This section looks at changes to players’ performance ranges over time, according to body types.
Once again, we create age bins:
age_limits = [0, 23, 25, 27, 29, 31, np.inf]
age_labels = [str(i) + '_' + str(age_limits[i]) + ' to ' + str(age_limits[i+1]) for i in range(len(age_limits)-1)]
proc_stats_df = proc_stats_df.assign(
age_bins=pd.cut(proc_stats_df.Age, bins=age_limits, labels=age_labels, right=False)
)And below is a PER vs age boxplot, subdivided by height.

Isn’t that fascinating! For younger players, height is a definite advantage, but the advantage more or less disappears in their primes (between 23–31) before coming back again in older age.
I would interpret that to indicate that height allow players to compensate more easily for getting older and losing their athleticism.
Similarly, the plot below shows PER vs age, subdivided by BMI ranges.

In this plot, being bigger with a higher BMI appears to be more of an advantage for younger players. The explanation might be that it allows them to perform better on the block/inside, but as they get older, any advantages is negated by the loss in agility.
Does this loss in agility affect certain player types more than others? We can separate the data into subplots by position and take a look:
fig = px.box(
proc_stats_df, x='age_bins', y='PER', color='bmi_bins', hover_name='Player', facet_row='pos_simple',
category_orders=dict(bmi_bins=bmi_labels, age_bins=age_labels, pos_simple=['PG', 'SG', 'SF', 'PF', 'C'])
)
As we divide the data further and further, we are dealing with smaller sample sizes and we should be wary of interpreting too much data. Having said that, the answer appears to be a resounding yes!
This is entirely consistent with what we had thought above about height compensating for loss in athelticism. Guards are generally shorter, and find it difficult to compensate for their lack of athleticism as they get older.
What’s more, point guards (at the top of the subplots) and shooting guards are disproportionately affected by ageing if they have a larger body BMI. Centres, at the bottom, seem to do just fine as they get older — and there isn’t much correlation between BMI & performance at all. Older guards who are also larger suffer the most.
Interestingly, older power forwards and centres appear to age the best. Is that true?
Let’s flip the data, and put the height bins as our x-axis data, and arrange them by age brackets.
fig = px.box(
proc_stats_df, x='height_bins', y='PER', color='age_bins', hover_name='Player',
category_orders=dict(height_bins=ht_labels, age_bins=age_labels)
)
fig.show()
This chart shows that taller players absolutely age better than shorter players.
In fact, taller players are expected to be significantly more productive from even a younger age, whereas shorter players take longer to develop. This would mean that significantly less time investment is required by the team that drafts them into the team before they become significant contributors.
Modern basketball
As a final review, let’s look at whether this is true for modern basketball. The dataset used above covers 1981–2017. Basketball has changed significantly in that time, including defensive rule changes, the three point revolution and the emphasis on spacing, rather than big men dominating inside.
What impact has that had? What if we divide the data into three sets of 12 year periods?
yr_limits = [0, 1995, 2007, np.inf]
yr_labels = [str(i) + '_' + str(yr_limits[i]) + ' to ' + str(yr_limits[i+1]) for i in range(len(yr_limits)-1)]
proc_stats_df = proc_stats_df.assign(
year_bins=pd.cut(proc_stats_df.Year, bins=yr_limits, labels=yr_labels, right=False)
)And we can box plots showing PER as a function of age and height, with subplots (rows) showing different eras:
fig = px.box(
proc_stats_df, x='height_bins', y='PER', color='age_bins', hover_name='Player', facet_row='year_bins'
, category_orders=dict(height_bins=ht_labels, age_bins=age_labels, year_bins=yr_labels)
)
fig.show()
According to these stats, the NBA between 2007–2017 has been been a golden era for bigs, much more so than 81–94, and 95–06. I will get to the basketball side of things in another post, but it seems that big men in the NBA have become more important and better, not less.
That is the power of visually exploring a dataset. By dividing up the data of approximately 8700 seasons from almost 1500 players, we can quickly identify trends in the data.
Just in a short time and a few graphs, we’ve identified potentially interesting and valuable data which might be worth investigating further. We’ve seen optimal BMI ranges for particular heights, how players of various height ranges might take to become fully productive, or regress according to age, and compare eras.
These might also become valuable inputs in choosing what features to include in a machine learning model, or in developing new features.
Also, it might be interesting to revisit this data with height/BMI data that is normalised across eras, compensating for any changes across eras. It would be also interesting to evaluate the data for another league (international, Euroleague, or the WNBA), to see how the data compares.
I hope that was a useful example of the information you can get by simply manipulating and visualising the data. Sometimes, there’s just no substitute for seeing something in front of you before you can understand it well.
As ever, hit me up if you have any questions or comments.
If you liked this, say 👋 / follow on twitter, or follow for updates. I also wrote these articles visualising basketball shot data, and about visualising climate data.

