The world’s leading publication for data science, AI, and ML professionals.

Make a Nested Bar Chart with Seaborn

Benchmarking the accuracy of college football polls

I Quick Success Data Science

A nested Bar Chart is a visualization method that compares multiple measurements within categories. One of these measurements represents a secondary or background measure, such as a target or previous value. The primary measurement represents the actual or current value.

The secondary measure is usually presented in a diminished capacity, thus providing context for the primary measurement. Placing a wider and darker primary bar on top of a narrower and lighter secondary bar yields an attractive and compact chart. It also explains why this graphic is sometimes referred to as a lipstick bar chart.

Of course, for this to work properly, the primary bar should never be longer than the secondary bar. Thus, you’ll want to use nested bar charts to plot examples of diminishing values per category, such as a drop in house prices, or a decrease in disease rates due to a new vaccine.

In this Quick Success Data Science project, we’ll look at how well the famous Associated Press College Football Top 25 poll does at picking the 25 best American college football teams at the start of each season. Since the number of teams picked in the preseason can’t exceed the final number of teams in the ranking, this is a good application for a nested bar chart.

The AP College Football Top 25 Poll

The AP releases their preseason poll around August or September of each year and then updates it every week throughout the season. Over 60 reputable sports writers and broadcasters across the country are called on to cast their votes for the best teams.

They start by creating a list of what they believe are the 25 best teams (out of 133) and assign each team a number of votes, awarding the best team a maximum of 25 points. The AP then combines these votes to rank the teams in descending order. After the bowl season and College Football Playoff, it releases a final poll for the year.

You can find these polls on the AP’s website and on sports sites such as Sports Illustrated, the NCAA, and Sports Reference. For convenience, I’ve already compiled the preseason and final polls for the last 20 years (2002–2022) and stored them as a CSV file in this Gist.

The Code

The following code was inspired by a recent article on lipstick charts by Oscar Leo:

How to Create a Lipstick Chart with Matplotlib

In this article, we’ll build on Oscar’s Python code for setting up a color palette, an attractive Seaborn style, and a function for plotting horizontal bars with different widths and alphas, as required for nesting bars. We’ll tweak some of this code and add some more for loading and preparing the data and for making the final display.

Importing Libraries and Setting the Style

For this project, we’ll need matplotlib and seaborn for plotting and pandas for data loading and preparation. You can find the current installation instructions for each by searching for install .

For the color palette, I chose "footbally" greens and browns from the helpful Color Hunt site recommended in Oscar Leo’s article.

Seaborn has a method for setting run configuration style parameters that will automatically be applied to every figure. This functionality is helpful if you want to make multiple figures with the same parameters, or if you want to "unclutter" your plotting code by abstracting these details to another cell or location. Obviously, you won’t need this code if you’re using seaborn’s default plotting parameters.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Set up a color palette of grass greens and pigskin browns:
BACKGROUND_COLOR = '#B5CB99'
TEXT_COLOR = '#FCE09B'
BAR_COLOR = '#B2533E'
LEGEND_COLOR = '#186F65'

# Create a dictionary of display parameters favorable for nested bars:
sns.set(style='white', rc={
    'axes.facecolor': BACKGROUND_COLOR,
    'figure.facecolor': BACKGROUND_COLOR,
    'text.color': TEXT_COLOR,
    'axes.spines.left': False,
    'axes.spines.bottom': False,
    'axes.spines.right': False,
    'axes.spines.top': False
})

Loading the Data

The following code uses pandas’ read_csv() method to load the CSV-formatted poll data from the gist. It then displays the first three lines for quality control.

# Load AP Top 25 College poll data:
df = pd.read_csv('https://bit.ly/45yEPtI')
df.head(3)

Note: Since we’re going to be comparing team names, it’s good to check that they’re consistent from poll to poll. In this case, a few polls included the team’s record in parentheses after the team’s name. This was a minor issue that I’ve already cleaned up, but you should be aware of it if you want to expand on this project in the future.

Preparing the Data for Plotting the Top 25 Results

We’ll start by looking at the number of teams in the preseason poll that made it to the final poll at the end of the season. To do this, we’ll lean on Python’s set data type.

Just as in classical mathematics, a set can contain only unique values (no duplicates) and you can use built-in functions to find the intersection of two sets. This means that we can extract items (teams) that are shared between the preseason and final polls.

# Initialize a list to store the intersection results:
top_25 = []

# Get unique years from the DataFrame:
unique_years = df['Year'].unique()

# Loop through each year and find the intersection of Final and Preseason Teams:
for year in unique_years:
    year_data = df[df['Year'] == year]

    # Make sets of the final and preseason teams:
    final_teams = set(year_data[year_data['Poll'] == 'Final']['Team'])
    preseason_teams = set(year_data[year_data['Poll'] == 'Preseason']['Team'])

    # Find the set intersections for each year and append to the top_25 list:
    intersection = final_teams.intersection(preseason_teams)
    num_right = len(intersection)    
    top_25.append({'Year': year, 'Finishers': num_right})

# Create a new DataFrame from the list:
df_25 = pd.DataFrame(top_25)

# Add columns for the number of ranked teams and percent predicted correctly:
df_25['Top 25'] = 25
df_25['Pct Right'] = df_25['Finishers'] / df_25['Top 25']
df_25['Pct Right'] = df_25['Pct Right'].apply(lambda x: f'{x:.0%}')

print(df_25)

Defining a Function to Plot Bars

The following code defines a function that calls seaborn’s barplot() method. Its arguments give you control over the parameters used to generate a nested bar chart. For example, you need the axes object (ax_obj) to overlay bars in the same figure, width to make the primary bar wider than the secondary bar, and alpha to adjust the transparency of each bar so that the primary bar is darker.

def add_bars(ax_obj, x, y, width, alpha, label):
    """Plot a seaborn horizontal bar chart (credit Oscar Leo)."""
    sns.barplot(ax=ax_obj, x=x, y=y, 
                label=label,
                width=width, 
                alpha=alpha,
                color=BAR_COLOR,
                edgecolor=TEXT_COLOR,
                orient="h")

Plotting the Top 25 Nested Bar Chart

In the code that follows, we set up a figure and then call the add_bars() function twice, tweaking the arguments, to make the primary and secondary bars. The label argument is used in the legend.

To make the display more informative, we’ll use the bar_label() method to add text about the percentage of preseason predictions that were correct. We’ll pad this to the left to ensure that the text is visually associated with the correct bar.

# Make the display, calling add_bars() twice to nest the bars:
fig, ax = plt.subplots(figsize=(8, 9))
ax.set_title('Number of Teams Starting AND Finishing in 
AP Top 25 College Football Poll', color='k', fontsize=13, weight='bold')

# Plot bars for total number of teams (secondary measure):
add_bars(ax_obj=ax, 
         x=df_25['Top 25'],
         y=df_25['Year'],
         width=0.55, 
         alpha=0.6, 
         label='Teams in Preseason Poll')

# Plot bars for teams that started AND finished in the Top 25 (primary measure):
add_bars(ax_obj=ax, 
         x=df_25['Finishers'],
         y=df_25['Year'],
         width=0.7, 
         alpha=1, 
         label='Teams in Preseason AND Final Polls')

# Add informative text stating percent correct:
ax.bar_label(ax.containers[1], 
             labels=df_25['Pct Right'] + ' correct', 
             padding=-70)

# Assign a custom x-axis label and legend:
ax.set_xlabel('Number of Teams') 
ax.legend(bbox_to_anchor=(1.0, -0.085), facecolor=LEGEND_COLOR);

Of the 500 teams ranked in the last 20 years, 313 of the ones picked in the preseason made the final cut, a success rate of about 63%.

Is that a good result? I’m not sure. American college football is dominated by multiple powerhouse programs that regularly appear in the final polls, so picking these is fairly easy and reliable. One interesting observation, however, is the downward trend beginning in 2011.

This trend could be a coincidence or a function of multiple factors including changing game rules, conference realignments, the introduction of NIL (Name, Image, and Likeness) payments, and the opening of the transfer portal.

Preparing the Data for Plotting the Top 4 Results

Starting in 2014, the National Collegiate Athletic Association (NCAA) adopted a four-team playoff tournament to determine the national champion for the Division 1 Football Bowl Subdivision. Let’s revisit our previous code and select the Top 4 teams, to see how well the polls did in selecting the final champion. Here’s a spoiler: the preseason poll selected the final champion in only 2 of the last 20 years, a success rate of only 10 percent!

# Filter the original DataFrame to teams ranked 4 or better:
df_4 = df[(df['Rank'] <= 4)].copy()

# Initialize a list to store the intersection results:
top_4 = []

# Loop through each year and find the intersection of Final and Preseason Teams:
for year in unique_years:
    year_data = df_4[df_4['Year'] == year]

    # Make sets of the final and preseason teams:
    final_teams = set(year_data[year_data['Poll'] == 'Final']['Team'])
    preseason_teams = set(year_data[year_data['Poll'] == 'Preseason']['Team'])

    # Find the set intersections for each year and append to the top_4 list:
    intersection = final_teams.intersection(preseason_teams)
    num_right = len(intersection)    
    top_4.append({'Year': year, 'Finishers': num_right})

# Create a new DataFrame from the list:
df_final_4 = pd.DataFrame(top_4)

# Add columns for the number of ranked teams and percent predicted correctly:
df_final_4['Top 4'] = 4
df_final_4['Pct Right'] = (df_final_4['Finishers'] / df_final_4['Top 4'])
df_final_4['Pct Right'] = df_final_4['Pct Right'].apply('{:.0%}'.format)

print(df_final_4)

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 10))
ax.set_title('Number of Teams Starting AND Finishing in 
Top 4 of AP College Football Poll', color='k', fontsize=14, weight='bold')

add_bars(ax_obj=ax, 
         x=df_final_4['Top 4'],
         y=df_final_4['Year'],
         width=0.55, 
         alpha=0.6, 
         label='Top 4')

add_bars(ax_obj=ax, 
         x=df_final_4['Finishers'],
         y=df_final_4['Year'],
         width=0.7, 
         alpha=1, 
         label='Finishers')

ax.bar_label(ax.containers[1], 
             labels=df_final_4['Pct Right'] + ' correct', 
             padding=3)

ax.set_xticks(range(5))
ax.set_xlabel('Number of Correct Final Four Predictions', 
              fontdict={'size': 16});

One thing to note here is that we had to pad the bar labels to the right this time. The problem is that none of the preseason teams made it into the final poll in 2010 and 2013. If we pad the labels to the left, as we did in the previous display, the annotations will "fall off the edge" and post over the y-axis values. As you can imagine, having a combination of very short and very long primary bars is not ideal for annotating nested bar charts.

Of the 80 teams landing in the Top 4 over the past 20 years, the preseason poll has identified about 42% of them. In 2011 and 2020 it got 3 out of 4 correct. In 2010 and 2013 it was incorrect for all four.

Summary

Nested bar charts are a clean and compact way to compare categorical measurements where one measurement is consistently lower than the other. By including bar labels that provide additional information, you can easily turn these into attractive infographics that convey meaning as well as information.

Thanks!

Thanks for reading and please follow me for more Quick Success Data Science projects in the future.


Related Articles