The world’s leading publication for data science, AI, and ML professionals.

How to Automatically Extract and Label Data Points on a Seaborn KDE Plot

A Kernel Density Estimate plot is a method – similar to a histogram – for visualizing the distribution of data points. While a histogram…

DALL·E 2023— An impressionist painting of an undulating mountain range with brightly colored circles along the ridgeline (all remaining images by the author).
DALL·E 2023— An impressionist painting of an undulating mountain range with brightly colored circles along the ridgeline (all remaining images by the author).

A Kernel Density Estimate plot is a method – similar to a histogram – for visualizing the distribution of data points. While a histogram bins and counts observations, a KDE plot smooths the observations using a Gaussian kernel. As alternatives to histograms, KDEs are arguably more attractive, easier to compare in the same figure, and better at accentuating patterns in data distributions.

A histogram versus a KDE plot
A histogram versus a KDE plot

Annotating statistical measures like the mean, median, or mode on KDEs makes them more meaningful. While adding lines for these measures is easy, making them look clean and uncluttered is not.

Marker lines added with the easy method (left) vs. with the harder but more attractive method (right)
Marker lines added with the easy method (left) vs. with the harder but more attractive method (right)

In this Quick Success Data Science project, we’ll use US Census and Congressional datasets to programmatically annotate multiple KDE plots with median values. This approach will ensure that the plot annotation automatically adjusts for updates to the datasets.

For more details on KDE plots, see my previous article here.

The Datasets

Because the United States has Age of Candidacy laws, the birthdays of members of Congress are part of the public record. For convenience, I’ve already compiled a CSV file of the names of the current members of Congress, along with their birthdays, branch of government, and party, and stored it in this Gist.

For the US population, we’ll use the Census Bureau’s Monthly Postcensal Civilian Population table for July 2023. As with the previous dataset, this is public information that I’ve saved to a CSV file in this Gist.

Installing Libraries

For this project, we’ll need to install seaborn for plotting and pandas for data analysis. You can install these libraries as follows:

With conda: conda install pandas seaborn

With pip: pip install pandas seaborn

The Code

The following code was written in JuptyerLab and is described by cell.

Importing Libraries

The secret to extracting values from seaborn KDE plots is to import the matplotlib [Line2D](https://matplotlib.org/stable/api/_as_gen/matplotlib.lines.Line2D.html) class, which gives us access to the coordinates of the points along the curve. In addition, we’ll use matplotlib [patches](https://matplotlib.org/stable/api/_as_gen/matplotlib.patches.Patch.html) to plot rectangles delineating the legal age limits for serving in the House and the Senate. A patch is a matplotlib artist object with a face color and an edge color.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from matplotlib.patches import Rectangle
import seaborn as sns
import pandas as pd

Loading the Congressional Dataset and Calculating Ages

The following code loads the Congressional dataset and calculates each member’s age as of 8/25/2023. It first converts this reference date, along with the DataFrame’s "Birthday" column, to datetime format using pandas’ to_datetime() method. It then uses these "date aware" formats to generate an "Age" column by subtracting the two values, extracting the number of days, and then converting the days to years by dividing by 365.25.

# Load the data:
df = pd.read_csv('https://bit.ly/3EdQrai')

# Assign the current date:
current_date = pd.to_datetime('8/25/2023')

# Convert "Birthday" column to datetime:
df['Birthday'] = pd.to_datetime(df['Birthday'])

# Make a new "Age" column in years:
df['Age'] = ((current_date - df['Birthday']).dt.days) / 365.25
df['Age'] = df['Age'].astype(int)

df.head(3)

Loading the Population Dataset

Next, we load the population data as a pandas DataFrame.

# Load the US population data for July 2023:
df_popl = pd.read_csv('https://bit.ly/3Po0Syf').astype(int)
display(df_popl)

Finding the Median Age of the US Population

Here’s a fun problem. How do you find the median age of the US population? That is, how do you relate the median population value to an age?

The key is to plot the population’s cumulative distribution against age. Since you must be 25 years old or older to serve in Congress, we’ll first filter the DataFrame to those ages. Here’s the concept:

Finding the median age for the US population > 24 years old using a cumulative distribution plot
Finding the median age for the US population > 24 years old using a cumulative distribution plot

And here’s the commented code:

# Calculate the cumulative sum of the population over 24 years:
df_popl = df_popl[df_popl['Age'] >= 25].copy()
df_popl['Cumulative_Population'] = df_popl['Population'].cumsum()

# Find the total population:
total_population = df_popl['Population'].sum()

# Find row where the cumulative population crosses half the total population:
median_row = df_popl[df_popl['Cumulative_Population'] 
                     >= total_population / 2].iloc[0]

# Get the median age:
popl_median_age = median_row['Age']

# Get the median population:
popl_median = total_population / 2

Making a Simple Stacked KDE Plot

Before we annotate the plot, let’s see what we get "out of the box," so to speak. We’ll layer multiple KDE plots in the same figure. These will include one for the House of Representatives, one for the Senate, and one for the US population over the age of 24.

# Make a list of median member ages by branch of government:
median_ages = df.groupby('Branch')['Age'].median()

# Make a custom (red-blue-gray) color palette (optional):
colors = ['#d62728', '#1f77b4', '#7f7f7f']
sns.set_palette(sns.color_palette(colors))

# Plot Congressional ages as a KDE and overlay with population KDE:
fig, ax = plt.subplots()

sns.kdeplot(data=df, 
            x='Age', 
            hue='Branch', 
            multiple='layer', 
            common_norm=True)

sns.kdeplot(df_popl, 
            x='Age', 
            weights='Population', 
            color='grey', 
            alpha=0.3, 
            legend=False, 
            multiple='layer')

ax.set_title('Age Distributions of Senate, House, and US Population > 24 yrs')
ax.legend(loc='upper left', labels=['Senate', 'House', 'Population'])
ax.set_xlim((0, 110));
A simple layered KDE plot
A simple layered KDE plot

An important parameter for the kdeplot() method is common_norm, which stands for "common normalization."

According to seaborn’s documentation, "When common_norm is set to True, all the KDE curves will be normalized together using the same scale. This can be useful when you want to compare the overall distribution shapes of different groups. It’s particularly helpful when you have multiple groups with different sample sizes or different ranges of values, as it ensures that the curves are directly comparable in terms of their shapes."

Note that, in this case, the normalization is only applied to the House and Senate curves, as the population data is plotted separately from a different DataFrame. This is because we need to weigh the ages by their population value, as our population dataset doesn’t include a distinct age value for each individual.

Finding the Median Age Values for the House and Senate

While attractive, the previous plot makes the reader work too hard. The x-axis needs more resolution, and it would be nice to know where the mean or median value falls on the curves. Since both houses of Congress include a few very old members that can skew the mean, we’ll focus on the median value.

First, we’ll need to find the median values for each branch (we found the population median previously). And since we want to programmatically find the plot coordinates for the Annotations, we’ll make a separate DataFrame for each branch. This will make it easier to extract the curve data.

# Filter the DataFrame to each branch of government:
df_house = df.loc[df['Branch'] == 'House'].copy()
df_senate = df.loc[df['Branch'] == 'Senate'].copy()

# Find the median age values for each branch:
median_house = int(df_house['Age'].median())
median_senate = int(df_senate['Age'].median())

Plotting and Annotating the KDE

The following commented code draws and annotates the plot. Our goal here is to find the (x, y) coordinates for the median values on the curves so that we can programmatically provide these coordinates when drawing lines and posting text. This makes the code adaptable to any changes in the input data.

So how do we do this? Well, when Seaborn makes a KDE plot, it returns a matplotlib [axes](https://matplotlib.org/stable/api/axes_api.html) object. This type of object has a [get_lines()](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.get_lines.html) method that returns a list of lines contained by that object. These lines are [Line2D](https://matplotlib.org/stable/api/_as_gen/matplotlib.lines.Line2D.html) objects that have a get_data() method that returns the line data as (x, y) pairs. Because these coordinates may not include the exact values we want, we’ll use NumPy’s [interp()](https://numpy.org/doc/stable/reference/generated/numpy.interp.html) method to interpolate the values.

# Create figure and title:
fig, ax = plt.subplots()
ax.set_xlim((0, 110))
ax.set_xticks(range(0, 110, 10))
ax.set_title('Age Distributions of House, Senate, and US Population > 24 yrs')

# Define colors and labels:
colors = ['#d62728', '#1f77b4', '#7f7f7f']
labels = ['House', 'Senate', 'Population']

# Loop through the datasets and plot KDE, median lines, and labels:
datasets = [df_house, df_senate]
medians = [median_house, median_senate]

for i, (data, color, label) in enumerate(zip(datasets, colors, labels)):
    sns.kdeplot(data=data, x='Age', color=color, fill=False, label=label)
    x, y = ax.get_lines()[i].get_data()
    f = np.interp(medians[i], x, y)
    ax.vlines(x=medians[i], ymin=0, ymax=f, ls=':', color=color)
    ax.text(x=medians[i], y=f, s=f'Median = {medians[i]}', color=color)

# Make and annotate the population KDE plot:
sns.kdeplot(df_popl, x='Age', weights='Population', color='#7f7f7f', fill=False)
x, y = ax.get_lines()[2].get_data()  # Note that this is the 3rd line([2]).
f = np.interp(popl_median_age, x, y)
ax.vlines(x=popl_median_age, ymin=0, ymax=f, ls=':', color='#7f7f7f')
ax.text(x=popl_median_age, y=f, 
        s=f'Median = {popl_median_age}', color='#7f7f7f')

# Build a custom legend:
legend_handles = [Line2D(xdata=[0, 1], ydata=[0, 1], ls='-', 
                         color=color) for color in colors]
ax.legend(handles=legend_handles, loc='upper left', labels=labels)

# Manually annotate the Age Limit shading:
age_limit_rects = [
    Rectangle((25, 0), 85, 0.003, facecolor='#d62728', alpha=0.3),
    Rectangle((30, 0), 85, 0.001, facecolor='#1f77b4', alpha=0.6)
    ]

for age_rect, label, color in zip(
    age_limit_rects, ['House age limits', 'Senate age limits'], 
    ['#d62728', '#1f77b4']):
    age_rect.set_zorder(0)  # Move rect below other elements.
    ax.add_patch(age_rect)
    ax.text(x=age_rect.get_x(), y=age_rect.get_height(), 
            s=label, color=color)
The annotated KDE plots
The annotated KDE plots

In the previous code, we plotted the population KDE outside the loop as the age data had to be weighted by its population value.

We also manually annotated the colored rectangles and text for the age limits as the programmatic solution is not very appealing. "Hardwiring" these annotations is acceptable as this information is fixed and won’t change with changes to the input data.

Summary

In this project, we programmatically extracted the (x, y) coordinates of points along a KDE curve and used them to annotate a plot. The result was a much cleaner-looking plot where vertical marker lines terminate when they intersect the curve, and text annotations begin at that intersection point. This makes the code much more flexible, as these annotations will automatically update with any changes to the input data.

We also used a cumulative distribution to find the median value of a pandas DataFrame column that corresponds to the median value of a related column. We had to do this as our age-versus-population input data was binned.

Thanks!

Thanks for reading and please follow me for more Quick Success Data Science projects in the future.


Related Articles