The world’s leading publication for data science, AI, and ML professionals.

College Football Conference Realignment – Exploratory Data Analysis in Python

Exploring the changing CFB landscape as a data scientist

It’s my favorite time of year: fall which means it’s time for college football. I have always loved college sports. Growing up, I lived in a Big Ten/SEC household and a Big East (now ACC) town which meant a deluge of college sports filled the television screen from the first kick-off in August to the last buzzer beater in April. Recently, analytics has come to dominate both sports, but since it is football season let’s start there.

Photo by David Ireland on Unsplash
Photo by David Ireland on Unsplash

The last two off-seasons in college sports have been abuzz with NIL, transfer portal, and conference realignment news. I think the sentiment among most fans is captured by Dr. Pepper’s "Chaos Comes to Fansville" commercial. I began to notice that every conversation about conference realignment, in particular, was filled with speculation and fueled by gut feeling. There was, however, a common faith that some great and powerful College Football Oz was crunching numbers to decide which team was worth adding to which conference. I still haven’t had the opportunity to meet his man behind the curtain, so until then I’d like to take a shot at proposing a data-driven conference realignment.

This is a four-part blog which will hopefully serve as a fun way to learn some new data science tools:

  1. College Football Conference Realignment – Exploratory Data Analysis in Python
  2. College Football Conference Realignment – Regression
  3. College Football Conference Realignment – Clustering
  4. College Football Conference Realignment – node2vec

I’ll preface this post by saying there are many ways to perform exploratory data analysis, so I’ll only be covering a few methods here which are relevant to conference realignment.

The Data

I took the time to build my own dataset using sources I compiled from across the web. These data include basic information about each FBS program, a non-canonical approximation of all college football rivalries, stadium size, historical performance, frequency appearances in AP top 25 polls, whether the school is an AAU or R1 institution (historically important for membership in the Big Ten and Pac 12), the number of NFL draft picks, data on program revenue from 2017–2019, and a recent estimate on the size of college football fan bases.

Finding Latitude and Longitude

The first step in our exploratory data analysis is to convert the city and state data we have for each team into a latitude and longitude. This is easy to do in Python using the geopy package. First, I import the dependencies and load in a csv file with the city and state of each team.

# Import dependencies
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
# Read csv data with column 'City' and column 'State'
city_list_df = pd.read_csv(r'.FBS_Football_Cities.csv', encoding = 'unicode_escape')

Then, comes the meat of the code. I run a for loop to collect all the latitudes and longitudes of each city.

# Lists to track latitude and longitude of each city
lat_list = []
long_list = []
# for each city in the dataframe, get the latitude and longitude and # add them to the lists
index = 0
for city in city_list_df['City']:
    city_name = str(city)+', '+str(city_list_df['State'][index])

    # Two cities needed some manual cleaning
    if index == 39:
        city_name = 'Urbana, Illinois'
    elif index == 92:
        city_name = 'San Jose, California'

    print(city_name)
    print(index)

    # calling the Nominatim tool
    loc = Nominatim(user_agent="GetLoc")

    # entering the location name
    getLoc = loc.geocode(city_name)

    # add the latitude and longitude to their respective lists
    lat_list.append(getLoc.latitude)
    long_list.append(getLoc.longitude)

    index = index + 1

Finally, I combine these into a data frame and output to a csv.

lat_long_df = pd.DataFrame(lat_list, columns=['latitude'])
lat_long_df['longitude'] = long_list
lat_long_df.to_csv(r'.cfb_lat_long.csv')

Conference Geographic Centroid

Now that we have the latitude and longitude data for each team, we can find the geographic center of each conference using the same package. You could think of the geographic center as being the optimal neutral site for a conference championship. The data I put together assigns teams to their respective conference for the 2025 season (e.g. UCLA and USC are in the Big Ten).

# Read csv of all data
cfb_info_df = pd.read_csv(r'.FBS_Football_Team_Info.csv', encoding = 'unicode_escape')
# Track conference name, lat., long., and city name 
conf_name_list = []
conf_lat_list = []
conf_long_list = []
conf_city_list = []
# for each conference in the data set calculate the mean  latitude and mean longitude
for conf in np.unique(cfb_info_df['Current_conference_2025']):
    conf_latitude = np.mean(cfb_info_df[cfb_info_df['Current_conference_2025'] == conf]['Latitude'])
    conf_longitude = np.mean(cfb_info_df[cfb_info_df['Current_conference_2025'] == conf]['Longitude'])

    # calling the Nominatim tool
    loc = Nominatim(user_agent="GetLoc")

    # entering the lat. and long. to return the city name
    getCity = loc.reverse(str(conf_latitude)+', '+str(conf_longitude))

    #Update lists
    conf_name_list.append(conf)
    conf_lat_list.append(conf_latitude)
    conf_long_list.append(conf_longitude)
    conf_city_list.append(getCity)

    print(f'Conference: {conf}, Centroid City: {getCity} ({conf_latitude}, {conf_longitude})')

#Create data frame by conference
conf_center_df = pd.DataFrame(conf_name_list, columns=['conference'])
conf_center_df['latitude'] = conf_lat_list
conf_center_df['longitude'] = conf_long_list
conf_center_df['city'] = conf_city_list
# Add a column for text to appear on our map
conf_center_df['text'] = conf_center_df['conference'] + ': ' + conf_center_df['city'].astype(str)

Now, we can use the plotly package to create a simple interactive map and visualize the new geographic centers of college football conferences.

import plotly.graph_objects as go
fig = go.Figure(data=go.Scattergeo(
        lon = conf_center_df['longitude'],
        lat = conf_center_df['latitude'],
        text = conf_center_df['text'],
        mode = 'markers'))
fig.update_layout(title = 'Conference Geographic Centers<br>(Hover for conference names)',
        geo_scope='usa')
fig.show()
Plot of the geographic center of each conference (and Independents).
Plot of the geographic center of each conference (and Independents).

These are the closest small and large urban areas to the conference geographic centers:

  • ACC: Greensboro, North Carolina (Charlotte, North Carolina)
  • American: Birmingham, Alabama (Memphis, Tennessee)
  • Big 12: Fayetteville, Arkansas (Oklahoma City, Oklahoma)
  • Big Ten: Springfield, Illinois (St. Louis, Missouri)
  • C-USA: Jackson, Mississippi (Memphis, Tennessee)
  • MAC: Toledo, Ohio (Detroit, Michigan)
  • Mountain West: Las Vegas, Nevada
  • Pac-12: Reno, Nevada (Salt Lake City, Utah)
  • SEC: Memphis, Tennessee
  • Sun Belt: Birmingham, Alabama (Atlanta, Georgia)
  • Independents: Scranton, Pennsylvania (New York, New York)

These results seem pretty reasonable, although, the Independents aren’t technically a conference. The south is the center of the college football world. The city of Memphis is looking to build a new stadium and perhaps just in time to become a potential host city for the American, C-USA, or SEC championship game.

Summary Statistics

One of the best ways to get acquainted with any data set is to produce summary statistics of numeric features. Pandas has a built in method for getting the summary statistics called describe(). Here is an example for enrollment:

cfb_info_df['Enrollment'].describe()

This one line statement outputs:

count      133.000000
mean     29337.541353
std      13780.834495
min       3297.000000
25%      21200.000000
50%      28500.000000
75%      38500.000000
max      69400.000000
Name: Enrollment, dtype: float64

From the summary statistics, we can see the mean and median are similar and the Coefficient of Variation (standard deviation/mean) is less than 0.5 meaning we likely have a roughly symmetric distribution, but we can visually check this by producing a histogram where we can break out enrollment by conference.

import plotly.express as px
fig = px.histogram(cfb_info_df, x="Enrollment", color="Current_conference_2025", marginal="box", hover_data=cfb_info_df.columns)
fig.show()
The histogram shows the distribution of enrollment split by conference. The marginal distribution is shown above.
The histogram shows the distribution of enrollment split by conference. The marginal distribution is shown above.

From the histogram, we can visually see that most universities have enrollments between 20,000–35,000 students. The Big Ten leads the way with a median enrollment of over 45,000 while the Sun Belt includes the smallest schools with a median enrollment of under 20,000. The ACC has the lowest enrollment schools of any Power 5 conference.

Correlation

Correlation is a measure of the relationship between two features. Correlation can range between -1 and 1 where a correlation of 1 represents perfect positive relationship, -1 represents perfect negative relationship, and 0 represents no relationship. Correlation should not be confused with causation as the correlation is simply a metric of the observed relationship between two features. Pandas once again comes equipped with the method to easily calculate correlation between two columns.

import seaborn as sns
# Pull out numeric columns from data
numeric_columns = ['Enrollment','years_playing','Stadium_capacity','total_draft_picks_2000_to_2020',
                  'wsj_college_football_revenue_2019', 'tj_altimore_fan_base_size_millions', 'bowl_game_win_pct',
                 'historical_win_pct', 'p_AP_Top_25_2001_to_2021']
# Generate Correlation Matrix using only the numeric columns
correlation_matrix = cfb_info_df.loc[:,numeric_columns].corr()
# Print the correlation matrix
#print(correlation_matrix)
# Plot the correlation matrix using seaborn
# Set plot size
sns.set(rc={"figure.figsize":(11, 10)})
# Show the plot
sns.heatmap(correlation_matrix, annot=True)
The correlation matrix shows a perfect positive relationship between each feature and itself. We also see high correlation between stadium capacity, fan base size, 2019 revenue, and the percentage of weeks appearing in the AP top 25 between 2001 and 2021.
The correlation matrix shows a perfect positive relationship between each feature and itself. We also see high correlation between stadium capacity, fan base size, 2019 revenue, and the percentage of weeks appearing in the AP top 25 between 2001 and 2021.

The resulting correlation matrix is a great visual for understanding which features might relate to one another. For example, the stadium capacity, 2019 revenue, fan base size, and historic probability of appearing in the AP top 25 poll are all highly correlated. Interestingly, these features are only moderately correlated with historic win percentages and barely correlated with bowl win percentages. One explanation could be that strength of schedule doesn’t show up in our data, and a team that goes 8–4 in a Power 5 conference could be much better than a team that goes 8–4 in a Group of 5 conference. Moreover, we can intuit directly that correlation does not equate to causation because building a bigger stadium will not guarantee a bigger fan base.

Another interesting take away is that the years a team has been active does not correlate strongly with any metric of success. Of course, a long history on the gridiron could lead to perennial pain for a fan base. For example, Indiana University recently became the first FBS team to hit the historic 700 loss mark.

Finally, we can zoom in on our latest findings and make a simple yet powerful visualization using plotly.

import plotly.express as px
fig = px.scatter(cfb_info_df, x="Stadium_capacity", y="p_AP_Top_25_2001_to_2021", color="Current_conference_2025",
                 size="tj_altimore_fan_base_size_millions",
                labels=dict(Stadium_capacity="Stadium Capacity", 
                            p_AP_Top_25_2001_to_2021="Percent of Weeks in AP Top 25", 
                            Current_conference_2025="Conference (2025)", 
                            tj_altimore_fan_base_size_millions = "Fan Base Size"))
fig.show()
The plot shows stadium capacity on the x axis, and AP poll success on the y axis. The larger dots represent larger estimated fan bases.
The plot shows stadium capacity on the x axis, and AP poll success on the y axis. The larger dots represent larger estimated fan bases.

The plot shows us visually the strong correlation between stadium capacity on the x axis and AP poll success on the y axis. The size of each point on the scatter plot represents the estimated size of the fan base. This plot shows that we might have success estimating the results of Tony Altimore’s fan base size estimation analysis with a regression model. I’ll discuss this more in part 2 of my conference realignment blogs.


Thanks for reading! Comment your thoughts below.

Interested in my content? Please consider following me on Medium.

Show some love on Twitter: @malloy_giovanni


Related Articles