The world’s leading publication for data science, AI, and ML professionals.

College Football Conference Realignment – Clustering

Using K means clustering to create data-driven college football conferences

Welcome to part 3 of this series on conference realignment! This is the blog post where we will start using the dataset to inform realignment decisions. There is a common complaint that conference realignment destroys traditional rivalries and the regional aspect of college football. It’s true that college sports tend to be regional. It’s even in the name of the conferences themselves: Pacific 12, Atlantic Coast, Southeastern, and Big East Conferences to name a few. Some get even more specific when we include the FCS: Ohio Valley Conference. Of course, the days of regional conferences in the FBS are long gone. In the last few days, it looks like the Pac 12 may be a relic of the past, as well.

This series is organized into four parts (and the full motivation for it is found in part 1):

  1. College Football Conference Realignment – Exploratory Data Analysis in Python
  2. College Football Conference Realignment – Regression
  3. College Football Conference Realignment – Clustering
  4. College Football Conference Realignment – node2vec
Photo by Gene Gallin on Unsplash
Photo by Gene Gallin on Unsplash

Hopefully, each part of the series provides you with a fresh perspective on the future of the beloved game of College Football. For those of you who did not read part 1 or 2 a quick synopsis is that I created my own data set compiled from sources across the web. These data include basic information about each FBS program, a non-canonical approximation of all college football rivalries, stadium size, historical performance, frequency appearances in AP top 25 polls, whether the school is an AAU or R1 institution (historically important for membership in the Big Ten and Pac 12), the number of NFL draft picks, data on program revenue from 2017–2019, and a recent estimate on the size of college football fan bases. In part 1 we found that there were several features that correlated strongly with fan base size, so in part 2, we developed a linear regression and random forest regression model to predict fan base size.

Clustering

My motivation for this post is the following: today’s conferences are founded on a traditional core. You can think about them like a new computer hard disk drive. Cleanly organized in a contiguous manner in regional conferences. However, over the years, just like we create, manipulate, and delete files on our disk drive, the college football world has seen new conferences formed (mostly recently the AAC), old conferences implode (Big East football), and once midwestern conferences grow from New York to LA. The most interesting case study of this is the Western Athletic Conference. Below is a membership graphic from Wikipedia, and the WAC has pulled itself up by its own bootstraps many times. I’ll reserve a deep dive post for another day.

Wikipedia has these mesmerizing membership graphics for each conference, and the WAC is one of the craziest.
Wikipedia has these mesmerizing membership graphics for each conference, and the WAC is one of the craziest.

For now, I want to know what happens if we de-frag the college football conference disk drive. What if we came up with conferences from scratch? If we strip away the tradition and the grant of rights, how will schools group together in 2022? The perfect approach to answer this question lies in an unsupervised machine learning approach called clustering. Unsupervised machine learning means that we will be feeding our model unlabeled data in the hopes of eliciting hidden patterns. Specifically, the goal of clustering is to by group together similar observations. These groups may be obvious or non-obvious even after a thorough exploratory data analysis.

K-Means Clustering

One of the most widely implemented algorithm for clustering is called k-means clustering. The idea behind k-means is described well here. Essentially, the user defines the desired number of clusters, k. The algorithm randomly assigns k centroids and finds the centroid closest to each item using the Euclidean distance. The item is then considered a member of that centroid’s cluster. Then, the centroids are shifted to the average location of the items assigned to that centroid. In the scikit-learn package, there is a user-defined number of iterations for this process.

Feature Engineering

Now that you have an idea of how we will approach the problem, let’s get coding. First, we will import our dataset and drop the unneeded columns. Remember that we want to hide labels like current and former conferences from the model so that we can start anew.

import numpy as np
import pandas as pd
cfb_info_df = pd.read_csv(r'.FBS_Football_Team_Info.csv', encoding = 'unicode_escape')
clustering_data_df = cfb_info_df.drop(['Team','Nickname', 'City', 'Current_conference', 'Former_conferences', 'First_played', 'Joined_FBS'], axis = 1)

Who doesn’t love an in-state rivalry? From the Iron Bowl to Bedlam to the Old Oaken Bucket, these are the games that mean serious bragging rights at the neighborhood block party for an entire calendar year. So, let’s keep that feature and but use pandas to convert it to one-hot encoding like we did in part 2 of this series. (Our data also includes one-hot encoding for existing rivalries).

clustering_data_df = pd.get_dummies(clustering_data_df,prefix = 'is_state', columns = ['State'])

For this analysis, we don’t need to worry about a train-test set split. Therefore, we can apply our min-max scaling all at once ro all our numeric features.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
clustering_data_df['Latitude'] = scaler.fit_transform(clustering_data_df[['Latitude']])
clustering_data_df['Longitude'] = scaler.fit_transform(clustering_data_df[['Longitude']])
clustering_data_df['Enrollment'] = scaler.fit_transform(clustering_data_df[['Enrollment']])
clustering_data_df['years_playing'] = scaler.fit_transform(clustering_data_df[['years_playing']])
clustering_data_df['years_playing_FBS'] = scaler.fit_transform(clustering_data_df[['years_playing_FBS']])
clustering_data_df['Stadium_capacity'] = scaler.fit_transform(clustering_data_df[['Stadium_capacity']])
clustering_data_df['total_draft_picks_2000_to_2020'] = scaler.fit_transform(clustering_data_df[['total_draft_picks_2000_to_2020']])
clustering_data_df['first_rd_draft_picks_2000_to_2020'] = scaler.fit_transform(clustering_data_df[['first_rd_draft_picks_2000_to_2020']])
clustering_data_df['number_1_draft_picks_2000_to_2020'] = scaler.fit_transform(clustering_data_df[['number_1_draft_picks_2000_to_2020']])
clustering_data_df['wsj_college_football_revenue_2019'] = scaler.fit_transform(clustering_data_df[['wsj_college_football_revenue_2019']])
clustering_data_df['wsj_college_football_value_2018'] = scaler.fit_transform(clustering_data_df[['wsj_college_football_value_2018']])
clustering_data_df['wsj_college_football_value_2017'] = scaler.fit_transform(clustering_data_df[['wsj_college_football_value_2017']])
clustering_data_df['tj_altimore_fan_base_size_millions'] = scaler.fit_transform(clustering_data_df[['tj_altimore_fan_base_size_millions']])
clustering_data_df['bowl_games_played'] = scaler.fit_transform(clustering_data_df[['bowl_games_played']])
clustering_data_df['bowl_game_win_pct'] = scaler.fit_transform(clustering_data_df[['bowl_game_win_pct']])
clustering_data_df['historical_win_pct'] = scaler.fit_transform(clustering_data_df[['historical_win_pct']])
clustering_data_df['total_games_played'] = scaler.fit_transform(clustering_data_df[['total_games_played']])

Clustering Implementation – Power 5 versus Group of 5

Again, we will lean on scikit-learn for an easy to implement k-means clustering function. The most basic split we can do in college football is the Power 5 v. Group of 5 teams, so let’s start there. This means we want to divide the data into 2 clusters. The model won’t know that these two clusters are Power 5 and Group of 5 teams. It will only work to find a way to divide the 133 FBS teams into two groups. Given the differences in revenue and fan base size between Power 5 and Group of 5 teams, my initial hypothesis is that this should be pretty easy.

from sklearn.cluster import KMeans
# Implement K Means Clustering
kmeans_conf_p5_v_g5 = KMeans(n_clusters=2, random_state=0).fit(clustering_data_df)

I manually compared the output of the model to the true breakdown of FBS teams. The output shows that we correctly identified 59 of 69 Power 5 teams and 63 of 64 Group of 5 teams.

import plotly.express as px

#Manually compare to P5 v G5 conferences 2025
num_tp = len(list_p5) - 1
num_fp = 1 # Tulane
num_tn = len(list_g5) - 10
num_fn = 10 #Baylor, BYU, Cincinnati, Houston, Oregon State, TCU, Texas Tech, UCF, Wake Forest, Washington State
fig = px.imshow([[num_tn, num_fn],
                 [num_fp, num_tp]], text_auto=True,
               labels=dict(x="True P5 v G5", y="Clustering P5 v G5"),
                x=['Group of 5 Team', 'Power 5 Team'],
                y=['Group of 5 Team', 'Power 5 Team'])
fig.show()
The 2-means clustering discriminates Power 5 and Group of 5 teams well.
The 2-means clustering discriminates Power 5 and Group of 5 teams well.

The 10 mislabeled Power 5 teams include seven Big 12 teams (Baylor, BYU, Cincinnati, Houston, TCU, Texas Tech, and UCF), two Pac 12 teams (Oregon State and Washington State) and one ACC team (Wake Forest). Given that the data is based on revenue pre-dating the inaugural Big 12 season of BYU, Cincinnati, Houston, and UCF, it is no surprise that they are clustered with their current Group of 5 brethren. Baylor, TCU, and Texas Tech are in a state crowded with competition for football allegiance, and casual fans in the Lone Star State often gravitate to Texas A&M or Texas. For those who have been following the delicate future of the Pac 12, many have predicted that Oregon State and Washington State could be left out of any realignment raids of the Pac 12 by the Big Ten and/or Big 12.

The 1 team that has Power 5 credibility that has been left out of the mix is Tulane. This season is the first time Tulane has been ranked this millennium, so it is no surprise they haven’t been picked up for a spot in a Power 5 conference. For 30 years, however, they were a member of the SEC conference.

Creating 10 new conferences

Now, it’s time to put it all together. Let’s assume that college football maintains 10 conferences for now. This means we will implement 10-means clustering to suggest the conferences resulting from data-driven realignment. The hope is that by including rivalry and geographic information in addition to revenue and fan base size, we can consider both money and tradition when forming our new conferences.

kmeans_10_conf = KMeans(n_clusters=10, random_state=0).fit(clustering_data_df)

labels_10_conf = kmeans_10_conf.labels_

The labels will tell us which team is in which conference. As you look through the clusters, you will notice that some clusters are very large (20+) while others have as few as four teams. According to NCAA bylaws, however, conferences must have at least eight teams. While this entire blog post is an exercise in challenging existing conference structure, we still want to constrain our results to this rule.

Luckily, researchers at Microsoft and RPI developed a constrained k-means algorithm by modeling it as a min cost flow optimization problem. There is a package to implement this constrained k-means algorithm in Python. It is just as simple to implement as the scikit-learn package. However, I did have to install ortools 9.3.10497 via pip.

In addition to setting our minimum conference size to eight, we will set our maximum conference size to twenty. This is a stochastic model, so if you change random_state, you’ll get a new set of conferences.

import ortools
import ortools.graph.pywrapgraph
from k_means_constrained import KMeansConstrained #needs pip install --user ortools==9.3.10497
clf = KMeansConstrained(n_clusters=10,size_min=8,size_max=20,random_state=0)
kmeans_Constrained_10_conf = clf.fit_predict(clustering_data_df)

Data-driven FBS Conferences

Now is the moment for the big reveal. I took the liberty of naming each cluster with its new conference name.

  • Southwest: Arkansas, Baylor, LSU, Rice, TCU, Texas, Texas A&M, Texas Tech. With the lone exception of LSU, these teams all used to play each other until the 1990s in what was called the Southwest Conference. Now, it’s back!
  • Sun USA: Akron, Appalachian State, Arkansas State, Bowling Green, Charlotte, Coastal Carolina, Georgia Southern, Jacksonville State, James Madison, Kent State, Liberty, Louisiana-Monroe, Middle Tennessee, North Texas, Old Dominion, Sam Houston State, South Alabama, Texas State, Troy, Western Kentucky. It’s as if the Sun Belt expanded to be the size of a super conference and convinced a few Ohio teams to join.
  • Big 8: Iowa, Michigan, Minnesota, Nebraska, Ohio State, Oklahoma, Penn State, Wisconsin. This is the pulsing heart of Big Ten football with the addition of a powerhouse from another farm-filled football-loving Midwest state: Oklahoma.
  • National Athletic: Arizona State, Boston College, Cincinnati, Kansas State, Kentucky, Louisiana, Louisville, Mississippi State, NC State, New Mexico, Ohio, Oklahoma State, Oregon State, South Carolina, Southern Miss, Syracuse, Temple, UTEP, Washington State, West Virginia. This conference has a presence in all four time zones which is currently the envy of conferences like the Big 12. There is a little bit of everyone and everything in this conference with members from nine of the current ten conferences (no Big Ten schools).
  • SEC: Alabama, Auburn, Clemson, Florida, Georgia, Georgia Tech, Ole Miss, Tennessee. With the exception of Clemson, these schools are founding members of today’s SEC conference, so they get to keep the name.
  • Basketball Brainiacs: Arizona, Buffalo, Colorado, Duke, Illinois, Indiana, Iowa State, Kansas, Michigan State, Missouri, North Carolina, Northwestern, Pittsburgh, Purdue, Rutgers, Tulane, Utah, Vanderbilt, Virginia, Washington. I call this the Basketball Schools conference because it includes traditional powerhouses Arizona, Duke, Kansas, and UNC along with Big Ten basketball schools Purdue and Indiana. The real feature that distinguishes this conference from the National Athletic Conference is that all of these schools are AAU members hence the secondary title of Brainiac.
  • MAC+: Army, Ball State, BYU, Central Michigan, East Carolina, Eastern Michigan, Louisiana Tech, Marshall, Memphis, Miami (OH), Navy, New Mexico State, Northern Illinois, SMU, Toledo, Tulsa, UMass, Utah State, Wake Forest, Western Michigan. It seems like everybody is adding pluses to their name, so why not to the expanded conference with a core of MAC teams? This is the last of the three conferences with huge geographic diversity. I think it is safe to say that this is the lower revenue, smaller fan base equivalent to the National Athletic Conference.
  • Fun Belt: FIU, Florida Atlantic, Florida State, Georgia State, Houston, Miami (FL), Nevada, South Florida, UAB, UCF, UConn, UNLV, UTSA. This conference is linked by its relative proximity to the equator. I suppose UConn joins in with its fan base of snowbirds. There has been some talk about the SEC or the Big Ten courting Florida State and Miami. In this case, our model tends to leave them out.
  • Mountain West: Air Force, Boise State, Colorado State, Fresno State, Hawaii, San Diego State, San Jose State, Wyoming. This conference is eight of the current teams in the Mountain West conference. This is a great example of when the rivalry, geography, and revenue data all align. The Mountain West consistently puts a solid football product on the field. Since its formation in 1999, nine different teams have won a conference championship speaking to the long-term parity on the gridiron.
  • Paclantic 8: California, Maryland, Notre Dame, Oregon, Stanford, UCLA, USC, Virginia Tech. These teams represent the biggest names on the west coast combined with a common rival Notre Dame. The schools in the Paclantic 8 are all AAU members except Notre Dame.

Principal Component Analysis

We have our new conferences and their names, but how different are the conferences? One way to do this is to compare the conference-level distributions of different features. We did a lot of this exploratory data analysis in part 1 of this blog series, so let’s instead look to reduce the dimensions of all these features using a principal component analysis (PCA). The PCA will reduce the dimensionality of the data which is helpful for visualization.

First, we will add the new conference names to the data set.

# Initialize new column to define our newly assigned conference
cfb_info_df['k_means_conf'] = 'Southwest'

#for loop to add the conference name for each team
for i in range(len(cfb_info_df['Team'])):
    if cfb_info_df['Team'].iloc[i] in cluster_1:
        cfb_info_df['k_means_conf'][i] = 'Sun USA'
    elif cfb_info_df['Team'].iloc[i] in cluster_2:
        cfb_info_df['k_means_conf'][i] = 'Big 8'
    elif cfb_info_df['Team'].iloc[i] in cluster_3:
        cfb_info_df['k_means_conf'][i] = 'National Athletic'
    elif cfb_info_df['Team'].iloc[i] in cluster_4:
        cfb_info_df['k_means_conf'][i] = 'SEC'
    elif cfb_info_df['Team'].iloc[i] in cluster_5:
        cfb_info_df['k_means_conf'][i] = 'Basketball Brainiacs'
    elif cfb_info_df['Team'].iloc[i] in cluster_6:
        cfb_info_df['k_means_conf'][i] = 'MAC+'
    elif cfb_info_df['Team'].iloc[i] in cluster_7:
        cfb_info_df['k_means_conf'][i] = 'Fun Belt'
    elif cfb_info_df['Team'].iloc[i] in cluster_8:
        cfb_info_df['k_means_conf'][i] = 'Mountain West'
    elif cfb_info_df['Team'].iloc[i] in cluster_9:
        cfb_info_df['k_means_conf'][i] = 'Paclantic 8'

Now, we can use scikit-learn to seamlessly calculate the two principal components. This will reduce our dimensionality such that we can produce a scatter plot and visualize the similarity of teams. We add the results to a data frame with our new conference names so that we can go ahead and plot them.

from sklearn.decomposition import PCA

# Set the n_components=2
principal = PCA(n_components=2)
principal.fit(clustering_data_df)
pca_clustering_data = principal.transform(clustering_data_df)

# Create data frame for plot
pca_clustering_data_df = pd.DataFrame(pca_clustering_data, columns = ['PCA_1', 'PCA_2'])
pca_clustering_data_df['k_means_conference'] = cfb_info_df['k_means_conf']

Using plotly, we will make a scatter plot of all the teams.

import plotly.express as px

fig = px.scatter(pca_clustering_data_df, x="PCA_1", y="PCA_2", color="k_means_conference",
                labels=dict(PCA_1="PCA Dimension 1", 
                            PCA_2="PCA Dimension 2", 
                            k_means_conference="10-means Conference"))
fig.show()

The result is the following:

The Principal Component Analysis reduces the dimensionality of the clustering data to two. That way we can visually inspect the differences between conferences.
The Principal Component Analysis reduces the dimensionality of the clustering data to two. That way we can visually inspect the differences between conferences.

Upon visual inspection, it is easy to see that the Sun USA, Fun Belt, and MAC+, and Mountain West align closely and are closely related to each other. The higher revenue conferences like the Big 8 and the SEC are much more spread apart.

We have a sense of the geography from the conference team names, but we can plot the conferences on a map to see how well we preserved the regional history of the sport:

import plotly.graph_objects as go

fig = go.Figure(data=go.Scattergeo(
    lon = cfb_info_df['Longitude'],
    lat = cfb_info_df['Latitude'],
    text = cfb_info_df['Team'],
    mode = 'markers',
    marker = dict(color = cfb_info_df['map_color'])))

fig.update_layout(title = 'Conference Membership',
        geo_scope='usa')
fig.show()
The map of conference membership shows many geographically spread conferences.
The map of conference membership shows many geographically spread conferences.

The result is a mix. Some conferences are truly regional: Southwest, SEC, Big 8, and Mountain West while others embrace a more national distribution. Perhaps this is destiny for college football.


Thanks for reading! As always, comment your thoughts below. I know this is a thought experiment, so let me know your reactions. Where did your team end up? Would you prefer these allegiances to today’s conferences?

Interested in my content? Please consider following me on Medium.

Follow me on Twitter: @malloy_giovanni


Related Articles