The other day I published an article about using Monte Carlo simulations to converge on lottery probability distributions. Notionally, it has gained traction much more quickly than a number of my previous articles. This could either mean that (A) readers are very interested in Monte Carlo methods or (B) readers are very interested in sports Analytics. I’m betting that sports analytics is the dominant factor here and this post will function as an experiment! (Very meta of me, I know.)
Overview
This full be an "end-to-end" project, starting with accessing database data via SQLite, data cleaning/munging via Pandas, network modeling via NetworkX, and finally, comparing our results to the official "seeds".
But I’m getting ahead of myself. Our data comes from the NCAA (National Collegiate Athletic Association) 2019 men’s basketball regular (and post) seasons(s), limited to the ACC (Atlantic Coast Conference). Our goal is to mine the data for patterns in wins and losses to rank the teams in terms in likeliness to win the conference championship.
Lastly, let’s discuss the "trillion dollar algorithm." Google launched 22 years ago in Menlo Park, Ca, bringing order and relevance to a sprawling web of information. This algorithm, Pagerank, is an adaptation of an established network metric, eigenvector centrality. The algorithm expects to receive a transition matrix as an argument; this transition matrix quantifies the links between web pages. When one page links to another, it shares some portion of credibility with the other page. Largely, the amount shared is dependent on how many outbound links the (first discussed) node has. The algorithm is performed iteratively until an eigenvector is converged upon. This eigenvector is the pagerank; in other words, it contains the rank of each page, relative to its peers. Pagerank has been applied to networks of scholarly journals, where one very popular article links to a single grandfather article, or seminal piece. The high popularity article shares the bulk of its credibility with the seminal piece. This allows for the conclusion that, though many articles did not cite the seminal piece, they were influenced indirectly by it via the high popularity article. This is the true beauty and power of pagerank.
We need to preprocess our data in such a way that Pagerank can receive it as an argument. Consider two teams, one who was virtually undefeated all season, and one who lost virtually every game all season – but – the "losing team" happens to beat the "winning team." In Sports, this is usually referred to as an upset. And Pagerank will be able to extract the significance and global context of this event. We simply need an effective way to encode patterns of wins and losses into a matrix. There is no absolute right or wrong answer but some approaches are better than others; I encourage you to brainstorm ways this method might be improved, implement it yourself and compare results to the official NCAA ACC seeds.
Method
In my approach, I iterated over every regular season game, and finding the point differential. For example, Team A beat Team B by 20 points; ergo Team B gives 20 points credibility to Team A. Graphs can be directed or undirected. In our case, we want to direct the edges (relationships) between nodes (teams) such that there is a unidirectional relationship. Team B gives 20 points to Team A, not vice versa. Networks can have multiple edges between two given nodes (multigraph) or a single edge between two given nodes (simple graph.) I’ve simplified the network such that in the event that two given teams played each other twice, and the same team won both times, the edge connecting them would be the average of their winnings.
I’ll leave it to you to determine if this method can be improved upon. For example, if Team A beats Team B by 90–80, the delta is 10 points. However, if Team A beats Team B by 20–10, the delta is still 10 points. Should we do something about this? I’ll leave it the reader to decide, implement, and evaluate 🙂
Code
The repository can be found here, database and all.
But for simplicity, I’ve also made a GitHub gist:
And my results:

And how did we compare to the official seeds?

Overall, we did alright! Our results aren’t identical to NCAA’s official seeds; this is due (of course) because they used a different algorithm entirely. Of note, we seeded Duke second, whereas NCAA seeded them third – ultimately Duke won the conference title (hurray). However, we seeded Florida State much lower and they made it to the final round.

I wonder what we could do to improve our results! Remember discussing that 90–80 and 20–10 both reflected a delta of 10? Perhaps the answer lies there. I encourage you to explore this possibility and share your results.
Thanks for reading! Please subscribe if you think my content is alright 🙂