The world’s leading publication for data science, AI, and ML professionals.

Rating Sports Teams – Maximizing A Generic System

How much information can we get from just a score line?

Photo by Markus Spiske on Unsplash
Photo by Markus Spiske on Unsplash

Introduction

My first Medium article was on basic ways to predict sports outcomes. It uses simple win-loss record as a baseline. For example, if the Chicago Cubs have won 58% of their games this year, you’d expect them to win 58% of the games going forward. That article showed that a simple system invented by a physics professor named Arpad Elo in the 1940’s could outperform using win-loss record alone.

There’s some good news: it’s no longer the 1940’s, and we have access to troves of data and computing power. Additionally, many people (notably 538 and Harvard professor Mark Glickman) have made many contributions to improve generic ratings systems. Since I’m interesting in models, sports, etc. I’ve decided to compile and compare some of these ratings systems.

Why?

I’m interested in creating sports models, and I think the most basic input in any model is an estimate of overall skill. If you’re just using wins and losses as a proxy for overall skill, you’re handicapping yourself because you’re relying on your model to infer how important a season win-loss record is as more and more games are played. This leads to using multiple inputs (like record over past 5 games, record over past 10 games, etc) to try to estimate skill. I think the one of the best first steps when creating a model is to maximize (within reason) a single estimate of overall skill. Worst case, it serves as a good baseline.

Improvements

  1. Data – In the last article, I used Python’s random library to generate fake players and fake scores. No longer! In this story, I’m using real college Basketball data going back to 2003 and real PGA and European Tour (golf) data going back to 2000. That’s about 87,500 college basketball games and 1,700 golf tournaments.
  2. Rating Systems – In the last article, I used the most basic form of Elo. No longer! I’ve added about four features to Elo to improve it. And remember the professor Glickman I mentioned earlier? He has his own system, and we’ll try that one too. Both of these systems can rate individual players in golf, or teams in basketball.
  3. Baseline— If we’re improving everything else, why not improve our baseline too? Previously, I just used one team’s win-loss percentage to predict the game outcome. While it always feels good to compare yourself to inferior opponents, that’s not a very scientific way to do things. Instead this time, I’ll use the Log5 method which incorporates both team’s winning percentages. I modified it slightly so that if both teams are undefeated or both teams have zero wins, it’s guaranteed to return a 50% win chance.
  4. Error Tracking – Nothing from the previous article is satisfactory! We’ll use cross-entropy loss (also known as log-loss) instead of Brier scores. Why? My best answer is because it makes it easier to tell if your ratings system is improving or if you’re just guessing. You’re penalized more for being overconfident, and so it forces your system to be slightly more conservative. I’ve found that it’s a much more popular error function as well, but I guess that’s not a concrete reason to use something.

Systems

Question: Which system is best?
Question: Which system is best?

1) Log 5

Straightforward, Log 5 is a simple paired comparison formula that compares the wins and losses of both teams.

2) Elo Ratings

Again, very straightforward. If you don’t know what Elo is, I recommend my first article or one of the 538 explainers.

3) Improved Elo

Finally, a newcomer! I improved upon Elo in four ways:

  1. Margin of Victory: Margin of victory is simple, bigger winning margins mean more K reward (your rating increases faster). This especially helps in situations like Alabama football. If they win by 20+ points every game, the improved Elo system can adjust its ratings faster to quickly rate Alabama near the top of all teams. Yes, the implementation is slightly complicated, and I’ll explain further in a second. On the flip side, this also works against teams that lose by 20+ points a game.
  2. Decaying K value: It makes sense that you account for early season uncertainty by starting off with a high K value. This allows the ratings to adjust faster early in the season. As the season progresses, and as the ratings have more data to be certain, the K value decays. If a season-long dominant team goes on a three game losing streak late in the season, the ratings won’t overreact because the K value is lower. In my implementation I use exponential decay, but there might be better functions out there.
  3. Priors: This is just a fancy word for preseason rankings. Often maligned preseason rankings contain information, albeit sometimes bad information. The model would like to know if everyone thinks Kentucky basketball and Duke basketball will probably be in the top 10 college basketball teams all season. It gives it a head start to converge on the true rankings. With this head start, there’s less need to jumpstart the ratings with a super high K value. Obviously, preseason rankings are more useful for short season sports like NFL vs. long seasons like MLB. The longer the season goes, the less important preseason rankings are. In the college basketball season, I found it was best to keep about 70% of a team’s rating year over year.
  4. Autocorrelation Prevention: Remember how I said implementing margin of victory was complicated? One advantage of standard Elo is that it’s balanced, and doesn’t runaway over time. Boosting awarded K-values based on margin of victory disrupts that balance. An autocorrelation prevention term (ACP term) keeps our improved system balanced.

4) Glicko

What if I told you… there’s a whole different ratings method to try? This one was invented in 1995 by professor Glickman whom I mentioned earlier. It’s worth reading the paper itself, but I’ll do my best to summarize. Glicko introduces two main components. One, it introduces uncertainty. This is extremely useful. An overall rating in the Elo ratings system is just captured by one number. Let’s take a player (or team) that’s rated 1750. The issue is that you can’t be 100% certain that the player’s exact rating, on a random day, in random conditions, after two weeks since the last match, is exactly 1750. Instead, you can be pretty sure that the player’s "true" rating is between 1650 and 1850. In the Glicko system, the ratings deviation term would be 50 in this case. It would represent a 95% confidence interval that the player’s true ability is within 100 points (two ratings deviations) of 1750.

Uncertainty can also be modified. If a player hasn’t played in 3 months, we can increase uncertainty. If a player has very erratic match results, Glicko keeps uncertainty high.

The other introduction is the concept of a "ratings period". As a mild spoiler, I’ve empirically found that the ratings period works better in some sports than others. Glicko treats all matches in a ratings period to have occurred simultaneously. This is useful in Golf, when you’re competing against over 100 other players simultaneously. Not only does it reduce calculation time, but by using the all the other players as reference, Glicko is able to nail the context of a round score. On the other hand, in college basketball, it works less well. I found that a ratings period of three games works best there. It doesn’t make much sense to use a ratings period in basketball. Depending on how you group past games into threes, you might end up with different ratings. You’re also waiting three games to update rankings, and so you’re often ignoring the last game or the last two games to factor into your rankings.

An implementation note: I’m using Glicko 2, but I refer to it as Glicko. I heavily referenced this implementation by Github user sublee.

College Basketball Results

Error per game, averaged from 2003–2019. Less is better!
Error per game, averaged from 2003–2019. Less is better!

In the graph above I show the error per game of the Ratings systems I tried. The weekly error is averaged over all games during that week of all seasons from 2003 to 2019. First of all: Wow! Improved Elo, when finely tuned, does very well. Crucially, it maintains a clear gap with the other ratings systems late into the season. In sports models, when you’re fighting for decimal places, that’s really impressive! An important note is that it even beat Glicko without preseason rankings, which is something I could’ve implemented into the Glicko system as well. Improved Elo seems better in any case!

In my opinion, the weirdest finding is the effectiveness of my improved Elo implementation early in the season. The error early in the season is smaller than it is late in the season. I believe this is mostly due to traditionally good teams "warming up" with cupcake opponents earlier in the season. If Kentucky plays a no-name directional school, improved Elo will predict a 90%+ probability that Kentucky will win, and the error will be very low when they most likely actually do win.

Along with cupcake schedules, there might be other explanations too. Do teams develop and change skill more during the season (thus being harder to predict) than during the longer offseason? How much does offseason roster turnover impact a team’s ability? I think the evolution of error over the season is worth exploring further. Mid-season injuries, conference vs. non-conference games, and studying coaching impact would all be interesting directions to go from here.

A secondary result is that after much tuning, I couldn’t get Glicko to come close to beating improved Elo. This is surprising because it’s supposed to be a better system! Only after plotting this did I realize the shortcomings of Glicko in the college basketball universe. Let’s not throw it out yet though!

Also, I isolated the individual improvements to Elo to show that they all contributed:

Improvement Isolation
Improvement Isolation

I didn’t fully optimize the individual improvements, but the point is that they all contributed something. It’s also impressive that as a whole they seem the same or better than the sum of their parts. I kind of expected decaying K and priors to be two solutions to the same problem, but it turns out, they’re both useful even in concert. The code I used, sans data, can be found here.

Golf Results

Note #1: Before updating July 9th 2019, I was displaying golf Glicko results that were better than they should’ve been. I was allowing some data leakage that improved the results.

Each system was optimized as much as possible within reason.
Each system was optimized as much as possible within reason.

Note #2: If you’re wondering how I used Log 5 in golf, I did it by summing the result of every matchup in a tournament. So in a 144 person tournament, 1 round involves 143 matchups for every player.

The results are much closer between the two main systems this time! Glicko beats regular Elo by a significant margin, and even approaches improved Elo. Glicko achieves its score with less computation and less parameter tweaking than improved Elo. Like I alluded to earlier, Glicko is much more useful in contexts that involve many players playing each other simultaneously.

In researching this article, I didn’t see it spelled out that Bayesian ranking systems like Glicko have shortcomings in head to head matches over short seasons. Of course, I can’t rule out there are adjustments that I could make to improve Glicko’s performance there. Microsoft uses TrueSkill and Trueskill 2 in most of their games which are both closely related to Glicko. This makes sense, because many online gaming multiplayer matches involve more than two players. From my experiments, game designers might be better off employing a version of Elo in head to head games.

Conclusion

All in all, I accomplished what I wanted to accomplish. I compared generic ratings systems across two very different Sports and found that results differed in each. In general, most further improvements would be squeaking out minor decimal points or sport-specific changes. What a waste of time right? Well, I plan on doing exactly that in the future.

Of course, it goes without saying that we could improve on Glicko a bit. Mark Glickman has made his own improvements (like the Glicko Boost system, 2010). There have also been improved systems based off the Glicko system, like the Stephenson system that won a 2012 Kaggle competition. Those focus on chess, but they offer clues on where to go from here.

Lastly, I want to compare the correlation of real-world established ranking systems vs. the systems I described here. If those systems produce much different rankings than my own, all of this was fairly useless. In the interest of brevity, I’ll only focus on comparing four golf ranking systems. One, official world golf rank. It’s not the best system, but has been used for a really long time and is familiar to any golf fan. It weights winning golf tournaments and doing well in big tournaments heavier than it probably should (therefore it loves 4-time major winner Brooks Koepka). The other ranking system, "DG Ranking", is the rankings created by datagolf.ca. Data Golf does great work and I think even non-golf fans would enjoy seeing some of the visualizations they’ve created. Their overall rankings are usually close to Vegas odds and are a good reference to see how golfers compare at any given time.

As you can see, both ranking systems correlate with DataGolf’s ranking 85% or more! Elo and Glicko also differ from each other substantially. Since they’re different but provide similar prediction accuracy, they could probably be combined to form an even better model 🤔.


Related Articles