XGBalling: Hacking Basketball Game Prediction with ML

Published in

Towards Data Science

9 min readNov 20, 2020

A look at how three-pointers have come to dominate college basketball scoring strategy. (Image by Author)

Every year, the NCAA (National Collegiate Athletic Association) hosts a popular college basketball tournament dubbed ’March Madness’. Mostly in March, 68 Division-1 college basketball teams compete in a single-elimination style tournament with the final two teams competing for the national championship. Match-ups are determined by a committee using divisions (regions) and seeds (ranking). Teams with seeds 1–4 would all be randomly assigned to separate divisions, teams with seeds 5–9 would all be randomly assigned to separate divisions, and so on until each region has 16 teams.

It has become wildly popular for fans to predict the outcomes of each game, with an estimated 60 million Americans filling out a bracket each year. In fact, in 2014 Investor Warren Buffet’s Berkshire Hathaway and Quicken Loans teamed up to offer $1 Billion to any fan who can perfectly predict the 2014 Men’s Bracket. With 67 games, the likelihood of randomly guessing a bracket is 0.⁵⁶⁷% or, to put it another way, your odds are 1 in ~148 quintillions. Very unlikely.

In this post, we develop a simple yet powerful strategy for predicting the outcome of match-ups in the NCAA March Madness tournament. We analyze each team’s regular-season stats against one another and test various machine learning algorithms such as Logistic Regression, SVM, K-Nearest Neighbors, and XGBoost algorithms to output a predicted match outcome. Ultimately, we were able to predict game outcomes with 90% test accuracy using simple features engineering and a gradient boosted tree algorithm (XGBoost).

Dataset and Features

Figure 1: Heat map of court location of made vs. missed shots from the 2019 Season. It is interesting to see how concentrated the made shots are vs. the sparsity of the missed ones. (Image by Author)

The datasets used in this project were provided by Google Cloud and NCAA. 25 datasets were provided containing game-level, player- level, and team-level data for both the Female and Male leagues. The earliest game we have data on is the 1985 Season but, presumably due to technology advances, we have access to much more detailed game statistics in the most recent seasons. Some of the data sets include:

Tournament Results: Win/Loss results of each tournament game
Regular Season Detailed Results: Detailed aggregate results of each team’s regular-season game performance. of Assists, blocks, 3 pointers, etc made by each team.
Play-By-Play [2016–2019]: A granular event-level log of each play in that season. Events tracked include timestamped turnovers, 3/2-point shots made/missed, assists, etc by player, and location on the court. Figures 1 and 2 include visualizations from this dataset.
Massey Ordinals: Team and Season level log of each team’s ranking according to many well-known ranking systems such as TrueSkill, ESPN SOR, Haslametrics, and PowerRank. Put together by Kenneth Massey.

These data-sets give interesting insights into the history of March Madness. For instance, a Seed 1 team has never lost to a Seed 16 team. Seed difference is a common statistic analyzed by casual sports fans when filling out brackets.

Figure 2: Shot’s Made percentages from 2003. Teams have become gradually become more efficient at 2 pointers over time. Generated using helper code from Martin Henze’s “Jump Shot to Conclusions” notebook. (Image by Author)

Preprocessing and Selection

To format the data for modeling, the data is organized by the difference between each team’s season performance and seed in a given tournament match up. Where y is a binary outcome variable of each tournament and each 𝑋1, …, 𝑋𝑛 is the difference between each Team1 and Team2’s various season stats. For example, a given predictor 𝑋𝑘, may be Team1’s Average Blocks per Game minus Team2’s Average Blocks per Game for that season. This is based on the intuition and insight from data exploration that a historically stronger team should beat a historically weaker team. Thus the better each teams’ strength can be gauged the greater predictive power statistical models can capture.

For this project only 2 data-sets are used to build model inputs; Regular Season Detailed Results and Tournament Results. Detailed regular season results are only tracked from 2003 onwards so only data from the 2003–2019 period are analyzed. In addition to the game statistics provided in the data-set; advanced basketball statistics are also engineered to be predictors. Created using helper functions from publicly available notebooks on Kaggle. All in all, each predictor is a function of each team’s difference in statistics like:

Total Season Points
Seed
Offensive efficiency
Defensive efficiency
3-Point Accuracy
Average Possession

To prepare for modeling, each variable is normalized and PCA is applied to reduce dimensions from 39 to 15 columns while capturing 90% of the total variance of the data-set. The data is then split into a training set of tournament games from 2003–2015 and a testing set of tournament games from 2016–2019.

Using the skim() library in R we can view a summary of the detailed regular season and tournament data sets.

reg_season_stats %>% 
  skim()

tourney_stats %>% 
  skim()

Statistical Methods

Below are brief descriptions of the ML algorithms we tested to predict the outcomes. I’ll provide youtube videos under each algorithm if you wish to learn more about them. Pretty much all StatQuest videos because Josh Starmer is heaven-sent.

Logistic Regression

A binary classification algorithm that extends the linear regression model. It outputs the probability of a given class via the logit function. Where 𝐵𝑖 represents the coefficient of the 𝑖𝑡h predictor.

We choose the l2 penalty parameter, denoted by 𝜆, and set it to 0.1. After choosing the regularization method, logistic regression aims to fit its coefficients to maximize the penalized likelihood function.

Youtube: StatQuest: Logistic Regression

Support Vector Machines

The support vector machine algorithm classifies data by fitting an n-dimensional hyperplane to the data set, where n is the number of features. Data points are classified based on which side of the hyperplane they fall on. SVM fits a hyperplane to the data that distinctly maximized the distance between data points of each class. In other words, it aims to maximize the margin, distance, between the plane and the two classes. Hyper Parameters chosen:

Gamma: 1e-05
Kernel: rbf
Squared l2 penalty, 𝜆: 1000.0

The cost function for the Support Vector Classifier is the Regularized Hinge Loss function.

Hinge Function with Regularization (Image by Author)

Youtube: Support Vector Machines, Clearly Explained!!!

K-Nearest Neighbors

K-Nearest Neighbors is a non-parametric algorithm used in this case for classification. It works by simply classifying the data point by a majority vote of its k-neighbors. K-neighbors are classified by the k-data points that have the smallest Euclidian distance to the data point we are trying to output a prediction for. If the majority of a datapoint’s K-neighbors fall into a certain class then that will be our output for 𝑦_hat. After testing parameters different k’s from 1–30, our grid search algorithms showed that 𝐾 = 9 gives us the greatest AUC.

Youtube: StatQuest: K-nearest neighbors, Clearly Explained

XGBoost

XGBoost, short for Extreme Gradient Boosting, is a derivation of the Gradient Boosted Tree Algorithm with regularization. The ensemble technique, Boosting, works by creating new models that correct the errors made by existing models. All models are then compounded until there are no more improvements to be made. Gradient Boosting uses the approach of creating new models that predict the residuals of the prior models then compound them to make a prediction. It uses a gradient descent algorithm to minimize its loss function, hence the name Gradient Boosting. XGBoost is a framework that implements the gradient boosted algorithms with an additional system, model, and algorithm improvements. Hyperparameters are chosen to optimize with learning rate, regularization, tree-depth, and tree-pruning for best model performance. Hyper Parameters tuned to:

Gamma: 0
Learning Rate: 0.03
Max Depth: 6
Lambda: 6

XGBoost minimizes the Log Loss function which is described in the following section on Evaluation.

Youtube: XGBoost Part 2: Classification

RESULTS

Evaluation

The primary metric we use to evaluate model performance on both train and test sets is Logarithmic Loss:

Logarithmic Loss function (Image by Author)

Where:

𝑛 is the number of games played
y_hat is the predicted probability of team 1 beating team 2
𝑦𝑖 is 1 if team 1 wins, 0 if team 2 wins

The use of the logarithm provides extreme punishments for being both confident and wrong. In the worst possible case, a prediction that something is true when it is actually false will add an infinite amount to your error score. In order to prevent this, predictions are bounded away from the extremes by a small value.

We also analyze a variety of secondary metrics like accuracy (percentage of correct predictions out of all predictions) and F- Measure. The F-Measure can be described as the harmonic mean of both precision and recall. It is at it’s highest at 1, which is perfect precision and recall.

Training and Out-Of-Sample Results

CV Train Results on 2003–2015 Seasons (Image by Author)

Test results on 2016–2019 Seasons (Image by Author)

The top table shows how our models performed when predicting the outputs of our training set. The log-loss is our main indicator of performance because this is what we trained the models to optimize for.

The lower table is the performance on the test set. We have more variables here to get a broader understanding of the strengths and weaknesses of the models. As we can see, XGBoost has the best performance across all evaluation metrics and datasets. Interestingly, our XGBoost algorithm achieves a slightly lower log loss on the test set than the training set. This is a good sign when considering whether the models may be overfitting.

Discussion

Figure 3: Confusion Matrix of XGBoost predictions on 2016–2019 Tournaments (Image by Author)

Clearly, our XGBoost algorithm produces the best performance on all of our test/train evaluation metrics. It incorrectly classified only 24 games out of the 268 games played over the test period as seen in figure 3. It would be interesting to see which specific match-ups it got right and wrong but that is for a later date. The model doesn’t appear to be over-fit since there is no large discrepancy between test and training results. Proactive measures were taken to avoid overfittings such as dimensional reduction, regularization, and cross-validation.

Conclusion/Future Work

The predictive power of these models was not expected. Based on previous winners of Kaggle competition around this dataset, it appears these results are significant. The winners of the 2019 challenge achieved a log loss of 0.44.

RJ Barrett: Our Canadian pride and joy doing physically what XGBoost can do metaphorically for your sports betting career. Image via Giphy

Future work is extensive; in the future, I want to see whether we predicted any upsets or if the model is simply going with the favorites. There is so much more data to take into account that I had time to look at but I would love to get features out of them in the future. I’d also like to see how effective this would be as a sports betting tool based on historical betting spreads on the games in the test set.

Find the code here.

If you have any questions or want to help move this project forward, you can find me at the links below. Thanks for reading!

Hussien Hussien is a curious computer scientist with a passion for data science, product, and Hawaiian shirts. Find me on LinkedIn or at hussien.net.