Building My First Machine Learning Model | NBA Prediction Algorithm

Merging the Worlds of Sports Analytics and Machine Learning

Published in

Towards Data Science

8 min readJul 9, 2020

EDIT: Since writing this article, we have launched a subscription service at https://infinitysports.ai which gives access to the API and outputs of our new NBA prediction model. We are now able to predict the winner, spreads, and point totals. I have thoroughly loved working on this project, please feel free to reach out with any questions!

When I started programming a few years back, I never had a purpose for what I would use my newfound skills for. I toyed around with building video games, making basic websites, iOS applications, but in a way, I was spreading myself thin by not specializing in a certain craft. Around this time, the Toronto Raptors were on a historic playoff run, and I had just started playing around with bet365 and happened to make a good profit betting exclusively on them during the entire playoffs.

Given my background in both Commerce and Computer Science, I wanted to learn more about the role of a Quantitative Analyst and Data Scientist. I began to explore the world of data science and started by learning the basics of the Scikit-learn package given my background in python. The rest of this article is going to outline how I went from knowing next to nothing about Data Science and Machine Learning to building my first NBA prediction model with a ~72% accuracy (more on this later but the results aren’t as great as they seem).

The Game Plan
Data Acquisition
Data Exploration & Processing
Choosing the Right Model
Testing and Results

The Game Plan

The NBA, as well as many other sports, has seen the use of statistics exponentially grow over the last 10–20 years. I began my search on the most relevant NBA stats by reading Which NBA Statistics Actually Translate to Wins by Chinmay Vayda. His research discovered that the best predictors of wins in the NBA were a team’s Offensive Rating, Defensive Rating, Rebound Differential, 3-Point %, among other stats, which you can read more about by following the link above.

I planned to use more recent data, by leveraging the NBA’s monthly statistics and using those as the predictors for the matches that were played during that month. Since this was my first attempt at building a model, I wanted to keep the data simple and keep the mathematical complexity at a minimum.

The next step was figuring out how to acquire this data. I needed a data source for match results of the last ~10 years, as well as a source for a team’s statistics in any given month.

Data Acquisition

For the data acquisition portion of this project, I used Selenium, a python package that facilitates web scraping, to get the data I needed off various websites. I also decided to limit my search to data beginning from the 2008–2009 season to the present.

Past Results

My data source for the past results was Basketball-Reference.com, which had match results going back to 1946.

Screenshot of the data source on Basketball-Reference.com

Using Selenium, I scraped these tables one-by-one and converted them to CSV files for later use.

Past Monthly Statistics

I also needed to know what every team’s stats were for any given month, and my data source for this was the official NBA Stats page. After tweaking with a few filters, I ended up with the following table.

Screenshot of the data source on NBA Advanced Stats

Once again, I used Selenium and scraped these monthly stats tables, and saved them as CSV files. My next step was to merge these files into one dataset.

Data Processing & Exploration

Data Processing

We now had to process our 2 sets of CSV files in a format where they can be compared and analyzed. I have embedded below examples of the first 5 lines of each CSV file, exactly as they were scrapped.

First 5 lines of both the results and stats in the raw form.

The first thing we notice is that neither table has a header, which will need to be added manually by referencing the original table. Second of all, we have multiple empty columns littered in our datasets, which need to be dropped. Lastly, in our results dataset, we need to drop the time column, the box score text, and the attendance numbers, since they won’t be useful to us.

Following these steps, our datasets now look like this:

First 5 lines of both the results and stats post-processing.

The code below outlines how I went about merging the two CSV files, as well as adding a new column for whether Team 1 won or lost, which would become our predictor variable. I posted the code since the specifics of the merging aren’t easy to explain, but if you are interested you can take a look below.

After executing the code outlined above, our dataset now looks like this:

We now have our dataset, which compares each team’s stats and states which team won the encounter. At the end of this process, our dataset contains over 13,000 unique matchups! This dataset is now very close to being ready for machine learning analysis.

Since I only needed the team names and scores to merge the datasets and figure out which team won, I will also get rid of those columns, and be left with a dataset that is now able to be explored and modeled for machine learning.

Data Exploration

Our current dataset looks like this:

I was now interested in knowing how the various statistics in our dataset correlate with one another. A friend of mine introduced me to a great python package called Speedml which simplifies the process of exploratory analysis and produces great looking plots to share your data.

When you have numerical features, it is always interesting to see how each feature in our dataset correlates to the other. The plot below illustrates our feature correlations, and it has some interesting insights that we can derive.

**Speedml** generated feature correlation plot

Firstly, we can tell that since both the top left and bottom right quadrants have the most correlated features, a team’s stats correlate strongly to itself. If a team has a high PACE, it will probably also have a high OFFRTG.

We also see that the ‘Team1Win’ feature, is not heavily correlated to any specific feature, although a few may stand out as stronger correlations than others. This leads me to create a feature importance plot, another easy to use implementation included in Speedml.

Although we do see some features with greater importance, we don’t have a clear reason to eliminate any features at this point, but it allows for some interesting observations to be made.

The reason for the high importance for Team2NETRTG can be seen in the strip plot above; when Team 2 has a higher Net Rating, Team 1 tends to win less, and on the flip side, when Team 2 has a lower Net Rating, Team 1 tends to have a higher chance of winning.

Finally, the last plot I found interesting was the distribution of Team 1’s wins. The way our data was acquired, Team 1 is always the away team. This distribution shows that the away team loses about ~59% of the time, which illustrates what we know as the home-court advantage.

Choosing the Right Model

I chose to use Scikit-learn, given the ease of implementation for a variety of algorithms and my experience with python. Using this flowchart provided by Scikit-learn, I identified a few models I would want to try out, and see which one performed the most accurately on my dataset. My attempts began as follows:

Out of all the models attempted, the one with the highest accuracy was the support vector machine SVC classifier, with an accuracy of 72.52% on historical data! It was now time to test this model in the real world, where I attempted predicting the outcome of 67 games, before the NBA shutting down as a result of COVID-19.

Testing and Results

For my first 28 game predictions, I was simply going off who was going to win, not taking into account the spread I was offered. For the remainder of my prediction, I implemented a scoring method that gave me a spread that my algorithm created, based on which team was more likely to win. My results for the algorithm accuracy were as follows:

A major weakness of my algorithm is the ability to predict upsets. On the first day of running my model, in 5 out of 7 games the underdog won the match. My algorithm was typically in line with what online sports betting websites published which was a good sign. Reading online, we can see that the NBA typically has an upset rate of 32.1%, while my sample size had an upset rate of 40.2%.

I also wanted to attempt a financial strategy and began by simply betting on the predicted winner. But after updating my model, I implemented a money-line which I compared with the lines offered by bet365. I compared my odds to bet365’s and chose the more favorable of the two. Below we can see the results of the blind and selective strategies.

Immediately we can tell that selectiveness is a better strategy, which involves choosing the more favorable odds.

I have been waiting till the end of the article to address this, but one major flaw in this model, which took me a few months to realize, is that the data it uses is biased. Put simply, a team’s monthly stats are influenced directly by their performance in that month’s games, so given that I was using a given month’s stats to predict the team’s performance, the stats already included their performance in those games. This realization led me to start building my new NBA prediction model.

Future Steps

The reason I chose to write and release this article now is that I built this model back in January/February of 2020, and I have been working on an updated, more comprehensive model that will use an Artificial Neural Network for predicting not wins or losses, but the actual scores of each encounter. This new model will be based on the players within a team as opposed to the team as a whole. I will be writing about this journey in the coming weeks and hope to share some good news!

If you enjoyed this article or would like to discuss any of the information mentioned, you can reach out to me on Linkedin, Twitter, or by Email.