Automating Sports Analytics with Data Science
Table of Contents:
Automation
Automate everything until only the fun stuff is left.
This is a quote that guides me in almost everything I do. I’m not sure if it is an original thought or something I heard along the way … but after a 15 minute unsuccessful Google search, I’m happy to claim ownership.
What follows is an application of data process automation specific to NFL handicapping, but as many will notice the solution here is applicable to many situations.
In my situation I had a spreadsheet. I created it every week to help me analyze NFL match ups. At first I put it together to get an edge in my friend’s fantasy league but over time as sports gambling was legalized throughout the country, I found it useful from a handicapping perspective as well.
The spreadsheet aggregates pre-game data such as game spreads/totals, implied fantasy points per team, opponent defensive ranking, points per game trends and record against the spread. Having a view of all of this data helps me make informed decisions around risk, value and expected return, much like running a business. The problem is, that it took me hours to put the file together before I could start analyzing anything. I also ran the risk of making decisions based off of out of date/incorrect data due to human error.
So after years of copy-pasting, vlookups and ridiculous spreadsheet formulas, I finally dedicated some time to build an automated data pipeline that allows me to spend more time picking and less time clicking.
The Pipeline
This exercise was easy-ish because I understood my target very well (the spreadsheet) and I knew exactly where I could get a consistent stream of data. I have been using https://sportsdata.io/ (and more recently https://bettingdata.com/ ) for years, and they have an extremely friendly API for any sports related data you could ever need.
The vision I had in my head looked something like … SportsDataIO as the source of data, BigQuery to store/transform the data and a DataStudio dashboard as a replacement for my manual spreadsheet … all orchestrated through a Jupyter Notebook.

Authentication
A lot of these articles skip boring things like authentication, which can be really frustrating unless you are familiar with the topic. Below is an example of how I easily authenticated with Google BigQuery (gbq) and connected to a specific project (sportsdataio) within my account.
Getting your credentials from Google Cloud Platform to plug into this template is easy. Follow the instructions for Creating a Service Account and insert results of the downloaded JSON file into the template below.
Extraction
I decided I needed to extract data from the following end points provided by https://sportsdata.io/:
- TeamSeasonStats
- AllTeams
- GameOddsByWeek
- Scores
The framework I use to extract data from each of these endpoints is standardized into 4 steps:
- set parameters needed for request and insert
- make request to sportsdata.io
- normalize request results
- insert request results into gbq
A simple example of accessing TeamSeasonStats for the current season is below:
A more complicated historical extraction of GameOddsByWeek is below.
- In this example we set multiple parameters to get current season and current week for the request.
- We then loop through each week of results from GameOddsByWeek from the beginning of the season.
- For each loop, we append the normalized results derived from the nested JSON request response to a temporary data frame.
-
Finally, we use the appended data frame to insert all historical results to gbq.
All of these extractions, written in a Jupyter Notebook, are executed with a simple command in a Terminal:
jupyter nbconvert – to notebook – execute sdio_extraction.ipynb
Transformation
There are a number of transformations to make, all of which will be made in a couple of SQL queries now that we have the information stored in gbq.
One of the most involved transformations is calculating how each team measures Against The Spread. For instance, if the Tampa Bay Buccaneers are getting -7 this week against the Atlanta Falcons, and the final score of the game is TB 30 – ATL 20, then Tampa covered the spread. This is a helpful metric when wagering on point spreads because you generally want to gravitate towards teams who have a positive ATS record.
I could probably write a whole post on ATS alone, but in short, this script produces the table pictured below which makes it very easy to calculate an ATS win percentage (ats_win/games) per team (i.e TB 53.85% ats_win_p)

From a macro perspective, the more important transformation queries are actually scheduled in BigQuery. I have two queries scheduled to run daily, which compliment the extraction schedule. These queries populate views that will ultimately be used in DataStudio.

Presentation

Now, instead of a shoddy spreadsheet that took me hours to prepare, I wake up to a dashboard automatically populated with the most recent information. From here, its easy for me to create charts and visualizations that make decisions easier.
For example:
- Using the Top 10 ATS Chart, I can see that Miami has an outstanding record ATS, so picking them -1 against a weak Patriots team feels a little safer.
- Using the points per game vs public over under column, I can see that KC and NO score 59 points combined based on average points per game this season, but the current total points line is set at 51.5. Seems like there is some value taking the over in this situation.
- Using the Top 10 Implied Fantasy Points Chart, I see I should play a lot of LAR skill position players, as the Rams have the highest implied fantasy point projection based on projected total points and projected points spread.
- Using the charts at the bottom, we can see what defenses we should be targeting in fantasy football. Overall, JAX, DET, HOU are defenses to target, DAL and HOU are defenses to target for running backs and SEA and NYJ are defenses to target for wide receivers.
To be continued!
Setting up this pipeline has given me more time to analyze data and increase my likelihood of success, versus repeating repetitive, remedial and low value tasks. This is not only my goal for NFL handicapping, but for life! Data technology today gives us an amazing opportunity to automate the boring stuff so we can empower ourselves, or others, to execute a higher function.
You can access the Jupyter Notebook that is the "engine" for this pipeline here.
If you have any questions or suggestions please feel free to comment or reach out to [email protected]