Sixth Man of The Year: Data Acquisition

Published in

Towards Data Science

7 min readJan 22, 2019

All of my eggs are currently in the Data Science basket, and as such I try to spend as much time as possible collecting little nuggets of advice from those who have been practicing data science for a while. After all — I’d much rather learn from other people’s mistakes than my own, that way I can make more advanced mistakes. There is one piece of advice that has been beaten to death at the hands of many different professionals. If you interact with Data Scientists on a daily basis, I’m sure that you’ve heard this little tip: “Data Science is actually 80% data collection and cleaning, while only 20% of it is the sexy machine learning and analysis.” As such, here’s another post highlighting how important data acquisition is, but I hope to show an interesting enough example to make this one a little more appealing to read.

When another professional tries to scare you away with the 80/20 split

A handful of classmates and I collaborated recently on a Kaggle competition, in which we attempted to predict the success or failure of a Kobe Bryant field goal attempt. The dataset provided by Kaggle was comprised of every shot taken in Kobe’s career, which included information about the shot such as distance from the hoop, time on the clock, shot location on the floor, shot type (dunk, jump shot, etc.), opponent, game date and more.

Very early on, my partners and I were able to leverage our domain knowledge to observe that there was one crucial variable missing in the data. Kobe played 20 seasons in the NBA, and throughout that time he saw many teams rise and fall in their levels of competitiveness, other than the Knicks. As such, knowing that Kobe took a shot against the Warriors doesn’t tell us much information, as it could be a great Warriors defense or a poor Warriors defense. An extremely trained model might be able to learn a general idea about what era certain opponents are better or worse. That would be asking a lot, and would likely result in a model with too much variance. We wanted to determine another way that we can account for how talented (specifically on defense) an opponent was, as we hypothesized that would be an important feature to consider.

We decided that the best way to obtain this information would be to get ahold of each NBA team’s defensive rating and ranking throughout the span of Kobe’s career. Defensive rating refers to how many points a team allows on average over 100 possessions. We defined defensive ranking as simply the ordinal position that a team occupies in the list of defensive rating. We used both metrics because defensive rating provides a metric that is consistent among all years, while defensive ranking puts an opponent’s skill in context with the rest of the league that year.

Data Acquisition

The first responsible step is ensuring that the website we hope to acquire data from has not forbidden automated scraping of the data. Some websites will ask that people only scrape the sites between certain times (when they expect less human traffic), others will limit scraping to a certain amount of requests per second, and others will explicitly forbid any scraping at all.

The website that we’ll be using, https://stats.nba.com/, was generous enough to collect and present all this data for free with easy access. We quickly took a look at their robots.txt. This is where we can read what sort of automated access the website would prefer. Fortunately, this website has asked for no restrictions. The specific statistics that we’re interested in collecting are found here.

The biggest roadblock in this process was the fact that these websites consist of dynamic HTML, meaning that using a typical get request from the website does not retrieve the actual data that we’re after. The website has javascript that only gets executed when the website is opened in a browser.

It would not have been too time-consuming to access and save the HTML from 20 different websites by hand. However, a professor once told me to never let the computer win, and I try to write code that could scale if needed.

I decided to use Selenium — a tool used to automate accessing webpages with a browser. There are a lot of really cool applications for Selenium, and unfortunately we will be using it in the most basic way possible. As I mentioned earlier, a typical get request is not sufficient, as without a browser the table of statistics is not populated. Connecting to the webpage with Selenium forces the table to be populated, and then we are able to suck down the HTML and save it locally.

All the code needed to use Selenium

This is what temporarily pops up on your monitor while the data is being collected

The local HTML after sucking it down with Selenium

Automating The Process

The previous paragraphs outlines the process for when you know the URL for one year. Again, it wouldn’t be outrageous to click through 20 pages in order to copy 20 URL’s, but we’re not cavemen/cavewomen, and we can do better.

Fortunately, the URL contains a parameter for the season, so for the 1996–1997 season, the url has Season=1996–97. The logic nicely follows through the rest of the seasons, so we can loop through all the URLs for the seasons of interest and apply the function from earlier.

Selenium is not exactly the speediest tool, but once the code has been automated to iterate through all the years, I was able to run it, go make and eat a sandwich, and all the HTML had been collected by the time I returned.

Disclaimer: Sandwich looked more like the one on the right

Making the Data Useful

While it’s great that we’ve acquired the data, just having a folder full of HTML files is not useful. Using the library Beautiful Soup, we can parse the HTML in order to extract the table of statistics that we’re interested in.

From there we can join the table of statistics onto the raw data from Kaggle, where the season and team columns match, and now we have more information to work with regarding Kobe’s opponent that night.

Assessing the Results

It’s a dandy exercise in data acquisition to go through this process, but is it even helpful? If 80% of data science is collecting and cleaning data, it must be useful right? Spoiler alert: I would not have written this blog if the results were useless.

We trained a Random Forest classifier on the raw data from Kaggle, joined with the defensive ratings/rankings. Random Forest comes with the nice ability to evaluate which variables were most effective in predicting the shot success. Our model found that the greatest predictor was Kobe’s distance from the basket, but the second greatest predictor was defensive rating. We were ecstatic to find that our model found this external data source to be useful.

There is a valid criticism of this practice, that perhaps using this feature is impractical if we had any hope of using our model to help Kobe decide what shots to take. This is true for two reasons: first, the defensive rating is based on season-long statistics, and defensive rating would be very sporadic towards the beginning of the NBA season. Secondly, if Kobe is playing a defensive juggernaut that night, our model would suggest to just not take any shots against them, but that’s impossible for Kobe.

While the additional information may not have been completely useful if the Lakers had hired my team’s model, the information does serve to further optimize our metric (log loss) in this competition.

Conclusion

For this project, we were given a massive dataset of clean, difficult-to-collect data. Despite this, we still found that one of the most impactful decisions that we made was to go acquire more data.

After the project my group and I sat down and discussed what we could have done to further improve our model. The consensus answer had nothing to do with better modeling practices; we decided the most helpful thing would be to determine some quantification of describing how contested Kobe’s shot was (perhaps Kobe’s distance to closest defender), and the individual defensive rating of the man guarding Kobe. Despite having already added some really useful data to the problem, we still felt that there was more data that could have significantly improved our model.

I opened this blog by saying that the 80/20 split had been beaten to death by professionals that I’ve spoken to, but after this project I honestly understand why. Good models are nice, and tuning hyper-parameters is important, but I don’t believe any improved algorithm can match the marginal increase that adding a useful feature to the data does.

Data Acquisition doesn’t quite get the glory of an NBA starting player; very few people get excited to discuss Data Acquisition, and Data Acquisition is never going to lead the team in jersey sales or PPG. However, adding good Data Acquisition practices to an already talented Machine Learning process can be the missing piece that takes a team to a championship. For that reason precisely, I have decided that Data Acquisition deserves to be the official Data Science Sixth Man of The Year.