My First Battle with Web Scraping

Whitney Dillon
5 min readApr 14, 2017

In the world of Data Science and Data Analytics one thing is certain, if you do not have actual data to study you will not get very far. There are a lot of data-sets that are readily available on the internet. Sites such as data.world, and kaggle.com have great compilations of data-sets as well as active communities were you can engage with other data enthusiasts. But what do you do if there is not a pre-made file for a topic you are interested in?

That was the problem I faced this week. Looking to do some data analysis on the scoring trends of the CrossFit Games, I did some Google searches and came to the realization that I would have to make my own data files. To accomplish this task, I had my first interaction with web scraping.

Web scraping is a process for extracting data from websites. While there are many different techniques to accomplish this, today I focused on using HTML parsing. HTML stands for Hypertext Markup Language, and it is the standard markup language used for creating web pages and web applications. The HTML source of a webpage contains unique tags that hold the data to what is on a web page. By searching for these tags, you can extract the data you are interested in and store it for later use.

My goal was to scrape the data from the leaderboard on the CrossFit games website, so that I could make my own Pandas DataFrame for further analysis.

(https://games.crossfit.com/legacy-leaderboard?competition=2&year=2016)

Here was my process and some lessons learned from my first scrape.

Step 1: Pull up source code for desired web page.

If you are using a Chrome browser you can simply navigate to the page you want to scrape and right click on the screen and select Inspect. This will bring up lower window on your page that shows the HTML elements of the page. Here you can see the various tags for the page such as <body>,<head>, and <div> tags. One of the tags should be noted, the <iframe> tag. This tag denotes an inline frame. An inline frame is used to embed another document within the current HTML document. So, while it may look like you are on the page you want to scrape the actual page with the information is imbedded. To navigate to the site needed for scraping, double click on the hyperlink in the inspection pane to copy and paste into a new browser tab. Once the new tab is open repeat the steps to open the html inspection pane. Now we can scrape the table to get the data.

Step 2: Open a workbook for coding

I used a jupyter notebook to get started.

Import in relevant libraries and search the html for the unique id attribute associated with the table you are interested in.

Further inspection of the table and html code, showed that the some of the information I wanted to collect needed to be split apart. For instance the ‘1 (1096)’ , ‘100 pts’ and ’34:10.27’ were all data points I needed. Luckily BeautifulSoup makes it easy to scan through the tags, and with some simple string manipulation in Python the data was easily extracted.

The following code are four functions I defined to retrieve the data along with the for loop to create the Data Frame. I used the ‘.find_all’ method from Beautiful Soup to locate each row in the table, and for each row all of the table data cells. Those cells were passed into the functions that utilized the ‘.text’ method to extract the strings of text from the HTML. With all these strings of text stored in a list of dictionaries, a Pandas Data Frame was then made.

Step 3: Format the Data Frame

I put in some place holder names for the events in the previous step. I felt it would be easier to format column names and data types within the Data Frame itself. Using the ‘.find_all’ method again, I was able to extract the header row from the table in HTML to get the correct event names. I reordered my columns to the order I wanted and made a new copy of the Data Frame.

The only thing left to do is to format the scores to prepare for my data analysis. My idea was to turn each of the event times into a metric of distance/second or repetitions/second.

I wrote the following functions accomplish that.

Time_sec was used to turn the measured times into seconds, and the foobar function was used to return the recalculated score based on the inputs of, measure (reps or distance totaled in workout), cap (time cap in minuets for the event), and cell (time cell from Data Frame).

I then used the apply method in pandas to create new columns that used these two functions to calculate the new scores.

The new columns look like this:

Now that my Data Frame is formatted to my liking and I have my new scoring parameters, I am looking forward to analyzing this data set and to scraping the previous years leader boards as well. My goal for future posts is to analyze the history of scoring in the CrossFit games. I want to see if I can determine the best way to quantify the fittest human on Earth.

Overall I enjoyed the challenge of my first web scrape. It taught me to really look at your html tags before diving into the parsing process. Inline frames are not so scary if you know they are there. I can’t wait to see what new things I learn on my next scraping adventure.

--

--