Data Visualization of Elite Swimmers’ Competition Results (Part 1 — Datasets)

Tanyoung Kim

Published in

Towards Data Science

6 min readJun 18, 2017

This post is the first part of the production notes for a data visualization project, Swimmers’ History.

Swimmers' History

Data Visualization of Elite Swimmers' Race Results at the Major Swimming Competitions. Explore +1500 swimmers' history…

tany.kim

Swimming is my long-time interest and I’ve been always wanting to make data visualization. Recently (read: a year ago :-)) I found a great website that provides the results of numerous swim meets and I tried to play with the information from the website. In this post, I discuss how I scrapped secured HTML pages (with R), extract useful data points, and finally generate optimal datasets for visualizations (with Python). With the datasets, I designed visualizations allowing users to select swim meets/events, and filter races by swimmers. Thus, I included the UI component design for data filtering (regenerating subsets of data) on the front-end. I also talk about the swimming world a bit for the better understanding of the harvested datasets.

Swim Data of Interest

Among the numerous web pages from SwimRankings.net, I focus on the international meets at which elite swimmers aspire compete. Currently, the full results of major competitions since 2007 are open on the site.

Five Major Swim Meets

From 2007 to 2016, total 17 meets are covered.

Olympic Games: Every 4-th years — 2008, 2012, 2016
World Championships: Every odd years — 2007, 2009, 2011, 2013, 2015
European Championships: Every even years — 2008, 2010, 2012, 2014, 2016
Commonwealth Games: Every non-olympic 4-th years — 2010, 2014
Pan Pacific Championships: Every non-olympic 4-th years — 2010, 2014

Olympic Events

Swimming events are largely divided into two — individual and team. For this project, I included events that are introduced to the Olympics. For example, shorter distance (50m) backstroke and breaststroke races are competed in the World Championships but not in the Olympics. Between the races of men and women, there is no perfect symmetry; the longest free style for men is 1500m, whereas women compete at 800m.

The dataset includes only final races, which make total 544 races (16 events X 2 genders X 17 meets). Here is the list of events in the dataset I use.

Individual Freestyle: 50m, 100m, 200m, 400m, 800m (only for women), 1500m (only for men)
Individual Backstroke: 100m, 200m
Individual Breaststroke: 100m, 200m
Individual Butterfly: 100m, 200m
Individual Medley: 200m, 400m
Team Freestyle: 4 X 100m, 4 X 200m
Team Medley: 4 X 100m

Scrapping Secured HTML Pages with R

On the source website, each event is served in a single page that is identified with parameters with meet ID, gender, and style (event) ID after a same URL base. For example, 2016 Rio Olympic’s (meet) man (gender), 50m freestyle event (style) page URL is like this:

page=meetDetail&meetId=596227&gender=1&styleId=16

To derive the ID of the 17 meets, I parsed the meta info page of each meet (e.g., Olympics page) because these pages have hyperlinks to the URLs above. A guaranteed way to earn correct index values is to inspect the source code. In the Olympics page, the part that I focused is here:

A part that I inspected of the Olympics info page.

A link to a specific meet has the meet ID

Code inspection shows the meet ID of each Olympic game. In addition, I traced the hosting country and city names from this source.

Dropdown menu of styles shows style IDs.

Now we have all the meet IDs. Styles ID are manually collected from the code inspection of <select> tag in the meet page.

The R scripts to extract all 544 races are available here. With the scripts, I just wanted to scrap the entire webpages and save them as files, then do more data cooking with Python. Why not doing everything with R or Python? I like the simplicity of the Python syntax better. I tried to parse webpage with Python but it was tricky to parse secured pages (https) with Python. If you know a way, please let me know!

Cooking Data with Python

Next step is parsing only useful parts from the scrapped webpages.

The exact part of resource that I targeted in a webpage

Data Parsing

Similar to what I did with R, I inspected HTML tags, and traced race information including race date, and the results from the final game (e.g., Rio Olympic 50m men freestyle).

The hyperlink to a swimmer’s info has theathlete ID.

Per each swimmer, I traced name, country, swim time, and points. In addition, by inspecting source code, I was able to find unique ID of each swimmer, which is used as a identifier for data filtering/processing on the front-end as well. At the end, I derived total 869 men and 779 women swimmers’s race results.

There is no perfect clean data that are ready to serve your needs right away. Especially when you want to design unique and customized data visualization, it is critical to optimize datasets to suit your needs before you use it on the front-end side. This is because processing a large dataset with Javascript on the client sides takes some time, which may result in a significant loading time and consequent user frustration ultimately.

Formatting Data for Presentation

I wanted to show the meets and events in a specific order on the front-end, so I added single letter to each meet and race name.

meet_year_letter = {    
    '2016': 'a',    
    '2015': 'b',
    ...
    '2007': 'j'
}events_name = {    
    '1': ['50m Freestyle', 'a50Fr', '0IND'],    
    '2': ['100m Freestyle', 'b100Fr', '0IND'], 
    ...
    '19': ['400m Medley', 'n400IM', '0IND'],    
    ...
    '40': ['4 X 100m Medley', 'q4X100M', '1TEAM']
}

I parsed information of all the swimmers. A swimmer’s information is structured as below; I stored records information as an array.

{
    "id": "4038916",
    "name":"Michael Phelps",
    "country": "United States",
    "records":[
        {
            "place":"2",
            "point":924,
            "race_id":"0OG-a2016--0IND-k100Fly",
            "swimtime":"51.14"
        },
        ....
    ]
}

As I will describe in more detail in the next post, I wanted to show the network of the swimmers. With Python, I made datasets that contain relationship information of any pair of swimmers who have competed together at same race(s).

{
    "source":"4042367",
    "target":"4081509",
    "value":[
        "0OG-a2016--0IND-a50Fr",
        "0OG-e2012--0IND-a50Fr",
        "1WC-d2013--0IND-a50Fr",
        "1WC-d2013--1TEAM-o4X100Fr"
    ]
}

At the end of the Python script, I save it as one JSON file for the front-end. See the Python code here.

Data Filtering on Front-end Side

Presenting all swimmers data all at once is not the most useful, considering real world use cases of this project. People would have specific needs such as “I’d like to see all Michael Phelps’s swims at the Olympics.” To support this, I designed UI components that allow users to select meets and events, as well as specify swimmers by name. I assume that people want to compare a same event at different meets or multiples events at a same meet. Thus designed this way: once a user selects meet(s) and event(s), all the combinations of the selected meets and events are included. In addition, she can also select swimmer(s) to further exclude races that the selected swimmers did not compete.

Options for filtering races (Among total 17 meets and 16 events)

For example, selecting three Olympic games (2008, 2012, 2016) and men’s all seven freestyle events (50m, 100m, 200m, 400m, 800/1500m, 4 X 100m relay, and 4 X 200 relay) returns 21 races (3 X 7) and 166 swimmers. Then if Michael Phelps is selected through the name filtering option, the total number of swimmers is reduced to 140 swimmers.

When the selection is updated in this option panel, a new subset of data is generated, and the visualization are also re-created. This front-end based data processing is gracefully managed by React-Redux, which I don’t discuss further here. If you’re interested, go check the code on my GitHub repo.

Please keep reading the next post on visualization design and ideation of the same project. Don’t forget to check the post on the insights after either.