The world’s leading publication for data science, AI, and ML professionals.

Road to the first marathon with Python

A data scientist's account of his running journey

A data scientist’s account of his marathon journey with Strava API, pandas, and so on.

Running to me is more than just an exercise. It makes me a happier person. In March 2019, I did my first marathon in Los Angeles. Back then, I wrote my first ever medium article on that experience.

As a data scientist, I know there’s something more I can do to remember the journey. However, I have procrastinated for more than a year. Due to Covid-19, our abilities to run outside and to participate in races are quite limited. It makes me long even more for the old times. Thanks to Strava, I have logged all my runs that led up to my first marathon. And I decided to nerd out on that data.

What is this article not about?

It’s not about how to get better at running. For that, I recommend the book "Daniel’s Running Formula".

What is this article about?

It is a technical primer that walks through how to use Python to connect with Strava’s activities API, and use common Python libraries to do exploratory analysis on my runs before the LA marathon 2019. It’s a revelation that how data diverges from intuition. Code snippets are provided as well.

Strava API

I have used Strava for the good parts of the last 5 years. I love the simplicity and functionality of the App. Only recently, I started to explore the Strava API. On a high level, Strava API requires OAuth 2.0 authentication for developers to access users’ data. And every time a user initiates an authentication, an auth code is created to exchange for a temporary access token. The developers can use this access token to get the user’s data.

As an example, I first send a post request to get my access token. This access token allows me to access my activities’ data.

import pandas as pd
import requests
import configparser
import os.path
import json
from collections import defaultdict
import datetime
import matplotlib.pyplot as plt
config = configparser.ConfigParser()
config.read('/Users/LG/Documents/key_files/strava/strava.cfg')
client_id = config.get('USER','CLIENT_ID')
client_secret = config.get('USER','SECRET')
auth_url = config.get('OAUTH','AUTH_URL')
auth_code = config.get('OAUTH','AUTH_CODE')
params = {'client_id':client_id,'client_secret':client_secret,'code':auth_code,'grant_type':'authorization_code'}
headers = {'content-type': 'application/json'}
r =requests.post(auth_url,data=params)

Then, I use the access token to send get requests to pull all my activities’ data from Strava.

# retrieve access token
access_token = "access_token=" + str(r.json()['access_token'])
# url for getting activities
activity_url = "https://www.strava.com/api/v3/activities"
# only select variables of interest
cols = ['average_heartrate','average_speed','distance','moving_time',
        'start_date','start_latitude','start_longitude','suffer_score','total_elevation_gain',
       'type'
       ]
# write a loop to retrieve all the activities and store in a dictionary indexed by activitiy id.
page = 1
d = defaultdict()
while True:
    string = activity_url + '?'+ access_token + '&per_page=50' + '&page=' + str(page)
    # get the page of activities
    r = requests.get(string)
    r = r.json()
    if not r:
        break
    for i in range(len(r)):
        if r[i]['id'] not in d:
            d[r[i]['id']] = defaultdict()
            for key,value in r[i].items():
                if key in cols:
                    d[r[i]['id']][key] = value
        else: 
            continue
    page += 1

Boom, I have saved all my activities data in a Python dictionary. Now, let’s do some data munging.

  • Filter for only Running activities. I also do cycling sometimes.
  • Strava records running distance in meters, I need to convert it into miles.
  • Strava records speed in Meter Per Second, I need to convert it into Minutes Per Mile.
  • I filter the dates between 2017–12 and 2019–03 when I was actually training for the marathon events.
my_strava = pd.DataFrame.from_dict(d,orient='index')
# filter only run activities
my_strava = my_strava[my_strava['type']=='Run']
# meter to mile translation
def meter_to_mile(meter):
    return meter * 0.000621371192
# average speed meter per second to Minutes Per Mile
def speed_converter(speed):
    return 1 / (meter_to_mile(speed) * 60)
# translate my distance into miles, and moving time into hours/minutes, 
# and start_date into date/time, speed into Minutes per Mile
my_strava['distance'] = my_strava['distance'].apply(meter_to_mile)
my_strava['moving_time'] = my_strava['moving_time'].apply(lambda x:str(datetime.timedelta(seconds=x)))
my_strava['start_date'] = my_strava['start_date'].apply(lambda x:datetime.datetime.strptime(x,'%Y-%m-%dT%H:%M:%SZ'))
my_strava['average_speed'] = my_strava['average_speed'].apply(speed_converter)
# filter dates between 2017-12-01 and 2019-03-24
first_run_date = '2017-12-01'
last_run_date = '2019-03-25'
mar = (my_strava['start_date'] > first_run_date) &amp; (my_strava['start_date'] < last_run_date)
my_marathon = my_strava.loc[mar]
my_marathon.head()

The data looks like the table below, and it seems to be ready for some exploratory analysis.

Analysis

This is definitely one of those moments of truth. Only when I looked at the data, I realized I ran much less than I thought I did. Within the year before the LA marathon, my running distance actually went downwards. It did NOT conform to any of the online marathon training schemes. Instead, it was the more than a year-long journey of training that built up my confidence and resistance.

fig,ax = plt.subplots(figsize=(10,7))
ax.set_title('Running Distance (Miles)')
ax.plot('start_date','distance','bo--',data=my_marathon)
arrowprops=dict(arrowstyle='<-', color='red')
# add annotations for 2018-07-29 for SF half, 2019-03-24 for LA full.
ax.annotate(s='SF half marathon',xy=((datetime.datetime.strptime('2018-07-29','%Y-%m-%d')),13.1),
           xytext=((datetime.datetime.strptime('2018-08-01','%Y-%m-%d')),15.0),
           arrowprops=arrowprops);
ax.annotate(s='LA marathon',xy=(datetime.datetime.strptime('2019-03-24','%Y-%m-%d'),26.2),
           xytext=(datetime.datetime.strptime('2019-01-15','%Y-%m-%d'),20.0),
           arrowprops=arrowprops);

Most of my runs centered around 3–7 Miles at a 10 Minutes Per Mile pace. This level of performance is the so-called "easy pace". Occasionally, I’d have a threshold run. The LA marathon was the only time I ran for more than 15 miles.

fig,ax = plt.subplots(figsize=(10,7))
ax.set_title('Speed (Minutes Per Mile) vs Distance (Miles)')
ax.scatter('average_speed','distance',data=my_marathon,color='blue',marker='D')
ax.set_xlabel('Minutes Per Mile')
ax.set_ylabel('Miles')
plt.show()

My peak month was 2018–07 with more than 50 miles when I trained and participated in the San Francisco half marathon. A trough came right after the event. Interestingly though, the lowest performance with less than 15 miles came in February 2019 right before the LA marathon. Again, surprises only when we look at the data!

fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(15,5))
ax[0].set_title('Training Frequency: Number of Runs Per Month')
ax[0].plot('month','count','bo--',data=monthly_run)
ax[1].set_title('Training Intensity: Total Miles Per Month')
ax[1].plot('month','total_distance','ro--',data=monthly_mile)
plt.show()

Here you have it. It’s my first marathon experience in a nutshell with Python requests, dictionaries, pandas, and matplotlib.


Related Articles