Using Python to Explore Strava Activity Data

Using the Strava API to analyze your activities.
Update: if you want to skip right to the results, I built a web app that will generate stats and charts for your top 3 activities. You can check it out here.
I started using Strava last summer to track my runs and bike rides. I noticed that the app offered advanced Analytics on my activities, but only if I paid to upgrade. I don’t like paying for things. So, instead, I found their API, got my data, and did some analysis on my own.
If you’re reading this, you might be interested in doing something similar. I’m here to help you do it. During this project, I found a lot of great resources scattered around the internet, but no unified guide. So I wanted to share a little bit of what I learned to help make it easier for others.
I’ll break this into two parts by first showing how I used the Stava API to get my data. Then, I’ll show how I explored that data using Python, or more specifically, Pandas. While this article is focused specifically on Strava, it will be useful for anyone interesting in learning more about how to grab data from public API’s and play with it.
Before we jump in, I want to give credit to the resources that helped me. This tutorial by Fran Polignano, was immensely helpful in regards to showing me how to access the data. The getting-your-data section of this piece will essentially be a summary of Fran’s video’s.
If you are already comfortable with accessing API’s and using Python, you may also be able to skim the above videos and use the following files from Fran’s GitHub repo to get the data. I will also share my complete code upfront.
Getting your data
There are three things you will need up front.
- A Strava account
- A environment to run Python. If you don’t have anything locally, I suggest setting up a Kaggle account and creating a notebook.
- Postman, or a similar tool for testing API’s will make things easier. You can download Postman here https://www.postman.com/downloads/.
To start, I will assume you have a Strava account with at least one activity logged. The first thing you will do is log in and create an API application [here](https://developers.strava.com/docs/getting-started/). You can find more information about how to do that here. The application is just something from Strava to associate your API use with. You will be prompted to fill out a forum. You can use any website for website name and any photo as the photo. Use ‘local host’ as the Authorization Callback Domain.
Once you submit the form, you will have an API Application. You will need to get some information from your Strava account before moving on. To do this, hover over your profile icon in the top right corner and go to settings. Within settings, click on ‘My API Application.’ If you don’t see this as an option in the navigation bar, go back to https://www.strava.com/settings/api and make sure the application was set up.
Assuming everything is working, you should see this:

There are two numbers you want to record from this page: Client ID and Client Secret. We will need an access token and refresh token to get our data. Unfortunately, we can’t use the ones on this page. That is because the provided tokens give us a scope/access level of "read." The way Strava has set up their authorization, we need a scope of "read all" to work with our activities. We will use the given Client ID and Client Secret to get a different pair of access and refresh tokens that have our desired scope.
Getting Access and Refresh Tokens
To start, you will go through a one time manual process that grants your application access to all of your activity data. To do this, replace your_client_id with your actual client id in the following link. Then paste it into your browser search bar and hit enter.
https://www.strava.com/oauth/authorize?client_id=your_client_id&redirect_uri=http://localhost&response_type=code&scope=activity:read_all
You should see a page that looks like this, with your photo and website. Go ahead and hit authorize.

After you do that, your browser will probably error. That’s ok. Within this error, we actually get the piece of information that we want from Strava. If you look at the navigation bar, the url should now look like:
http://localhost/?state=&code=somecode&scope=read,activity:read_all
Copy down whatever the code is between code= and the & sign (where my snippet says "somecode"). Now you are ready to get your access and refresh token.
You will then make a POST request to Strava which will give you the tokens you need. To do this, I would suggest using Postman, which is great for initial API testing. But, if you are comfortable doing this in Python, that is great too. To make the call, replace the placeholders in the call below with your Client ID, Client Secret, and the code from the previous step.
https://www.strava.com/oauth/token?client_id=your_client_id&client_secret=your_client_secret&code=your_code_from_previous_step&grant_type=authorization_code
If you are unfamiliar with Postman, skip to about 9:00 in Fran’s video and he will walk you through it. After you make the POST request, you will get a response which includes an Access and Refresh token. Note both down. These are the keys to accessing what we want from the API. With those in hand, you are ready to pull your data.

Getting the data with Python
Take a look at the code below from Fran. This is the code we will use to get our data.
https://github.com/fpolignano/Code_From_Tutorials/blob/master/Strava_Api/strava_api.py
I’ll break this down a little. His code makes two API calls. The second call uses your access token to ask for your data. For one time use, you could just use the access token from the POST request you just did in Postman and skip to this step. However, these access tokens expire and you don’t want to have to do all the manual work we just did all over again. So, he first makes a call using the refresh token. This call retrieves the the most recent access token to ensure your program will always run. He then plugs that token into the second call and gets the data.
Now is the time to fire up whatever environment you will be using to write your Python code in. I love notebooks for data exploration, but scripts offer a lot of advantages.
This article dives more into the advantages of scripts. But, for now, either will do. I just used a Kaggle notebook for this exploration.
5 Reasons why you should Switch from Jupyter Notebook to Scripts
To run this code, you can simply copy and paste it into your environment. Then you just need to plug in your Client ID, Client Secret, and Refresh Token which you obtained using the steps above. You will then be able run it whenever and not have to touch any of the manual steps again. Run the code and you will get a response with all of your activity data which will be stored in the variable my_dataset.
For each activity, you get a bunch of interesting data points. This includes distance, duration, average speed, max speed, time, date, elevation gain, and more.
In the next section of the piece, I will walk you through how I used Pandas, a popular Python library, to explore this data. Here’s a few of the questions I looked into:
- Have my runs gotten faster over time?
- Is my max speed higher on shorter runs?
- Do I ride my bike faster in my hometown or in DC (where I live now)?
To get started, you can run the following code to create a Pandas DataFrame of your activities. If you are unfamiliar with DataFrames, you can think of them as Excel tables.
import pandas as pd
from pandas.io.json import json_normalize
activities = json_normalize(my_dataset)
Analyzing your activity data
Now you have your Client ID, Client Secret, and refresh token. You can then run Fran’s code to get your data, which will be saved as _mydataset.
https://github.com/fpolignano/Code_From_Tutorials/blob/master/Strava_Api/strava_api.py
From here on out, I’m just going to show what I did with my data. I mainly explored the relationship between different variables. This is just one way you could look at your data.
Imports
There are a handful of libraries I will use to analyze my data – Pandas being the backbone of our work. If you are following along, you can run the imports below.
#Pandas will be the backbone of our data manipulation.
import pandas as pd
from pandas.io.json import json_normalize
#Seaborn is a data visualization library.
import seaborn as sns
#Matplotlib is a data visualization library.
#Seaborn is actually built on top of Matplotlib.
import matplotlib.pyplot as plt
#Numpy will help us handle some work with arrays.
import numpy as np
#Datetime will allow Python to recognize dates as dates, not strings.
from datetime import datetime
Data Manipulation
Now, to get the data into a table, I’ll fun the following code. Commented out are a few commands I used to explore the data set.
activities = json_normalize(my_dataset)
#activities.columns #See a list of all columns in the table
#activities.shape #See the dimensions of the table.
By calling activities.columns, I can see this list of features included in our dataset.
Index(['resource_state', 'name', 'distance', 'moving_time', 'elapsed_time',
'total_elevation_gain', 'type', 'id', 'external_id', 'upload_id',
'start_date', 'start_date_local', 'timezone', 'utc_offset',
'start_latlng', 'end_latlng', 'location_city', 'location_state',
'location_country', 'start_latitude', 'start_longitude',
'achievement_count', 'kudos_count', 'comment_count', 'athlete_count',
'photo_count', 'trainer', 'commute', 'manual', 'private', 'visibility',
'flagged', 'gear_id', 'from_accepted_tag', 'upload_id_str',
'average_speed', 'max_speed', 'has_heartrate', 'heartrate_opt_out',
'display_hide_heartrate_option', 'elev_high', 'elev_low', 'pr_count',
'total_photo_count', 'has_kudoed', 'athlete.id',
'athlete.resource_state', 'map.id', 'map.summary_polyline',
'map.resource_state', 'workout_type', 'device_watts'],
dtype='object')
For my purposes, I only care about a subset of these columns. I am going to reassign my table to only include the columns I want.
In the same cell, I am going to do a little transformation of the time data, which will make it more useful later on. If I run activities[‘start_date_local’], I can see all values in this series. For example: 2020–07–20T21:27:24Z. I want to break this date into two columns: date and time. To do that, I first change the data type of this series to datetime. When I convert my data, I am changing it from a string to an object of the class datetime. This helps the computer work more effectively with the distance between dates.
I then create a column called start_time that extracts only the time for each activity and I reassign start_date_local to only the date of the activity. I preview the changes by calling activities.head(5).
#Create new dataframe with only columns I care about
cols = ['name', 'upload_id', 'type', 'distance', 'moving_time',
'average_speed', 'max_speed','total_elevation_gain',
'start_date_local'
]
activities = activities[cols]
#Break date into start time and date
activities['start_date_local'] = pd.to_datetime(activities['start_date_local'])
activities['start_time'] = activities['start_date_local'].dt.time
activities['start_date_local'] = activities['start_date_local'].dt.date
activities.head(5)
The corresponding dataframe looks like this:

Visualizations
Now that I have the data I want I’m ready to start doing some visualization and analysis. I rarely worry about manipulating the data too much ahead of time. If I have a question I want to answer and that requires me to manipulate my data in someway, I will do it then.
Calling activities[‘type’].value_counts(), I can see that I have recorded 17 runs and 12 bike rides. I want to do some analysis of my runs specifically, so I’ll create a table of just that data.
runs = activities.loc[activities['type'] == 'Run']
At this point, I just start asking questions. The first that popped into my head was: is there a relationship between how far I run and my average speed? I have been trying to run further, but I think I slow down significantly when I do so.
I will use Seaborn to create plots. In the examples below, the syntax is pretty self explanatory. But you can find far more information on their website. The following code will create a regression plot of my average speed vs distance. Not that, for now, the speed is in meters per second. Later we will change that to miles per hour.
sns.set(style="ticks", context="talk")
sns.regplot(x='distance', y = 'average_speed', data = runs).set_title("Average Speed vs Distance")

We could take it a step further and analyze the R² value, or do some more advanced regression. But, in this case, it’s pretty clear there is no relationship.
Then, I ran the same code, but for max, as opposed to average, speed. Sometimes when I run, I do sprints. These are usually shorted runs. So I expected to see most of max speed runs sitting closer to the y axis.

Another question I wanted to ask was: have I started running faster, on average, over the past few months? Seaborn is a boiled down framework created on top of matplotlib. It’s great for many things, but sometimes it’s hard to know what’s going on under the hood. If you try to plot time on the x axis, you may run into some strange formatting issues. It can be nice to have more control. So, for this, I’ll switch to matplotlib, another visualization library. It is a bit more verbose but it also provides more flexibility.
There’s a little bit of weird stuff that needs to happen before we create a visual with matplotlib. You can read through the code below to get the details, but I will summarize how it works. Essentially, we create a container, which is like a canvas to paint the graph on. This is called a figure. Then we add a subplot, which is like outlining the area we will draw our graph in. Then, we use a special function in matplotlib called plot_date to do just that – plot the date, average speed points. Then, we create a trend line and add this to the plot. Finally, we do a little bit of formatting and display.
fig = plt.figure() #create overall container
ax1 = fig.add_subplot(111) #add a 1 by 1 plot to the figure
x = np.asarray(runs.start_date_local) #convert data to numpy array
y = np.asarray(runs.average_speed)
ax1.plot_date(x, y) #plot data points in scatter plot on ax1
ax1.set_title('Average Speed over Time')
#ax1.set_ylim([0,5])
#add trend line
x2 = mdates.date2num(x)
z=np.polyfit(x2,y,1)
p=np.poly1d(z)
plt.plot(x,p(x2),'r--')
#format the figure and display
fig.autofmt_xdate(rotation=45)
fig.tight_layout()
fig.show()

It looks like my runs have indeed got a little faster, on average, over time! But when you note the scale of the y-axis, the change is a little less exciting – about 0.25 m/s, or 0.5 mph. That’s barely a dent in my minutes per mile.
Biking
I have also been biking a lot. I went home for a month this summer and was curious if there was any difference between my rides there, as opposed to DC (where I normally live). Specifically, I wanted to see: do I bike faster at home, or in DC?
So, I created a table of just my bike data by running
rides = activities.loc[activities['type'] == 'Ride']
One of the fields returned by the Strava API is location, but for whatever reason, the api only returned null values. To get around this, I created boolean column called ‘dc.’ I classified all rides before I went home as in DC and all after, as home.
from datetime import datetime
home_date= '2020-06-27' #date I flew home
home_date = datetime.strptime(dc_date, '%Y-%m-%d').date()
rides['start_date_local'] = pd.to_datetime(rides['start_date_local']).dt.date
rides['dc'] = np.where(rides['start_date_local'] < home_date, 'true', 'false')
Then, I compared my average speed and average max speed at home vs in DC using the following code.
#Create seperate table for DC and Home
dc = rides.loc[rides['dc'] == 'true']
home = rides.loc[rides['dc'] == 'false']
#Gather descriptive statistics
dc_speed = round(dc['average_speed'].mean() * 2.237, 2)
home_speed = round(home['average_speed'].mean() * 2.237, 2)
dc_max_speed = round(dc['max_speed'].mean()* 2.237, 2)
home_max_speed = round(home['max_speed'].mean()* 2.237, 2)
print("Average DC Speed: " + str(dc_speed) + " | Average DC Max Speed: " + str(dc_max_speed) + 'n'
+ "Average Home Speed: " + str(home_speed) + " | Average Home Max Speed: " + str(home_max_speed))
#Default data is in meters and seconds. Conversion to mph is m/s * 2.237
The results:
Average DC Speed: 9.51 | Average DC Max Speed: 25.76
Average Home Speed: 13.29 | Average Home Max Speed: 35.43
So, it appears that I bike faster at home. Actually, on average I bike almost 40% faster at home.
percent_increase_average = round((home_speed - dc_speed) * 100 / dc_speed,2)
percent_increase_average
percent_increase_average = 39.75
There are a lot of other questions we might ask of this data. One hypothesis I had was that I actually bike faster when there is more up and down. Obviously I go faster on down hills. However, I usually power my way through up hills pretty quickly too, which would lead me to believe that more hills actually means higher average speed. You can see that below I created a regression plot. However, due to the number of confounding variables, the line means nothing. The more I’ve thought about this, the more confident I am that the slower speed is simply due to more stop-and-go riding in the city.
rides_filtered = rides.loc[(rides['total_elevation_gain'] < 250)] sns.regplot(x='total_elevation_gain', y = 'average_speed', data = rides_filtered).set_title("Speed vs Elevation Gain")

Final thoughts
There are so many ways you might explore this data. What I did above is just one example. As I finish writing this, at the end of 2020 – Strava has just released a year in review for its users. This shows stats like total number of miles moved, highest speed of the year, favorite time of day for a given activity, etc. These are all things you could find explore yourself with the same techniques covered here. It’s all about asking the questions that are interesting to you. Being able to capture, acquire, and analyze this data is so cool because it enables us all to be mini scientists, exploring whatever we might find interesting.
I hope you found this helpful and if there is anything I missed, you would like to see, or you think is incorrect, please comment below.
Find me on Twitter @matt_ambrogi
Additionally, here are some resources you might find helpful.