Parsing fitness tracker data with Python
A look at some of the most common file formats and useful Python libraries
Introduction
It is now very common to use fitness tracking apps and devices (like Garmin, FitBit, Strava, etc) to track your exercise, particularly cardio activities such as running and cycling. There are plenty of phone and web apps out there that allow you to view and analyse your activities in increasingly sophisticated ways. Some of these apps are provided by the device manufacturers (such as Garmin Connect), whereas others are device- and manufacturer-independent, such as Strava.
But maybe you don’t want to rely on your device manufacturer or one of the third-party providers to store and access your data. (For example, Garmin’s servers went down for several days in July 2020, leaving Garmin users completely unable to access Garmin Connect.) Or maybe you want to review and analyse your activities in a way that isn’t possible with those apps.
Luckily, with some basic programming skills it is not difficult to parse and analyse this data yourself, and then the only limit to what you can do with it is your imagination. In this article we will discuss the basics of fetching activity data and parsing it using Python. We will look at the most common formats for storing and exporting activity data, and explore a couple of useful Python libraries for parsing it. There will also be some example scripts to parse data files and build pandas
DataFrames with the data (we will not otherwise be talking about pandas here, and no familiarity with pandas is necessary to follow this article — some basic knowledge of Python and XML would be helpful). The example scripts can be found at the following GitHub repo (the minimum Python version required to run is 3.6, and the dependencies can be found in the Pipfile):
When we say “activity data”, we are primarily talking about GPS and time data which describes some movement-based activity such as a run, walk or cycle, as well as complementary data that may be provided by your device, such as elevation, heart rate and cadence data.
The methods described in this article are mainly based on my playing around with activities recorded by my Garmin vívoactive 3 watch (which I’ll refer to as VA3 for convenience) or my older Garmin Forerunner 30 watch (which I’ll refer to as FR30), which I have mainly used to record runs and walks. There will be differences in how each device records and exports data so your mileage may vary, but the file formats and libraries we will be discussing should be useful for working with a broad range of popular tracking devices and apps.
Unless explicitly stated, I was not involved in the development of any of the services, software or articles discussed or linked to in this article.
How to get the data
Export from app
Many of the more popular apps give you an option to export your activities to common file formats. For example, both Strava and Garmin Connect allow you to export activities to GPX and TCX formats, and to download the original source file (which may be a FIT file). Strava’s instructions are here and Garmin Connect’s instructions are here. Of course, you need to have an account and have already uploaded your activity to the relevant app.
If you have a lot of activities you want to analyse, you will probably want to download the files in bulk rather than one by one. Not all apps offer a bulk activity export function. However, Strava, Garmin and other companies that comply with the General Data Protection Regulation (GDPR) should give you an option to download all of the personal data they hold about you, which will include your activity data (among other things). See here for Strava and here for Garmin. You may also be able to find third-party scripts that bulk download your activities in a more convenient way, such as this Python script for Garmin Connect (any such script will likely ask for your username and password in order to download your activities, so be careful and only use software you trust).
For apps other than Strava and Garmin Connect, consult their FAQs or technical support for their export options.
Directly from the device
Some devices can be hooked up to your computer using a USB connection so that you can access the activity files on them directly. Whether this is possible and exactly where to find the activity files will depend on your device, so if in doubt check the FAQs or technical support provided by your device manufacturer. For example, both the FR30 and VA3 have a directory called GARMIN which contains (among other things) a directory called ACTIVITY. That directory contains activity data as FIT files.
How to parse the data
Parsing GPX files with gpxpy
The GPS Exchange Format (GPX) is an open XML-based format that is commonly used to store GPS-based data. Of the three formats we will discuss in this article, GPX is probably the easiest to work with. It is a simple and well-documented format and there are several useful tools and libraries out there for working with GPX data. (On the other hand, TCX and FIT files can contain more information about the activity than GPX files can.) We will be working with the gpxpy
library for Python in order to work with GPX files.
But first, let’s see what a GPX file looks like. By way of example, here is a screenshot showing (the first few lines of) two GPX files side by side, one downloaded from Strava and the other downloaded from Garmin Connect (but both generated using the same underlying data from my FR30).
In each file, you can see that the root element is a gpx
element, with several attributes describing the creator of the GPX file and the XML namespaces used therein. Within the gpx
element there is a metadata
element, with metadata about the file itself, and a trk
element representing a “track”, which is “an ordered list of points describing a path”. This loosely corresponds to what we would generally consider to be a single activity (a run, cycle, walk, etc).
The trk
element contains some metadata about the activity, such as its name and its activity type (more on that later), as well as one or more trkseg
elements, each representing a “track segment”, which is “a list of Track Points which are logically connected in order”. In other words, a trkseg
should contain contiguous GPS data. If your activity simply involves turning on your GPS, running for 10km and then turning off your GPS when you’re done, that whole activity will normally be a single track segment. However, if, for whatever reason, you have turned your GPS off and on again (or lost and then regained GPS functionality) during the activity, the trk
may consist of multiple trkseg
elements. (At least, that’s the theory, according to the documentation; when I pause and restart my VA3 during a run, it still seems to represent the whole run as a single track segment.)
Each trkseg
element should contain one or more (likely many) trkpt
or “track point” elements, each representing a single (geographical) point detected by your GPS device. These points are usually a few seconds apart.
At a minimum, a trkpt
must contain latitude and longitude data (as attributes lat
and lon
of the element) and may optionally include time and elevation (ele
) data, as child elements (data generated by a fitness tracker is highly likely to at least include time). A trkpt
may also contain an extensions
element, which can contain additional information. In the example above, extension elements (in Garmin’s TrackPointExtension (TPE) format) are used to store heart rate and cadence data that is provided by the FR30.
There are three main differences between the two GPX files displayed above that I want to point out. First, the type
of the trk
element: the Garmin file describes this as “running”, whereas the Strava file simply describes it as “9”. There is no standardised way to represent the type of a track. Garmin uses words such as “running”, “walking”, “hiking”, etc., whereas Strava uses numeric codes, such as “4” for hiking, “9” for running, “10” for walking, etc. I couldn’t find a comprehensive mapping of Strava numeric codes to activity types. If you want to find the code for a specific activity type, you could edit the activity type of an existing activity on Strava (click the pencil icon on the left hand side of the activity page) and then export it to GPX to check the value in the type
element.
Secondly, the reported elevations of the track points are different, which might seem surprising given that they are based on the same underlying data. Some fitness trackers (including, it seems, the FR30) either do not record elevation data or take highly inaccurate recordings based on GPS signal. In these cases, apps like Strava and Garmin use their own internal elevation databases and algorithms to either generate their own elevation data or adjust the data recorded by the device in order to give a more realistic reading (see here for more information from Strava). Each app’s methods for generating or adjusting elevation data will be slightly different, and you are witnessing the difference here.
Finally, you will note that the latitude and longitude data reported by the Garmin file is far more precise, sometimes giving the value to about 30 decimal places, whereas the Strava file gives the value to seven decimal places. The Garmin file appears to reflect the precision of the raw data reported by the FR30, whereas Strava seems to round the data. It is important to note that precision is not the same as accuracy. Reporting latitude and longitude to thirty decimal places suggests a truly microscopic level of precision, whereas the GPS in your fitness tracker is likely accurate to a few metres at best. Therefore, all that extra precision reported by your fitness tracker isn’t particularly useful. It can, however, have a small but noticeable impact on the total recorded distance of the activity (calculated by adding up the distances between all of the points) so the total distance may differ slightly depending on where the data is coming from.
So, let’s take a look at the gpxpy
library. First, make sure it is installed:
pip install gpxpy
Now let’s fire up a Python interpreter in the same directory as our GPX file (for the rest of the article I am using data for a different activity to the one we saw above). Parsing the file is as easy as:
>>> import gpxpy
>>> with open('activity_strava.gpx') as f:
... gpx = gpxpy.parse(f)
...
>>> gpx
GPX(tracks=[GPXTrack(name='Morning Walk', segments=[GPXTrackSegment(points=[...])])])
You can see that calling gpxpy.parse
on the GPX file object will give you a GPX
object. This is a data structure that reflects the structure of the GPX file itself. Among other things, it contains a list of GPXTrack
objects, each representing a track. Each GPXTrack
object contains some metadata about the track and a list of segments.
>>> len(gpx.tracks)
1
>>> track = gpx.tracks[0]
>>> track
GPXTrack(name='Morning Walk', segments=[GPXTrackSegment(points=[...])])
>>> track.type
'10'
>>> track.name
'Morning Walk'
>>> track.segments
[GPXTrackSegment(points=[...])]
Each GPXTrackSegment
, in turn, contains a list of GPXTrackPoint
objects, each reflecting a single track point.
>>> segment = track.segments[0]
>>> len(segment.points)
2433
>>> random_point = segment.points[44]
>>> random_point
GPXTrackPoint(40.642868, 14.593911, elevation=147.2, time=datetime.datetime(2020, 10, 13, 7, 44, 13, tzinfo=SimpleTZ("Z")))
>>> random_point.latitude
40.642868
>>> random_point.longitude
14.593911
>>> random_point.elevation
147.2
>>> random_point.time
datetime.datetime(2020, 10, 13, 7, 44, 13, tzinfo=SimpleTZ("Z"))
Information that is stored as an extension in the GPX file can be accessed too. In that case the relevant XML elements (ie, the children of the extensions
element in the GPX file) are stored in a list.
>>> random_point.extensions
[<Element {http://www.garmin.com/xmlschemas/TrackPointExtension/v1}TrackPointExtension at 0x7f32bcc93540>]
>>> tpe = random_point.extensions[0]
>>> for child in tpe:
... print(child.tag, child.text)
...
{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}hr 134
{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}cad 43
In addition to holding the data parsed from the underlying GPX file, GPXTrack
and GPXTrackSegment
objects have some useful methods for calculating things we might want to know based on the data. For example, you can calculate the total length of a track or segment:
>>> segment.length_2d() # ignoring elevation
8104.369313043303
>>> segment.length_3d() # including elevation
8256.807195641411
Or data about moving time, or speed (in metres/second):
>>> segment.get_moving_data()
MovingData(moving_time=7829.0, stopped_time=971.0, moving_distance=8096.192269756624, stopped_distance=160.6149258847903, max_speed=1.7427574692488983)
>>> segment.get_speed(44) # The number of the point at which you want to measure speed
1.157300752926421
There are various other methods available to calculate other metrics, as well as methods to adjust or modify the data, such as by adding or removing points, splitting segments, smoothing values, etc. You can explore these by calling help
on the relevant object.
Finally, here is a Python script to parse a GPX file and place some of the key data into a pandas DataFrame. Calling this script on our GPX file:
python3 parse_gpx.py activity_strava.gpx
… will output something like the following:
Parsing TCX files with lxml
The Training Center XML (TCX) format is another common format for storing activity data, and was created by Garmin. The easiest way to understand the difference between GPX and TCX is to look at the two files side by side:
The first thing you will probably notice is that the data points in the TCX file are grouped into “Laps” and that each Lap
element has some useful data associated with it, such as the total time taken for the lap, calories burned, average and maximum heart rate, etc. What constitutes a “lap” depends on how the device is configured; in this case, the activity is divided into “laps”, or splits, of 1,000 metres.
The other thing you may have noticed is that the first Trackpoint
element in the TCX file contains altitude, distance, heart rate and speed data, but not latitude or longitude data. This happens occasionally and reflects the structure of the raw (FIT) data recorded by the device. I can only guess that this happens because that data (which does not depend on GPS) is reported separately to the latitude and longitude data. Because a trkpt
element in a GPX file must contain latitude and longitude, it is not possible for GPX files to record the altitude (etc) separately; it must be associated with some latitude and longitude data. So the GPX file downloaded from Garmin Connect seems to simply ignore those datapoints which do not have latitude and longitude data, whereas it appears that the GPX file downloaded from Strava includes them and “fills in” the missing latitude and longitude data using the data from the next point.
Other than the above points, the structure of a TCX file is not that different to that of a GPX file. The root element is a TrainingCenterDatabase
element, which contains an Activities
element. That element has one or more Activity
elements, each describing an activity. As well as some metadata, the Activity
element contains a number of Lap
elements. Each Lap
element contains some metadata about the relevant lap (or split), as well as a Track
element which contains many Trackpoint
elements, each representing a data point that was reported by the device, and which may (or may not) contain, among other things, latitude and longitude, altitude, heart rate, cadence, distance and speed data.
I am not aware of any established Python library for working with TCX files, but given that it is just a type of XML file, you can use lxml
or Python’s standard xml
library to parse it. Here is a Python script that uses the lxml
library to parse a TCX file and put some of the key data into a pandas DataFrame, similar to the one linked for GPX files above. Note that we also use the python-dateutil
library to easily parse the ISO 8601-formatted timestamps. Leveraging off the extra information contained in the TCX file, we create an additional DataFrame with lap information. Calling this script as follows (make sure you have lxml
and python-dateutil
installed):
python3 parse_tcx.py activity_strava.tcx
… will give you something like this:
Parsing FIT files with fitdecode
Unlike the GPX and TCX formats, which are based on XML, the Flexible and Interoperable Data Transfer (FIT) protocol is a binary format created by Garmin. fitdecode
is a Python library for parsing FIT files. The documentation for the library is here. It can be installed like so:
pip install fitdecode
The fitdecode
library allows you to create a FitReader
object which reads a FIT file. You can then iterate through the FitReader
to access each “frame” or chunk of data present in the FIT file, in order. Each frame is represented by a FitHeader
, FitDefinitionMessage
, FitDataMessage
or FitCRC
object, depending on the type of the underlying data record. FitDataMessage
is the one we’re interested in, because that is the object that will contain the actual data. But not every FitDataMessage
will be relevant; many of them may just contain data about the device status or metadata about the file itself. For present purposes, what we are looking for is a FitDataMessage
where the name
attribute is lap
or record
:
with fitdecode.FitReader('activity_garmin.fit') as fit_file:
for frame in fit_file:
if isinstance(frame, fitdecode.records.FitDataMessage):
if frame.name == 'lap':
# This frame contains data about a lap.
elif frame.name == 'record':
# This frame contains data about a "track point".
You can inspect what data fields a given frame has as follows:
for field in frame.fields:
# field is a FieldData object
print(field.name)
And you can use the has_field
, get_field
and get_value
methods of the FitDataMessage
object to access the relevant data.
# Assuming the frame is a "record"
if frame.has_field('position_lat') and frame.has_field('position_long'):
print('latitude:', frame.get_value('position_lat'))
print('longitude:', frame.get_value('position_long'))
# Or you can provide a "fallback" argument to give you a default
# value if the field is not present:
print('non_existent_field:', frame.get_value('non_existent_field', fallback='field not present'))
The above code (if called in a context where frame
is a FitDataMessage
object of message type record
and has latitude and longitude data) will produce an output something like:
latitude: 484805747
longitude: 174290634
non_existent_field: field not present
Now, you will notice that latitude
and longitude
are stored as integers. According to this StackOverflow post, the way to convert these integers to degrees is to divide them by (2**32)/360
:
>>> 484805747 / ((2**32)/360)
40.63594828359783
>>> 174290634 / ((2**32)/360)
14.608872178941965
The following are just some of the more useful fields present in the FIT files generated by my VA3:
- For
lap
frames:start_time
,start_position_lat
,start_position_long
,total_elapsed_time
,total_distance
,total_calories
,avg_speed
,max_speed
,total_ascent
,total_descent
,avg_heart_rate
,max_heart_rate
,avg_cadence
,max_cadence
,avg_power
,max_power
- For
record
frames:timestamp
,position_lat
,position_long
,distance
,altitude
,enhanced_altitude
,speed
,enhanced_speed
,heart_rate
,cadence
There are others, and different devices may report different data, so it is worth exploring your own files to see what you can find. Additionally, not every field will always be present — for example, as we saw in the previous section, sometimes latitude and longitude data may not be reported. So it is good practice to use the has_field
method or provide a fallback argument to get_value
.
Here is a basic script that parses a FIT file and produces pandas DataFrames with lap and track point information, similar to the script I linked in the previous section for TCX files.
What next?
Now that you know the basics of fetching and parsing your fitness tracker data using Python, it’s up to you what you want to do with it. One of the things you’ll probably want to do is visualise that data in some way, using matplotlib
, seaborn
, plotly
or some other data visualisation library.
Here are a few articles and libraries that you might find interesting:
- This article partially goes over the same ground we covered in this article, but goes on to discuss the basics of plotting and transforming GPS data and includes a helpful discussion of how to calculate the distance between two points.
- This article discusses how to visualise GPS data using Folium.
- If your device does not report elevation data, check out
srtm.py
, from the same author asgpxpy
, that will let you look up elevation data using NASA’s Shuttle Radar Topography Mission data.
Thanks for reading, hopefully you found this article helpful!