Parsing fitness tracker data with Python

A look at some of the most common file formats and useful Python libraries

Alan Bunbury
Towards Data Science

--

Photo by Cameron Venti on Unsplash

Introduction

It is now very common to use fitness tracking apps and devices (like Garmin, FitBit, Strava, etc) to track your exercise, particularly cardio activities such as running and cycling. There are plenty of phone and web apps out there that allow you to view and analyse your activities in increasingly sophisticated ways. Some of these apps are provided by the device manufacturers (such as Garmin Connect), whereas others are device- and manufacturer-independent, such as Strava.

But maybe you don’t want to rely on your device manufacturer or one of the third-party providers to store and access your data. (For example, Garmin’s servers went down for several days in July 2020, leaving Garmin users completely unable to access Garmin Connect.) Or maybe you want to review and analyse your activities in a way that isn’t possible with those apps.

Luckily, with some basic programming skills it is not difficult to parse and analyse this data yourself, and then the only limit to what you can do with it is your imagination. In this article we will discuss the basics of fetching activity data and parsing it using Python. We will look at the most common formats for storing and exporting activity data, and explore a couple of useful Python libraries for parsing it. There will also be some example scripts to parse data files and build pandas DataFrames with the data (we will not otherwise be talking about pandas here, and no familiarity with pandas is necessary to follow this article — some basic knowledge of Python and XML would be helpful). The example scripts can be found at the following GitHub repo (the minimum Python version required to run is 3.6, and the dependencies can be found in the Pipfile):

When we say “activity data”, we are primarily talking about GPS and time data which describes some movement-based activity such as a run, walk or cycle, as well as complementary data that may be provided by your device, such as elevation, heart rate and cadence data.

The methods described in this article are mainly based on my playing around with activities recorded by my Garmin vívoactive 3 watch (which I’ll refer to as VA3 for convenience) or my older Garmin Forerunner 30 watch (which I’ll refer to as FR30), which I have mainly used to record runs and walks. There will be differences in how each device records and exports data so your mileage may vary, but the file formats and libraries we will be discussing should be useful for working with a broad range of popular tracking devices and apps.

Unless explicitly stated, I was not involved in the development of any of the services, software or articles discussed or linked to in this article.

How to get the data

Export from app

Many of the more popular apps give you an option to export your activities to common file formats. For example, both Strava and Garmin Connect allow you to export activities to GPX and TCX formats, and to download the original source file (which may be a FIT file). Strava’s instructions are here and Garmin Connect’s instructions are here. Of course, you need to have an account and have already uploaded your activity to the relevant app.

If you have a lot of activities you want to analyse, you will probably want to download the files in bulk rather than one by one. Not all apps offer a bulk activity export function. However, Strava, Garmin and other companies that comply with the General Data Protection Regulation (GDPR) should give you an option to download all of the personal data they hold about you, which will include your activity data (among other things). See here for Strava and here for Garmin. You may also be able to find third-party scripts that bulk download your activities in a more convenient way, such as this Python script for Garmin Connect (any such script will likely ask for your username and password in order to download your activities, so be careful and only use software you trust).

For apps other than Strava and Garmin Connect, consult their FAQs or technical support for their export options.

Directly from the device

Some devices can be hooked up to your computer using a USB connection so that you can access the activity files on them directly. Whether this is possible and exactly where to find the activity files will depend on your device, so if in doubt check the FAQs or technical support provided by your device manufacturer. For example, both the FR30 and VA3 have a directory called GARMIN which contains (among other things) a directory called ACTIVITY. That directory contains activity data as FIT files.

How to parse the data

Parsing GPX files with gpxpy

The GPS Exchange Format (GPX) is an open XML-based format that is commonly used to store GPS-based data. Of the three formats we will discuss in this article, GPX is probably the easiest to work with. It is a simple and well-documented format and there are several useful tools and libraries out there for working with GPX data. (On the other hand, TCX and FIT files can contain more information about the activity than GPX files can.) We will be working with the gpxpy library for Python in order to work with GPX files.

But first, let’s see what a GPX file looks like. By way of example, here is a screenshot showing (the first few lines of) two GPX files side by side, one downloaded from Strava and the other downloaded from Garmin Connect (but both generated using the same underlying data from my FR30).

Two GPX files: Strava on the left; Garmin on the right

In each file, you can see that the root element is a gpx element, with several attributes describing the creator of the GPX file and the XML namespaces used therein. Within the gpx element there is a metadata element, with metadata about the file itself, and a trk element representing a “track”, which is “an ordered list of points describing a path”. This loosely corresponds to what we would generally consider to be a single activity (a run, cycle, walk, etc).

The trk element contains some metadata about the activity, such as its name and its activity type (more on that later), as well as one or more trkseg elements, each representing a “track segment”, which is “a list of Track Points which are logically connected in order”. In other words, a trkseg should contain contiguous GPS data. If your activity simply involves turning on your GPS, running for 10km and then turning off your GPS when you’re done, that whole activity will normally be a single track segment. However, if, for whatever reason, you have turned your GPS off and on again (or lost and then regained GPS functionality) during the activity, the trk may consist of multiple trkseg elements. (At least, that’s the theory, according to the documentation; when I pause and restart my VA3 during a run, it still seems to represent the whole run as a single track segment.)

Each trkseg element should contain one or more (likely many) trkpt or “track point” elements, each representing a single (geographical) point detected by your GPS device. These points are usually a few seconds apart.

At a minimum, a trkpt must contain latitude and longitude data (as attributes lat and lon of the element) and may optionally include time and elevation (ele) data, as child elements (data generated by a fitness tracker is highly likely to at least include time). A trkpt may also contain an extensions element, which can contain additional information. In the example above, extension elements (in Garmin’s TrackPointExtension (TPE) format) are used to store heart rate and cadence data that is provided by the FR30.

There are three main differences between the two GPX files displayed above that I want to point out. First, the type of the trk element: the Garmin file describes this as “running”, whereas the Strava file simply describes it as “9”. There is no standardised way to represent the type of a track. Garmin uses words such as “running”, “walking”, “hiking”, etc., whereas Strava uses numeric codes, such as “4” for hiking, “9” for running, “10” for walking, etc. I couldn’t find a comprehensive mapping of Strava numeric codes to activity types. If you want to find the code for a specific activity type, you could edit the activity type of an existing activity on Strava (click the pencil icon on the left hand side of the activity page) and then export it to GPX to check the value in the type element.

Secondly, the reported elevations of the track points are different, which might seem surprising given that they are based on the same underlying data. Some fitness trackers (including, it seems, the FR30) either do not record elevation data or take highly inaccurate recordings based on GPS signal. In these cases, apps like Strava and Garmin use their own internal elevation databases and algorithms to either generate their own elevation data or adjust the data recorded by the device in order to give a more realistic reading (see here for more information from Strava). Each app’s methods for generating or adjusting elevation data will be slightly different, and you are witnessing the difference here.

Finally, you will note that the latitude and longitude data reported by the Garmin file is far more precise, sometimes giving the value to about 30 decimal places, whereas the Strava file gives the value to seven decimal places. The Garmin file appears to reflect the precision of the raw data reported by the FR30, whereas Strava seems to round the data. It is important to note that precision is not the same as accuracy. Reporting latitude and longitude to thirty decimal places suggests a truly microscopic level of precision, whereas the GPS in your fitness tracker is likely accurate to a few metres at best. Therefore, all that extra precision reported by your fitness tracker isn’t particularly useful. It can, however, have a small but noticeable impact on the total recorded distance of the activity (calculated by adding up the distances between all of the points) so the total distance may differ slightly depending on where the data is coming from.

So, let’s take a look at the gpxpy library. First, make sure it is installed:

pip install gpxpy

Now let’s fire up a Python interpreter in the same directory as our GPX file (for the rest of the article I am using data for a different activity to the one we saw above). Parsing the file is as easy as:

>>> import gpxpy
>>> with open('activity_strava.gpx') as f:
... gpx = gpxpy.parse(f)
...
>>> gpx
GPX(tracks=[GPXTrack(name='Morning Walk', segments=[GPXTrackSegment(points=[...])])])

You can see that calling gpxpy.parse on the GPX file object will give you a GPX object. This is a data structure that reflects the structure of the GPX file itself. Among other things, it contains a list of GPXTrack objects, each representing a track. Each GPXTrack object contains some metadata about the track and a list of segments.

>>> len(gpx.tracks)
1
>>> track = gpx.tracks[0]
>>> track
GPXTrack(name='Morning Walk', segments=[GPXTrackSegment(points=[...])])
>>> track.type
'10'
>>> track.name
'Morning Walk'
>>> track.segments
[GPXTrackSegment(points=[...])]

Each GPXTrackSegment, in turn, contains a list of GPXTrackPoint objects, each reflecting a single track point.

>>> segment = track.segments[0]
>>> len(segment.points)
2433
>>> random_point = segment.points[44]
>>> random_point
GPXTrackPoint(40.642868, 14.593911, elevation=147.2, time=datetime.datetime(2020, 10, 13, 7, 44, 13, tzinfo=SimpleTZ("Z")))
>>> random_point.latitude
40.642868
>>> random_point.longitude
14.593911
>>> random_point.elevation
147.2
>>> random_point.time
datetime.datetime(2020, 10, 13, 7, 44, 13, tzinfo=SimpleTZ("Z"))

Information that is stored as an extension in the GPX file can be accessed too. In that case the relevant XML elements (ie, the children of the extensions element in the GPX file) are stored in a list.

>>> random_point.extensions
[<Element {http://www.garmin.com/xmlschemas/TrackPointExtension/v1}TrackPointExtension at 0x7f32bcc93540>]
>>> tpe = random_point.extensions[0]
>>> for child in tpe:
... print(child.tag, child.text)
...
{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}hr 134
{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}cad 43

In addition to holding the data parsed from the underlying GPX file, GPXTrack and GPXTrackSegment objects have some useful methods for calculating things we might want to know based on the data. For example, you can calculate the total length of a track or segment:

>>> segment.length_2d()  # ignoring elevation
8104.369313043303
>>> segment.length_3d() # including elevation
8256.807195641411

Or data about moving time, or speed (in metres/second):

>>> segment.get_moving_data()
MovingData(moving_time=7829.0, stopped_time=971.0, moving_distance=8096.192269756624, stopped_distance=160.6149258847903, max_speed=1.7427574692488983)
>>> segment.get_speed(44) # The number of the point at which you want to measure speed
1.157300752926421

There are various other methods available to calculate other metrics, as well as methods to adjust or modify the data, such as by adding or removing points, splitting segments, smoothing values, etc. You can explore these by calling help on the relevant object.

Finally, here is a Python script to parse a GPX file and place some of the key data into a pandas DataFrame. Calling this script on our GPX file:

python3 parse_gpx.py activity_strava.gpx

… will output something like the following:

A pandas DataFrame with track point data.

Parsing TCX files with lxml

The Training Center XML (TCX) format is another common format for storing activity data, and was created by Garmin. The easiest way to understand the difference between GPX and TCX is to look at the two files side by side:

GPX file on the left; TCX file on the right.

The first thing you will probably notice is that the data points in the TCX file are grouped into “Laps” and that each Lap element has some useful data associated with it, such as the total time taken for the lap, calories burned, average and maximum heart rate, etc. What constitutes a “lap” depends on how the device is configured; in this case, the activity is divided into “laps”, or splits, of 1,000 metres.

The other thing you may have noticed is that the first Trackpoint element in the TCX file contains altitude, distance, heart rate and speed data, but not latitude or longitude data. This happens occasionally and reflects the structure of the raw (FIT) data recorded by the device. I can only guess that this happens because that data (which does not depend on GPS) is reported separately to the latitude and longitude data. Because a trkpt element in a GPX file must contain latitude and longitude, it is not possible for GPX files to record the altitude (etc) separately; it must be associated with some latitude and longitude data. So the GPX file downloaded from Garmin Connect seems to simply ignore those datapoints which do not have latitude and longitude data, whereas it appears that the GPX file downloaded from Strava includes them and “fills in” the missing latitude and longitude data using the data from the next point.

Other than the above points, the structure of a TCX file is not that different to that of a GPX file. The root element is a TrainingCenterDatabase element, which contains an Activities element. That element has one or more Activity elements, each describing an activity. As well as some metadata, the Activity element contains a number of Lap elements. Each Lap element contains some metadata about the relevant lap (or split), as well as a Track element which contains many Trackpoint elements, each representing a data point that was reported by the device, and which may (or may not) contain, among other things, latitude and longitude, altitude, heart rate, cadence, distance and speed data.

I am not aware of any established Python library for working with TCX files, but given that it is just a type of XML file, you can use lxml or Python’s standard xml library to parse it. Here is a Python script that uses the lxml library to parse a TCX file and put some of the key data into a pandas DataFrame, similar to the one linked for GPX files above. Note that we also use the python-dateutil library to easily parse the ISO 8601-formatted timestamps. Leveraging off the extra information contained in the TCX file, we create an additional DataFrame with lap information. Calling this script as follows (make sure you have lxml and python-dateutil installed):

python3 parse_tcx.py activity_strava.tcx

… will give you something like this:

pandas DataFrames with lap and track point data.

Parsing FIT files with fitdecode

Unlike the GPX and TCX formats, which are based on XML, the Flexible and Interoperable Data Transfer (FIT) protocol is a binary format created by Garmin. fitdecode is a Python library for parsing FIT files. The documentation for the library is here. It can be installed like so:

pip install fitdecode

The fitdecodelibrary allows you to create a FitReader object which reads a FIT file. You can then iterate through the FitReader to access each “frame” or chunk of data present in the FIT file, in order. Each frame is represented by a FitHeader, FitDefinitionMessage, FitDataMessage or FitCRC object, depending on the type of the underlying data record. FitDataMessage is the one we’re interested in, because that is the object that will contain the actual data. But not every FitDataMessage will be relevant; many of them may just contain data about the device status or metadata about the file itself. For present purposes, what we are looking for is a FitDataMessage where the name attribute is lap or record:

with fitdecode.FitReader('activity_garmin.fit') as fit_file:
for frame in fit_file:
if isinstance(frame, fitdecode.records.FitDataMessage):
if frame.name == 'lap':
# This frame contains data about a lap.

elif frame.name == 'record':
# This frame contains data about a "track point".

You can inspect what data fields a given frame has as follows:

for field in frame.fields:
# field is a FieldData object
print(field.name)

And you can use the has_field, get_field and get_value methods of the FitDataMessage object to access the relevant data.

# Assuming the frame is a "record"
if frame.has_field('position_lat') and frame.has_field('position_long'):
print('latitude:', frame.get_value('position_lat'))
print('longitude:', frame.get_value('position_long'))

# Or you can provide a "fallback" argument to give you a default
# value if the field is not present:
print('non_existent_field:', frame.get_value('non_existent_field', fallback='field not present'))

The above code (if called in a context where frame is a FitDataMessage object of message type record and has latitude and longitude data) will produce an output something like:

latitude: 484805747
longitude: 174290634
non_existent_field: field not present

Now, you will notice that latitude and longitude are stored as integers. According to this StackOverflow post, the way to convert these integers to degrees is to divide them by (2**32)/360:

>>> 484805747 / ((2**32)/360)
40.63594828359783
>>> 174290634 / ((2**32)/360)
14.608872178941965

The following are just some of the more useful fields present in the FIT files generated by my VA3:

  • For lap frames: start_time, start_position_lat, start_position_long, total_elapsed_time, total_distance, total_calories, avg_speed, max_speed, total_ascent, total_descent, avg_heart_rate, max_heart_rate, avg_cadence, max_cadence, avg_power, max_power
  • For record frames: timestamp, position_lat, position_long, distance, altitude, enhanced_altitude, speed, enhanced_speed, heart_rate, cadence

There are others, and different devices may report different data, so it is worth exploring your own files to see what you can find. Additionally, not every field will always be present — for example, as we saw in the previous section, sometimes latitude and longitude data may not be reported. So it is good practice to use the has_field method or provide a fallback argument to get_value.

Here is a basic script that parses a FIT file and produces pandas DataFrames with lap and track point information, similar to the script I linked in the previous section for TCX files.

pandas DataFrames with lap and track point data.

What next?

Now that you know the basics of fetching and parsing your fitness tracker data using Python, it’s up to you what you want to do with it. One of the things you’ll probably want to do is visualise that data in some way, using matplotlib, seaborn, plotly or some other data visualisation library.

Here are a few articles and libraries that you might find interesting:

  • This article partially goes over the same ground we covered in this article, but goes on to discuss the basics of plotting and transforming GPS data and includes a helpful discussion of how to calculate the distance between two points.
  • This article discusses how to visualise GPS data using Folium.
  • If your device does not report elevation data, check out srtm.py, from the same author as gpxpy, that will let you look up elevation data using NASA’s Shuttle Radar Topography Mission data.

Thanks for reading, hopefully you found this article helpful!

--

--

Lawyer with an interest in programming, data analysis and the financial markets. Based in Dublin, Ireland.