For the past six or so months I’ve decided to try learn some python for data analysis. What’s the goal? Skill building I guess. I like to learn new things, and as an undergrad in engineering I think python would be a good place to start learning how to code. I like the idea of AI for healthcare and machine learning, so here I am, trying to learn python. My main issue though, is I have yet to really start my own project. Last holiday season I attempted the Advent of Code 2020 challenge, but I only managed a few days in. My attempts were nothing to get out of bed for, and I eventually gave up as my internship responsibilities grew a little too much for me to juggle everything. I’ve been working through the Data Science course on DataCamp, and while I enjoy the structure it gives me, I find it’s datasets/projects uninspiring. I often complete them for the practice but I don’t bother to investigate anything on my own.
However, this past year a few things aligned that has prompted me to start playing around with my first dataset. With lockdowns insight for us here in Ireland in early January, and no sign of returning to college (we went online) I found I had a lot of free time to run a lot more than I usually do. Almost every day or so I’d head out for a 5 kilometre run with the intention of staying healthy, fit, and hopefully finally achieving a sub 20 minute run (so far I haven’t made it). I took my Strava account somewhat seriously and have signed up for running 100 kilometres each month. One issue I’ve had though, is that Strava wants me to pay a premium subscription to get some nice graphs and analytics. That subscription is not happening. I’m a student, possibly one of the few young adults that doesn’t have a music streaming service because I can’t stretch the budget that far. It got me thinking, why don’t I just pour over the data myself and make some nice graphs? Turns out Strava has a lovely way of downloading your data into a csv file, something I as a beginner python coder knows how to work with. And so this piece is about that small six month dataset. Along the way I hope you’ll learn a bit about Python, pandas, and plotting, and maybe it’ll inspire you to find your own datasets to explore and extract some meaningful and exciting information.
- But first, a few lessons
Those of you who know the basics of pandas and matplotlib skip ahead, maybe I’ll put in some sort of gif that will catch your attention and make you stop at the start of the data analysis. Maybe not, this is my first time writing, I’ve no clue what I might do. For the rest of you, I’ll use this little block to explain the very basics of pandas and matplotlib. I’d like if this article could be used for beginners to learn a thing or two, while at the same time giving you a taste of what you might do with some datasets that can be found online. I’m gonna split this quick little lesson into three parts, python, pandas, and matplotlib. The python part will be really short, if anyone at all wants a lesson on Python I can write a whole new article about it, just let me know!
1.1 Python
So what is Python? Or even why bother with it? Well, I’m no computer scientist, but having learned a small bit of MATLAB and Java, I can safely say that Python is one of the easier Programming languages to learn. Luckily, it’s one of the most common programming languages and has a lot of support from the data science community. Without python I’d be unable to use tools like pandas and matplotlib, and would have to defer to other programming languages to achieve certain tasks. It has it’s drawbacks, but for a beginner programming language and data science needs, it’s ideal. For a real quick example just take a look at this:
Printing ‘Hello World’ in Java:
System.out.println('Hello World')
Printing ‘Hello World’ in Python:
print('Hello World')
There’s a chance most of us can agree that one of those pieces of code is more intuitive than the other. Again, I’m no computer scientist, but I am an engineer, and I want the most efficient tool to do the job. (Also, I think I’m missing some public static void stuff for the Java code. It’s been a while, please don’t shout at me.)
1.2 Pandas
Data scientists traditionally used R, a programming language built for statistical software and data analysis. I like it. It gets the job done for sure. But as Python had a larger audience Wes McKinney set out to implement R’s dataframe in Python. He built the pandas library and thus data science was able to reach far more people. Using pandas, we’re able to create, manipulate, and display tabular data. For the most part it looks like an excel spreadsheet, with columns, rows, and many functions to alter everything in between. It’s power comes from you, the user, and whatever way you want to manipulate the data.
There are ways for us to create our own custom dataframes, one of which looks something like this:
Which will output this:

Each column has three rows of data, and each row of data has individual entries for each column. These entries could have been blank, True/False values, floats, addresses, anything really. Regarding the code box, I started with import pandas as pd
. Pandas isn’t automatically built into python. Without that line of code I’d be unable to execute any of the pandas commands. I could type them out no problem, but when I try to run the code I’d get an error. The as pd
part just allows me to give pandas a smaller name. I used a dictionary to create the dataframe in the above example. Each key is the column name and each value is the entries in that column’s rows. Why might you create a dataframe like this? Well, if you were gathering a few columns and rows of data for a project in class, and you prefer pandas flexibility and matplotlib’s graphs over Excel then it might be worth it to manually create a small dataframe like the one above.
However, usually you’ll have data stored elsewhere on your hard drive and you’ll want to import that instead. You’ll see how to do that below.
1.3 Matplotlib
Python has no inherent method to plot and display data. Matplotlib helps us solve this problem. Lets look at an example:

Again, I must import the relevant library. Then, I create a figure and an axes using plt.subplots()
. Once created, I can create any type of plot I can think of. Here I used .plot()
which gave me a line plot but I could have used .scatter()
or .hist()
for a scatterplot or a histogram respectively. I then call for the plot to be displayed using plt.show()
.
2. Data Exploration
Before diving in I think it’s best to start with a few goals. Something might catch my interest when I start to look at the data and explore whatever Strava will give to me, but if we start with something in mind at least we won’t be struggling to get started. As I have ran roughly on a monthly basis, with 100k in mind at all times, I think I’d like to compare across each month. Strava gives me my record runs, but won’t allow me to filter on month/week/day of the week. I’d also like to see what my average time/distance is over each month. I started out running five kilometres, then slowly moved up to 10, and during the month of June I ran three half marathons. Maybe I can illustrate the increase in distance? Did my time overall increase? Lets get all this down.
- Monthly distance/ Number of runs.
- Was there a drastic increase in distance or gradual?
- Distance/Time/Month graph.
The rest of this will hopefully be more code based. I’ll comment on code blocks whenever necessary. I won’t break down every line, but hopefully the message of exploring a dataset will become clear.
2.1 Importing, Cleaning, Initial Insights
My downloaded data comes in multiple .csv files. csv stands for comma separated values and is one of the most common file types when working with data. Using pd.read_csv
followed by the file path that I have it saved in I can save my data to strava_df
.
To ensure that my data covers the first six months of 2021 I can use the .head()
and .tail()
method to view the top and bottom of my dataset.
.head
gives:

and .tail
gives:

The output is automatically five rows. The columns are cut in half as there are far too many to display (there’s 79 columns, most blank). As you can see the first entry of my dataset is dated January 4th 2020 to July 18th 2021. So long as I uploaded an activity to Strava within that time they’ll show up. One of the first things I’ll do to clean the data is remove all the columns that aren’t important. I could do this using the pd.drop()
function but as I I want to keep the data at the start of my dataframe (date, time, distance) I’d prefer to use pd.loc[]
as it will allow me to subset my data. It’s basic use case is accessing data using the following convention pd.loc[row_A:row_Z,column_A:column_Z]
The :
indicates I want everything in between, and the ,
differentiates between rows and columns. For now, I’d like to keep every row, and remove every column after ‘Distance’.
Calling .head()
on this shows our updated strava_df dataframe:

Next, I want my dataset to only include my runs. Strava automatically saves each activity correctly, but I have sometimes edited the activity and labelled a run as a ‘workout’ or a ‘race’. Let’s check and see what types of data is saved in the ‘Activity Type’ column.

My output shows five activity types. I’ll filter out ‘Ride’, ‘Hike’, ‘Swim’, and then check the maximum value of ‘Run’ and ‘Workout’ to ensure that none of my cycles were mislabelled.

Line 1 filters the dataframe for two conditions. If the ‘Activity Type’ is either a ‘Run’ or a ‘Workout’ we keep it, if not, we ignore it. Executing this code gives one instance where a run was labelled as a workout. As it was in 2020 we don’t need to convert the type as we’re going to be ignoring data from before 2021. To further clean the data lets drop the ‘Name’, ‘Type’, and ‘Description’ columns as they won’t be needed for our goals (axis = 1
tells pandas to look across the columns for the list [_____]
as opposed to the rows).

And then convert the time in seconds to time in minutes:

Now the next task will be filtering all the rows for activities that were recorded in 2021 (and ultimately to split them up on a daily and monthly basis). There are a few ways to achieve this, all of which require converting each item in the ‘Activity Date’ column into a datetime object:
The first line converts each row entry into a datetime object, and the next three lines extracts the day, month, and year data and puts it into it’s own separate column as follows:

As a personal preference I’ll rearrange the column order:
Which will give:

Including all data we have 128 records to work with, however we need to drop all the values in 2020. This is much easier as now there’s a column with the yearly information only.

And removing July data:
And that’s pretty much it for importing and cleaning. We’ve 102 rows to draw information from and the table is laid out with the variables that we need.
I can create four dataframes that store the row information for my longest and shortest runs/times:

The first line in each block (line 1, 4, 7, 10) isolates the row for either the maximum .max()
or minimum .min()
in the dataframe. Printing those variables separately gives me the index where those variables are held. I can then use the .rename()
function to change the index value to a custom value. Line 13 then uses pd.concat()
to join all four dataframes on top of each other.
We can use some statistical functions built into pandas with string formatting to give some meaningful information about the distances and times (other functions include .mean(), .std(), .var()
)

Finally, before we graph anything lets take a look at the spread of data across each month and each week:

Using the .groupby()
function we can group both the ‘Elapsed Time’ and ‘Distance’ columns by month. Then, using the .agg()
function we can apply both the sum
and count
functions to the data. I ran 15 times in June for a total distance of 149.78 km (three half marathons helped). On the other hand, I ran 17 times in April for only 105.38 km. That looks like the first goal taken care of!
There’s a lot more we could do at this stage. We could loop through each day value and use them to determine the week value. Using that we could create similar tables. We could also divide distances by run times to determine the pace of each run, and see how that changed over time. But to keep things short and sweet I think it’s time to look at visualising some of this data.
2.2 Data Visualisation
I know the main graph that I want. I’d like a scatter plot with distance on the x-axis, elapsed time on the y-axis, and maybe a way to tell what month each data point belongs to.
Using matplotlib we shout be able to achieve this fairly easily:

Well that’s fairly uninspiring. How can we improve? Well, we can easily change the style of the graph using plt.style.use()
. Also, we will hopefully be able to use the color
and label
argument to enhance the graph (spoiler: it won’t be that easy).
If I add the color
argument and say six colours that I want mapped to the plot I’ll receive an error:

I need a list with 94 colour values. I’m obviously not going to go through each item in the dataframe and type out a colour for each month. Instead, we could create a simple loop:
This piece of code will go through each row in strava_df[month]
and for each month add a specific colour to the list col_list
. Now, allowing c
to equal col_list
we should get an appropriate colour gradient representing each month. Further, we can change the style to something a little nicer. I’ll use the fivethirtyeight
style from the popular data science website.

Lets quickly use some dataframe subsetting to remove all the values greater than 20 (allowing us to drop the top right data points to avoid the large gap in the middle):

It’s getting closer to the final product. However, as it stands it’s fairly static. I don’t get much of an impression with the months plotted on top of each other. Maybe we can make an animation where the January data is plotted first, followed by February, and so on? While everything above is focused towards beginners and should be accessible enough for those to understand and follow, this next bit might be a bit difficult. However, this is a clear example of where hard coding your data analysis with python has advantages over traditional graphing/data analysis packages.
First I’ll make a few lists. I worked on this for a while and found it too awkward to update a legend and a vertical line for the month of data being presented as well as the average distance and pace per month. Instead, it’s much easier to illustrate it through the title. The first loop breaks the dataframe into it’s six months, extracts the pace, and appends the integer value to pace_avg
. The second loop converts each pace in pace_avg
to a string and adds ‘km’ to the end of it. Now, onto the animated plot.
Lets break this down. First we need to import the FuncAnimation package from matplotlib, and numpy (numerical package, something like MATLAB). We then initialise the figure size, the axes limits, the title and axes labels.
For FuncAnimation we need to create an animate
function, essentially what we would like to animate. FuncAnimation needs a few arguments, namely it needs the figure to animate on, and an animate loop to plot over. Our animate loop first updates the title of the plot to reflect the month and pace plotted (hence the loops above) and then loops through strava_df
to extract the Distance
and Elapsed Time
data. This gets plotted and the limits are kept in check, while the colours are updated with scat.set_color()
. Finally, FuncAnimation
is called using the figure fig
, the function animate
, over seven separate frames with 1000 milliseconds in between each frame. With plt.show()
we get the following:

It seems that up until June my pace was increasing. I think the reason for the rise in the last month was due to a few interval sessions where I never paused my watch and so it recorded me walking in between intervals. Otherwise though, I think we’ve created a fairly nice graph that displays exactly what I was hoping it would.
3. Conclusion
Most people suggest you should pick a project and then learn to code by working on it. Once you get used to documentation the easy stuff comes along handy, and resources online can help guide you through the trickier parts of your work. I always found that advice difficult to swallow though. I like a structured programme. But I’ve found that a lot of structured programmes don’t give you a whole lot to work with. They introduce a topic, get you to do a few problems, then move on. Usually at the end of a module all the material is tested in a slightly larger more coherent problem but that’s about it. It’s important for anyone learning Python that you find some small projects to work on to test what you learned to reinforce those lessons. Mine just happened to be a handy dataset that I had from a website I wasn’t willing to pay a premium to get access to.
_If anything here is unclear or you’d like elaboration on please let me know. I’d be happy to help. Further, if there’s any interest for a full tutorial on Python, Pandas, Numpy, Matplotlib, and Data Analysis, I’d be happy to write those up, just let me know! The link to the code used to output all of the above is found here._