Using Pandas and Seaborn to compare ride counts and trip duration in 2019 and 2020 to see how usage has changed.
By comparing ride counts from 2019 and 2020 we can see the effects of the pandemic on New York City bike share usage. The bike share system, Citi Bike, makes tripdata files available on its website so that anyone can analyze how the system is used.

In my earlier article Exploring NYC Bike Share Data I went through the Exploratory Data Analysis process for March 2020, the month when New York City shut down. In a follow-up article I used September 2020, when ridership had recovered, but I realized I really needed to look at two full years of data to understand the changes the pandemic caused.
While I was working on this article I got an email from the staff of Sam Schwartz (known to New Yorkers as "Gridlock Sam"), who recently researched Citi Bike ridership. It’s available at Shifting Gears: Citi Bike Ridership Surge Continues over the Winter and there are also links to earlier articles on this topic. However, their charts were produced using a commercial data visualization tool.
I wanted to do some analysis using my preferred open source tools: Pandas for analysis and Seaborn for graphs. Then anyone can reproduce and expand on my analysis.
All the source code shown below is available in a Jupyter notebook on Github as twoyears.
Download and Prepare Citi Bike Trip Data
The Citi Bike System Data page describes the information that Citi Bike makes available. The specific information for each ride is:
- Trip Duration (seconds)
- Start and End Time and Date
- Start and End Station ID and Name
- Start and End Station Latitude and Longitude
- Bike ID
- User Type (Customer = day pass or single ride; Subscriber = Annual Member)
- Gender (Zero=unknown; 1=male; 2=female)
- Year of Birth
This can be augmented with some derived columns to simplify analysis:
- Trip Duration (minutes)
- Start Year, Day of the Month, Hour of the Day
- Weekend Indicator (Boolean)
- Age (omitting riders 75 and older and with default year of 1969)
- Distance (calculated using Haversine formula)
These calculations were done as part of the Exploratory Data Analysis process. Now I needed to run the same calculations for two years of monthly data. I’ve saved the Python programs in a Jupyter notebook on GitHub as gettripdata.
Each month’s tripdata file is saved as a Parquet file with the same name as the original CSV file but with an extension of .parquet
. I saved all 24 monthly files into a tripdata
sub-directory. This lets me read individual months using the complete file name, and all the data by using the directory name.
Parquet is a column store that uses substantially less space than a CSV file. It’s also a lot faster to read, especially when selecting just a few columns, so I can just read the columns I need for any specific analysis.
Parquet is well integrated with Pandas. To read and write Parquet files you only need to install one library:
conda install pyarrow
Monthly Ridership
To look at monthly ridership totals I need the year
and month
columns, but to break the data down further I also need usertype
and gender.
I read them into a DataFrame:
df = pd.read_parquet('tripdata',
columns=['year','month','gender','usertype'])
df.shape
The shape
tells me that there were just over 40 million rides in the two years.
(40058554, 4)
While I could use Seaborn countplot
it has to scan the entire DataFrame to do so. Since it was an iterative process to get the chart, I found it was faster to aggregate the data just once, using Pandas groupby
to count the number of rides by year and month. To display the data in a wide format use unstack()
.
counts = df.groupby(["year", "month"]).size()
counts.unstack()

The object created by groupby
has a multi-index on year
and month
. I want to use those values in my chart, so I’ll convert them to columns. I’ll also rename the Ride Count
column.
counts=counts.reset_index( level = ['year','month'])
counts.rename(columns = {0:'Ride Count'}, inplace=True)
Now I can use barplot
to visualize the data, but creating an attractive chart similar to the one in the "Shifting Gears" article took a few additional steps:
- To show the months as text instead of numbers create a variable
months
with the three-letter month abbreviations. - To make the bars two shades of blue to show the years use
hue='year', palette='Paired'
-
To get the years to appear vertically on each bar use
annotate
and loop through the bars to annotate each one. Note: most references show how to annotate a bar with its height, but that’s not the only way to use this feature. Here I used a countery
to put the current value ofyear
there.
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
plt.figure(figsize=(12,5))
ax=sns.barplot(data=counts, x="month" , y='Ride Count', hue='year', palette='Paired' )
ax.set_xlabel("Months")
ax.set_ylabel("Ride Counts in Millions")
ax.set_xticklabels(months)
ax.legend_.remove()
y=0
for p in ax.patches:
ax.annotate(counts.year[y], (p.get_x() + p.get_width()/ 2, 0),
rotation=90, ha='center', size=12, fontweight='bold',
xytext=(0, 10), textcoords='offset points')
y+=1

After declines in the spring and summer, ridership grew in September and continued to grow in the fall and winter. To calculate the percentage change use:
100 * counts.unstack().pct_change().round(3)[1:]

Ridership by Gender and User Type
The analysts noted that the share of female riders has increased. To view this change I can use the same DataFrame and aggregate the rides by usertype
and gender
.
countsg=df.groupby(["usertype","year", "month","gender"])
.size().unstack()
countsg.head()
As a reminder, Citi Bike distinguishes between Subscriber (Annual Member) and Customer (one trip or daily user) and the values for gender are 0 (unspecified), 1 (male) and (2) female.

Calculate the percentage of female riders by first summing across the columns to get the total number of riders, then dividing the count of female riders by the total and multiplying by 100 to get the percentage:
countsg['total'] = countsg.sum(axis=1)
countsg['pct female'] = countsg[2] / countsg.total * 100
countsg.head()

For charting I’ll convert the multi-indexes to columns:
countsg=countsg.reset_index( level = ['usertype','year','month'])
To show the years side-by-side I used catplot
and to show Subscribers and Customers separately I created a list usertypes
of the two types that I could then iterate over. To make the comparison clear I set the maximum of both charts to the same value, 40%.
usertypes=['Subscriber','Customer']
sns.set(font_scale = 2)
for u in usertypes:
ax=sns.catplot(data=countsg[countsg.usertype==u],
x="month", y="pct female", col="year",
color='Green', kind='bar', height=5, aspect=12/5 )
ax.set_axis_labels("", u + a"n% Female")
ax.set_xticklabels(months)
ax.set_titles('{col_name}')
ax.set(ylim=(0,40))
This does indeed show the percentage of female riders increasing, for both types of users, but especially for Subscribers (annual members).

One thing it doesn’t show, however, is the percentage of rides by customers (daily users), which has always been much lower than by subscribers (annual members). To find out if this has changed over time I did another groupby
on usertype
:
count_type=df.groupby(["year", "month", "usertype"])
.size().unstack()
count_type.head()

Then I created another DataFrame with percentage of rides by customers:
custpct = count_type.Customer / (count_type.Customer +
count_type.Subscriber) * 100
custpct = custpct.reset_index(level = ['year','month'])
custpct.rename(columns = {0:'Customer Percent'}, inplace=True)
Use catplot
again to get two side-by-side charts:
sns.set(font_scale = 3)
ax = sns.catplot(data=custpct,
x="month", y="Customer Percent", col="year", dodge=False,
color='blue', kind='bar', height=10, aspect=20/10 )
ax.set_axis_labels("", "Customern% Rides")
ax.set_xticklabels(months)
ax.set_titles('{col_name}')
ax.set(ylim=(0,40)) ;
Here I can see that the ratio of Customers to Subscribers has increased from 2019 to 2020, especially over the summer.

Ridership by Hour of Day
Another change the analysts noted was the distribution of rides over the course of the day. This analysis requires different columns then those above, so I’ll go back to the files and create a new DataFrame with year
, weekday
and start hour
.
df = pd.read_parquet('tripdata',
columns=['year','weekday','start hour'])
Aggregate by the same columns, change the multi-index to columns, and rename the Ride Count
column.
count_hr = df.groupby(["year", "weekday","start hour"]).size()
count_hr=count_hr.reset_index(level=['year','weekday','start hour'])
count_hr.rename(columns = {0:'Ride Count'}, inplace=True)
count_hr['Ride Count'] /= 1000
Since catplot
doesn’t have a line chart I’m using subplot
to create a lineplot
for each day of the week. I’ll iterate over the day numbers, from 0 (Monday) to 6 (Sunday), using the day of the week to customize the chart and labels.
sns.set(font_scale = 1)
fig, ax = plt.subplots(1,7, figsize=(20,5))
days=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
for d in range(len(days)):
sns.lineplot(ax=ax[d], data=count_hr[count_hr.weekday==d],
linewidth=4,x="start hour", y="Ride Count", hue='year',
palette='Paired')
.set_title(days[d]) ;
if ~d:
ax[d].set_ylabel('Ride Count in Thousands')
if d:
ax[d].set_ylabel('')
ax[d].set_yticklabels('')
if d != 3:
ax[d].set_xlabel('')
if d < 6:
ax[d].legend_.remove()
Here we can see the changes from 2019 to 2020. The rush hour peaks are much less pronounced in 2020 than they were in 2019, especially the morning peaks. The ride counts on weekends have also increased. Both of these trends indicate a shift from commuting trips to leisure rides.

Note that these charts reflect the entire year; individual months would look different.
Trip Duration
One more indication of this shift is how long the rides last. The analysts note that in December, "The average trip duration was longer by over 20% – 17 minutes versus 14 minutes." While I confirmed that number is correct, I think it’s misleading. In my analysis I’ve seen outliers for trip duration: bikes that were not returned or docked properly, resulting in exceedingly long trip times.
Instead of average I think it’s better to use the median trip duration, as the mean is thrown off by those large values.
I’ll re-read the tripdata files to get year
, month
and tripduration
and convert the duration from seconds to minutes. Then do a groupby
, but instead of using size()
to get a count of records I’ll use agg('median')
to get the median trip duration for each group.
df = pd.read_parquet('tripdata',
columns=['year','month','tripduration'])
df.tripduration /= 60
dur = df.groupby(["year", "month"]).agg('median')
dur.unstack().round(2) # just to display

Then I can create a chart similar to the first one:
dur.reset_index( level = ['year','month'], inplace=True)
sns.set(font_scale = 1)
plt.figure(figsize=(12,5))
ax=sns.barplot(data=counts, x="month" , y='tripminutes', hue='year', palette='Paired' )
ax.set_xlabel("Months")
ax.set_ylabel("Trip Duration in Minutes")
ax.set_xticklabels(months)
ax.legend_.remove()
y=0
for p in ax.patches:
ax.annotate(dur.year[y], (p.get_x()+p.get_width()/ 2, 0),
rotation=90, ha='center', size=12, fontweight='bold',
xytext=(0, 10), textcoords='offset points')
y+=1
Using the median rather than mean does show that trip durations increased from 2019 to 2020. However the increase is more pronounced in the summer than it is in December.

Indeed if I look at the percent change from 2019 to 2020 I see the change in December was just 18%.
dur.unstack().pct_change().round(2)[1:] * 100

Conclusion
Using Pandas and Seaborn I can do the same kind of analysis and visualization that was done with a commercial product. By analyzing the data with these libraries anyone can reproduce my results and do their own additional analysis.
The pandemic caused a substantial decline in Citi Bike usage during the initial lockdown in the spring of 2020. Ridership rebounded in the summer and surpassed the prior year’s usage in the fall.
Weekday usage has shifted with lower peaks during rush hours and more ridership in the afternoon and evening, indicating a shift from commuting to leisure rides.