Exploring the Effects of the Pandemic on NYC Bike Share Usage

Using Pandas and Seaborn to compare ride counts and trip duration in 2019 and 2020 to see how usage has changed.

Clif Kranish
Towards Data Science

--

By comparing ride counts from 2019 and 2020 we can see the effects of the pandemic on New York City bike share usage. The bike share system, Citi Bike, makes tripdata files available on its website so that anyone can analyze how the system is used.

Citi Bike Ride Counts for 2019/2020 — All images by author

In my earlier article Exploring NYC Bike Share Data I went through the Exploratory Data Analysis process for March 2020, the month when New York City shut down. In a follow-up article I used September 2020, when ridership had recovered, but I realized I really needed to look at two full years of data to understand the changes the pandemic caused.

While I was working on this article I got an email from the staff of Sam Schwartz (known to New Yorkers as “Gridlock Sam”), who recently researched Citi Bike ridership. It’s available at Shifting Gears: Citi Bike Ridership Surge Continues over the Winter and there are also links to earlier articles on this topic. However, their charts were produced using a commercial data visualization tool.

I wanted to do some analysis using my preferred open source tools: Pandas for analysis and Seaborn for graphs. Then anyone can reproduce and expand on my analysis.

All the source code shown below is available in a Jupyter notebook on Github as twoyears.

Download and Prepare Citi Bike Trip Data

The Citi Bike System Data page describes the information that Citi Bike makes available. The specific information for each ride is:

  • Trip Duration (seconds)
  • Start and End Time and Date
  • Start and End Station ID and Name
  • Start and End Station Latitude and Longitude
  • Bike ID
  • User Type (Customer = day pass or single ride; Subscriber = Annual Member)
  • Gender (Zero=unknown; 1=male; 2=female)
  • Year of Birth

This can be augmented with some derived columns to simplify analysis:

  • Trip Duration (minutes)
  • Start Year, Day of the Month, Hour of the Day
  • Weekend Indicator (Boolean)
  • Age (omitting riders 75 and older and with default year of 1969)
  • Distance (calculated using Haversine formula)

These calculations were done as part of the Exploratory Data Analysis process. Now I needed to run the same calculations for two years of monthly data. I’ve saved the Python programs in a Jupyter notebook on GitHub as gettripdata.

Each month’s tripdata file is saved as a Parquet file with the same name as the original CSV file but with an extension of .parquet. I saved all 24 monthly files into a tripdata sub-directory. This lets me read individual months using the complete file name, and all the data by using the directory name.

Parquet is a column store that uses substantially less space than a CSV file. It’s also a lot faster to read, especially when selecting just a few columns, so I can just read the columns I need for any specific analysis.

Parquet is well integrated with Pandas. To read and write Parquet files you only need to install one library:

conda install pyarrow 

Monthly Ridership

To look at monthly ridership totals I need the year and month columns, but to break the data down further I also need usertype and gender. I read them into a DataFrame:

df = pd.read_parquet('tripdata',\
columns=['year','month','gender','usertype'])
df.shape

The shape tells me that there were just over 40 million rides in the two years.

(40058554, 4)

While I could use Seaborn countplot it has to scan the entire DataFrame to do so. Since it was an iterative process to get the chart, I found it was faster to aggregate the data just once, using Pandas groupby to count the number of rides by year and month. To display the data in a wide format use unstack().

counts = df.groupby(["year", "month"]).size()
counts.unstack()
Ride Counts by Year and Month

The object created by groupby has a multi-index on year and month. I want to use those values in my chart, so I’ll convert them to columns. I’ll also rename the Ride Count column.

counts=counts.reset_index( level = ['year','month'])
counts.rename(columns = {0:'Ride Count'}, inplace=True)

Now I can use barplot to visualize the data, but creating an attractive chart similar to the one in the “Shifting Gears” article took a few additional steps:

  • To show the months as text instead of numbers create a variable months with the three-letter month abbreviations.
  • To make the bars two shades of blue to show the years use hue=’year’, palette=’Paired’
  • To get the years to appear vertically on each bar use annotate and loop through the bars to annotate each one.
    Note: most references show how to annotate a bar with its height, but that’s not the only way to use this feature. Here I used a counter y to put the current value of year there.
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']plt.figure(figsize=(12,5))
ax=sns.barplot(data=counts, x="month" , y='Ride Count', hue='year', palette='Paired' )
ax.set_xlabel("Months")
ax.set_ylabel("Ride Counts in Millions")
ax.set_xticklabels(months)
ax.legend_.remove()
y=0
for p in ax.patches:
ax.annotate(counts.year[y], (p.get_x() + p.get_width()/ 2, 0),\
rotation=90, ha='center', size=12, fontweight='bold',\
xytext=(0, 10), textcoords='offset points')
y+=1
Citi Bike Monthly Ride Counts — 2019 and 2020

After declines in the spring and summer, ridership grew in September and continued to grow in the fall and winter. To calculate the percentage change use:

100 * counts.unstack().pct_change().round(3)[1:]
Percent change in ridership from 2019 to 2020

Ridership by Gender and User Type

The analysts noted that the share of female riders has increased. To view this change I can use the same DataFrame and aggregate the rides by usertype and gender.

countsg=df.groupby(["usertype","year", "month","gender"])\
.size().unstack()
countsg.head()

As a reminder, Citi Bike distinguishes between Subscriber (Annual Member) and Customer (one trip or daily user) and the values for gender are 0 (unspecified), 1 (male) and (2) female.

Ridership by User Type and Gender

Calculate the percentage of female riders by first summing across the columns to get the total number of riders, then dividing the count of female riders by the total and multiplying by 100 to get the percentage:

countsg['total'] = countsg.sum(axis=1)
countsg['pct female'] = countsg[2] / countsg.total * 100
countsg.head()
Percentage of Female Riders

For charting I’ll convert the multi-indexes to columns:

countsg=countsg.reset_index( level = ['usertype','year','month'])

To show the years side-by-side I used catplot and to show Subscribers and Customers separately I created a list usertypes of the two types that I could then iterate over. To make the comparison clear I set the maximum of both charts to the same value, 40%.

usertypes=['Subscriber','Customer']
sns.set(font_scale = 2)
for u in usertypes:
ax=sns.catplot(data=countsg[countsg.usertype==u],\
x="month", y="pct female", col="year",\
color='Green', kind='bar', height=5, aspect=12/5 )
ax.set_axis_labels("", u + a"\n% Female")
ax.set_xticklabels(months)
ax.set_titles('{col_name}')
ax.set(ylim=(0,40))

This does indeed show the percentage of female riders increasing, for both types of users, but especially for Subscribers (annual members).

Percentage of Female Riders

One thing it doesn’t show, however, is the percentage of rides by customers (daily users), which has always been much lower than by subscribers (annual members). To find out if this has changed over time I did another groupby on usertype:

count_type=df.groupby(["year", "month", "usertype"])\
.size().unstack()
count_type.head()
Ridership by Year, Month and User Type

Then I created another DataFrame with percentage of rides by customers:

custpct = count_type.Customer / (count_type.Customer +\
count_type.Subscriber) * 100
custpct = custpct.reset_index(level = ['year','month'])\
custpct.rename(columns = {0:'Customer Percent'}, inplace=True)

Use catplot again to get two side-by-side charts:

sns.set(font_scale = 3)
ax = sns.catplot(data=custpct,\
x="month", y="Customer Percent", col="year", dodge=False, \
color='blue', kind='bar', height=10, aspect=20/10 )
ax.set_axis_labels("", "Customer\n% Rides")
ax.set_xticklabels(months)
ax.set_titles('{col_name}')
ax.set(ylim=(0,40)) ;

Here I can see that the ratio of Customers to Subscribers has increased from 2019 to 2020, especially over the summer.

Percentage of Rides by Customers

Ridership by Hour of Day

Another change the analysts noted was the distribution of rides over the course of the day. This analysis requires different columns then those above, so I’ll go back to the files and create a new DataFrame with year, weekday and start hour.

df = pd.read_parquet('tripdata',\
columns=['year','weekday','start hour'])

Aggregate by the same columns, change the multi-index to columns, and rename the Ride Count column.

count_hr = df.groupby(["year", "weekday","start hour"]).size()
count_hr=count_hr.reset_index(level=['year','weekday','start hour'])
count_hr.rename(columns = {0:'Ride Count'}, inplace=True)
count_hr['Ride Count'] /= 1000

Since catplot doesn’t have a line chart I’m using subplot to create a lineplot for each day of the week. I’ll iterate over the day numbers, from 0 (Monday) to 6 (Sunday), using the day of the week to customize the chart and labels.

sns.set(font_scale = 1)
fig, ax = plt.subplots(1,7, figsize=(20,5))
days=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']for d in range(len(days)):
sns.lineplot(ax=ax[d], data=count_hr[count_hr.weekday==d],\
linewidth=4,x="start hour", y="Ride Count", hue='year',\
palette='Paired')\
.set_title(days[d]) ;
if ~d:
ax[d].set_ylabel('Ride Count in Thousands')
if d:
ax[d].set_ylabel('')
ax[d].set_yticklabels('')
if d != 3:
ax[d].set_xlabel('')
if d < 6:
ax[d].legend_.remove()

Here we can see the changes from 2019 to 2020. The rush hour peaks are much less pronounced in 2020 than they were in 2019, especially the morning peaks. The ride counts on weekends have also increased. Both of these trends indicate a shift from commuting trips to leisure rides.

2019 and 2020 — Ride Counts by Day of Week and Hour of Day

Note that these charts reflect the entire year; individual months would look different.

Trip Duration

One more indication of this shift is how long the rides last. The analysts note that in December, “The average trip duration was longer by over 20% — 17 minutes versus 14 minutes.” While I confirmed that number is correct, I think it’s misleading. In my analysis I’ve seen outliers for trip duration: bikes that were not returned or docked properly, resulting in exceedingly long trip times.

Instead of average I think it’s better to use the median trip duration, as the mean is thrown off by those large values.

I’ll re-read the tripdata files to get year, month and tripduration and convert the duration from seconds to minutes. Then do a groupby, but instead of using size() to get a count of records I’ll use agg('median') to get the median trip duration for each group.

df = pd.read_parquet('tripdata',\
columns=['year','month','tripduration'])
df.tripduration /= 60
dur = df.groupby(["year", "month"]).agg('median')
dur.unstack().round(2) # just to display
Median Trip Duration in Minutes by Year and Month

Then I can create a chart similar to the first one:

dur.reset_index( level = ['year','month'], inplace=True)
sns.set(font_scale = 1)
plt.figure(figsize=(12,5))

ax=sns.barplot(data=counts, x="month" , y='tripminutes', hue='year', palette='Paired' )
ax.set_xlabel("Months")
ax.set_ylabel("Trip Duration in Minutes")
ax.set_xticklabels(months)
ax.legend_.remove()
y=0
for p in ax.patches:
ax.annotate(dur.year[y], (p.get_x()+p.get_width()/ 2, 0),\
rotation=90, ha='center', size=12, fontweight='bold',\
xytext=(0, 10), textcoords='offset points')
y+=1

Using the median rather than mean does show that trip durations increased from 2019 to 2020. However the increase is more pronounced in the summer than it is in December.

Median Trip Duration in Minutes by Year and Month

Indeed if I look at the percent change from 2019 to 2020 I see the change in December was just 18%.

dur.unstack().pct_change().round(2)[1:] * 100
Percent change in median Trip Duration 2019–2020

Conclusion

Using Pandas and Seaborn I can do the same kind of analysis and visualization that was done with a commercial product. By analyzing the data with these libraries anyone can reproduce my results and do their own additional analysis.

The pandemic caused a substantial decline in Citi Bike usage during the initial lockdown in the spring of 2020. Ridership rebounded in the summer and surpassed the prior year’s usage in the fall.

Weekday usage has shifted with lower peaks during rush hours and more ridership in the afternoon and evening, indicating a shift from commuting to leisure rides.

--

--

Explorer of data and bicycle routes. Leader in agile product management for data access and data preparation.