The world’s leading publication for data science, AI, and ML professionals.

Interpolating NYC Bike Share Data to Discover Rebalancing Movements

Using Pandas concat to restructure Citi Bike trip data

To ensure there are bikes (and docks) available when needed, Citi Bike, like most bike share systems, rebalances or moves bikes from where there are too many to where they are needed. Citi Bike doesn’t disclose data about when and where bikes are moved, but as it turns out these movements can be discovered.

Nearly empty Citi Bike Station - Photo by author
Nearly empty Citi Bike Station – Photo by author

This article is part of a series of explorations into the bike share data provided by Citi Bike. If you you find this article interesting you may also want to read:

Exploring NYC Bike Share Data – An introduction to data preparation and analysis and how to get started with the tools used: Jupyter, Python, Pandas and Seaborn.

Reverse Geocoding with NYC Bike Share Data – A tutorial on how to augment the data Citi Bike provides with the borough, neighborhood and zip code.

Exploring the Effects of the Pandemic on NYC Bike Share Usage – A look at two years of data and how ridership has changed in this time period.

Download Citi Bike Trip Data

Citi Bike provides tripdata files so that anyone can analyze how the system is used. The Citi Bike System Data page describes the information provided. Each file contains one month’s data; each record represents one trip.

For this article I’m using the tripdata file for September 2020. After a falloff in the first months of the pandemic, usage had recovered and it was the busiest month to date.

On Windows use the link above to find and download the September tripdata file for New York and then uznip it to a bikeshare directory.

On Linux from a command prompt create a bikeshare directory, download the tripdata file, unzip it and then delete the zip file.

mkdir bikeshare && cd bikeshare 
wget https://s3.amazonaws.com/tripdata/202009-citibike-tripdata.csv.zip
unzip 202009-citibike-tripdata.csv.zip
rm 2020009-citibike-tripdata.csv.zip

Import Libraries and Data

Start Jupyter and from a browser window create a new notebook in your bikeshare directory. The Jupyter notebook with the Python code used in this article is on GitHub as rebalancing.ipynb.

Import the common libraries used: Pandas for analysis and Seaborn for charts.

import pandas as pd
from pandas import to_datetime
import seaborn as sns
import matplotlib.pyplot as plt

Read the Citi Bike trip data file into a Pandas dataframe. For this analysis we only need the start and end station ID and times.

df = pd.read_csv('~/bikeshare/202009-citibike-tripdata.csv',
     usecols=['starttime','start station id',
              'stoptime','end station id','bikeid'],
     parse_dates=['starttime','stoptime'])
df.info()

I see that there were almost 2½ million rides in September.

Tripdata info for selected columns
Tripdata info for selected columns

Ghost Riders

Citi Bike is a traditional system where a user picks up a bike from one station and drops it off at another.

When analyzing the trip data files I can follow each bike’s movement from station to station across the system if I create a new dataframe and sort it by bikeid and by starttime.

dfbike=df.sort_values(by=['bikeid','starttime'])
dfbike.head(10)

Now I can follow a bike’s journeys throughout the month. The end station id of one ride is (almost) always the start station id of the next ride.

Trips sorted by bikeid and starttime - image by author
Trips sorted by bikeid and starttime – image by author

If it’s not the same then the bike was (somehow) moved from one station to another between rides.

While I could find these "ghost rides" by looping through the dataframe and comparing each trip with the next one, with almost 2½ million rides that would take a while.

A better way is to take advantage of Panda’s vector operations and create a dataframe with pairs of rides.

To do this I’ll create an offset dataframe with a single dummy record:

offset = pd.DataFrame({'starttime': pd.to_datetime('2010-09-01'),
  'start station id':0,'stoptime': pd.to_datetime('2010-09-01'),
  'end station id':0,'bikeid':0},index=[0])

Then create two new dataframes, one with offset first, one with it last:

dfbike1 = pd.concat([offset,dfbike]).reset_index(drop=True)
dfbike2 = pd.concat([dfbike,offset]).reset_index(drop=True)

Combine those two dataframes, but side-by-side using axis=1

dfbike=pd.concat ([dfbike1[['bikeid','stoptime','end station id']]
            ,dfbike2[['bikeid','starttime','start station id']] ],
             axis=1 )
dfbike.head()

Here each record represents the time period that a bike was docked. I can see that one ride ended at station 388 and the next one started at the same station. It then was then docked at station 480, and so on.

image by author
image by author

But as I track the bike on its journey I found that one day the bike was moved! It was docked at station 3906 one day and picked up at station 4123 the next.

Image by author
Image by author

To detect these movements I want to create a dataframe of these "ghost rides" as though they were actual rides. That requires swapping the start/stop names. Then I select records where the bikeid is the same but the station id values are different.

dfbike.columns=['bikeid1','starttime','start station id',
                'bikeid2','stoptime','end station id']
dfrebal = dfbike[['starttime','start station id',
                  'stoptime','end station id']].
           loc[(dfbike.bikeid1==dfbike.bikeid2) & 
          (dfbike['start station id'] != dfbike['end station id']) ]
dfrebal.reset_index(drop=True, inplace=True)
dfrebal
rebalancing movements for September 2020 - image by author
rebalancing movements for September 2020 – image by author

Each row of this dataframe represents a bike that was somehow moved from one station to another for rebalancing, repairs or some other purpose.

Now that I have this dataframe I’m going to save it as a Parquet file so that I can use it in my next article Estimating Bike Share Availability from NYC Bike Share Data.

This requires pyarrow (install by: conda install -c conda-forge pyarrow)

dfrebal.to_parquet('202009-citibike-reblance.parquet')

Note that the report above shows that close to 40 thousand bikes were rebalanced in September. Is that number reasonable?

The Citi Bike Monthly Operating Reports page has links to operating reports for each month of operation. For September 2020 it says:

Citi Bike staff rebalanced a total of 51,179 bicycles during the month of September… utiliz[ing] box trucks, vans, contracted trikes, articulated trikes (‘bike trains’), valets, and member incentives (‘Bike Angels’) to redistribute bikes system-wide.

So although the official count of over 51,179 is higher than what I found, it also includes rides by "Bike Angels," members who are incentivized to ride bikes for the purpose of rebalancing. Those rides would appear as "regular" rides in the tripdata files. I would say my derived counts are close enough for government work.

Which stations get rebalanced?

Now that I know how many bikes are rebalanced, I want to know which stations have the most rebalancing.

That information is more meaningful with the station names rather than the numeric station IDs, so I’ll go back to the tripdata file and create a table of station IDs and names.

dfstations = 
  pd.read_csv('~/bikeshare/202009-citibike-tripdata.csv',
  usecols=['start station id','start station name']).
  drop_duplicates()                
dfstations.columns=['station id','station name']
dfstations.set_index('station id',drop=True, inplace=True)

I’ll use Pandas merge to get start and end station names.

dfrebal = pd.merge(dfrebal, dfstations[['station name']],
     how = 'left', left_on='start station id', right_on='stationid')
dfrebal = pd.merge(dfrebal, dfstations[['station name']],
     how = 'left', left_on='end station id', right_on='stationid')

And rename the column names from the generated defaults.

dfrebal.rename(columns = 
        {'station name_x':'start station name',
         'station name_y':'end station name'},
          inplace = True)

Create two series with the top 20 stations with bikes moved in and out.

rebalin = dfrebal['end station name'].value_counts()[:20]
rebalout = dfrebal['start station name'].value_counts()[:20]

Use barplot to show the total count of bikes moved out of the top 20 stations, and similarly for bikes moved in.

plt.figure(figsize=(10,8))
plt.title('Citi Bike Rebalancing - September 2020
 - Top 20 Stations bikes moved OUT'  ) 
plt.xlabel('Count of bikes moved') 
sns.barplot( x=rebalout.values, y=rebalout.index,  orient="h" ) ;
Count of bikes moved out - image by author
Count of bikes moved out – image by author
Count of bikes moved in - image by author
Count of bikes moved in – image by author

After looking at the counts of bikes moved by station, I realized it would be more interesting to see the counts by neighborhood. Citi Bike doesn’t identify stations by neighborhood, but as I described in an earlier article Reverse Geocoding with NYC Bike Share Data it’s possible to do so using the latitude and longitude provided. See the Jupyter notebook rebalancing.ipynb for the steps to create these charts.

Neighborhood counts - Image by author
Neighborhood counts – Image by author
Neighborhood counts - image by author
Neighborhood counts – image by author

The charts show that bikes are more often moved out from stations downtown, predominantly a business area; while bikes are moved to stations uptown, which is more residential, and to midtown.

Rebalancing by Time of Day

When does the rebalancing take place? That information isn’t available from Citi Bike nor can it be determined from the tripdata files. I know what time a bike was dropped off before it was moved, and what time it was first picked up after it was moved, but the actual move could have happened any time in between, even days apart. Since that’s all the data I have, I’ll use it.

Start by adding columns with the hours corresponding to the recorded time.

dfrebal['starthour'],dfrebal['endhour'] = 
    dfrebal.starttime.dt.hour, dfrebal.stoptime.dt.hour

To create a bar chart showing the number of bikes moved in and out each hour, I could use countplot but the weren’t rebalances every hour so the chart is missing bars. Instead I can do the counting first and generate counts of zero when there were no movements.

This lets me see the counts in a wide format with one row for each station and one column for each hour of the day.

pd.set_option('display.max_columns', 24)
dfrebal.value_counts(subset=['start station id', 'starthour'])
        .unstack().astype('Int64').fillna(0).head(10)

There’s a lot to unpack here so I’ll break it down:

  • .value_counts(subset=['start station id', 'starthour']) counts number of trips by start station id and starthour.
  • .unstack() pivots starthour from rows to columns. When there are no rides for an hour NaN is filled in ("Not a Number").
  • .astype('Int64') Whenever there are NaN values Pandas converts the data to float64 which is not suitable for a count. This converts data back to an Integer format Int64 with missing values represented as <NA>.
  • .fillna(0) replaces the missing or <NA>values with zero.
  • .head(10) shows the first ten rows.

That statement generates a wide report like this, by start station id and across starthour.

Wide report of rebalance counts by station across hour - image by author
Wide report of rebalance counts by station across hour – image by author

The last step is to pivot the data back to the long format using stack(), which I’ll do for the bikes moved in and out of the station.

Convert the data from a series to a dataframe and add a new column Movementto indicate the direction of the movement (In or Out) and concatenate the two dataframes into one to easily show both types of movements in a single chart.

startcounts = 
    dfrebal.value_counts(subset=['start station id','starthour'])
   .unstack().astype('Int64').fillna(0).stack()
startcounts=pd.DataFrame(startcounts).assign(Movement='Out')
endcounts = 
    dfrebal.value)counts(subset=['end station id', 'endhour'])
   .size().astype('Int64').unstack().fillna(0).stack()
endcounts=pd.DataFrame(endcounts).assign(Movement='In')
rebalcounts=pd.concat([startcounts, endcounts])
rebalcounts.columns=['Bike Count','Movement']

To create charts for multiple stations I put the code to generate a chart into a function. It requires the two dataframes, stationrebal and dfstations and takes a station name as a parameter.

def stationrebal (stationname):
    plt.figure(figsize=(12,5))
    if stationname not in list(dfstations['station name']):
        raise RuntimeError("Station name not found.")
    plt.suptitle('Citi Bike Rebalanancing - ' +  stationname   )         
    ax=sns.barplot(data=rebalcounts.loc[station], 
                x=list(rebalcounts.loc[station].index),
                y="Bike Count", hue="Movement" ,
                palette=["darkred", "darkgreen"], edgecolor = 'w')  
    ax.set(xlabel="Hour of Day", ylabel = "Count of bikes moved") ;

Here’s the chart for the station with the most bikes removed for rebalancing:

Top station for bikes moved out - image by author
Top station for bikes moved out – image by author

This station is in the middle of Soho, which is both a residential and retail district. The chart shows that bikes are mostly moved out of this station, with some in the morning and many in the afternoon.

And here’s the station with the most bikes added:

Top station for bikes moved in - image by author
Top station for bikes moved in – image by author

In this primarily residential area, the chart shows that bikes are moved in to the station in the morning and less so in the afternoon.

Conclusion

Even though Citi Bike doesn’t provide information about bikes moved manually for rebalancing and other purposes, it can be derived from the tripdata by looking at the movements for individual bikes and where their location changed without a ride.

This information is more meaningful in the aggregate (by neighborhood) than it is for individual stations.

We can get a rough idea when bikes are rebalanced at a station by looking at when the bikes were taken for a ride.


Related Articles