To ensure there are bikes (and docks) available when needed, Citi Bike, like most bike share systems, rebalances or moves bikes from where there are too many to where they are needed. Citi Bike doesn’t disclose data about when and where bikes are moved, but as it turns out these movements can be discovered.

This article is part of a series of explorations into the bike share data provided by Citi Bike. If you you find this article interesting you may also want to read:
Exploring NYC Bike Share Data – An introduction to data preparation and analysis and how to get started with the tools used: Jupyter, Python, Pandas and Seaborn.
Reverse Geocoding with NYC Bike Share Data – A tutorial on how to augment the data Citi Bike provides with the borough, neighborhood and zip code.
Exploring the Effects of the Pandemic on NYC Bike Share Usage – A look at two years of data and how ridership has changed in this time period.
Download Citi Bike Trip Data
Citi Bike provides tripdata files so that anyone can analyze how the system is used. The Citi Bike System Data page describes the information provided. Each file contains one month’s data; each record represents one trip.
For this article I’m using the tripdata file for September 2020. After a falloff in the first months of the pandemic, usage had recovered and it was the busiest month to date.
On Windows use the link above to find and download the September tripdata file for New York and then uznip it to a bikeshare
directory.
On Linux from a command prompt create a bikeshare
directory, download the tripdata file, unzip it and then delete the zip file.
mkdir bikeshare && cd bikeshare
wget https://s3.amazonaws.com/tripdata/202009-citibike-tripdata.csv.zip
unzip 202009-citibike-tripdata.csv.zip
rm 2020009-citibike-tripdata.csv.zip
Import Libraries and Data
Start Jupyter and from a browser window create a new notebook in your bikeshare
directory. The Jupyter notebook with the Python code used in this article is on GitHub as rebalancing.ipynb.
Import the common libraries used: Pandas for analysis and Seaborn for charts.
import pandas as pd
from pandas import to_datetime
import seaborn as sns
import matplotlib.pyplot as plt
Read the Citi Bike trip data file into a Pandas dataframe. For this analysis we only need the start and end station ID and times.
df = pd.read_csv('~/bikeshare/202009-citibike-tripdata.csv',
usecols=['starttime','start station id',
'stoptime','end station id','bikeid'],
parse_dates=['starttime','stoptime'])
df.info()
I see that there were almost 2½ million rides in September.

Ghost Riders
Citi Bike is a traditional system where a user picks up a bike from one station and drops it off at another.
When analyzing the trip data files I can follow each bike’s movement from station to station across the system if I create a new dataframe and sort it by bikeid
and by starttime
.
dfbike=df.sort_values(by=['bikeid','starttime'])
dfbike.head(10)
Now I can follow a bike’s journeys throughout the month. The end station id
of one ride is (almost) always the start station id
of the next ride.

If it’s not the same then the bike was (somehow) moved from one station to another between rides.
While I could find these "ghost rides" by looping through the dataframe and comparing each trip with the next one, with almost 2½ million rides that would take a while.
A better way is to take advantage of Panda’s vector operations and create a dataframe with pairs of rides.
To do this I’ll create an offset
dataframe with a single dummy record:
offset = pd.DataFrame({'starttime': pd.to_datetime('2010-09-01'),
'start station id':0,'stoptime': pd.to_datetime('2010-09-01'),
'end station id':0,'bikeid':0},index=[0])
Then create two new dataframes, one with offset first, one with it last:
dfbike1 = pd.concat([offset,dfbike]).reset_index(drop=True)
dfbike2 = pd.concat([dfbike,offset]).reset_index(drop=True)
Combine those two dataframes, but side-by-side using axis=1
dfbike=pd.concat ([dfbike1[['bikeid','stoptime','end station id']]
,dfbike2[['bikeid','starttime','start station id']] ],
axis=1 )
dfbike.head()
Here each record represents the time period that a bike was docked. I can see that one ride ended at station 388 and the next one started at the same station. It then was then docked at station 480, and so on.

But as I track the bike on its journey I found that one day the bike was moved! It was docked at station 3906 one day and picked up at station 4123 the next.

To detect these movements I want to create a dataframe of these "ghost rides" as though they were actual rides. That requires swapping the start/stop names. Then I select records where the bikeid
is the same but the station id
values are different.
dfbike.columns=['bikeid1','starttime','start station id',
'bikeid2','stoptime','end station id']
dfrebal = dfbike[['starttime','start station id',
'stoptime','end station id']].
loc[(dfbike.bikeid1==dfbike.bikeid2) &
(dfbike['start station id'] != dfbike['end station id']) ]
dfrebal.reset_index(drop=True, inplace=True)
dfrebal

Each row of this dataframe represents a bike that was somehow moved from one station to another for rebalancing, repairs or some other purpose.
Now that I have this dataframe I’m going to save it as a Parquet file so that I can use it in my next article Estimating Bike Share Availability from NYC Bike Share Data.
This requires pyarrow (install by:
conda install -c conda-forge pyarrow
)
dfrebal.to_parquet('202009-citibike-reblance.parquet')
Note that the report above shows that close to 40 thousand bikes were rebalanced in September. Is that number reasonable?
The Citi Bike Monthly Operating Reports page has links to operating reports for each month of operation. For September 2020 it says:
Citi Bike staff rebalanced a total of 51,179 bicycles during the month of September… utiliz[ing] box trucks, vans, contracted trikes, articulated trikes (‘bike trains’), valets, and member incentives (‘Bike Angels’) to redistribute bikes system-wide.
So although the official count of over 51,179 is higher than what I found, it also includes rides by "Bike Angels," members who are incentivized to ride bikes for the purpose of rebalancing. Those rides would appear as "regular" rides in the tripdata files. I would say my derived counts are close enough for government work.
Which stations get rebalanced?
Now that I know how many bikes are rebalanced, I want to know which stations have the most rebalancing.
That information is more meaningful with the station names rather than the numeric station IDs, so I’ll go back to the tripdata file and create a table of station IDs and names.
dfstations =
pd.read_csv('~/bikeshare/202009-citibike-tripdata.csv',
usecols=['start station id','start station name']).
drop_duplicates()
dfstations.columns=['station id','station name']
dfstations.set_index('station id',drop=True, inplace=True)
I’ll use Pandas merge
to get start and end station names.
dfrebal = pd.merge(dfrebal, dfstations[['station name']],
how = 'left', left_on='start station id', right_on='stationid')
dfrebal = pd.merge(dfrebal, dfstations[['station name']],
how = 'left', left_on='end station id', right_on='stationid')
And rename the column names from the generated defaults.
dfrebal.rename(columns =
{'station name_x':'start station name',
'station name_y':'end station name'},
inplace = True)
Create two series with the top 20 stations with bikes moved in and out.
rebalin = dfrebal['end station name'].value_counts()[:20]
rebalout = dfrebal['start station name'].value_counts()[:20]
Use barplot
to show the total count of bikes moved out of the top 20 stations, and similarly for bikes moved in.
plt.figure(figsize=(10,8))
plt.title('Citi Bike Rebalancing - September 2020
- Top 20 Stations bikes moved OUT' )
plt.xlabel('Count of bikes moved')
sns.barplot( x=rebalout.values, y=rebalout.index, orient="h" ) ;


After looking at the counts of bikes moved by station, I realized it would be more interesting to see the counts by neighborhood. Citi Bike doesn’t identify stations by neighborhood, but as I described in an earlier article Reverse Geocoding with NYC Bike Share Data it’s possible to do so using the latitude and longitude provided. See the Jupyter notebook rebalancing.ipynb for the steps to create these charts.


The charts show that bikes are more often moved out from stations downtown, predominantly a business area; while bikes are moved to stations uptown, which is more residential, and to midtown.
Rebalancing by Time of Day
When does the rebalancing take place? That information isn’t available from Citi Bike nor can it be determined from the tripdata files. I know what time a bike was dropped off before it was moved, and what time it was first picked up after it was moved, but the actual move could have happened any time in between, even days apart. Since that’s all the data I have, I’ll use it.
Start by adding columns with the hours corresponding to the recorded time.
dfrebal['starthour'],dfrebal['endhour'] =
dfrebal.starttime.dt.hour, dfrebal.stoptime.dt.hour
To create a bar chart showing the number of bikes moved in and out each hour, I could use countplot
but the weren’t rebalances every hour so the chart is missing bars. Instead I can do the counting first and generate counts of zero when there were no movements.
This lets me see the counts in a wide format with one row for each station and one column for each hour of the day.
pd.set_option('display.max_columns', 24)
dfrebal.value_counts(subset=['start station id', 'starthour'])
.unstack().astype('Int64').fillna(0).head(10)
There’s a lot to unpack here so I’ll break it down:
.value_counts(subset=['start station id', 'starthour'])
counts number of trips bystart station id
andstarthour
..unstack()
pivotsstarthour
from rows to columns. When there are no rides for an hourNaN
is filled in ("Not a Number")..astype('Int64')
Whenever there areNaN
values Pandas converts the data tofloat64
which is not suitable for a count. This converts data back to an Integer formatInt64
with missing values represented as<NA>
..fillna(0)
replaces the missing or<NA>
values with zero..head(10)
shows the first ten rows.
That statement generates a wide report like this, by start station id
and across starthour
.

The last step is to pivot the data back to the long format using stack()
, which I’ll do for the bikes moved in and out of the station.
Convert the data from a series to a dataframe and add a new column Movement
to indicate the direction of the movement (In or Out) and concatenate the two dataframes into one to easily show both types of movements in a single chart.
startcounts =
dfrebal.value_counts(subset=['start station id','starthour'])
.unstack().astype('Int64').fillna(0).stack()
startcounts=pd.DataFrame(startcounts).assign(Movement='Out')
endcounts =
dfrebal.value)counts(subset=['end station id', 'endhour'])
.size().astype('Int64').unstack().fillna(0).stack()
endcounts=pd.DataFrame(endcounts).assign(Movement='In')
rebalcounts=pd.concat([startcounts, endcounts])
rebalcounts.columns=['Bike Count','Movement']
To create charts for multiple stations I put the code to generate a chart into a function. It requires the two dataframes, stationrebal
and dfstations
and takes a station name as a parameter.
def stationrebal (stationname):
plt.figure(figsize=(12,5))
if stationname not in list(dfstations['station name']):
raise RuntimeError("Station name not found.")
plt.suptitle('Citi Bike Rebalanancing - ' + stationname )
ax=sns.barplot(data=rebalcounts.loc[station],
x=list(rebalcounts.loc[station].index),
y="Bike Count", hue="Movement" ,
palette=["darkred", "darkgreen"], edgecolor = 'w')
ax.set(xlabel="Hour of Day", ylabel = "Count of bikes moved") ;
Here’s the chart for the station with the most bikes removed for rebalancing:

This station is in the middle of Soho, which is both a residential and retail district. The chart shows that bikes are mostly moved out of this station, with some in the morning and many in the afternoon.
And here’s the station with the most bikes added:

In this primarily residential area, the chart shows that bikes are moved in to the station in the morning and less so in the afternoon.
Conclusion
Even though Citi Bike doesn’t provide information about bikes moved manually for rebalancing and other purposes, it can be derived from the tripdata by looking at the movements for individual bikes and where their location changed without a ride.
This information is more meaningful in the aggregate (by neighborhood) than it is for individual stations.
We can get a rough idea when bikes are rebalanced at a station by looking at when the bikes were taken for a ride.