The New York City bike share system Citi Bike provides trip data files that allow anyone to analyze the use of the system. Chord diagrams provide a way to visualize flows between entities. In this article I will show that once you have the data in the required format it is easy to create an interactive chord diagram that helps to understand how these shared bikes are used.

The diagram above was created using the Holoviews Chord element. If you are using Anaconda you can install the Holoviews library with the command:
conda install holoviews
For this article I’m also using Jupyter Notebook, Pandas, Seaborn, and Pyarrow. See my previous article Exploring NYC Bike Share Data for instructions on how to install these libraries.
All of the Python code used in this article and the output it generates can be found on GitHub in the Jupyter Notebook chords.ipynb.
Download Citi Bike Trip Data
The Citi Bike System Data page describes the information provided and has a link to a page where you can download the data. For this article I’m using data from September 2020. From Windows find the NYC file for 202009, download it and unzip it to a bikeshare
directory. On Linux issue these commands:
mkdir bikeshare && cd bikeshare
wget https://s3.amazonaws.com/tripdata/202009-citibike-tripdata.csv.zip
unzip 202009-citibike-tripdata.csv.zip
rm 2020009-citibike-tripdata.csv.zip
Citi Bike is a traditional system with fixed stations where users pick up and drop off the shared bikes. Each record in the trip data files is a single trip and has a starting and ending station name and ID. However the only geographical data included is latitude and longitude.
In order to make sense of the trip data I created a file with the borough, neighborhood, and zip code for each station. (In New York City a borough is an administrative unit. While there are five, the four with Citi Bike stations are Manhattan, Brooklyn, Queens and the Bronx).
You can download the file from 202009-stations.parquet. On Linux issue the command:
wget https://github.com/ckran/bikeshare/raw/main/202009-stations.parquet
To read a Parquet file from Python install pyarrow. If you are using conda you can install with:
conda install -c conda-forge pyarrow
This file was created using data from OpenStreetMaps. If you want to see how, read my article Reverse Geocoding with NYC Bike Share Data.
Import libraries and data
Start Jupyter and create a new notebook in your bikeshare
directory. Enter each block of code into a cell and run it.
Import these libraries and set options as shown:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from holoviews import opts, dim
import holoviews as hv
hv.extension('bokeh')
hv.output(size=200)
Then read the Citi Bike trip data file. These files have one record for each trip, and include information about the ride (trip duration, station names and geographical coordinates) and the rider (birth year, gender, user type). If I wanted to analyze trips by time or day, or by type of rider, I would read the entire file. However for this analysis I just need to count the number of rides by starting and ending location so all I need is the start station id
and end station id
.
dfa = pd.read_csv('202009-citibike-tripdata.csv',
usecols=['start station id','end station id'])
dfa.shape
The output shows the number of rows (rides) and columns. There were almost 2½ million rides this month!
(2488225, 2)
Then read the stations file into a DataFrame and look at the first ten rows.
dfstations=pd.read_parquet('202009-stations.parquet')
dfstations.head(10)

Then we can join the tripdata table (with start and end station IDs) to the stations table (on the station ID) using the Pandas merge
function. We’ll do this twice, once for start stations and once for end stations.
dfa = pd.merge(dfa, dfstations[['boro','neighborhood','zipcode']], how = 'left', left_on='start station id', right_on='stationid')
dfa = pd.merge(dfa, dfstations[['boro','neighborhood','zipcode']], how = 'left', left_on='end station id', right_on='stationid')
dfa.head(10)
Note in the output below two columns were created for each attribute. For example boro_x
for the starting location and boro_y
for the ending location.

Format the Data
Now I need to put the data into the format required by the Holoviews Chord Diagram, while limiting the data to only trips that start and end in Manhattan. For this analysis I’m using the starting and ending neighborhood, but I could also use zip code for more detailed analysis if I restricted the data further. The use of value_counts()
returns a sorted list of the count of rides between neighborhoods.
trips=dfa[['neighborhood_x','neighborhood_y']]
.loc[((dfa['boro_x']=='Manhattan')&(dfa['boro_y']=='Manhattan'))]
.value_counts()
trips.head()

Now I need to format the data in a three column Pandas DataFrame. I’ll call the columns start
, end
and trips
.
links=pd.DataFrame.from_records(list(trips.index),
columns=['start','end'])
links['trips']=trips.values
links.head(10)

This data can be easily viewed in a bar chat. First I’ll create a list names
combining the start
and end
neighborhood names, then plot the trips
.
names = links.start + '/' + links.end
plt.figure(figsize=(12,10))
sns.barplot( x=links.trips[:60], y=names[:60], orient="h") ; ;

We see a classic "long tail" here; I limited the chart to the first 60 of the start/end pairs.
I can easily see that the second most popular trips are those that start and end in Chelsea. But what about the trips that start or end in Chelsea? I’d have to read the labels on the chart to find them. If I just wanted to see the trip counts I could see them by pivoting the data.
pd.pivot_table(links,index='start',values='trips',columns='end')

But what if I want to see a graphical representation of this data?
View Chord Diagram
This is where a chord diagram comes in. It easily lets me create a single chart that shows the start and end stations.
But if I try to use the entire table I get an incomprehensible cats cradle of chords, it’s literally too much information.

So I’m going to limit this diagram to the sixty pairs with the most rides. The options here set the colors for the edges and nodes.
chord=hv.Chord(links[:60])
chord.opts(node_color='index', edge_color='start',
label_index='index',cmap='Category10', edge_cmap='Category10' )
Here I can clearly see the amount of Citi Bike usage between the most popular neighborhoods.

In Jupyter this diagram is automatically enabled for exploration. So when I hover over the node for Chelsea, the trips starting there are highlighted in green and I can see their destinations around the circle.

And if I click on a node the rest of the diagram is dimmed so that I can see the volume of trips starting and ending at the selected locate. Here I see that most trips that start in Chelsea also end there, and those that don’t mostly go to adjacent neighborhoods.

And what if I don’t want to go to Chelsea? I just click on a different node.
Conclusion
When you are exploring data that includes flows between entities such as docking stations in a bike share system, a chord diagram is a great way to visualize data and one that leads to further exploration.