Exploring and Visualizing Chicago Transit data using pandas and Bokeh — Part I (intro to pandas)

Published in

Towards Data Science

12 min readMay 30, 2018

Part 1 of this series is an introduction to pandas, in which I will describe how to read in a large open-access data set into a Jupyter notebook with python3, and demonstrate a few capabilities of pandas as a high performance data manipulation/analysis software package.

Part 2 will dive more into the Bokeh software package, where I will demonstrate the interactive visualization capabilities and geospatial mapping features in order to present the data from Part 1 in an engaging way.

The City of Chicago provides a considerable amount of open data for its citizens to explore. The Chicago Data Portal (https://data.cityofchicago.org/) houses thousands of datasets available for download. It includes anything from where movies are being filmed, to sanitation data about your favorite restaurant. People have already created apps that integrate this data in very unique and useful ways! For example, I found an app that alerts you via text message when your street will be swept (to avoid those lovely ticket$ that the city doles out for failing to move your car during a cleaning): see https://sweeparound.us

I thought it would be interesting to explore a dataset that can be found on the Chicago Data Portal, around a topic that most Chicagoans would be familiar with: the Chicago Transit Authority (CTA) public train system (also known as the ‘L’ trains). For those that are not native to the Chicago area, the CTA train system is split into 8 different ‘L’ train lines:

Each train line passes through a set of stations. Some stations have multiple lines running through them. Below is a map of the CTA train system:

CTA Transit Map. Source: https://www.transitchicago.com/assets/1/6/ctamap_Lsystem.png

The Data

The Chicago Data Portal has many datasets relating to the public transit system. The two datasets we will use for this tutorial are located below:

The first data set contains monthly ridership totals for all the CTA train stations, which are further broken down by day type. (weekday, Saturday or Sunday/Holiday) The awesome thing is this data is available all the way back to 2001. Pretty impressive!

The second data set is a map, which contains geospatial data about each CTA train stop (including stop name, and coordinates on a map). You can export the information in a number of formats (JSON, TSV, XML). We’ll use this more in Part 2 of this tutorial.

Formulating Questions / Exploring the Data

As a Chicagoan I’ve ridden most of the train lines (except the purple line). By riding the trains frequently, you can get a sense for certain transit patterns. For example, you would know that taking your entire family on the Red Line State/Lake stop at 7:30AM would be a VERY BAD idea. Using this transit data, more concrete and data-informed patterns can be discovered.

Before jumping full-force into analyzing data and plotting curves, it is essential that you ask questions — or begin to explore the data. Overall, what is it you hope to discover? What are some specific questions you have before and after reviewing the data? What don’t you know that you would like to know? This process can be iterative: after discovering new information, you can start to ask new questions.

Some general questions I had about the CTA train system that this data will help to answer:

Questions

Which of the two train stations next to my apartment is more popular?
What are the most popular stations overall (and least popular)?
What part of the city has the most ridership?
Are there any interesting trends in ridership I can see over time?

As with most data science problems, the first place to start is to read in and process the data (and check for data quality problems) to begin answering some basic questions. This will be the main focus of this article (the answers to Questions 1–2). I plan to dive into some more advanced topics in Part 2 that will include more advanced visualization methods using Bokeh and choropleth maps (Questions 3–4).

Getting Started with Jupyter Notebooks

The analysis I will perform is run using a Jupyter Notebook and python3, along with some other helpful packages (pandas and Bokeh). If you don’t know what a Jupyter notebook is, you are missing out! It’s an incredible visual way of programming / doing exploratory data analysis that has helped me become a better python programmer and has freed me of the shackles of Excel. Here are some instructions for installing Jupyter on your computer: http://jupyter.org/install

You’ll also need to install python3, and the pandas / bokeh packages. I have done that using the following commands at the command line:

pip3 install pandas
pip3 install bokeh

To launch the Jupyter notebook from the command line (in the desired directory), you type:

jupyter notebook

This launches the Jupyter notebook dashboard in the browser (by default, http://localhost:8888).

This dasboard view shows a list of the notebooks, files, and subdirectories in the directory where the notebook server was started.

To create a new notebook, you click on the New button at the top, and choose a Python 3 environment.

This opens the notebook view where you can start coding in python3!

For a quick rundown of how to use a jupyter notebook, see this article.

The Tools

The two main tools I will use for this analysis are pandas and Bokeh:

pandas: “Python package providing fast, flexible, and expressive data structures designed to make working with ‘relational’ or ‘labeled’ data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python”
Bokeh: “Bokeh is an interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of versatile graphics, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.”

The bottom line is that pandas will be primarily used for storing and manipulating the data (in a dataframe) and Bokeh will be used for the presentation and visualization of the data.

Reading in the Data

To read in the data, I import the pandas package and read in the CSV of L-stop ridership averages. I can use the .read_csv() function to do this, passing the file name and a separator as the arguments:

import pandas as pdL_Rides = pd.read_csv('CTA-L-Monthly-Ridership.csv',sep=',')
L_Rides.head()

L_Rides is the dataframe containing the ridership data. Using the .head() function allows me to view only the first few lines of the data frame. It’s a great way to see what the data looks like.

Here we see that there are a few columns of the data:

station_id, stationame, month_beginning, avg_weekday_rides, avg_saturday_rides, avg_sunday-holiday_rides, month_total

It looks like for each station and each month, there are ridership statistics for weekday/weekend rides, as well as a monthly total for each station.

pandas also has a helpful function called .describe() that is used to summarize data frames, providing helpful statistics about the data set (mean, min, max, etc). Running the following command, will provide these statistics:

L_Rides.describe()

Looking at this data frame, we can see there are a total of 29,460 data points (rows) in this data set. A few more interesting tidbits:

The most rides a station has given on weekday has been 24,041.
An average CTA station gives about 100,573 rides in a month.
In general weekdays have the most ridership, while sundays have the least.

Looking at these individual numbers can give clues on what statistics would be interesting to visualize, though creating tables and visualizations can give a more complete picture and tell a more compelling story.

Answering Questions

Question 1: Which of the two train stations next to my apartment is more popular?

After having read in the data, I can now begin to answer some questions that I originally had. I live on the north side of the city in Edgewater, and the two stations closest to me are the Bryn Mawr and Berwyn Red Line stops. I want to know which of these two are the most popular in 2017. I can use the very useful indexing feature of pandas to do this.

To perform a selection of the data for the Berwyn stop, I am ultimately answering the query:

Find all the data where the Berwyn stop is the ‘stationame’

You can translate this query into a pandas indexing selection like the one below:

L_Rides[L_Rides['stationame'] == 'Berwyn']

You can see that this filtered the data frame to only those stations with ‘Berwyn’ as the ‘stationame’.

Ok, that’s cool. But how do I get only the 2017 Berwyn station data?

Simple. You can add more boolean operators to filter the dates.

Find all the data between 1/01/2017 and 12/31/2017

The tricky part here is that the data that was imported is assigned certain variable types. Pandas does its best to assign variable types to the data in each column. For example, the stationame column was brought in as a string (technically a python object: read more here). You can check this with the following command:

L_Rides.dtypes

You can see all the columns and their associated data type above. There’s some good news and bad news.

Bad News: the ‘month_beginning’ column is not a date-time type and we can’t do date comparisons very easily

Good News: pandas makes it easy to change this type

To change the data type of an entire column in pandas, you can use the to_datetime() function. My general practice is to create a new column of this data. I do this with the following command:

L_Rides['date_time'] = pd.to_datetime(L_Rides['month_beginning'])

You can see another column was added called date_time, that now is a ‘datetime64[ns]’ data type.

Now we can perform date time comparison operations. I can select Berwyn data between 1/1/2017 and 12/31/2017 using the following command:

Berwyn_Rides = L_Rides[(L_Rides['stationame'] == 'Berwyn') &
                       (L_Rides['date_time'] >= '01/01/2017') &
                       (L_Rides['date_time'] < '12/31/2017')]

Now we have 12 data points for the various months of data in 2017 for the Berwyn stop.

The final step is obtaining the total ridership for the year. To do this, I must sum up the different rows of data that I have selected. Again, pandas makes this very easy, as data frames have mathematical functions that can be applied to each column of data.

Unsurprisingly, the summation function is called .sum(). You can use this function as-is to get a sum of every column in the data frame, or you can pass specific column names. I get the Berwyn 2017 ridership total using the following command:

Berwyn_Rides['monthtotal'].sum()

The same set of commands can be executed for the Bryn Mawr Station:

Ultimately we see that the Bryn Mawr station was almost 50% more popular than the Berwyn station in 2017 (1,094,141 Berwyn rides to 1,509,613 Bryn Mawr rides).

Question 2: What are the most popular stations (and least popular)?

The next question is about the data from all the stations in the CTA transit system, so we will be using the entire L station data set. To get a listing of the most popular transit stations, we must first group the data by each station.

Creating individual totals for each station like I have done in the previous example would be a LOT of unnecessary work. Instead we are going to use this wonderful .groupby() functionality that pandas has!

The groupby function is documented extremely well here, but the basic concept is to perform one or more of the following functions:

Split — data into groups
Apply — function to the group(s)
Combine — the results into a data structure

For this transit example, we want to Split the data into transit station groups, Apply a summation function for the total ridership, then Combine the results into a single data frame.

To begin we can perform the split by station using the following command:

Station_Group = L_Rides.groupby(['stationame'])

Above we see that the group is just a pandas DataFrameGroupBy object. To this object you can perform a variety of Apply functions. To view the unique groups for which the data has been split, you can use the .groups function:

Station_Group.groups

The result shows that the object is just a dictionary containing the group as the key, and the values of that group as the value.

With the grouped object created we can perform an aggregation (Apply) operation on the groups. I want to apply a sum operation on the ‘monthtotal’ column of the dataframe. I do this with the following comand:

Station_Group['monthtotal'].sum()#orStation_Group['monthtotal'].agg('sum')

Here we see the resulting month totals of every station, summed back until the beginning of 2001.

The last thing we can do to this, is to Combine the results back into a dataframe (and sort the results using the .sort_values() function):

Station_Totals = pd.DataFrame(Station_Group['monthtotal'].sum())
Station_Totals.sort_values(ascending=False,by='monthtotal',inplace=True)
Station_Totals

Now that we have the data frame with ridership totals per station, we can answer the question that was posed. I’ll return the top 5 stations and bottom 5 stations using some handy pandas commands:

Station_Totals.head(5)         # Top 5 stationsStation_Totals.tail(5)         # Bottom 5 stations

Now we have the most popular and least popular CTA stations (since the beginning of 2001).

Ok now about those 27 rides on Homan….

Mini History Lesson

From Wikipedia:

Homan was a station on the Chicago Transit Authority’s Green Line. The station was located at Homan Avenue and Lake Street in the East Garfield Park neighborhood of Chicago. Homan was situated east of Pulaski and west of Kedzie. Homan opened in March 1894 and closed on January 9, 1994, when the entire Green Line closed for a renovation and rehabilitation project. The station did not reopen with the rest of the Green Line on May 12, 1996.[1]
Relocation of the former station house began in March 2001 and the project was finished on June 30, 2001, when the new station opened at Conservatory.[4]

So it is quite possible that 27 people took some rides just before/after the station was closed….

Basic Plots using Pandas

While pandas is primarily used for data wrangling, it does have some useful plotting features. pandas can plot graphs from dataframes (using the matplotlib library under the hood). The function .plot() can be appended to a dataframe to plot all the columns.

Berywn_Rides.plot()

The graph above isn’t particularly useful, though additional arguments (such as the columns to use for the x and y axes) can be passed to the .plot() function to make a more useable graph. Below I pass column names as the x and y parameters, to plot the average weekday rides over time.

Berwyn_Rides.plot(x ='date_time', y='avg_weekday_rides')

The above graph displays the Berwyn stop average weekday rides over the course of 2017.

pandas plots data using lines by default, though there are a number of other visualization types that can be used: histograms, box plots, pie charts, etc.

To demonstrate this, I will create a bar plot of top 25 CTA stations. I can use the Station_Totals data frame that we made above, and pass ‘bar’ as the ‘kind’ parameter and 60 as the ‘rot’ parameter to rotate the labels 60 degrees.

Station_Totals.head(25).plot(kind='bar',rot=60)

As you can see, the plotting functionality in pandas is fairly simple to use. Because pandas uses the matplotlib library as a wrapper, plots can be further customized using the features of matplotlib.

The great thing about pandas is that dataframes are usable in other major python plotting packages. (I.e. plot.ly and Bokeh). So you aren’t just stuck with one option for plotting your data.

In my next post (Part 2), I introduce the Bokeh visualization package to begin visualizing some of this CTA transit data. I will also attempt to integrate the CTA map into the visualization to provide geographic contextual information about the ridership data! Stay tuned! Hope you enjoyed Part 1.