Using subway data to place volunteers — and dive in to data science

Published in

Towards Data Science

8 min readOct 1, 2020

Introduction

I recently began a Metis data science bootcamp, joining about 55 other aspiring data scientists for 12 jam-packed weeks of coding, lectures, challenges and projects. I come from a background in biology; I finished a master’s in marine biology at San Francisco State University in January. I entered 2020 planning to work as a laboratory technician or some other low-level pipette monkey while I figured out a concrete career path, but the Covid-19 global pandemic completely disrupted that plan. As a 26-year old living in a new city (Portland, OR), I found myself looking for some career direction and day-to-day purpose. I had a burgeoning interest in coding, and soon began researching careers utilizing that skill set. I quickly found Metis and formally began my data science journey.

The project

Here I describe my experience with the first project in the Metis curriculum — and my very first data science project (check out the repo). The assignment was to analyze New York City MTA subway data in order to place volunteers looking to recruit guests to an annual gala held by the fictional company WomenTechWomenYes.

Many have analyzed this dataset (google it), and as I embarked on this project as well as a new blog, I wondered to myself: what can I contribute? As a data science newbie, how can I hope to provide revelations that other, more seasoned, data scientists have missed? it seemed unlikely. Instead, I would like to share with you this project through my own perspective: as a brand new, doe-eyed data scientist looking to sink his teeth into the nitty-gritty, math and Python-driven world of data science.

By sharing my thought process throughout this project, I hope to shed light on why I think data science is so useful — and so downright cool. As I share snippets of code and outputs from my analysis, I hope that other burgeoning data scientists will be able to follow along and become inspired to dig in to the many, many available datasets just waiting to be analyzed. Because at the end of the day, my goal is to get more people interested in data science, a broadly applicable set of skills that centers around solving problems and helping people. Let’s get started.

The first step of the project was to find and import the dataset so that we could conduct some EDA (acronyms are very important in data science).

Multiple weeks of turnstile data were loaded from the MTA website:

df_1 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_190323.txt')
df_2 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_190330.txt')
df_3 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_190406.txt').
.
.df_12 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_190608.txt')
df_13 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_190615.txt')
df_14 = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_190622.txt')Mta_2019_df_input_1 = pd.concat([df_1, df_2, df_3, df_4, df_5, df_6, df_7, df_8, df_9, df_10])mta_df = pd.concat([Mta_2019_df_input_1, df_10, df_11, df_12, df_13, df_14])

The resulting dataframe was a sight to behold:

Little did I know how challenging this dataset would be to clean and use! But that’s the fun part.

The first step in cleaning was to add a few columns and clean up the column names:

mta_df['DATETIME'] = pd.to_datetime(mta_df['DATE'] + ' ' + mta_df[“TIME”], format=”%m/%d/%Y %H:%M:%S”)mta_df['TURNSTILE'] = mta_df['C/A'] + ' ' + mta_df['UNIT'] + ' ' + mta_df['SCP'] + ' ' + mta_df['STATION']

Then remove some duplicates:

mta_df.drop_duplicates(subset=['C/A', 'UNIT', 'SCP', 'STATION', 'DATETIME'], inplace=True)

We then made a new frame grouped by the four attributes that together signified a unique turnstile: C/A, Unit, SCP, and Station (each an identifier for the MTA turnstiles).

mta_daily = (mta_df.groupby(['C/A', 'UNIT', 'SCP', 'STATION', 'DATE'],as_index=False).ENTRIES.first())mta_daily.head()

With that cleaning out of the way, we were ready to dive into analysis that would help us achieve our goal: to maximize volunteer impact by targeting the busiest stations and turnstiles. But there was one more issue we saw: the turnstile entries and exits were cumulative, rather than representing passengers since the previous observation (observations were made every four hours).

I had no idea how to solve this issue and get our entry and exit totals in nice four hour increments. This represented an important moment for me in this process. I felt overwhelmed by even this simple problem and felt that I lacked the skills to truly be a data scientist (impostor syndrome is a common ailment in many technical fields, I am told). I was feeling bleak.

However, I was not conducting this analysis alone: this was a group project after all, and on top of that, we had instructors willing to help. After hacking away at it for a day or so, one of our instructors showed us a truly elegant solution that allowed us to turn the cumulative entries and exits into daily ones:

# Adding previous date and previous entries, so that we can look at the deltas:mta_daily[["PREV_DATE", "PREV_ENTRIES"]] = (mta_daily.groupby(["C/A", "UNIT", "SCP", "STATION"])["DATE","ENTRIES"].apply(lambda grp: grp.shift(1)))# Dropping top row of NaN's:
mta_daily.dropna(subset=["PREV_DATE"], axis=0, inplace=True)# Where is the counter going the wrong way?
mta_daily[mta_daily["ENTRIES"] < mta_daily["PREV_ENTRIES"]].head()# Fixing counter
def get_daily_counts(row, max_counter):
    counter = row["ENTRIES"] - row["PREV_ENTRIES"]
    if counter < 0:
        # Counter reversed?
        counter = -counter
    if counter > max_counter:
        # Counter reset to 0? 
        counter = min(row["ENTRIES"], row["PREV_ENTRIES"])
    if counter > max_counter:
        # Still too big?
        return 0
    return counter# finally adding our daily entries column!
mta_daily["DAILY_ENTRIES"] = mta_daily.apply(get_daily_counts, axis=1, max_counter=40000)mta_daily.head()

Yay! Now we have a column for daily entries (we focused solely on entries for the remainder of the project).

In addition to facilitating us moving forward with our analysis, this collaborative intervention revealed an extremely important revelation: I was not alone in solving this problem. This was team project and our instructor had shown he was there to help.

In hindsight, this seems like an obvious observation. Metis stresses the collaborative aspects of the program, and I was literally working with a group on this project. But I want to stress this series of events because I really feel it was a “level-up” moment for me. I had never written code collaboratively, and upon entering a new program, I want to prove to myself and others that I had the stuff to perform and compete. However, I feel this lesson of working together and asking for help is one that more people should take to heart.

Filled with a fresh sense of “its-going-to-be-ok”, we continued on with our analysis. We now had daily entries and exits for each station and we wanted to find the busiest stations. We first grouped our dataframe by station and sorted it by daily entries to see the top 10:

top10_stations = \
    (stations_daily.groupby(['STATION'])['DAILY_ENTRIES'].sum()
                   .reset_index()
                   .sort_values(by='DAILY_ENTRIES',ascending=False) 
                   .STATION.head(10))

Then we extracted the parts of our dataframe that belonged to those top 10:

station_daily_top10 = \
    stations_daily[stations_daily['STATION'].isin(top10_stations)].sort_values(by = 'DAILY_ENTRIES', ascending = False)

And finally plotted it using a standard box plot in Seaborn:

sns.boxplot(x='DAILY_ENTRIES', y='STATION', data=station_daily_top10, order = top10_stations, \
            showfliers = False).set_title('10 Busiest Subway Stations', size = 15);

Awesome! With a few lines of code we had taken a huge, messy dataset and extracted the first piece of information valuable to our client: the 10 busiest subway stations in NYC! Next, we visualized the busiest stations on a map using geo_pandas:

top10_stations_geo = combined_df[combined_df['STATION'].isin(top10_stations)]fig, ax = plt.subplots(1, figsize=(10, 10))
districts_geo.plot(ax=ax)
top10_stations_geo.plot(ax=ax, column = top10_stations_geo['DAILY_ENTRIES'], \
                        legend=True, cmap = 'hot', markersize = 50)

And focused in on Manhattan:

manhattan = districts_geo[districts_geo['boro_cd'].astype(int) <200]
fig, ax = plt.subplots(1, figsize=(20, 20))
manhattan.plot(ax=ax)
top10_stations_geo.plot(ax=ax, column = top10_stations_geo['DAILY_ENTRIES'], \
                        legend=True, cmap = 'hot', markersize = 80)

Beautiful! We could now show our clients exactly where the best target stations were according to our analysis.

At this point, we were nearly finished and I was feeling accomplished. The project was going smoothly, and we were almost ready to deliver the final presentation.

The final aspect of our project was incorporating demographic data into our analysis and recommendation. First, we read in our demographic data obtained from data.cccnewyork.org, a site focused on tracking incomes in NYC. We focused in on a list of the most affluent neighborhoods in NYC, reasoning that more affluent people would be more likely to donate at a gala.

bougie_df = pd.read_csv('Income_Stations_by_Neighborhood.csv')
bougie_df.head()

We then wrote a function to match station names in our demographic dataframe with those in our ‘stations_daily’ frame.

def get_station_name(row, col_name, station_names, check_length = False):
    """
    Takes in a row of a dataframe and matches the value of the column labeled by 
    col_name with the values in station_names
    """
    row_name = row[col_name]
    row_name = re.sub('th', '', row_name)
    row_name = re.sub('rd', '', row_name)
    row_name = re.sub('nd', '', row_name)
    row_name = re.sub('-', '', row_name).upper()
    similarity = 0
    similar_name = ''
    for name in station_names:
        if check_length:
            if len(name) < 14:
                name += "".join(['-' for i in range(len(name), 14)])
        ratio = SequenceMatcher(None, row_name, name).ratio()
        if ratio > similarity:
            similarity = ratio
            similar_name = name.strip('-')
    return similar_name

We added these station names to our demography dataframe:

bougie_df['MTA_name'] = bougie_df.apply(get_station_name, axis = 1, col_name = "MAP NAME", \
                                        station_names = stations_daily['STATION'].unique())bougie_df.head()

We then cross-referenced the stations to find the busiest stations of those in affluent neighborhoods:

bougie_list = bougie_df['MTA_name']bougie_stations = \
    stations_daily[stations_daily['STATION'].isin(bougie_list)].sort_values(by = 'DAILY_ENTRIES', ascending = False)
bougie_top10 = \
    (bougie_stations.groupby(['STATION'])['DAILY_ENTRIES'].sum()
                   .reset_index()
                   .sort_values(by='DAILY_ENTRIES',ascending=False) 
                   .STATION.head(10))

Finally, we plotted the top 10 busiest stations in these neighborhoods:

sns.boxplot(x='DAILY_ENTRIES', y='STATION', \
            data=bougie_stations[bougie_stations['STATION'].isin(bougie_top10)], \
            order = bougie_top10, showfliers = False);

So there it was! A final deliverable to our client: the top 10 busiest subway stations in NYC’s most affluent neighborhoods.

Final thoughts

Having completed my first of hopefully many data science projects, I feel like I have passed a milestone. Not to be dramatic, but there were numerous times during the lead up to and first week of Metis that I was legitimately unsure if I could do it, if my math, coding and analytic skills were up to the high standard Metis sets for its students. After completing this project (repo!), I feel that I can succeed in data science, and I am really excited to continue this bootcamp.

I have recently begun working on my second project (which I will blog about soon, don’t you worry) bolstered by the confidence gained from this first project. If you take nothing else from this blog post, I hope you can see the start of my data science journey as inspiration to really get your feet wet in whatever it is you are doing, and know that there are always people around you that can help and support you! Until next time.

Using subway data to place volunteers — and dive in to data science

Introduction

The project

Final thoughts

Written by Noah Jaffe