Not one week into my Data Science Program at Flatiron School and I was already haunted by the colloquial knowledge "police data is a headache to work with. It’s messy." And even under the best circumstances, I would expect police data to be messy. The police are not trained to make data collection their first priority, nor should they be. This can be easily contrasted to the entire teams of data collectors working at Netflix or Spotify or Google – whose first priority is engineering their data to be optimized for analytics.
However, I think messy, police data is practically ripe for visualization, simply because it’s too messy, complex, and large for the average person to understand. These are circumstances where visualization can bring much-needed intuitive understanding. A simple choropleth can become the beginning of the story our data has for us. And this story is one that the average New Yorker deserves to have access to. And if you are an average New Yorker and care more about the findings than the code, feel free to scroll down to findings.
NYPD and Stop-and-Frisks, a Boilerplate:
For this project, I used 2020 NYPD stop-and-frisk data. The American Civil Liberties Union (ACLU) has already thoroughly analyzed this data.[1] They found that of the 9,544 stops recorded, 5,791 were innocent (61 percent).
They also found:
5,367 were Black (56 percent). 2,879 were Latinx (30 percent). 856 were white (9 percent). 219 were Asian / Pacific Islander (2 percent) 132 were Middle Eastern/Southwest Asian (1.4 percent)
And why those demographics are relevant to a baseline knowledge of stop-and-frisk is because stop-and-frisks were deemed unconstitutional in 2013 as a violation of the 4th Amendment, prohibition of unreasonable searches and seizures.
If you Google "stop and frisk + unconstitutional", you see the following first, from Wikipedia:
In New York City, Mayor Adams is a former police captain, and a founding member of 100 Blacks In Law Enforcement Who Care (an advocacy group that focuses on fighting injustices between the African American community and their interactions with the NYPD.) Eric Adams also promised to bring a legal form of stop-and-frisks back, [2] stating "If you have a police department where you’re saying you can’t stop and question, that is not a responsible form of policing…"
An Expensive Institution:
In 2022, "The proposed NYPD expense budget for Fiscal 2022 is $5.44 billion, representing 5.5 percent of the City’s total." [3]

Business Insider estimated the real cost of NYPD’s funding was closer to 10B in 2020. [4] New Yorkers’ taxes make up around 95% of the funding for NYPD, as the most heavily funded police, anywhere. It’s extremely important for the health of New York City that the average New Yorker knows about the everyday goings-on of their city’s police. But there’s a huge bottleneck of that information.
The Accessibility Problem With Pubic Police Data:
For transparency, NYPD makes much of its data public.[5] But in the case of stop-and-frisk data, none of that data was aggregated. Instead, each row of information represented one independent stop-and-frisk.
There’s nothing unusual about public data sets not being made for the average human to discern meaning from. It’s common practice. But it’s not doing the average New Yorker much good as is until we do something with it. Let’s put it on a map!
New Yorkers Know Neighborhoods:
Maybe it’s from constantly looking at subway maps?
After all, the average New Yorker can now look at this map and understand "Stop-and-Frisk meant something extremely different to East Harlem, East New York, and The Bronx (marked by red arrows) than it did to Staten Island, the Financial District, and Bay Ridge (marked by green arrows.)"

The Code:
As a note, this blog post is not going to teach you anything that I didn’t learn first from Jade Adams in her amazing post: "A Python Tutorial on Geomapping using Folium and GeoPandas." [6] Look at that post if you’re looking for an amazing Folium tutorial, I’m going to keep talking about how to visualize NYPD data specifically. I’ll be taking you through my steps on how I made the above map.
First, open up whatever you use to write code, I like Jupyter Notebooks. Let’s import Pandas to read the excel file and folium to make the map.
The next step is to take a look at the shape and columns of the data.
The output told us that we’re working with 9544 rows, 83 columns, and the names of all the columns.
Let’s get rid of anything that’s not id or zipcode data for our purposes, just to keep things extra simple. We can always come back to this extremely complex data set later for further investigation (as well we should!) After all, this map is not the end of our process, only the beginning.
However, now that we have this new data frame, saved as zipcode_frequency, we need to do some basic reformatting to be able to use it.
The above code will return the following report of our columns, how many non-null values there are in that column, and the data type of that column.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9544 entries, 0 to 9543
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 STOP_ID 9544 non-null int64
1 STOP_LOCATION_ZIP_CODE 9544 non-null object
dtypes: int64(1), object(1)
memory usage: 149.2+ KB
Notice how the Non-Null Count shows 9544 non-null
values for both STOP_ID
and STOP_LOCATION_ZIP_CODE,
this shows us that our null values will not be filled by a simple zipcode_frequency.fill_na(<params here>)
. However, when we run the following code:
We got the following output: a count of each time the zip code was filled in, a count of the unique zip codes, the most occurring zip code, and the frequency of that most occurring zip code.
count 9544
unique 179
top (null)
freq 654
Name: STOP_LOCATION_ZIP_CODE, dtype: object
This tells us that our null values are filled with the value of "(null)" in the zip code column 654 times. But what percentage of the data is that? How many times did police officers in NYC fail to record the zip code where they did record a stop-and-frisk?
This code returns our string with its value-filled in as the percent of data we’re getting rid of.
This data has 0.07 percent of its zipcode data missing.
Removing the nulls, vs. adding the null values to the Yes column vs. adding the null values to the No column: there is a bit of a dilemma here. In the future, I’d like to come back to this project and figure out a way to both aggregate the data, and hold on to all three, distinct values. Normally, I wouldn’t hold onto null values, but with this kind of data, null values could tell their own story (do certain zipcodes have higher tendencies for their reports to contain null values?) However, for the purposes of this blog post, simplicity and just getting an initial idea about what the data says, I’ll be removing null values from the data and keeping track of these losses.
Aggregating the Data with Pandas:
And the return shows our aggregated data! Now our rows represent a whole zip code instead of an instance.

But we still have three more things to do before we map.
- We will rename
STOP_ID
toFREQUENCY
for clarity. - We will cast our zip codes as strings, instead of numeric values. This will allow folium to assign the frequencies of the zip codes to their zip code labels.
-
Finally removing those
"(null)"
values from the data, with the understanding that we are introducing bias into our process, and changing what our map will mean.Making A Choropleth Map:
A choropleth is a geographic heat-map. Folium is a python map-making library. We require two basic things to make a choropleth: some boundaries for our map as a GeoJson file, and our data. We’ve prepared our data. I’ll be using the NYC GeoJson file I found here: https://github.com/fedhere/PUI2015_EC/blob/master/mam1612_EC/nyc-zip-code-tabulation-areas-polygons.geojson,
For more on folium choropleths, I will refer you to the folium documentation directly: https://python-visualization.github.io/folium/quickstart.html
It’s true, ideally, I’d be making these maps interactive, so that one could click on each zip code area of the map, and statistics for that zip code would pop, and I’d probably want to make these using tableau, not folium. That would increase usability for sure. There are very many ways to improve on these maps to increase the amount of understanding we can learn from them.
But I’d argue that just a quick look at the map we’ve been able to make together here, let’s take another look at it really quick, is already telling us more, by a level of magnitude than looking at the original excel file.

Embracing Questions to Find a Story:
While this isn’t the dashboard of our dreams, it’s a fantastic storytelling tool because it already makes us curious with questions to ask of our data:
- How do these zipcodes correlate to the
PHSYSICAL_FORCE_RESTRAINED_USED_FLAG
column, that we have yet to explore in our data? Are there any zip codes that have a higher percentage of folks being restrained, and are they the zip codes where the most stop-and-frisks happened? - How do these zipcodes correlate to the
ASK_FOR_CONSENT_FLAG
column in our data? Were folks across zipcodes asked for their consent at the same rates? - What about the
OFFICER_IN_UNIFORM_FLAG
column? What happens to this column when we aggregate by zip code? (Were there more undercover police officers in certain zip codes?) - How would this map compare to a map with the same parameters from 2019? Were zip codes with a lot of stop-and-frisks happening there the same as the year before?
- Perhaps most notably: "what is happening in East Harlem, East New York, and the Bronx?"
My enthusiasm about messy data and using choropleths can be summed up to fellow data scientists, thus: just as we use histograms to establish a baseline understanding of frequency data, we can use simple choropleths to establish a baseline of understanding of geographical data. From there, we can use our choropleth to figure out where in the data we’d like to explore more.
Consent-Asked:
Using similar methods, I was able to get the percentage frequency of consent-asked by zip code. (As an important note, ASK_FOR_CONSENT
is not measuring when consent was asked for and given in this data set, only times when consent was asked for. There are very many instances of stop-and-frisks where consent was asked for, consent was not given, and the stop-and-frisk still happened.)

Using that data I was able to make the following map:

The above map is great for showing the real range of how consensual stop-and-frisks were in 2020, there are two small zip codes, one in east Queens and one right near Yonkers that had a 100% ask for consent rate. But what if we want to compare the consent-asked rates in these zipcodes to each other in a way that really highlights where most of the data is?
Using Quartile Ranges in Choropleth Maps:
The following code:
Will give us the following map:

It’s important to point out the legend, which shows that zipcodes must be in the top 75 percentile to have consent-asked rates of higher than 18%. It’s our duty to point this out as storytellers, lest this map is misinterpreted to show that "consent-asked rates are generally high throughout NYC in 2020 since a bunch of the map is dark blue." What we should or shouldn’t consider a "high" consent-asked rate to be is a topic I won’t discuss in this blog, but intuitively "high" could be associated with "a majority of the time" and we must let our audience know this isn’t the case with this data.
Keep Digging to Find a Story:
When taking a close look at this map, I again want to point out some neighborhoods of interest:

The red arrows: Here are two zip codes in Staten Island that had a relatively low frequency of stop-and-frisks but relatively high consent-asked rates. Bay Ridge shows a similar pattern.
The purple arrows: Here is Sunset Park, which had a relatively low consent-asked rate compared to its frequency of stop-and-frisks. East Harlem also looks to be showing that pattern.
The orange arrow: This is East New York which had a relatively high frequency of stop-and-frisks along with East Harlem, but here the recorded consent-asked rates are higher than they are in East Harlem.
I won’t go into every neighborhood on this map, nor will I delve too far into my further analysis, it’s all on my GitHub. However, by looking at some maps and inspecting the data I had the following findings. Please note I’ll be using zip code and neighborhood names interchangeably, for context but this is also not the most accurate.
Findings
High-Frequency Stop-and-Frisk Zip Codes
10029:

In East Harlem, the most common stop-and-frisk zip code, where 220 folks were reported stop-and-frisked in 2020, the consent-asked rate is around 11%, (in between the first and second quartile, so slightly less than the absolute average for this data set.)
11207:

East New York, the second most common stop-and-frisk zip code, percent-asked is around 20% (putting it in the 75 percentile of most consensual stop-and-frisks as well.) That’s a big difference from East Harlem’s 11% consent-asked rate, for those 169 folks stop-and-frisked.
11206:

Bushwick, the seventh most common stop-and-frisk zip code, consent-asked is around 25%, (also putting it in the 75% percentile of most consent-asked zipcodes.) This zip code has one of the highest consent-asked rates among the zipcodes where stop-and-frisks were most frequent.
11234:

Flatlands/Bergen Beach, the 14th most common stop-and-frisk zip code, the consent-asked rate is at about 23%, while just next door in Canarsie, zipcode 11236, the consent-asked rate is less than half of that at about 10.5%.
11220:

Sunset Park, the 15th most common stop-and-frisk zip code, consent-asked is the lowest% within the top 25 highest frequency zip codes, at just under 4% asked. This is a very interesting case since all the other zip codes in the top 25 most frequent zip codes have consent-asked rates much higher by about 3 times at least. The next most consent-asked zip code in the top 25 is 10002 (The Lower East Side) with a consent-asked rate of about 9%.
Low-Frequency Consent-Asked Zip Codes:
11232:

This is Bush Terminal/Greenwood/Sunset Park. Of the zip codes where consent was never asked for in 2020, this is the zip code where stop-and-frisks had the highest stop-and-frisk frequency of 40 people.
11375:

Forrest Hills had a low consent-asked rate too compared to its stop-and-frisk frequency of about 30 people, similar to sunset park.
10306:

This zip code is in Staten Island and includes Midland Beach. It had a surprising 0% consent-asked rate for the 31 people stop-and-frisked there, compared to just next door, in 10309 where 22 people were stop-and-frisked. But when we look to try to see what percentage of the folks in 10309 were asked for consent, we see a percent close to 20%.
11210:

For the about 30 people who were stop-and-frisked in this part of the Flatlands in 2020, none were asked if they consented.
In Conclusion:
I’d like to challenge my fellow Data Scientists to visualize police data specifically. I know it’s messy. I know it’s unwieldy. But fear not the null value! Instead, keep diligent care of your tracking your missing data, make a visual, and take it one column at a time. You don’t need to tell the whole story behind the data before you start to understand the data. Let your questions guide your process. Say YES to the MESS of police data with choropleths!
References:
[1] NYCLU, Stop-and-Frisk in the de Blasio Era (2019), NYCLU
[2] E. Ngo, Eric Adams explains why he supports stop-and-frisk, when it’s used legally (2021), Spectrum News
[3] New York City Council, Report to the Committees on Finance and Public Safety on the Fiscal 2022 Executive Budget for the New York Police Department (2021)
[4] A. Narishkin, K. Hauwanga, H. Gavin, J. Chou, A. Schmitz, The real cost of the police, and why the NYPD’s actual price tag is $10 billion a year (2020), Business Insider
[5] NYPD, Stop, Question and Frisk Data (2020)
[6] J. Adams, A Python Tutorial on Geomapping using Folium and GeoPandas (2022), Medium