EDA with Tableau: 2021 NYPD Stop-and-Frisks by Demographics

The case for making accessible dashboards when working with public data —

Louis Casanave
Towards Data Science

--

(“Silent march to end stop and frisk and racial profiling” by longislandwins is licensed under CC BY 2.0.)

The above image was taken in a silent march to end stop-and-frisk and racial profiling, and it was taken on June 17th, 2012, one year before the practice of stop-and-frisk was deemed unconstitutional. The public has a vested interest in exploring NYPD Stop-and-Frisk data. This last year in NYC, there were just under nine thousand stops made.

But what is a stop-and-frisk?

A NYPD Stop-and-Frisk Overveiw:

Dictionary.com defines stop-and-frisk as the policing practice of stopping a person briefly in order to search them for weapons or prohibited items.”

NYPD has started calling their program “Stop, Question and Frisk” this last year, as a reflection of Mayor Adam’s attitude on the practice of stop-and-frisk, a former NYPD Captain.

“I was continually pressed on my position on the policing procedure known as ‘stop and frisk’ — which is actually in law enforcement known as ‘stop, question and frisk’ — and why I believed that, if used properly, it could reduce crime without infringing on personal liberties and human rights,” — Mayor Adams [1]

Why this is important to New York City’s history of policing is because if you Google “stop and frisk + unconstitutional”, you see the following first, from Wikipedia:

In Floyd v. City of New York, decided on August 12, 2013, US District Court Judge Shira Scheindlin ruled that stop-and-frisk had been used in an unconstitutional manner and directed the police to adopt a written policy to specify where such stops are authorized.

The NYCLU has been clear on its stance on NYPD’s stop-and-frisk program.[4]

From the NYCLU:

Annual Stop-and-Frisk Numbers:

An analysis by the NYCLU revealed that innocent New Yorkers have been subjected to police stops and street interrogations more than 5 million times since 2002, and that Black and Latinx communities continue to be the overwhelming target of these tactics. At the height of stop-and-frisk in 2011 under the Bloomberg administration, over 685,000 people were stopped. Nearly 9 out of 10 stopped-and-frisked New Yorkers have been completely innocent.

…In 2021, 8,947 stops were recorded.
5,422 were innocent (61 percent).
5,404 were Black (60 percent).
2,457 were Latinx (27 percent).
732 were white (8 percent).
192 were Asian / Pacific Islander (2 percent)
71 were Middle Eastern/Southwest Asian (1 percent)

However, there’s a problem stopping most New Yorkers from approaching stop-and-frisk data. As I’ve mentioned in my previous article about NYPD Stop-and-Frisk Data, this information is made available to the public but it isn’t accessible to the public. The files published by NYPD only exist in their raw form, they aren’t summarized or aggregated. Each row of the file represents one record of a stop and to look at the raw files goes a bit like this:

(Video by author. Link for the video clip on youtube is here: https://www.youtube.com/watch?v=illObkP3TLQ)

For data that needs to be accessible to the public, a tool like Tableau Public is an excellent match. Its point-and-click graphical user interface is intuitive for the data-savvy and data novice alike. By using Tableau Public for my EDA, the benefit of interactive visualization helped my own understanding of the data but perhaps, more importantly, I built something accessible for the average New Yorker to understand this last year’s Stop-and-Frisks.

Let’s take a look at what I was able to build. Then I’ll be working backward to show my methods so you can understand how I was able to build this dashboard and utilize it for EDA.

My EDA is going to be focused on understanding the relationships between demographics (race, age and sex) to police precincts where the stop-and-frisk happened, and the suspected crime of the person stopped. I’ll also be looking into whether or not Physical Force of any kind was used, and whether or not the stop led to an arrest.

Results:

https://public.tableau.com/views/StopandFrisk2021NYPD/StopandFrisk2021DemographicOverview?:language=en-US&:display_count=n&:origin=viz_share_link

Here’s a static picture of the dashboard I made, as Medium doesn’t support embedding Tableau Public dashboards.

(Image by the author)

The Data:

NYPD posts all of its available data to NYC Open Data, and here’s their Stop, Question, and Frisk Data [2].

NYC has it’s NYPD precinct geojson file here under “School, Police, Health and Fire” [3].

The Code:

My Github Repo where I cleaned the data and prepare it for aggregation is here. I won’t be going over every step I did but instead giving an overview of the choices I made.

  1. I made STOP_FRISK_DATE and STOP_FRISK_TIME into a date time column for each stop and resaved the file as a csv.gz file to save some space:
(Image by the author)

2. I separated out the columns of interest and only kept those:

(Image by the author)

3. I replaced some of the more encoded crime descriptions with their full wordage:

(Image by the author)

4. I changed some of the racial descriptions towards more nuanced language where I could. (Business Insider does a great job of breaking down why “Middle Eastern” is dated language here.[4]) Changing mistakes like MALE and (null) values to UNKNOWN in the SUSPECT_RACE_DESCRIPTIONwas pretty simple. I didn’t want to change too many of these values because I found it pertinent to keep precision in how the police are racially profiling people.

However, arguments could be made to combine BLACK HISPANIC with BLACK and WHITE HISPANIC with WHITE or to make a HISPANIC or LATINXcategory onto itself. Changing ASIAN / PACIFIC ISLANDER to E. ASIAN is the only category where I lose some of this nuance in exchange for a more clipped name object for ease of use, which is something I fix once I’m in Tableau with an alias.

(Image by the author)

5. I replaced Y with 1 and the UNKNOWNwith 0 in the columns about Physical Force Used. When I aggregate, this will tally all the Y s together.

(Image by the author)

6. For aggregation purposes and our descriptive statistics, I also made a column that tallies all counts of Physical Force used during a single encounter (row.)

(Image by the author)

Which returns to us the following table:

(Image by the author)

What’s very important to note about this return is out of the 8,947 rows this data represents, less than 100 of those rows had an UNKNOWN amount of Physical Force used. All the other stops in this data set had a known factor of at least one count of some kind of Physical Force used. We’ll come back to this when we get our descriptive statistics.

7. I made a boolean column to capture if Physical Force of any kind was used. Note that when I aliased this column in Tableau, FALSE became UNKNOWN :

(Image by the author)

8. I cast the DateTime column as the index and changed the STOP_LOCATION_PRECICNTcolumn to a string datatype since it’s categorical information and we don’t want to compute these values.

(Image by the author)

9. I changed some of the values in SUSPECT_REPORTED_AGE that didn’t make sense into UNKNOWN after doing a data['SUSPECT_REPORTED_AGE'].value_counts() and discovering some age values that were illogical like 0 years and 120 years.

(Image by the author.)

10. I sliced out of our data the rows where the age was a known value, and converted those values into integers so we can compute them.

(Image by the author)

I did this because I want to make anAGE_CATEGORY categorical column based on the quartile ranges of ages. By splitting up ages by quartile ranges, I knew I could most equally split the data where the data was distributed.

(Image by the author)

This bit of code returned to us the numbers that correlate to where the age bins should be.

(Image by the author)

11. So I made my bins to capture ages 20 and below, 21–28, 28–38 and 38 and above:

This bit of code returns to us SUSPECT_REPORTED_AGE column to compare to the AGE_CATEGORY column we just made. Let’s see if it worked.

It worked! I can tell it worked because our first row, indexed at 2021-01-01 01:50:00 is listed as being 40 years old and in the Thirtyeight-Above category and so forth.

From there, I wanted to mix our age unknowns back into the data with their age category as UNKNOWN :

(Image by author)

Which returns the following:

(Image by the author)

Looks like it took! From there, I wanted to get a sense of the overall distributions of ages in the AGE_CATEGORY

Which returned the following:

(Image by the author)

Looks like our quantiles worked to capture the distribution evenly, and our UNKNOWN age category is just smaller than half of our other categories.

11. When I looked into the SUSPECT_SEX distributions like so:

(Image by the author)

It returned the following:

(Image by the author)

Making FEMALE the gross minority in this data set. Having the majority of folks be an UNKNOWN gender didn’t make very much sense. I decided to change UNKNOWN to MALE with the understanding that there may have been some folks who were neither female nor male who may experience erasure in the data by handling it this way.

(Image by the author)

12. I made a column simply for aggregation purposes to count the total number of stops in any given category. I will use this column quite a bit in Tableau.

(Image by the author)

13. I made the SUSPECT_ARRESTED_FLAG column suitable for aggregation and made its counterpart column SUSPECT_NOT_ARRESTED_FLAG show the opposite values so that I can use both.

(Image by the author)

So if someone was arrested, it would show a value of 1 for SUSPECT_ARRESTED_FLAG and a value of 0 for SUSPECT_NOT_ARRESTED_FLAG which will help me out during visualization.

14. I took a look into the descriptive statistics using data.describe() and here are my findings:

(Image by the author)
  • The overall average ratio is 37% Arrested to 62% Not Arrested on Average with a standard deviation of about 50% which means that the variability is higher than the actual arrest rate.
  • The average ratio of firearms being drawn during stops is just under 6% of the time.
  • Handcuffs are being enforced about 20% of the time during stops.
(Image by the author)
  • OC spray wasn’t used during 2021 during stops according to the data.
  • Physical force “other” was used about 3% of the time during stops.
  • Restraint was used just under 3% of the time during stops.
  • Physical Force Verbal Instruction was used 90% of the time, making it a regular part of most stops.
(Image by the author)
  • Weapon Impact was hardly ever recorded as being used.
  • The physical force total, perhaps the most interesting metric in this set of stats is at about 1.25% meaning if we sum up all the various counts of physical force used, there were more counts of physical force used than there were stops by about a quarter. This is a very interesting statistic. With the standard deviation being at about 0.5 or a half count of a kind of physical force used, that means that the average experience was that physical force was used on a stop, and how much physical force normally varied by a half count.

From here I just saved the data using data.to_csv('2021_data.csv') and proceeded to make the dashboard with Tableau Public.

The Dashboard:

This was my first time using Tableau Public and it took me a minute to get my bearings. Here are my generalized steps, but for more detail, I encourage you to download the dashboard itself and start playing around with it.

  1. Adding the precincts.json file as a Spacial file and the 2021_data.csv file as a text file, and defining the relationship between them as Precicnt1 = STOP_LOCATION_PRECICNT
(Image by the author)

2. Making the Outcome by Precinct map by dragging the Geometry onto the blank page, making Stop Location Precicnta dimension detail, (not by SUM,) allows me to scroll over each precinct and have that precinct pop. From there it was a matter of what I wanted to show by precinct. I chose to show the total stops made in that precinct and I chose to show a calculated field I made called Not-Arrested Rate , details on how I calculated that field is here:

(Image by the author)

3. For the Suspected Crime of Stopped plot, I dragged Total Stops to the Columns and made sure it was taking the SUM. I then dragged the Suspected Crime pill over to rows.

(Image by the author)

I decided to alias “Selling of Cannabis” from “Selling of Marijuana” because I wanted to destigmatize Spanish speakers. More on that from NPR[5.]

Making the numbers pop next to the bar was simple too, just dragging SUM(Total Stops) to the text box like this:

(Image by author)

4. Making Racial Disparity Over Time plot was a two-step process. First I dragged the time index column over to Columns and selected Month instead of Year. Then I dragged Total Stops over to Rows and made sure it was taking the SUM.

(Image by the author)

Then I dragged the Suspect Race Description pill over to Marks and made it a color detail.

(Image by the author)

5. Making the Ages plot was similar, I dragged Age Category over to columns and Total Stops over to Rows, making sure it was taking the SUM.

(Image by the author)

Then I likewise dragged Suspect Race Description over to the color Marks.

(Image by the author.)

6. Putting it all together onto one dashboard, making each plot interact, and setting up all the filters, was a fun process of mess-around-and-find-out. To see my full process, the easiest thing would be to download the dashboard from Tableau Public and mess/find out yourself. Further resources are abundant in the Tableau community documentation: https://help.tableau.com/current/pro/desktop/en-us/filtering.htm [6]

As a note: this dashboard could still stand to see some changes or improvements. I haven’t forayed into using the SETS functionality to group all precincts in the same borough together for example. Nor have I connected a way to show the stops that have a count of 2 or more kinds of Physical Force used, which may be pertinent when separating out PHYSICAL_FORCE_.

Why is this better than doing all my EDA in python?

  1. I can interface with the data with rapid-fire questions without being slowed down by making a ton of static tables and graphs.
  2. I can share my findings extremely easily with the public and my stakeholders.

For example, what if I just wanted to get a breakdown of stats where the Not-Arrested rate is the highest? On the map, I could easily spot precinct 75 (East New York) as the darkest red, and when I click on it, I can see the distribution breakdowns of the Suspected Crimes just in that precinct being overwhelmingly Criminal Possession of a Weapon. From here I can also see the distributions of ages by race, and the time the arrests happened by race. A single click has saved me many, many lines of code!

(Image by the author)

When we use those descriptive statistics as insights into this precinct, we can understand that on average, Precinct 75 is stopping folks much younger than the average age for stops across all stops(28 years old.) We also learn that Precinct 75’s Not-Arrested rate is much higher than the average for the city, 83% compared to 67%. If we take a look at the timeline, it occurs to me to ask “What started happening in April to spike stops, especially when stopping Black folks?”

(Image by the author)

When we compare our graphs using the Outcome filter, we can see that out of the 330 people stopped in Precinct 75 in this last year for suspected criminal possession of a weapon, 278 of those folks were not arrested.

Further findings:

Suspicion of Criminal Possession of a Weapon Stops That Do Not Lead to Arrest:

Of the 8,947 stops made this year, 2,471 of them were under suspicion of criminal possession of a weapon and did not lead to an arrest. That’s just under 27% of all stops. By looking at the Ages plot, we can see that more than a fourth of those stops were stopping folks under twenty years of age, or where their age was not listed. We can also see a majority of black folks make up these stops. And when we look over to the Racial Disaparity Over Time Plot we can see a wave that starts in September and peeks in November where the number of stops is more than double what they were in September. It begs the question, what started happening in September, and why are the police stopping young black men under suspicion of criminal possession of a weapon as a result?

(Image by the author)

Staten Island Stops:

When we take a look at the stops that happened only within Staten Island, we get a different slice of the data. We see even more evidence of targeting younger folks for stops, (but less trend to not list the age,) and black young folks make up half of the stops in the twenty-and-under category.

(Image by the author.)

What’s most notable about the stops that happened in Staten Island is a vast majority of them used physical force. These two photos show all the stops that happened on Staten Island, and then I isolate only the stops that used Physical Force of some kind. These two photos are almost identical.

(Image by the author)

That’s because there were only 5 stops that happened in Staten Island this last year where it was unconfirmed if Physical Force of some kind was used or not. All the rest of the stops there involved Physical Force. In the future, I’d want to run hypothesis tests to see how different Staten Island data is from the rest of NYC, as we previously discovered that at least one count of Physical Force being used is common in 90% of stops this last year city-wide.

Precinct 14’s Stops and Women:

When just looking at the stops where those who were stopped were listed as Female (and bearing in mind that while most female folks ID as women, this is an oversimplification,) we can see one precinct in lower manhattan in particular where the not-arrest rate is particularly high (dark red.)

(Image by the author)

On further investigation, we can see that this is precinct 14, which had 60 stops made up of women this last year, 90% of whom were not arrested. That’s quite a high rate compared to our 67% not-arrest rate for all stops across the city.

(Image by the author.)

Of these 60 stops, 40 (67% or two-thirds) were under suspicion of criminal possession of a weapon. When we look to see their ages, we can more than half of these women were under 28 years old, and a majority of those who didn’t have their age listed were of East Asian or Pacific descent. When we look at our Racial Disparity Over Time plot, we can see that these stops had a big spike between October and November.

When we take an even closer look, we can see that a majority of these stops did not lead to arrests. By comparing these two snapshots we can see the difference in the Racial Disparity Over Time is made up of mostly Black women. Importantly, most of the stops that lead to the arrests of these black women didn’t happen when stops were spiking from October-November.

By using Tableau, I can share these findings with a much larger audience, and I was able to rapidly fire questions at the data with ease of use. Tableau has proven extremely useful in helping me explore the data, and in finding where I want to inquire the data even further.

In conclusion, I’ll be incorporating Tableau into my EDA process and I encourage you to consider doing the same. It saves you time as the person asking questions of the data, and your stakeholders will thank you for being able to understand and interact with the data themselves.

[1] E. Adams, A. Wehenkel and G. Louppe, How we make New York City safe: Mayor-elect Eric Adams explains why we need stop and frisk and proactive policing (2021), New York Daily News

[2] NYPD, Stop, Question and Frisk Data (2021)

[3] NYC, Political and Administrative Districts — Download and Metadata (2021)

[4] NYCLU, Stop-And-Frisk Data (2021), NYCLU

--

--

Louis is a Data Scientist who loves: writing, python and maps. He knows a better world is possible, and data science can help.