Analysis of NYC Reported Crime Data Using Pandas

Bruna Mendes
Towards Data Science
7 min readFeb 23, 2021

--

Photo by Andre Benz on Unsplash

Introduction

While I was learning Data Analysis using Pandas in Python, I decided to analyze the open data about New York City — the largest and most influential American metropolis. New York City is in reality a collection of many neighborhoods scattered among the city’s five boroughs: Manhattan, Brooklyn, the Bronx, Queens, and Staten Island. New York is the most populous and the most international city in the country.

Pandas is a high-level library for doing practical, real-world data analysis in Python. It is one of the most powerful and flexible open-source tools to analyze and manipulate data. In this article, my goal is to explore the wide range of opportunities for visual analysis with Pandas.

About the dataset

The dataset includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to the end of 2019. The original dataset can be found on the NYC Open Data website.

Import libraries and data

  • read_csv reads the .csv data and returns a Pandas DataFrame.
  • I made the .csv dataset available at Kaggle for public use.
#import pandas
import pandas as pd # data processing and manipulation

#import data
df = pd.read_csv('NYPD_Complaint_Data_Historic.csv')
  • Check if data is successfully obtained.

df.head()

Data pre-processing

First look at the data

Firstly, we check the number of rows in the dataset to understand the size we are working with. For that, we use the function df.shape to return the shape of an array.

  • Number of observations: 6 983 207
  • Variables: 35

After looking at the head of the dataset we were already able to notice some NaN values, therefore we need to examine the missing values further before continuing with the analysis.

  • The isna() function shows us the percentage of non-existent values for each variable.
PARKS_NM              object             99.64%
STATION_NAME object 97.75%
TRANSIT_DISTRICT float64 97.75%
HADEVELOPT object 95.04%
HOUSING_PSA object 92.31%
SUSP_AGE_GROUP object 67.41%
SUSP_SEX object 49.72%
SUSP_RACE object 47.81%
CMPLNT_TO_DT object 23.89%
CMPLNT_TO_TM object 23.82%
VIC_AGE_GROUP object 23.46%
LOC_OF_OCCUR_DESC object 21.19%
PREM_TYP_DESC object 0.56%
Y_COORD_CD float64 0.34%
X_COORD_CD float64 0.34%
Lat_Lon object 0.34%
Latitude float64 0.34%
Longitude float64 0.34%
OFNS_DESC object 0.26%
BORO_NM object 0.15%
PATROL_BORO object 0.09%
PD_DESC object 0.08%
PD_CD float64 0.08%
JURISDICTION_CODE float64 0.08%
ADDR_PCT_CD float64 0.03%
CMPLNT_FR_DT object 0.009%
VIC_RACE object 0.004%
VIC_SEX object 0.004%
CMPLNT_FR_TM object 0.0006%
CRM_ATPT_CPTD_CD object 0.0001%
JURIS_DESC object 0.00%
LAW_CAT_CD object 0.00%
KY_CD int64 0.00%
RPT_DT object 0.00%
CMPLNT_NUM int64 0.00%

Dealing with missing data

Since some of the columns are pretty important for the analysis, I dropped entire rows which miss any crucial value. For that, I used the dropNA() function in Pandas.

The columns I didn’t want to drop entire rows, I opted to fill them with the ‘UNKNOWN’ value (These include variables that contain information about the victims of the crime, such as their age group, race, and gender). I used the fillNA() function for that.

It is worth mention that some specific variables have a lot of NaN values, and they don’t necessarily have any use in this analysis (like PARKS_NM means the park or public place nearby where the crime happened, and columns with information about the suspect are also not going to be important, considering they have a lot of missing data) so I would drop those columns entirely with the drop() function.

Preprocessing Text about Crime Type

After taking a good look at this data and removing NaN values, I realized that the crime type data is really confusing. By extracting the unique values in the OFNS_DESC column (description of offense), I can see which descriptions are less intuitive and rename those values to make them more understandable.

The replace() method returns a copy of all occurrences of a substring after being replaced with another substring. With that, I was able to copy() the dataset into a new dataset called 'df_clean' and rename some of the crime descriptions (e.g. 'HARRASSMENT 2' to 'HARASSMENT', 'OFF. AGNST PUB ORD SENSBLTY &' to 'OFFENSES AGAINST PUBLIC ORDER/ADMINISTRATION').

Exploratory Analysis

1. Types of Crimes

After cleaning the data, I want to know how many types of crimes are there in New York City.

  • Use value_counts in Pandas.

The value_counts method helps count the number of appearances of different types of crimes and sort them in order.

Below, we have the 10 top crimes reported in the dataset.

df_clean.OFNS_DESC.value_counts().iloc[:10].sort_values().plot(kind="barh", title = "Types of Crimes")

Here, we can see the crime incident that has happened the most frequently in New York City is “Petit Larceny”, a form of larceny in which the value of the property taken is generally less than $50.

There are three levels of crime in New York State: Violation, Misdemeanor, and Felony.

From the graph below, I can tell that Misdemeanor, an offense of which a sentence in excess of 15 days but not greater than one year may be imposed, is the most popular level of crime. The second popular one is Felony, the most serious of offenses, and the third one is Violation, a lesser offense for which a sentence only be no more than 15 days.

2. Distribution of Crime Over Time

I also want to know about the trend of crime incidents that have been taking place in NYC.

The first graph shows the crime events by year, then by month, and, at last, the crime distribution in a day.

df_clean['year'].value_counts().plot(kind="line", title = "Total Crime Events by Year")

Overall, by looking at the number of offenses recorded by the NYC police, it’s possible to see that crime levels have been decreasing consistently since 2006.

df_clean.groupby('month').size().plot(kind = 'bar', title ='Total Crime Events by Month', color = '#C0392B',rot=0)

Crimes happened the most during July, August, and October, whereas February, November, and December appear to be safer. From that information, it might be possible to assume a positive correlation between temperature and crime.

df_clean.groupby('time').size().plot(kind = 'bar', title ='Total Crime Events by Day', color = '#E67E22', xlabel = 'hours',rot=0)

We can tell the safest time of the day when a crime is the least possible to happen is 5 am, but it’s more likely to happen between 12 pm and 6 pm.

3. Distribution of Crime in each borough

df_clean['BORO_NM'].value_counts().sort_values().plot(kind="barh", color = '#1ABC9C', title = 'Total of Crime Events by Borough')

According to this visualization, Brooklyn has the overall highest number of crime events, with over 2 million reports.

4. Analyzing a Specific Crime

I want to specifically analyze sex-related crimes in NYC. For that, I put the part of the data frame that contains the crime description ‘sex crimes’ and ‘rape’ into another data frame and called it “sex_crimes.”

sex_crimes = df_clean[df.OFNS_DESC.str.contains('SEX CRIMES|RAPE')]
  • The new dataset registers 104.211 sex crimes reports in NYC in total.

We may be interested in the distribution of values across the years, so I’m going to group the data by year and plot the results.

  • Based on the bar graph that I computed, sex crimes happened the most during the last 3 years compared to previous years.
  • On average, there are 7443 victims of rape and sexual assault each year in New York City.

Let’s also look at how the number of the reports changes within a day.

Here, we can tell the safest time of the day when a sex crime is the least possible to happen in NYC is 6 am. However, people need to be more careful between midnight and 1 am.

Analyzing the victims

Sexual violence affects millions of people. The impact of sexual assault, domestic violence, dating violence, or stalking can be life-altering for survivors and their families. Therefore, I decided to make a brief analysis of the sex crime victims in New York.

FEMALE                   83.20%
MALE 14.75%
UNKNOWN 1.93%
<18      48.85%
25-44 22.43%
18-24 16.44%
45-64 4.69%
65+ 0.53%
BLACK                             32.87%
WHITE HISPANIC 28.59%
WHITE 16.66%
UNKNOWN 10.13%
ASIAN / PACIFIC ISLANDER 5.85%
BLACK HISPANIC 5.60%
AMERICAN INDIAN/ALASKAN NATIVE 0.28%

Through this analysis of the victims, I have found the following insights:

  • 83% of all victims are female. Women are approximately four times more likely to be victims of rape, attempted rape, or sexual assault.
  • Ages under 18 are at the highest risk for rape/sexual assault.
  • Black and White Hispanic residents are twice as likely to experience a rape/sexual assault compared to any other race.

Conclusion

By using Pandas, I analyzed and visualized the open data of NYC Crime Incident Reports. This library proved to be a powerful tool in data analysis, and it’s a good way to start.
After this, I am intrigued to look into more data related to crime and try different libraries for data visualization.

Thanks for your reading! Feel free to check up my Github for the full code.

Bruna M.

--

--