Analysis of NYC Reported Crime Data Using Pandas

Published in

Towards Data Science

7 min readFeb 23, 2021

Introduction

While I was learning Data Analysis using Pandas in Python, I decided to analyze the open data about New York City — the largest and most influential American metropolis. New York City is in reality a collection of many neighborhoods scattered among the city’s five boroughs: Manhattan, Brooklyn, the Bronx, Queens, and Staten Island. New York is the most populous and the most international city in the country.

Pandas is a high-level library for doing practical, real-world data analysis in Python. It is one of the most powerful and flexible open-source tools to analyze and manipulate data. In this article, my goal is to explore the wide range of opportunities for visual analysis with Pandas.

About the dataset

The dataset includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to the end of 2019. The original dataset can be found on the NYC Open Data website.

Import libraries and data

read_csv reads the .csv data and returns a Pandas DataFrame.
I made the .csv dataset available at Kaggle for public use.

#import pandas
import pandas as pd # data processing and manipulation

#import data
df = pd.read_csv('NYPD_Complaint_Data_Historic.csv')

Check if data is successfully obtained.

df.head()

Data pre-processing

First look at the data

Firstly, we check the number of rows in the dataset to understand the size we are working with. For that, we use the function df.shape to return the shape of an array.

Number of observations: 6 983 207
Variables: 35

After looking at the head of the dataset we were already able to notice some NaN values, therefore we need to examine the missing values further before continuing with the analysis.

The isna() function shows us the percentage of non-existent values for each variable.

PARKS_NM              object             99.64%
STATION_NAME          object             97.75%
TRANSIT_DISTRICT     float64             97.75%
HADEVELOPT            object             95.04% 
HOUSING_PSA           object             92.31%
SUSP_AGE_GROUP        object             67.41%
SUSP_SEX              object             49.72%
SUSP_RACE             object             47.81%
CMPLNT_TO_DT          object             23.89%
CMPLNT_TO_TM          object             23.82%
VIC_AGE_GROUP         object             23.46%
LOC_OF_OCCUR_DESC     object             21.19%
PREM_TYP_DESC         object              0.56%
Y_COORD_CD           float64              0.34%
X_COORD_CD           float64              0.34%
Lat_Lon               object              0.34%
Latitude             float64              0.34%
Longitude            float64              0.34%
OFNS_DESC             object              0.26%
BORO_NM               object              0.15%
PATROL_BORO           object              0.09%
PD_DESC               object              0.08%
PD_CD                float64              0.08% 
JURISDICTION_CODE    float64              0.08%
ADDR_PCT_CD          float64              0.03%
CMPLNT_FR_DT          object              0.009%
VIC_RACE              object              0.004%
VIC_SEX               object              0.004%
CMPLNT_FR_TM          object              0.0006%
CRM_ATPT_CPTD_CD      object              0.0001%
JURIS_DESC            object              0.00%
LAW_CAT_CD            object              0.00%
KY_CD                  int64              0.00%
RPT_DT                object              0.00%
CMPLNT_NUM             int64              0.00%

Dealing with missing data

Since some of the columns are pretty important for the analysis, I dropped entire rows which miss any crucial value. For that, I used the dropNA() function in Pandas.

The columns I didn’t want to drop entire rows, I opted to fill them with the ‘UNKNOWN’ value (These include variables that contain information about the victims of the crime, such as their age group, race, and gender). I used the fillNA() function for that.

It is worth mention that some specific variables have a lot of NaN values, and they don’t necessarily have any use in this analysis (like PARKS_NM means the park or public place nearby where the crime happened, and columns with information about the suspect are also not going to be important, considering they have a lot of missing data) so I would drop those columns entirely with the drop() function.

Preprocessing Text about Crime Type

After taking a good look at this data and removing NaN values, I realized that the crime type data is really confusing. By extracting the unique values in the OFNS_DESC column (description of offense), I can see which descriptions are less intuitive and rename those values to make them more understandable.

The replace() method returns a copy of all occurrences of a substring after being replaced with another substring. With that, I was able to copy() the dataset into a new dataset called 'df_clean' and rename some of the crime descriptions (e.g. 'HARRASSMENT 2' to 'HARASSMENT', 'OFF. AGNST PUB ORD SENSBLTY &' to 'OFFENSES AGAINST PUBLIC ORDER/ADMINISTRATION').

Exploratory Analysis

1. Types of Crimes

After cleaning the data, I want to know how many types of crimes are there in New York City.

Use value_counts in Pandas.

The value_counts method helps count the number of appearances of different types of crimes and sort them in order.

Below, we have the 10 top crimes reported in the dataset.

df_clean.OFNS_DESC.value_counts().iloc[:10].sort_values().plot(kind="barh", title = "Types of Crimes")

Here, we can see the crime incident that has happened the most frequently in New York City is “Petit Larceny”, a form of larceny in which the value of the property taken is generally less than $50.

There are three levels of crime in New York State: Violation, Misdemeanor, and Felony.

From the graph below, I can tell that Misdemeanor, an offense of which a sentence in excess of 15 days but not greater than one year may be imposed, is the most popular level of crime. The second popular one is Felony, the most serious of offenses, and the third one is Violation, a lesser offense for which a sentence only be no more than 15 days.

2. Distribution of Crime Over Time

I also want to know about the trend of crime incidents that have been taking place in NYC.

The first graph shows the crime events by year, then by month, and, at last, the crime distribution in a day.

df_clean['year'].value_counts().plot(kind="line", title = "Total Crime Events by Year")

Overall, by looking at the number of offenses recorded by the NYC police, it’s possible to see that crime levels have been decreasing consistently since 2006.

df_clean.groupby('month').size().plot(kind = 'bar', title ='Total Crime Events by Month', color = '#C0392B',rot=0)

Crimes happened the most during July, August, and October, whereas February, November, and December appear to be safer. From that information, it might be possible to assume a positive correlation between temperature and crime.

df_clean.groupby('time').size().plot(kind = 'bar', title ='Total Crime Events by Day', color = '#E67E22', xlabel = 'hours',rot=0)

We can tell the safest time of the day when a crime is the least possible to happen is 5 am, but it’s more likely to happen between 12 pm and 6 pm.

3. Distribution of Crime in each borough

df_clean['BORO_NM'].value_counts().sort_values().plot(kind="barh", color = '#1ABC9C', title = 'Total of Crime Events by Borough')

According to this visualization, Brooklyn has the overall highest number of crime events, with over 2 million reports.

4. Analyzing a Specific Crime

I want to specifically analyze sex-related crimes in NYC. For that, I put the part of the data frame that contains the crime description ‘sex crimes’ and ‘rape’ into another data frame and called it “sex_crimes.”

sex_crimes = df_clean[df.OFNS_DESC.str.contains('SEX CRIMES|RAPE')]

The new dataset registers 104.211 sex crimes reports in NYC in total.

We may be interested in the distribution of values across the years, so I’m going to group the data by year and plot the results.

Based on the bar graph that I computed, sex crimes happened the most during the last 3 years compared to previous years.
On average, there are 7443 victims of rape and sexual assault each year in New York City.

Let’s also look at how the number of the reports changes within a day.

Here, we can tell the safest time of the day when a sex crime is the least possible to happen in NYC is 6 am. However, people need to be more careful between midnight and 1 am.

Analyzing the victims

Sexual violence affects millions of people. The impact of sexual assault, domestic violence, dating violence, or stalking can be life-altering for survivors and their families. Therefore, I decided to make a brief analysis of the sex crime victims in New York.

FEMALE                   83.20%
MALE                     14.75%
UNKNOWN                   1.93%

<18      48.85%
25-44    22.43%
18-24    16.44%
45-64     4.69%
65+       0.53%

BLACK                             32.87%
WHITE HISPANIC                    28.59%
WHITE                             16.66%
UNKNOWN                           10.13%
ASIAN / PACIFIC ISLANDER           5.85%
BLACK HISPANIC                     5.60%
AMERICAN INDIAN/ALASKAN NATIVE     0.28%

Through this analysis of the victims, I have found the following insights:

83% of all victims are female. Women are approximately four times more likely to be victims of rape, attempted rape, or sexual assault.
Ages under 18 are at the highest risk for rape/sexual assault.
Black and White Hispanic residents are twice as likely to experience a rape/sexual assault compared to any other race.

Conclusion

By using Pandas, I analyzed and visualized the open data of NYC Crime Incident Reports. This library proved to be a powerful tool in data analysis, and it’s a good way to start.
After this, I am intrigued to look into more data related to crime and try different libraries for data visualization.

Thanks for your reading! Feel free to check up my Github for the full code.

Bruna M.