The world’s leading publication for data science, AI, and ML professionals.

Visualizing Houston violent crime trends via data collection and cleaning

Violent crime is one of the symptoms indicating societal illness. It occurs in large part due to the failure of a society and its…

Downtown Houston at dusk. Photo courtesy of Pixabay.
Downtown Houston at dusk. Photo courtesy of Pixabay.

Violent crime is one of the symptoms indicating societal illness. It occurs in large part due to the failure of a society and its government to provide those components essential to human flourishing: food, shelter, education, job training and access to physical and mental healthcare. At the same time, heinous and petty crimes are committed by individuals well-endowed with these essentials. In this article, I will illustrate fundamental data science methods in Python using the example of violent crime statistics. In particular, I will be looking at violent crime data in my hometown of Houston, Texas.

The goal of this article is to present

  1. data collection: web scraping to download files into a local directory
  2. Data Cleaning: extract and store only the desired values
  3. data visualization: plotting data to reveal trends

Disclaimer: I will not be going through the code required to perform the above three steps line by line. My hope is that the screenshots of the code blocks will help you understand: (1) how to perform web scraping for data files (2) reading Excel files and converting them into Pandas dataframes (3) performing standard data cleaning operations in Pandas and (4) plotting data stored in Pandas dataframes with the matplotlib library.

I. Data collection with web scraping and browser automation

To start, we go to the Houston Police Department crime statistics web page.

Houston Police Department monthly crime statistics web page for years 2009 to 2018.
Houston Police Department monthly crime statistics web page for years 2009 to 2018.

Monthly crime statistics are available in both Access and Excel formats. I will be working with Excel files. When you click on an Excel link, an Excel file containing the statistics for that month is downloaded to your local machine. Below is an example of the first few columns of the June 2009 data.

June 2009 Houston crime data
June 2009 Houston crime data

Before moving on to the Data Science, let’s take a moment to reflect on all the suffering caused by each one of the incidents above and another moment to reflect on the countless unreported such incidents.

Note that some of the rows have dates lying outside of June 2009. This could be due to previous incidents getting reported at later dates. We will have to handle these cases in our data cleaning and processing step. To begin collecting data, we observe that all Excel links have URLs that when visited initiate a download of the associated file. So first, we need to collect all the Excel link URLs on the webpage above. We will do this using the requests and bs4 (a.k.a. BeautifulSoup) libraries. The Excel links can be obtained using the following function.

Function to scrape all URLs of Excel files on monthly crime statistics webpage
Function to scrape all URLs of Excel files on monthly crime statistics webpage

The function takes two URLs as input: a home URL and a base URL. Since these two URLs are fixed they can be defined in a configuration file called config.py. Here are their definitions:

Fixed URLs for web scraping step
Fixed URLs for web scraping step

With the links in hand, we need to write browser automation code to visit and download them nicely into a directory of our choice. One obstacle with this particular data set, is that the file names and extensions are in two different formats. In addition, the files with the longer names including the string NIBRS_Public_Data_Group_A&B have a more complicated structure and result in gobbledygook upon importing with the Pandas library. Since these files only comprise the last 6 months out of the 114 months in the data set, we will ignore these files for now. To properly ignore them, we save them in a sub-directory called messy. Here is a function for downloading the files into local directories from the URL links

Function for downloading Excel files from HPD crime statistics webpages
Function for downloading Excel files from HPD crime statistics webpages

where data_dir is the path to the download directory and the dictionary m_dict is used to convert between 3-character alphabetic and 2-character numeric string representations of the months given by

Ignoring the messy files, all the others are in the format mm-yyyy.xlsx. I want to change these to the format yyyy-mm.xlsx so that the file system will list them in chronological order when sorting alphanumerically. The following function will do the trick

Now the non-messy files are listed chronologically and the messy files are stored in a sub-directory.

Our data collection step is now complete. To organize the code, I placed all the functions shown above into a file called helper_funcs.py and imported the requests, os and bs4 libraries. Then, I created another file named data_collection.py to handle this data collection step. The entire contents of this file are provided in the screenshot below

Full code for web scraping and file renaming (a.k.a. data collection)
Full code for web scraping and file renaming (a.k.a. data collection)

II. Data cleaning with Pandas

Now we’re ready to clean the data and extract a subset of the information. We would like to import a single spreadsheet into a Pandas data frame using the read_excel function. When I tried this directly, I ran into the following error for a significant subset of files:

WARNING file size (3085901) not 512 + multiple of sector size (512) WARNING OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero

To bypass these errors, I found the following hack here, which requires importing the xlrd library.

Once we have the spreadsheet converted into a Pandas dataframe, we need to drop all the columns of extraneous information. In this case, we are only interested in the date of occurrence, the offense type and the number of offenses. We are also only going to keep track of violent offenses so will be dropping those rows associated with non-violent incidents. The crimes considered violent and non-violent are listed below

where we will be later renaming ‘Aggravated Assault’ to ‘Assault’. Here is the list of all column names common to many of the files at the beginning of the data set (around 2009)

The subset of columns we are interested in is [‘Date’, ‘Offense Type’, ‘# Of Offenses’]. For simplicity, we will be renaming the column ‘Offense Type’ to ‘Offense’ and ‘# Of Offenses’ to ‘#’. However, in going through the data I noticed that not all last columns are labelled ‘# Of Offenses’. In fact, here is the exhaustive list of all alternative names for this column:

To handle these irregularities, we will look for any occurrence of the above items and rename them to ‘#’. This issue being handled, we can now drop extraneous columns. Taking the June 2009 data as an example, here is a print out of the Pandas dataframe with dropped columns

The following function will take a Pandas dataframe created from an Excel file and produce as output a three-column data frame like the one above

Note the usage of the inplace=True argument. This is necessary when passing data frames to functions because it tells Pandas to update the original dataframe instead of creating a copy and leaving the original unchanged. The last line is row entry renaming rather than column dropping.

The two remaining steps in the data cleaning operation are: (1) dropping extraneous rows and (2) summing up all offenses occurring in a given month. For dropping extraneous rows, we use the following code to drop rows associated with non-violent offenses and rows with null entries

Now we need to total up the violent offenses for each month. Before doing that, there are a few things that need to be done.

First, for some of the data files, strings associated with the offense type contained extra white spaces, which need to be cleaned up. The following code will do the trick.

Second, I want to reformat how the incident dates are stored because I do not want to work with Pandas’ TimeStamp objects while also demonstrating how to split data into multiple columns. The code below has two functions: the first one converts the dates into numeric format while the second one splits this numeric information into Year and Month columns. The second function also drops rows that have missing Date values.

The very last step of our data cleaning operation involves summing up all the offenses of a given type and storing them in an output dataframe. The function append_monthly_sum below achieves this:

Putting it all together, we create a file called data_collection.py to carry out the above steps. First we import libraries and functions, then we create an empty Pandas dataframe and loop over all data files in the directory

Inside the loop, for each file in the directory (corresponding to crime statistics occurring in that month), we call the functions defined above

Once all the months have been appended to the output dataframe, we can export to a pickle file using the to_pickle command to save the dataframe. We have now successfully cleaned the data.

III. Data visualization with matplotlib

With the data now cleaned, let’s plot the monthly reported violent crime incidents over the nine year data period to visually capture the time series dynamics. We will be plotting the data using the matplotlib library. Of course, there are better open-source plotting libraries out there. In particular, I find the plotly library to have much wider range of functionality, simpler syntax and to have a more beautiful natural presentation. However, at the time of writing, I used matplotlib.

Before jumping into the final version of the code for this Data Visualization step, let’s look at a plot of the monthly data for assaults.

Reported monthly counts of assault in Houston, TX from June 2009 to May 2018
Reported monthly counts of assault in Houston, TX from June 2009 to May 2018

We can see that there is a lot of fluctuation in the data from month to month. It would be easier to identify overall trends in the data using a moving average to smooth out these fluctuations. To see both the raw data and the smoothed out trend, I plot both the monthly data and the moving average.

Here is the function for computing a moving average of an array called data with n_w windows of size w_size.

And here is the function for plotting the monthly and moving average data, for a given offense type:

We put everything together in a file called data_visualization.py which simply loops over all Violent Crime offense types and applies this function.

With a little extra finagling in a free online photo editor called Photopea, to insert a legend, here are the final plots for each violent crime offense type.

The whole point of data visualization is to obtain intuition about the entire time series data set. Indeed, the plots above help us identify broad trends in the occurrence of each offense type. The stories the above plots tell are:

(1) Assaults decreased by about a third of their initial levels until a period of rapid increase beginning 80 months after June 2009.

(2) Murders roughly stayed constant over the entire nine year period. However, there was a decrease for the first 60 months followed by an increase.

(3) Rape cases stayed roughly constant for the first 70 months. Afterward, the number of incidents per month increased dramatically and ended up at almost double its initial value by May 2018. This trend is alarming and its origins need to be investigated.

(4) Robberies almost halved in the first two years. Afterward, they rebounded and settled at rates slightly less than their initial monthly values.


Related Articles