Demand Forecasting: Boston Crime Data

Published in

Towards Data Science

14 min readMar 8, 2020

A model that can predict how likely a violent crime may happen on a certain day and at a specific location.

Demand forecasting is a hot topic and a never ending goal in retail, supply-chain management and logistics. Decision makers need to optimize their resources based on the predicted demand, so it is directly linked with optimization problems as well.

Forecast in its common machine learning usage deals with time series data which is one dimensional. Forecasting stock price or number of purchases spanning years are the most common use cases. Additive models like ARIMA are one of the oldest mathematical models in the domain:

Forecasting a clear trend. Seems easy uh?

Using this well known data — the number of airline passengers for several years — to understand the basics of forecasting is of course helpful. But even upon first glance, it is not hard to guess that real life problems are rarely this easy.

Introduction to the Boston Crime Data

Analyze Boston which is City of Boston’s open data hub publishes various city data. Its Crime incident report is one of them. The city’s tabular data lists crime incidents since 2015. We can see here the leading features (it has 16 columns in total) in their data set:

Incident number: Unique identifier for the incident
Offense Code Group: Incidents are grouped with crime types
Reporting Area: The code for the reporting areas or zones.
Occurred On Date: Date of the incident
Lat: Latitude of the crime location
Long: Longitude of the crime location

Their most up-to-date data has more than 400K rows. They have also uploaded their data set to Kaggle to introduce it and challenge the data science community.

Analysis of the crime types and their trends can surely help us to understand the crime dynamics in the city. But what if we can go a little bit further than that?

Goal

The historical data set has a time and space dimension for different types of crimes in the city.

So the most exciting project that can be built is to predict crimes for neighborhoods before they actually happen!

**No, not going in that direction!** — Image by Geraldine Dukes from Pixabay

Now we need a frame to structure the problem. Just predicting the number of crimes in a neighborhood or generally in the whole city does not say much and is not useful. We need to predict whether or not rare crimes are going to happen in a specific region.

Also… Problems of this kind are not studied often and Boston Crime Data gives us an opportunity to investigate a challenging problem. Why? More on that in the Challenge section.

Problem Framing

Here I will follow the approach that is shared in another paper, “Crime Prediction Using Data Analytics: the Case of the City of Boston.” It contains the core of the solution. I will implement it by using Python with some modifications to the features (less spatial engineering) and the model (Xgboost for increasing accuracy). I will also share the code so that everyone can follow along, and try to improve the predictions. We can summarize it as follows:

We need to predict the probability of some rare event (violent crimes in this case) happening on a specific date with a spatial feature. That way, the police force can use this to concentrate on specific parts of the city each day to prevent violent crimes or at least increase the chance of patrolling there.

And how does this framework shape the data solution? Without too much thought we can say that we are going to:

Aggregate historical crime data for each region/reporting area for each day.
Label those location & days combinations such that if violent crime happened there on that day or not.
Predict a binary value with a probability. This is a classification problem. Logistics regression or decision tree based models can be used.

Since our data set is going to be highly imbalanced, some tuning for the machine learning model and the decision threshold is going to be necessary. Remember that we are predicting rare events.

Challenge — Why not use Time Series Forecasting?

Notice that we have translated a forecasting problem into a traditional machine learning classification. Why is that?

Difficulties start to arise if we need to forecast multiple values since setting and analyzing ARIMA type models is not easy.

So do we need to build multiple models for that? What happens if we need to build a model for hundreds of time series data? Is that feasible at all?

Let’s list the challenges for forecasting demand when demand is distributed in the city without clear boundaries for a long period of time like a year.

Thinking about the Taxi Demand Prediction can be a good starting point since we will follow a similar logic. In that case we have taxi trip data starting from a certain location at a certain date. Here’s what the challenge in forecasting demand there would be:

Demand is distributed spatially and it is continuous. It does not have centers. Think about several markets in the city, they have fixed locations. You can build forecast models to predict total sales for each of them but here it is not the case.
Even if you cluster demand, there might be way too many cluster centers to build forecast models for each of them.
You know that certain features (like weather or times — like a holiday season) heavily affect the demand. But in traditional time series modelling, it is not possible to add those features. Deep learning (LSTM models for instance) allows that though.

And now, for the Crime Prediction, there is an additional challenge. We need to predict rare events like violent crimes:

Remember that traditional time series forecasting deals with numerical data. Here we need to predict a binary value: 0 or 1.

Method

Now let’s list what we are going to do to create a predictive model:

Divide the space to cluster demand for each spatial unit (grids or reporting areas)
Group crime types into violent and nonviolent crimes.
Aggregate historical crime data for each spatial unit for each day. The label for that row is going to be 1 or 0 depending on whether a violent crime has happened that day in that location.
Integrate historical weather data as a feature for each day in Boston.
Divide the data into a train and test set. Use a specific date such that before that day train data is around 70% and after that date test data is 30% of the all data set.
Build a model which focuses on predicting positive labels: increasing True Positive rate which is sensitivity. (more on that later)
Evaluate the accuracy and sensitivity of the test set by choosing a threshold.

1- Cluster Demand

From this point on you can follow the code in the github repo.

After a bit of exploration in the data we can start from the “Cluster Centers from Reporting Area” section. The reporting area centers are calculated by taking the average of the latitude and longitude:

KeplerGL package used to visualize spatial data

Obviously, there are some outliers. By using the city boundaries file here we draw and extract the reporting areas that are outside the Boston polygon:

# load shape file for 500 cities
city_data = gpd.read_file("geodata/CityBoundaries.shp")# some spatial processing here...# select the reporting areas (points in geodf_ra) that are 
# in geodf_boston polygon. using spatial join function
# in geopandas package 

geodf_ra_inboston=gpd.sjoin(geodf_ra, geodf_boston, how="inner", op='within')

Now they are all inside the Boston Polygon

In the paper, 500 hundred grids are used. So, following the same logic, we can create 500 clusters to have fewer demand centers with a more equally distributed demand surface.

# create 500 centers with K-means
clusterer = KMeans(n_clusters=500,random_state=101).fit(reporting_area_centers[["Long","Lat"]])
# get predictions from our Kmeans model
preds_1 = clusterer.predict(reporting_area_centers[["Long","Lat"]])
# set our new column: cluster_no
reporting_area_centers["cluster_no"]=preds_1

500 cluster centers for reporting areas.

Reporting Areas are clustered into 500 centers.

2- Group Crime Types

Now we need to group crime categories (OFFENSE_CODE_GROUP) so that some of them will be labeled as violent crimes. I’ve added the categories based on my assumption about which of these would be violent crimes, but of course you can cross reference the legal definition..

# check labeled crime groups: violent, property and other
# the crimes are grouped manually into 3 groups: violent, property or other
data_1718_ocg=pd.read_csv("./data/data_1718_ocg_grouped.csv")

The ones that are labeled as violent crimes are:

3- Aggregate Historical Crime Data

At this point, in the notebook file, we have reached the “Preparing Data for ML: Aggregate data for each cluster&day combination” section.

We have cluster centers labeled with their latitude and longitude. Now we need predictive features!

This is the workhorse of the script: For each day, calculate the past crime statistics of the all cluster centers. It is done for the past 120 days, 30 days, 7 days and 1 day.

For preparation of the script I have first done the calculation for just one day. Then I wrote a “for loop” for the all days.

# check one of the lists:
# here we calculated sums for different crimes for the date: 2017-10-3
# they are aggregated for each clusters.
# in this table we see the results for the past 120 days
working_120_results.head()

Here we can see the sum of different crime types in different regions for just one day. It is calculated by counting the relevant crime types for that region prior to 120 days.

After checking the sanity of the results we can move on to the larger calculation in: “Calculate the stats for each day for the last 365 days and the last day.” Now we run a loop to aggregate the historical crime data for each cluster and day combination.

# check resulting dataframe. This is going to be our baseline for train&test data
crime_stats_fordays_df.info()

All the historical predictive features are aggregated

4- Integrate Weather Data

Again, following the paper, we would like to integrate Weather Data into our training set as a predictive feature. We think the weather conditions might affect the violent crimes.

For this reason, we can use the Python Package wwo-hist which encapsulates the Weather API from World Weather Online. I have created a free trial account to gather the API key needed. You might try other options but this seems to be the fastest solution.

# use package: https://github.com/ekapope/WorldWeatherOnlinefrom wwo_hist import retrieve_hist_data# daily frequency.
frequency=24
start_date = '1-JAN-2015'
end_date = '31-DEC-2018'
api_key = 'your-api-key-comes-here'
location_list = ['boston,ma']# this one runs for all days from start_date to end_date
# and saves the results as csv to the current directory.

hist_weather_data = retrieve_hist_data(api_key,
                                location_list,
                                start_date,
                                end_date,
                                frequency,
                                location_label = False,
                                export_csv = True,
                                store_df = True)## some processing# include weather data to our aggregated data for: rain, cloud cover and minimum temperature. 
# you can include more features herecrime_stats_fordays_df=pd.merge(crime_stats_fordays_df,weather_data[["date_time","precipMM","cloudcover","mintempC"]],left_on="theday",right_on="date_time")

We used cloud coverage, minimum temperature and precipitation amount as predictive features.

5- Modelling

First we split the data into a train and test set. Since this is time series prediction, we choose a cut-off date such that the data prior to it corresponds to the 70 percent of the data.

Then we decide on the feature columns:


# feature set
x_columns=[ 'Lat', 'Long', 
        'sumviolentcrime120', 'sumpropertycrime120',      'sumothercrime120', 
        'sumviolentcrime30','sumpropertycrime30', 'sumothercrime30', 
        'sumviolentcrime7','sumpropertycrime7', 'sumothercrime7', 
        'sumviolentcrime1','sumpropertycrime1', 'sumothercrime1', 
        'precipMM','cloudcover', 'mintempC']# outcome
y_column=["isviolentcrime"]

Notice that unlike the paper we’ve been following, we use here Longitude and Latitude to identify the location. This is because if you use income data for each grid, you lose the “locality-sensitive” feature of the space coding. To put it simply, we need spatial features such that their coding is close to each other if they are close to each other in two dimensional space. We lose that feature if we just use “income data” since far away neighborhoods might have the same average income.

Another outcome of this choice is that we need to go with the tree based models since Latitude and Longitude do not carry information in linear models like Logistics Regression.

I continued with “Xgboost Classifier” for the model. As we said in the beginning, this a classification problem. We have created aggregated historical features for each cluster center and day. And now we’re going to predict if violent crime happened on that day based on those historical features and the weather data.

I skip the tuning of the model so as not to expand the article too much, but it can be checked in the Jupyter Notebook: “Tune Parameters” section.

# tuned parameters: in the next session we run hyperparameter search with cross validation
# these parameters are gathered from that tuning session.
# so you can just continue with the tuned parameters
# you can switch back and forth between the baseline model and the tuned model to see the
# change in accuracy and other parametersxgb_model=xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, learning_rate=0.1,
       max_delta_step=0, missing=None,
       n_estimators=100, n_jobs=1,   
       nthread=None,scale_pos_weight=83,
       objective='binary:logistic', random_state=101,
       subsample= 0.6,
       min_child_weight= 10,
       max_depth= 3,
       gamma= 1,
       colsample_bytree= 0.8)

6- Threshold Selection

Now we have probabilities for labeling the “isviolentcrime” target either 1 or 0. But how should we select the decision threshold to label rows (region & day combinations) as 1 or 0?

The assumption derived from the paper is as follows: Let’s assume the police force can patrol 30 percent of the city each day. This finalizes the frame of our problem. We need such a model that predicts positive labels 30 percent of the time. So the police force is going to have 30 percent of the regions labeled as positive in each day.

# 30 percent of the labels: 38250.0
X_train.shape[0]* 0.3# tune threshold so that 30 percent of all the labels are predicted as positive
# we found we can use 0.48 as our threshold# (38819,)
y_pred_prob_pos[y_pred_prob_pos>0.48].shape# it is 1 if greater than 0.48
xgb_predictions=np.where(y_pred_prob_pos>0.48,1,0)

7- Evaluation

Let’s check the metrics for train data:

# my accuracy score: 0.7031058823529411
accuracy_score(y_train,xgb_predictions)train_confusion_matrix=confusion_matrix(y_train, xgb_predictions)pd.DataFrame.from_dict({"Negatives":train_confusion_matrix[0],
                        "Positives":train_confusion_matrix[1]},orient='index',columns=["Negative Pred.","Positive Pred."])

Notice out of 1523 positive cases our model predicted 1244 correctly thus having 1244/1523=0.82 True Positive Rate.

True Positive Rate formula from Wikipedia

We can see overall evaluation metrics. Notice Recall is 0.82 for label 1.

Now for the Test set:

Our True Positive Rate (Recall) decreases to 0.73.

It is calculated as 191 / (191+524)

Also we set the threshold for the training data to have 30 percent positive labels. We used the same threshold in the test set and if you check:

# our predictor predicts 28 percent positive labels
# police needs to travel 28 percent of the area
xgb_predictions_test.sum() /y_test.shape[0]

In the test set there are even fewer positive predicted labels, which is even better.

Summary and Usage

So what have we done here? And most importantly, how can this method be used?

First, we clustered crime report areas to have 500 demand centers that are smoothly distributed in the Boston city.
We labeled some crime categories as violent crimes.
Then we used historical crime data to create a training data set. For each cluster center and day we aggregated past crime data to use them as predictive features: Total number of violent crimes in that region for the past 120 days, 30 days, 7 days etc… Total number of property crimes in that region for the past 120 days, 30 days, 7 days etc…
We integrated historical weather data as a predictive feature.
We built a machine learning model to predict if violent crime happened on that day and in that region.

Our training data. Some features and the target variable: “isviolentcrime”

We selected a threshold for our classification model such that the 30 percent of the predictions are positive.

Now this means if historical crime data is aggregated accordingly for each spatial unit (which in our case 500 demand centers distributed in the city) the police can use the model to predict violent crimes. Each day, the police can run the model and they are going to get around 30 percent of the cluster centers with positive predictions for violent crimes. This corresponds to 150 locations out of 500. They can even order those areas with positive labels. The ordering is going to consider corresponding probabilities, and thus prioritize the most risky areas for violent crimes.

With this performance of the model, the police can expect to capture 70 percent of violent crimes by just patrolling 30 percent of the area. If you analyze the historical crime data you can see that there are around seven violent crimes daily. With our True Positive Rate police can expect to be around five of those violent crimes daily.

Conclusion

The problem of predicting violent crimes across Boston City was particularly challenging. It differs from the usual demand forecasting problems:

The usual demand forecasting models have fixed demand and supply centers.
Demand is a continuous feature like total sales or total requests.

In our case:

Demand is distributed in the city continuously. Clustering is needed.
Demand is a categorical feature. The problem is whether violent crime happened there or not.
And as a binary classification problem, our data set is highly imbalanced. Positive labels are only around 2 percent. This adds an extra challenge to the modelling and tuning.

Future Work

We know that there are 12 police departments / districts in Boston City. So we can try to find optimal locations of 12 supply centers to capture demand for violent crime.

This problem corresponds to facility location problem. And actually I already did that using optimization packages of Python. My next article is going to be about introducing and implementing those packages to solve a linear optimization problem.