Human Rights First: A Data Science Approach

Using Natural Language Processing to Find Instances of Police Brutality In The United States

Daniel Benson
Towards Data Science

--

The Journey

Human Rights First is an independent organization pushing for total human rights and equality within the United States through political brainstorming, creative policy development, various campaigns, data gathering and research, and mass education. These human rights issues are especially important in today’s world where the frequency of inequality and injustices continue to run rampant. You can find more information and ways to help the Human Rights First organization at their website Human Rights First.

In a team consisting of one team project lead, five web developers, and three data scientists we worked on a heavily specific subdomain of the Human Rights First organization, police brutality. Our team project lead met weekly with a stakeholder of the organization to gain insight into expectations of the project. Questions were asked such as, “What is the user expected to see upon when loading up the landing page”, “What interactivity is the user expected to be given”, “What kind of data is expected and is precision more important or the number of instances?” The problem was laid out before us: create a website that a user can visit and be met with an interactive choropleth map of the United States allowing visual insights into which states and cities contained incidents of police brutality as well as textual insights through linked articles, videos, etc. with a heavy focus on data precision.

Each of the teams inherited a GitHub repository for their respective field created by a previous team working on the same project. The Data Science team recieved a repo that included ipython notebooks, app directories and files, a preprocessed baseline predictive model, and .csv files containing pre-collected data. Our approach began by determining the usefulness of this repository’s contents and planning what our own contributions would be.

Every Journey Begins At The Pathway’s End

Our team began the planning process immediately, spending a week meeting as a full team at least once a day and within our specific field teams several times a day. In our brainstorming sessions we came up with a list of tasks deemed important to the final product. This was followed up by the creation of user stories allowing us valuable insight into the workings of our visualized final product through a user’s eyes. Using a Trello board and an aggregation of creative yet logical thought we settled on a final list of user stories and a sublist of tasks for each. We tackled problems such as, “As a user, I can see immediately on the landing page, a map of the US with informational data on how many police brutality incidents occur in different states” and “As a user, I can zoom in to the map enough to view individual incident data” with tasks labeled and sorted by which team/s would need to contribute.

A trello card outlining the broadest user story and its associated tasks.

During the team’s planning stage we worked through a number of brainstorming sessions to hash out sub-lists for each task outlining the technical possibilities for completing them. This included models, libraries, methods, and potential programming canvases we felt that we might need as tools. We implemented these lists into a full-product flowchart, combining each field (DS, Backend, Frontend) through anticipated connections.

Product engineering architecture mapping out all the anticipated connections
The Data Science technical decisions outline for the Human Rights First project

While first approaching the development phase of this project I anticipated two major challenges. The first was in using Natural Language Processing as part of our modeling process as none of us had worked with NLP libraries for several months. This risk was handled accordingly through exploration, research, and inheritance of the previous team’s use of this same method. This allowed us to explore what we did have and modify as needed to fit our own model needs.

Below is a snippet of code used to tokenize our data.

# Create tokenizer functions

nlp = spacy.load("en")


def tokenize(text):
text = nlp(text)
tokens = [token.lemma_ for token in text if (token.is_stop != True) and (token.is_punct != True) and (token.text != " ")]
return tokens

def retoken(text):
tokens = re.sub(r'[^a-zA-Z ^0-9]', '', text)
tokens = tokens.lower().split()
tokens = str(tokens)

return tokens

The second challenge involved the Data Science team as a whole. We needed to ensure that we were collaborating frequently and effectively to ensure that we were all on the same page regarding the modeling research and development process, time-management of the data collection process, and cleaning the data properly and usefully; the importance of this lay in the fact that each of these tasks is strongly dependent on the success of the previous task. To that end we worked closely five days a week for upwards of eight hours a day ensuring accurate communication. Whenever coding was being done we would meet as a team and pair program, switching out who was driving and who was navigating round-robin style. In this way, we were able to avoid any possible pitfalls that could occur through a lack of teamwork and communication.

The Journey Is As Much The Process And Struggle As It Is The End Result

The Data Science side of the project included a number of time consuming features. We began by exploring the notebooks we inherited from the previous team and recreating and modifying their work for a thorough understanding. We asked questions such as, “How did they go about cleaning up the data?”, “What features did they feel were important?”, “Why these features?”, “What parameters did they use to create their model?”, “How accurate is their model?”. Using these questions as a layout to our exploration, we created new google colab notebooks and recreated the inherited notebooks one by one, putting together tests and making modifications as needed to ensure our thorough understanding. This process included using the Reddit API wrapper PRAW to pull news articles and reddit posts from “news” subreddits as well as pre-collected data from reddit, twitter, internet sources, and various news sites as well as cleaning up the data and performing some feature engineering as needed.

Below is the code we used to access the reddit API and pull the top 1000 hottest submissions from the “news” subreddit; these were then appended to a list called “data” and used to create a new dataframe:

# Grabbing 1000 hottest posts on Reddit 

data = []

# Grab the data from the "news" subreddit
for submission in reddit.subreddit("news").hot(limit=1000):
data.append([submission.id, submission.title, submission.score,
submission.subreddit, submission.url, submission.num_comments,
submission.selftext, submission.created])

# Create and assign column names
col_names = ['id', 'title', 'score', 'subreddit', 'url',
'num_comments', 'text', 'created']
df_reddit = pd.DataFrame(data, columns=col_names)

Next, we decided to recycle the previous team’s data collection, cleaning and feature engineering but modifying their Natural Language Processing model to include a number of tags that the previous team had left out. We followed this up by putting together our baseline predictive model using TfidVectorizer and a RandomForestClassifier with a RandomizedSearchCV for early parameter tuning. Using these methods we were able to create a csv file we felt comfortable sending over to the web team for use in their baseline choropleth map. The code used to build our model can be found in the embedding below.

# Build model pipeline using RFC

pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('classifier', RandomForestClassifier(random_state=42, n_jobs=-1, max_depth=5, n_estimators=45,)),
])
pipeline.fit(X_train,y_train)
predictions = pipeline.predict(X_test)
param_distributions = {
'classifier__max_depth': [1, 2, 3, 4, 5]}

search = RandomizedSearchCV(
pipeline,
param_distributions=param_distributions,
n_iter=10,
cv=3,
scoring='accuracy',
verbose=10,
return_train_score=True,
n_jobs=-1
)

search.fit(X_train, y_train);
>> Best hyperparameters {'classifier__max_depth': 5}
>> Best Score 0.9075471698113208

On top of my contributions in the exploration, cleaning, and modeling phases I took the lead in working with the Data Science API. Our project used FastAPI to get the Data Science app created and working and Docker to hold an image of our app for deployment to AWS Elastic Beanstalk. Within my local environment I included the previously mentioned csv file along with a file containing data cleaning and feature engineering methods put together by myself, a fellow team member, and the previous Data Science team. Using this I was able to create two new csv files, one containing the raw final data and the other containing the final data cleaned up and pre-processed for jsonification. This data was converted to a json object before being added to a get endpoint for access by the web team’s back end. The router that was set up to achieve this task can be found in the following embedding:

from fastapi import APIRouter, HTTPException
import pandas as pd
import numpy as np
# from .update import backlog_path # Use this when able to get the # backlog.csv filled correctly
from ast import literal_eval
import os
import json
# Create router access
router = APIRouter()
@router.get('/getdata')
async def getdata():
"""
Get jsonified dataset from all_sources_geoed.csv
"""
# Path to dataset used in our endpoint
locs_path = os.path.join(os.path.dirname(__file__), '..', '..',
'all_sources_geoed.csv')
df = pd.read_csv(locs_path) # Fix issue where "Unnamed: 0" created when reading in the
# dataframe
df = df.drop(columns="Unnamed: 0") # Removes the string type output from columns src and tags,
# leaving them as arrays for easier use by backend
for i in range(len(df)):
df['src'][i] = ast.literal_eval(df['src'][i])
df['tags'][i] = ast.literal_eval(df['tags'][i])
"""
Convert data to useable json format
### Response
dateframe: JSON object
"""
# Initial conversion to json - use records to jsonify by
# instances (rows)
result = df.to_json(orient="records") # Parse the jsonified data removing instances of '\"' making it
# difficult for backend to collect the data
parsed = json.loads(result.replace('\"', '"')) return parsed

One of the major challenges we faced was during the deployment stage of the project. I was able to get the data set up and deployed onto AWS Elastic Beanstalk but several times there was a problem with the jsonification of the data making it unusable for the web team. First, the data was returning with several out-of-place forward slashes “\” and backslashes “/”. Secondly some of the data features, specifically “src” and “tags” were being returned as strings instead of arrays. The DS team sat down together in chat to research and brainstorm how to fix this issue. After a number of trial and errors in our deployment we found the preprocessing steps we needed to ensure the data being sent was formatted correctly. The embedded code for this process can be found below:

import os
import pandas as pd
import re
# set up various things to be loaded outside of the function
# geolocation data
locs_path = os.path.join(os.path.dirname(__file__),
'latest_incidents.csv')
# Read in the csv file into a dataframe
sources_df = pd.read_csv(locs_path)
# Remove instances occurring in which backslashes and newlines are
# being created together in the data.
sources_df["desc"] = sources_df["desc"].replace("\\n", " ")
# Remove the "Unnamed: 0" column creating when reading in the csv
sources_df = sources_df.drop(columns=["Unnamed: 0"])
# Fix instances occurring in which new lines are being created in
# the data
for i in range(len(sources_df)):
sources_df["desc"][i] = str(sources_df["desc"][i]).replace("\n",
" ")
# Create csv file from dataframe
sources_df.to_csv("all_sources_geoed.csv")

The End Is Not The End But A New Beginning

As of now our contributions to the product has reached its end. The Data Science team was able to collect, clean, feature engineer, and ship a useable dataset to the web team, including 1177 instances of police brutality across the United States over the last seven months. Each instance sent to the back end included the following features: “src” — an array of urls linking to the source videos, articles, and/or posts, this was cleaned up using newspaper3k to extract the correct text — “state” — the state in which the incident occurred, this was engineered using spaCy to extract the state name from the text — “city” — the city in which the incident occurred, this was engineered using spaCy to extract the city name from the text— “desc” — text description of the incident — “tags” — the tags used to identify whether the story was a case of police brutality or not, this was engineered using Natural Language Processing — “title” — a descriptive title of the incident — “date” — the date that the incident occurred — “id” — a unique string identifier for each instance — “lat” — the latitude code to map the location of the incident, this was engineered using the state and city name and a separate csv file listing city, state, and geolocation codes (lat and lon) — and “long” — the longitude code to map the location of the incident, this was engineered using the state and city name and a seperate csv file listing city, state, and geolocation codes (lat and lon). Our team was also able to ship a useable API that allowed connection to backend and through them connection to the frontend visualizations. A sample selection of the data being sent to the backend team can be found below:

[{"src":["https://www.youtube.com/watch?v=s7MM1VauRHo"],"state":"Washington","city":"Olympia","desc":"Footage shows a few individuals break off from a protest to smash City Hall windows. Protesters shout at vandals to stop.  Police then arrive. They arrest multiple individuals near the City Hall windows, including one individual who appeared to approach the vandals in an effort to defuse the situation.  Police fire tear gas and riot rounds at protesters during the arrests. Protesters become agitated.  After police walk arrestee away, protesters continue to shout at police. Police respond with a second bout of tear gas and riot rounds.  A racial slur can be heard shouted, although it is unsure who is shouting.","tags":["arrest","less-lethal","projectile","protester","shoot","tear-gas"],"geolocation":"{'lat': '47.0378741', 'long': '-122.9006951'}","title":"Police respond to broken windows with excessive force","date":"2020-05-31","date_text":"May 31st","id":"wa-olympia-1","lat":47.0378741,"long":-122.9006951},{"src":["https://mobile.twitter.com/chadloder/status/1267011092045115392"],"state":"Washington","city":"Seattle","desc":"Officer pins protester with his knee on his neck. His partner intervenes and moves his knee onto the individual's back.  Possibly related to OPD Case 2020OPA-0324 - \"Placing the knee on the neck area of two people who had been arrested\"","tags":["arrest","knee-on-neck","protester"],"geolocation":"{'lat': '47.6062095', 'long': '-122.3320708'}","title":"Officer pins protester by pushing his knee into his neck","date":"2020-05-30","date_text":"May 30th","id":"wa-seattle-1","lat":47.6062095,"long":-122.3320708},{"src":["https://twitter.com/gunduzbaba1905/status/1266937500607614982"],"state":"Washington","city":"Seattle","desc":"A couple of police officers are seen beating and manhandling an unarmed man. The officers are throwing punches while he was on the ground and pinned.  Related to Seattle OPA Case 2020OPA-0330.","tags":["beat","protester","punch"],"geolocation":"{'lat': '47.6062095', 'long': '-122.3320708'}","title":"Police beat unarmed man on the ground","date":"2020-05-31","date_text":"May 31st","id":"wa-seattle-2","lat":47.6062095,"long":-122.3320708}, . . . ]

The web team was able to use this data and put together a functioning interactive choropleth map of the United States on the landing page. This map shows the user a number of clickable pins that, upon being interacted with, shows the user an information box that includes the title of the incident, the city and state location of the incident, a text description of the incident, and the http sources linking to where the data was collected, where the incident was reported, and any accompanying video and news sources of that incident.

The shipped choropleth map with interactive pins
An example information box of an incident in Salt Lake City, Utah seen upon clicking the pin

Upon product completion, there are a number of possible future modifications and feature additions. The Data Science team’s baseline model, while working and deployed, had a tendency to output too many false positives not making it dependable enough for use in an automatic data gathering architect. Future teams can work to improve the model significantly and use and modify the update skeleton that collects new data every 24 hours and uses the model to predict instances of police brutality. For the web side some possible future additions could include an updated and more visually pleasing map, access to more visualizations and data — this would also involve extra data and possibly seaborn or plotly visualizations shipped by future Data Science teams — and ability for user to sort the data by dates, locations, etc.

The Reward At Journey’s End

Throughout this journey I was exposed to a plethora of new and rewarding experiences. I learned the importance of prioritizing planning early on and how to go about this process productively and efficiently both as a team and individually. I learned the banes and boons of pair programming; this project had us pair programming five days a week for four weeks, a span long enough to gain plenty of insights and frustrations. This led me directly to my next and arguably most important learning opportunity: team development and communication; when working on a multidisciplinary project that requires many tasks from many different people, most of which are heavily dependent on the completion of a number of different tasks, working closely as a team to build trust, understanding, and a strong communication system is of the utmost importance. Beyond this I also learned the ins and outs of a number of technical libraries and methods including FastAPI, AWS Elastic Beanstalk, and newspaper3k, and obtained more experience in libraries and methods such as NLP, spaCy, tokenization, RandomForestClassification, RandomSearchCV, APIs such as PRAW and TWEEPY, amongst a number of many others.

This product and experience is something I will be able to carry with me into the future for the purposes of leaning on to glean insight into future approaches to NLP or Data Engineering problems, invaluable teamwork experience, a reference for my resume, applications, and job interviews, and so much more beyond my current scope of conscious thought.

My name is Daniel Benson. I am a Data Scientist and aspiring Machine Learning Engineer trained through Lambda School’s Data Science program. I currently hold an Associate’s Degree from Weber State University and am working on finishing up a Bachelor’s Degree in Biology and Neuroscience. If you wish to read more of my work I make an effort to complete at least one blog post a week here on Medium outlining a project that I worked on or am working on. If you wish to contact me I can be found at LinkedIn or Twitter, and for a closer look into my project methods you can also visit my GitHub where I store all of my technical projects. Finally, if you would like to follow the development of this police brutality product you can consult the repository, the deployed landing page, or the Human Rights First website itself. Thank you for following me on this journey, brave reader, and keep an eye out for the next post. Rep it up and happy coding.

--

--

I am a Data Scientist and writer prone to excitement and passion. I look forward to a future I am able to focus those characteristics into work I love.