Into a Textual Heart of Darkness

Going zero to not-quite-hero in NLP via hate speech classification

Published in

Towards Data Science

8 min readMay 17, 2018

Image courtesy of Peggy and Marco Lachmann-Anke, CC0 license.

The internet is a jungle. Here, rich diversity and wondrous color combine to create a unique ecosystem, enabling novel technologies and methods of communication. But beneath this bright canopy is the dark understory of the unmoderated internet — a place where the safety of anonymity blurs the lines of civil discourse.

Here, you’ll find insults and indignity, abuse and ridicule. In this brave new world, the person is merely an abstraction, removed from the things he or she says. Here, entire swaths of people are demonized and denigrated through no fault of their own. Thankfully, some of these are the result of internet trolls — comments carefully constructed to feign ignorance and provoke outrage.

Some. But, not all.

Some of these comments are the product of real prejudice and malice. I wanted to see if I could find a way to construct a model that would be able to detect and classify these comments and separate them from the rest. As one with a very limited (read zero) knowledge of natural language processing techniques, I was interested in seeing how far I could get in two weeks.

“The Horror! The Horror!”

To limit the scope of my investigation, I wanted to hone in on a specific category of distasteful comment — hate speech. While it is a term increasingly thrown around, it is just as increasingly misunderstood. Justice Alito, in his written opinion opinion on the case Matal v. Tam, defines hate speech and our relationship with it in the following way:

Speech that demeans on the basis of race, ethnicity, gender, religion, age, disability, or any other similar ground is hateful; but the proudest boast of our free speech jurisprudence is that we protect the freedom to express “the thought that we hate”.

Hate speech in the United States is surprisingly unregulated at the federal level, and has been upheld to be protected under the First Amendment. Institutions such as universities, however, have frequently defined codes of conduct and decency that place limitations around speech, to limited effect.

With the anonymity offered by the Internet, and the insulation from consequences that it provides, hate speech is certainly on the rise and of increasing concern to communities and platforms seeking to maintain civility.

The Data

To start, I needed some hate speech to work with. A quick internet search let me to a collection of human-labeled tweets. This data formed the basis for the paper “Automated Hate Speech Detection and the Problem of Offensive Language”. A cursory exploration of the data revealed what I was up against:

A sample tweet from the data, classified as offensive, but not hateful.

The numerical columns corresponded to, from left to right:

Index, or entry number of the tweet
Number of human readers who viewed the tweet
Number of human readers who considered the tweet hate speech
Number of human readers who considered the tweet offensive
Number of human readers who considered the tweet neither of the above
Final classification, i.e. the majority opinion

Each tweet was read by a minimum of three readers, with as many as double that for more ambiguous statements.

As my challenge was to classify the tweet solely on the text of the tweet, I disregarded all the numeric columns except for the final classification decision.

Taking Out the Trash

Tweets are inherently messy; Internet communication is informal and fluid — repeated letters for emphasis, unconventional abbreviations, and ever-evolving slang are just some of the things that both bring the platform to life and make the text just thaaaaaaaat much harder to process. In order to get any sort of accurate read on the words I had in my data, I needed to standardize and normalize.

To do this, I generated a list of words from a number of books on Project Gutenberg. Unfortunately (and understandably), much of the profanity and jargon on the Internet was not part of these classic works of literature, and needed to be added manually. This was a very imperfect system, as it would be impossible for me to account for everything, but it preserved the profanity I needed and normalized the majority of my text.

An example from my normalization workflow. Note the repeated ‘yyy’ in ‘crazy’ is corrected, though ‘bahaha’ is interpreted as an unrelated word.

I hoped most of what would fall through the cracks would at least be normalized incorrectly in a consistent fashion, thus limiting the potential impact.

Breaking it Down

With my (mostly) corrected words, I was ready to convert my tweets into something a little more machine-readable. A common approach when handling text is to divide the text into individual words, or groups of words, known as n-grams.

Partial n-grams for a normalized tweet, with corresponding part-of-speech tags (discussed later).

These n-grams were then turned into a machine-readable vector of numbers, with numbers corresponding to the tf-idf weight of each n-gram.

Term frequency-inverse document frequency (tf-idf) can be thought of as a measure of how unique or distinctive a word is to a class of document. One approach to determining how important a word is is to simply count how many times it shows up (term frequency).

We have a problem, though. Some words just appear more often in general across all documents — for example “and,” “the,” or “a.” Without removing these words, a simple term frequency would be dominated by these non-distinguishing words, thus hiding the real information.

The tf-idf was conceived to rectify this by penalizing words appearing across multiple documents. Although “the” might have a high term frequency, because it appears in almost every document, its tf-idf score will suffer, allowing the truly unique and descriptive words to rise to the surface.

In my case, the “offensive” and “hate” classes shared a lot of vocabulary, such as curse words. My hope was that tf-idf would allow the distinctive slurs that make hate speech what it is to emerge.

Keeping it Classy

I performed two major classification steps — part-of-speech tagging, and the final classification as hateful, offensive, or neither.

Superlatives, Infinitives, and Participles, Oh My!

Now I freely admit I’m no grammar buff. I didn’t know what a “semantically superlative adjective” was until five minutes ago. But part of speech could potentially help me glean some extra juice from the data I had, so I buckled down and prepared myself for some hardcore linguistic instruction.

Now, smarter people have walked this road before, and excellent code already exists to tag text with minimal effort. But in the name of learning, I resolved to build an inferior model of my own to do the same job. Taking a pre-tagged corpus, I used a simple and quick algorithm, a naive Bayes classifier, to do the job.

The result of my part-of-speech classifier.

As “naive” might suggest, this type of classifier is simple-minded, making assumptions that trade accuracy for simplicity and speed. And, simple and speedy it was; in a few lines of code I had my tagger up and running, with an accuracy of 85%. Certainly room for improvement, but quite sufficient: I was eager to move on to the meat of the problem — detection of hate speech itself.

Recap: At this point, I’ve cleaned up my original tweets, attached part-of-speech tags, broken my text down into small chunks of words, and turned them into a whole lot of numbers. I’m ready for the main event.

Off-Balance

My first pass through the data, I trained five different types of models on each of my six types of features: chunks of one, two, or three words and their corresponding part-of-speech tags.

Average F1-Score. Each number corresponds to a distinctly trained model.

Looks great for a first pass, right? My top F1-scores are already in the high-80s, and with optimization, I’m confident I can push higher.

Unfortunately, my data was not so kind. As I was seeking to build a hate speech classifier, the metric of greatest interest to me was the F1-score specific to hate speech; showing only that data painted a radically different picture.

F1-Score for “hate speech” classification of tweets.

What’s going on here? Sadly, in my excitement and haste, I was not diligent in performing my exploratory data analysis. Had I been more thorough, I might have noticed something very off about my data.

Class 0: hate speech. Class 1: offensive speech. Class 2: neither offensive nor hateful.

The proportion of my classes was wildly off balance. Class 0, hate speech — the class I was most interested in — only comprised little over 7% of my total data. With the similarity in language between “hateful” and “offensive” classes, my models were having trouble seeing where the boundary between hateful and offensive lay.

There’s No “I” In Team

With my best models only performing at around 25%, clearly my approach was not working. Since my independent models weren’t up to the task, I looked at combining the best I had — a poor man’s random forest.

The five models I selected for my ensemble approach.

The five models I ended up selecting were all decision trees or variants thereof. For each tweet, I took the majority decision as my classification; if three models predicted class 0, that would be my classification decision for that tweet. This approach got me up to a class-0 F1 score of 21%. This was less than I what had achieved before.

Running out of time, I started to worry. Even with majority decision, my model could not reach the correct conclusion. But that gave me a second idea — if most of my models were getting the hate speech classification wrong, maybe they were getting it wrong consistently. If there was any sort of pattern to it, I could feed those five decisions into a new classifier that would be able to make sense of it all.

My ploy paid off: feeding my model inputs into a new naive Bayes classifier making the final decision got me an F1-score of 31%, not a terribly significant improvement, granted, but an improvement nonetheless.

Closing Thoughts

While my success in classifying hate speech was not all that impressive, it provides a decent start. Corny as it may be, more important than my accuracy or precision was my increased understanding of good code and data analysis that could only come from making mistakes and having a boneheaded code architecture.

As for the model itself, there’s certainly a lot more I could have done, especially with what I know now. From handling class imbalance gracefully to exploring more advanced algorithms or word embeddings, the only way to go is up.