An Introduction

Introduction
The Perspective API from Jigsaw and Google aims at moderating toxic content in online social media platforms to promote a more civil and inclusive environment for everyone.
Although it seems to work pretty well at first, but on a deeper look🕵 ️♂️ some serious cracks start to appear. Let’s understand them using example input sequences and corresponding classifications from the API:
(1) I can kill you with my knife ➡️ 98.65% threatening
(2) I can kill you with my sarcasm ➡️ 97.62% threatening
(3) If looks could kill, I would be dead by now 😂 ➡️ 95.35% threatening
(4) If looks could pay, I would be rich by now 😎 ➡️ 18.16% threatening
Their model has clearly developed a Bias towards the words kill and dead, and flags sequences as threatening even when used in a non-threatening context. In terms defined by [4], we can say that the model is over-generalising these words towards being threatening.
As a result, it is failing to capture the context they appear in appropriately.
Let’s look at some reasons that can lead to these unintended generalisations by the model.
Before we proceed
Due to my field of interest, the examples used in this article revolve around hate-speech/abuse detection. However, the principles discussed here are not limited to any field, and extend to a variety of other datasets. Go on further, and decide for yourself if your Dataset shows any symptoms of bias 👩 ⚕️
Feel free to follow along the companion Jupyter notebook for a more hands-on introduction! 💻 🤓
The meaning of bias in datasets
Our machine learning algorithms can only be as good as the data that they try to model 📊 .
Textual datasets are vulnerable to capturing biases due to a wide variety of reasons. Some of the common types of biases are:
Bias due to social stereotypes: [2] show that even the Google News article dataset exhibit a disturbing amount of gender stereotypes by revealing that the word2vec embeddings trained on this dataset show a close association for the terms ‘female’ ↔ ️’homemaker’ and ‘male’ ↔ ‘computer programmer’.
Bias due to skewed classes: It is often the case that "pure" sampled data tends to contain a very sparse representation of a certain label. For example, [3] estimate that the hateful content constitutes of at max 3% of the total tweets on Twitter. Hence, if a dataset is generated by randomly picking tweets from Twitter, it is expected to contain (at best) 3% hateful tweets. Models trained on these low-density datasets tend to develop a bias against these less-represented classes, tending to predict the higher-represented classes by default.
Bias introduced during sampling: Bias due to skewed classes can prompt dataset creators to resolve to different sampling/filtering techniques to increase the density of a particular label. Some of these commonly adopted techniques include:
- Boosted random sampling: Simple (or "pure") random sampling followed by some heuristic to boost the desirable label. For example, collect tweets at random from a specific time frame followed by adding more tweets from the users who are known to post content with the desirable label.
- Biased topic sampling: Directly pick up samples from specific topics which are known to engage content with the desirable label. For example, tweets related to migrants are known to attract abusive posts.
Bias in Annotation: [6] point to the drawbacks of crowdsourcing annotation to platforms like Mechanical Turk, CrowdFlower, etc. Improper supervision of the annotators can lead to their demographic (gender/ethnicity/belief/other) alignments bias their annotations. [7] also discuss the drawback of creating a gold-standard dataset by methods like aggregation by majority, in which they support the importance of separately capturing each individual annotator’s viewpoints.
Why should you care about these biases anyway? 🤷♂️
These unintended irregularities in the training data further leak into the models trained on them, bringing down their performance on new unseen data.
Let’s take a hate-speech classification dataset by [5]. It is known to suffer from false-positive bias due to a presence of abusive words in the non-hateful class in the test split. This leads to a large proportion of neutral samples being classified as hateful (aka false-positives):

As we can see, a BERT ([8]) model trained on this dataset performs near-perfect on the train split, moderately-well on the validation split, but fails to generalise on the test split with a low accuracy and F1-score.
(Try plugging in your own dataset into the companion Jupyter notebook and see if you can notice any bias in model predictions 😉 )
Let’s take a look at some of the samples that wrongly get classified as hateful:
(1) @user b***h i love you with my whole heart ur my fave person ever thank u ❤️️
(2) anyway this b***h is back and ready to detest boygroups minus bts and bap with her whole heart and being
(3) @user b***h dont test me
(4) @user @user agreed! tonight's show was very informative. tucker is brilliant, and @user is so smart buildthatwall
(5) california needs voter id! ice at every polling station! voterid voterfraud buildthatwall
...
The model seems to be biased towards words like:
- b***h: an abuse, but can also be used in a non-hateful context (eg- in AAE)
- buildthatwall: a hashtag that attracts hate against immigrants
Yes, you guessed it right! Our dataset is suffering from bias 🤒 which is visible in our model’s predictions.
What’s next 🤔
In this article we saw some of the common causes of bias in NLP model. We also looked at a way of manually inspecting 🔍 bias in our dataset.
In the next article, we will dive deeper into systematic ways of detecting and mitigating bias from NLP datasets 🌊🏊 ♂️.
References
[1] : Caliskan, Aylin & Bryson, Joanna & Narayanan, Arvind. (2017). Semantics derived automatically from language corpora contain human-like biases. Science. 356. 183–186.
[2]: Bolukbasi, Tolga & Chang, Kai-Wei & Zou, James & Saligrama, Venkatesh & Kalai, Adam. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.
[3]: Founta, Antigoni-Maria & Djouvas, Constantinos & Chatzakou, Despoina & Leontiadis, Ilias & Blackburn, Jeremy & Stringhini, Gianluca & Vakali, Athena & Sirivianos, Michael & Kourtellis, Nicolas. (2018). Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior.
[4]: Badjatiya, Pinkesh & Gupta, Manish & Varma, Vasudeva. (2020). Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations.
[5]: Basile, Valerio & Bosco, Cristina & Fersini, Elisabetta & Nozza, Debora & Patti, Viviana & Rangel Pardo, Francisco & Rosso, Paolo & Sanguinetti, Manuela. (2019). SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. 54–63. 10.18653/v1/S19–2007.
[6]: Geva, Mor & Goldberg, Yoav & Berant, Jonathan. (2019). Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets.
[7]: Akhtar, S., Basile, V., & Patti, V. (2020). Modeling Annotator Perspective and Polarized Opinions to Improve Hate Speech Detection. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 8(1), 151–154.
[8]: Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT.