Fairness and Bias

Towards Removing Gender Bias in Writing

Learn how to understand the amount of gender bias in your daily consumption of literature and avoid it in your own writing.

Srihitha Pallapothula
Towards Data Science
11 min readSep 30, 2021

--

A yellow family sign painted on cement. From left to right: a daughter holding a father’s hand, a father, and a mother holding a baby.
Photo by Sandy Millar on Unsplash

Why gender bias?

I grew up in a culture in which gender bias was pervasive. As a writer, I found this concerning and tried to mitigate the effects of bias in my stories. I believe that addressing this issue begins with detection, so I decided to undertake a project that could detect possible gender bias.

Disclaimer

This project is my first introduction to data science, machine learning, and natural language processing. It was an exploratory effort into areas of data science I have always been intrigued by. Thus it is important to note that this project is a proof of concept, not a perfect algorithm.

Methodology

I decided to start by examining gender bias on a sentence level. The alternative would have been to examine paragraphs or even the whole text. If I found a biased word, I would label the entire text as biased. I steered away from this method because I felt a singular word did not necessarily make an entire text biased. Instead I felt it was more important to that specific word’s relationship with other words to get a better understanding of the role it played in terms of generating bias. The only way for me to do so was by using a high level of granularity and processing sentences instead of larger chunks of text.

When examining sentences, I wanted to break each sentence down into words and identify the part of speech of every word. I needed to decide which parts of speech I wanted to examine to determine if the sentence was biased. I ended up choosing the subject and object for the following reasons: The object is what the writer associates to a gendered subject (man, boy, woman, girl, etc.). Additionally, these are the parts of speech that are most likely to be targets of gender bias, and the relationship between the two most often indicates bias. For example, if the writer correlates something stereotypical to the subject ‘girl’, such as the word ‘nurse’ or ‘dolls’, then it becomes clear that the sentence is biased. For this reason, I chose to examine the subject and object relationship. Additionally, something that reinforced my choice was my observation that the biased relationship in most sentences with gender stereotypes was often between these two components.

Note: The sentences I examined were from an article from The Gender Equality Law Center[1]. I also used some of these sentences to help me develop my own algorithm.

I used the Python library Spacy to deconstruct each sentence — because it is one of the few libraries that would parse sentences for me — and loaded a file called ‘encore_web_lg’, which allowed me to separate the words in each sentence into Spacy token objects. Tokens have many features, but the main ones that I used were:

  • token.pos_, for part of speech
  • token.dep_, for dependencies or the relationship the word has to other words in the sentence
  • token.text, which gave me the string form of the words
  • token.lemma_, which gave me the root form of the word, example: token.lemma_ for Women would be woman

Here is an example of the sentence ‘A girl should like dolls.’ displayed with Spacy:

Spacy visual of parts of speech for the sentence ‘A girl should like dolls.’
Image by author.

Using Spacy, I created two methods in which I checked for the object and subject of each sentence. I only proceeded with the sentence if the subject was either male or female, which I determined by checking if the lemma and lowercase form of the subject was in the following lists:

lemma_female=[‘girl’, ‘woman’, ‘women’, ‘female’]

lemma_male=[‘boy’, ‘man’, ‘men’, ‘male’]

Note: The reason I included women and men despite both of them not being root words is because in some cases, Spacy is not able to detect that the lemma form is woman and thus it defaults to women. In order to account for this issue, I had to add women and men to the lemma lists.

With the example of the sentence ‘A girl likes dolls.’, my method would give me the subject — girl and object — dolls.

However, I realized that the sentence ‘A girl likes dolls.’ would be unbiased because there was no insistence that a girl should like dolls, unlike in a sentence like ‘A girl should like dolls.’

So, in order to ensure that my methods were only looking at sentences that were biased, I checked for qualifiers, such as the word should. Qualifiers are words that “change how absolute, certain or generalized a statement is.” [2]. I checked specifically for necessity and quantity qualifiers because they indicated whether or not the user inputting the sentences believed males and females either should do something or was making broad generalizations about either gender or even both:

necessity_qualifiers=[‘must’, ‘should’, ‘ought’, ‘required’, ‘have to’, ‘has to’]

quantity_qualifiers=[‘all’, ‘only’, ‘most’]

Additionally, I checked if the words were plural gender terms because if so, the user would be making a generalization about the entire gender, not just a singular member. To do so, I checked if the subject was in the following list:

plural_form=[‘women’, ‘men’, ‘girls’, ‘boys’, ‘females’, ‘males’]

If the sentence contained one or more of the phrases above, then the methods used to find the subject and object would process the text.

For example, now with the sentence ‘A girl likes dolls.’, the method wouldn’t output anything. Whereas with the sentence ‘A girl should like dolls.’, it would give me the subject girl and the object dolls.

Next I needed to find a way to determine whether or not the relationship between the subject and object detected was biased. Because this project operates within the binary, I realized the only way to do so was to compare the relationship detected to the opposite gender’s relationship with the object. If the relationship was somehow different, then there would be a possibility of bias.

I decided to use cosine similarity, which calculated the similarity between words, for this part of the project because it is a common NLP technique used to compare words. Higher cosine similarity scores mean that the words have a closer relationship to each other. Lower cosine similarity scores mean the opposite: the words have a more distant relationship. The similarity between words is dependent on Twitter data that helps calculate correlations based on the way those on Twitter used the words.

For the comparisons portion, taking the example of ‘A girl should like dolls.’, first my methods would identify the subject as ‘girl’ and the object as ‘dolls’. Then I would need to compare ‘girl’, ‘doll’ (the lemma or root version of dolls) to ‘boy’, ‘doll’. Using cosine similarity, I first found the similarity between ‘girl’ and ‘doll’, which was 0.58. Then I found the similarity between ‘boy’ and ‘doll’, which was 0.48. These results meant that both ‘girl’ and ‘doll’ and ‘boy’ and ‘doll’ were closely related. However, because the score for ‘girl’ and ‘doll’ was higher, the relationship between these two words was closer than that of ‘boy’ and ‘doll’. Subtracting the similarity of the second from the first, we get 0.10. The significant difference in the two relationships suggests that there is bias in the relationship between ‘girl’ and ‘doll’. Beyond processing sentences with a single subject and object like this one, the algorithm can also process sentences with multiple subjects and objects as well as conjugated sentences.

Based on the difference I received from my work above, I needed to determine whether or not the relationship was biased. Because there was little research on what difference counted as biased, I set a base metric of greater than .02 or less than -.02 to qualify; I decided on this number after testing many biased cases — such as ‘A girl should like dolls.’ — and observing the difference between these scores.

For my metric, if the difference was greater than .02, it meant the sentence was biased towards females. If it was less -.02, it was biased towards males. If it was in between, there was no bias.

Here is an image of the spectrum:

Less than -0.02 is Biased towards males. Greater than 0.02 is Biased towards females. Between -0.02 and 0.02, inclusive, is Unbiased.
Image by author.

Based on the difference in cosine similarities, if it was >.02, I added to a count of sentences that were biased towards females, and if it was <-.02, I added to a count of sentences that were biased towards males. Additionally, I maintained a count of the total sentences. In order to see if a sentence was biased towards both males and females, I checked to see if a sentence had female gendered terms, male gendered terms, and if it had any qualifiers. If all three components were present, I marked the sentence as biased and added it to a count of sentences that were biased towards males and females.

I also checked to see if the subject and object methods did not output anything but still had gendered terms. If so, I did not go through the process above. Instead, using the Python library vaderSentiment, I calculated the compound sentiment score of the sentence, which ranges from -1 (negative sentiment) to 1 (positive sentiment). If the score was less than 0 and had female gendered terms, male gendered terms, or both, I marked it as biased and added them to the respective count. I decided to use sentiment analysis because I wanted to identify general bias and detect bias in sentences where there were no qualifiers.

Lastly, I calculated the percentages of bias in the entire text — biased towards males, biased towards females, biased towards males and females — by dividing each of the counts by the total number of sentences. Then I displayed my output.

I decided to use Streamlit, which allowed me to efficiently create my app, to display my output. I organized the information I received from my work above in a format that Streamlit would accept and outputted it on the screen.

My algorithm extracts text from books, articles, etc. in PDF and Word document form. It first processes the text, then inputs each sentence through my above algorithm, calculates the biased percentages, and displays the sentences with bias as well as which type of bias they had.

Demonstration:

The sentence I used to explain the way my algorithm works is quite simple. I used it primarily, so my algorithm’s analysis could be easily understood. Instead here is an example of my algorithm’s analysis of a more complex piece of writing:

Here is the homepage of my website. Users can either upload a PDF or DOCx (Word) file. As can be seen, I chose to analyse this Women’s Web article about dismantling the stereotype that women should not work and uploaded it as a PDF.

PDF uploaded to site. Title text above the uploading area: How gender biased is your book, article, etc.?
Image by author

After clicking the process button, here are the results:

Biased Towards Females: 28.99 percent, Biased Towards Males: 2.9 percent, and Biased Towards Both Genders: 1.45 percent.
Image by author.

The text seems to primarily be biased towards females. There is little bias towards males and towards both genders. Now let’s take a look at a couple of the sentences in which bias was detected.

Starting with bias towards females, here is one sentence:

It all started when we were talking about the generic things happening in the society and out of nowhere, my friend said, ‘Women should never go to work.’

  • In this sentence, the subject identified is women. The object identified is work. This sentence is flagged as biased because it has a plural gendered term — women and has a necessity qualifier — should. After comparing the cosine similarity of ‘woman’ (the lemma version of women or the root form of the word) and ‘work’ (the lemma version of work or its root form), which is 0.32, and that of ‘man’ and ‘work’, which is 0.39, it becomes clear that there is an absolute value difference of .07 between the two. This makes the sentence biased towards females.

Next, here is one sentence where there is bias towards males:

Before they are married, most men are completely dependent on their mums and after, on their wives.

  • In this sentence, the subject is identified as ‘men’ and the object is identified as ‘mothers’. The sentence is flagged because it has a plural gendered term — men and a quantity qualifier—most. This sentence has multiple objects: ‘wives’ and ‘mums’. Using the Python library gensim and data from Google news vectors to find the closest word to the identified objects, I found that the closest word was ‘mothers’. The cosine similarity of ‘man’ (the lemma version of men or the root form of men) and ‘mother’ (the lemma version of mothers or its root form) is 0.50, and the cosine similarity for ‘woman’ and ‘mother’ is 0.68. The absolute value difference is 0.18, showing that there is a difference between the two, meaning the sentence is biased towards males.

Lastly, here is the only sentence biased towards males and females:

Another reason women can’t work is because of how men are raised.

  • In this sentence, the subjects identified are women and men. Because both are plural, this means the author is making some sort of assumption about both genders. Because there is no way to compare both genders in this case, the algorithm automatically marks the sentence as biased towards males and females.

What I learned:

This project taught me how machine learning can be used to tackle issues such as gender bias in text. However, it also showed me how simultaneously expansive and limiting ML can be. Though ML has a very wide scope, at times, it fell short in addressing my algorithm’s needs. Thus I had to find ways to fill in ML’s gaps algorithmically.

I have also learned a lot about the different Python libraries and now fully understand how to use many of them. Overall, my data science knowledge base has expanded a lot. I have come from being an absolute beginner to quickly adapting to Python, learning about NLP, specifically understanding the intricacies of Spacy — the Python library used to process text and identify parts of speech, and understanding how to employ large datasets in my work.

Potential Areas for Improvement:

  • As of now, this project operates within the gender binary. In the future, I hope to expand it to address bias towards nonbinary and transgender individuals as well.
  • Additionally, I’d like to key on bias towards certain genders in certain occupations. For example, men in nursing or women in computer science are often stigmatized, so I aim to find ways to address biases towards and stereotypes about these groups.
  • For comparing a subject with multiple objects, I originally wanted to take the subject and compare it to every single object. In the case of a sentence like ‘A girl should like dolls and trucks.’, I would have compared ‘girl’ and ‘doll’ (the lemma version of dolls or its root form) and ‘girl’ and ‘truck’ (the lemma version of trucks or its root form). But if there were more than two objects or potentially more subjects, this process would become tedious and time-consuming. However, I feel doing so would make my algorithm more accurate. Thus, I would like to search for an efficient way to do so.
  • Lastly, I am currently working to reduce the time required for large files to load and for the results to appear. Thus, in the future, I hope to find ways to improve the speed and efficiency of the algorithm.

If you have any questions, please feel free to reach out to me through LinkedIN.

Thank you for reading. I hope you’ve learned something from my project!

References:

[1] Examples of Gender Stereotypes: Gender-Equality-Law (2015), Gender Equality Law Center

[2] Using Qualifiers (2005), Changing Minds

--

--

Srihitha Pallapothula is a high school senior located in the Bay Area. She is fascinated by the intersection between English and computer science.