Creating a “Gender Equality in the News” Dataset

Extracting face data from news website images using Python and Azure Cognitive Services

Richard Farnworth
5 min readSep 4, 2020

The news media is supposed to be a mirror of our society. It’s how many of us become informed about the events that shape our society and affect our lives. Unfortunately, we live in a world of inequality, with white, heterosexual men disproportionately occupying positions of power across many areas of our industry, politics and culture. For example, only 23% of the U.S. Congress are female, with numbers being only marginally better in the UK and Australia. When the news media holds up its mirror, the stories told and the people given a voice, come predominantly from a privileged minority.

While we’ve undoubtedly made a great deal of progress in the last hundred years, there’s still a long road ahead to close the gap. The question I want to help answer is whether or not the news media is helping or hindering this process with its representation of people and stories from across society.

Photo by Roman Kraft on Unsplash

The Project

To create a dataset of faces featured in images on popular news websites.

By periodically scraping image data from a selection of popular news websites, using computer vision to detect faces and predict their gender and age, we will create a dataset providing a record as to where, when and what type of faces are presented.

Potential lines of analysis

Using this data, we can ask some of the following questions:

  • Is there an equal balance between male and female faces in images on news websites?
  • Which websites are the most equal/unequal with their presentation of different gendered faces?
  • Are male faces often featured more prominently (size or closeness to the top of the image) than female faces?
  • Do images featuring male faces stay on the site longer than those featuring female faces?
  • Are female faces largely confined to specific sections of the front page (e.g. the Daily Mail’s celebrity sidebar)?
  • Are the men pictured on news media of a different age distribution than women?
  • Are news stories biased towards certain age groups?

Method

The data is collected via a Python script, which runs periodically (once an hour via cron). It polls each website using Selenium, records any new images from each page and then uses Azure’s Face API to extract face data for each image. Four tables are built up over time:

  • Scans: Each row represents a scan of a particular website, recording its url and a timestamp of that scan.
  • Images: Each row represents a unique (by url) image. An image can appear in many scans and a scan can yield many images, necessitating the following table:
  • Appearances: Each row represents the occurrence of a particular image within a particular scan.
  • Faces: Each row represents a face, identified within an image.
Each connection represents a one to many relationship. E.g. each image can contain zero to many faces, but each face can only relate to one image.

Selenium

Because some content on popular news sites is loaded dynamically, using a simple http request would risk missing some images.

Selenium is a framework for the automated testing of web applications, and serves our purposes quite nicely. Because the page is automatically rendered, we have the added benefit of being able to detect the pixel location that each image is displayed.

Azure API

Within Azure’s Cognitive Services suite, sits the Face API. Pass in an image URL and it returns a list of faces present in the image. Each face comes complete with a set of attributes including age, gender, head position and orientation, emotion, hair colour and whether or not the person is wearing glasses.

By extracting this information for images found on popular news sites, we can build up a rich dataset describing how different demographics are represented.

The API was easy to set up, and I was calling it in Python within a few minutes. Pricing includes a free tier, limited at 20 images per minute, or USD $1 per 1,000 if you creep above. There are alternatives such as Amazon Rekognition, which is near identical in features and in price.

The Websites

The websites chosen for this analysis were selected from Alexa’s top 50 news sites. I excluded any messageboards or aggregation sites, and also added a handful of UK and Australian news providers so as not to be too U.S. focused.

The Code

The full code is located at https://github.com/richfarnworth/news_face_data. The environment.yml file is provided to setup your environment via Conda and the main script to run is scrape_news.py.

The Data

The process was kicked off on the 4th September 2020. As such we are collecting data as we speak and hope to do some preliminary analysis by early October. The dataset itself will be uploaded to Kaggle at regular intervals to allow others to do their own analysis.

Limitations

As with any approach, it’s important to understand the shortfalls of our method.

  • We’re assuming a binary view of gender. Gender is a hugely complex subject and there are many different ways which people can identify. Constrained by the tools available, however, we’re forced to assume a simplistic binary categorisation.
  • Age and Gender are the only demographic dimensions we are able to analyse, as (probably wisely), Microsoft doesn’t attempt to identify race with their Face API service.
  • We’re completely at the mercy of the (unpublished) accuracy of the Azure Face API. If for example, it’s worse at detecting male faces, or can detect female age less accurately, then this would introduce a skew to any further analysis. I did do a few experiments with pictures of celebrities for which I could look up the age and it seemed to guess within a couple of years of the correct answer most of the time. But this was far from rigorous and given more time it would be useful to properly analyse the service’s performance.
  • The data will be highly dependent on the main stories going on right now. I started the scraping process on the 31st August, meaning that many news sites will be prominently featuring the US Election and the Coronavirus Pandemic. Given both candidates are male and in their 70s, and the chief medical officers of the US, UK and Australia are all in a similar demographic, this will likely skew any analysis. To mitigate this, we’d want to collect data over a longer period of time.
  • I use the default Selenium browser window settings using Chrome as the webdriver. While some dynamically loaded content will be extracted, features such as infinite scroll will not be activated.
  • In using the images from the home page of each news site, the image files are often thumbnails with lower resolution than the feature image on the actual article. This can lower the identification rate of the Azure Face API. “Clicking” through the link to the original article and extracting the feature image from there might help, although some news sites have different images for home page links and feature images.
  • All content will be assumed to be static, with no personalisation based on location, cookies, browser etc.

Potential expansion

While this initial scraping is just focused on the front pages of each site, it would be interesting to expand this to top level subsections. For example, do the gender and age biases change between political, business and sports reporting?

Watch this space

Updates, datasets and analysis to come.

--

--

Richard Farnworth

Data scientist, computer programmer and all-round geek with 10 years of using data in finance, retail and legal industries. Based in Adelaide, Australia.