Bias in Machine Learning: How Facial Recognition Models Show Signs of Racism, Sexism and Ageism

Examining bias in facial recognition through the lens of age and gender prediction to encourage the development of fair, accountable, and transparent machine learning.

Rachel Meade

Published in

Towards Data Science

16 min readDec 14, 2019

By: Amber Camilleri, Robbie Geoghegan, Rachel Meade, Sebastian Osorio, Qinpei Zou

Back in 2018, an article by the technology market research firm, Counterpoint, predicted that over one billion smartphones would be equipped with facial recognition by 2020. Today, Apple, Samsung, Motorola, OnePlus, Huawei, and LG all offer devices that feature facial recognition.

When we pocket our phones and step outside, public spaces are dotted with facial recognition cameras and hundreds if not thousands of retail stores use facial recognition cameras across the globe. Most large retailers are tight-lipped about their use of facial recognition for theft prevention, but articles like this confirm big names like Target and Walmart have already experimented with facial recognition in their stores.

Soon, those same stores and others may launch facial recognition to enhance store loyalty programs or for customization of services. A Forbes article points out that loyalty members have generally already agreed to share personal data with the brand, so enhancement of loyalty programs through facial recognition may not be far in the future. This trend is occurring across many industries, from identifying cancerous cells to predicting the likelihood of criminals reoffending.

With its burgeoning ubiquity, we felt it was important to understand this technology, so we decided to explore the inner workings of image detection and classification by building our own convolutional neural net capable of predicting a person’s age and gender from his or her image. We would use a model and data sets that were common industry benchmarks. Then, using predictions from the model, we would further examine any bias the model might have. Although our findings may not be perfectly representative of the models being used in commercial deployment today, this project aims to shed light on bias in machine learning and to highlight the importance of thoughtful development of fair, accountable, and transparent machine learning.

INITIAL RESEARCH

Research into the area of fair, accountable and transparent Machine Learning has garnered significant attention in recent years. Unintended bias has been found in all forms of machine learning for example Bolukbasi et al., 2016 analyzed gender biases in a commonly used text analysis technique: word embedding trained on Google News articles. The researches built a model to predict how an analogy should end using gender specific words, such as “man is to king as woman is to _________” where the model would predict the female equivalent word “queen”. After training the model with Google News articles the researchers then used non-gendered professions to see the female equivalent. For “man is to computer programmer as woman is to _________” the model predicted “homemaker”. Other extreme “female” professions in the model included: “nurse”, “receptionist”, “librarian” and “hairdresser” while extreme male professions included: “maestro”, “skipper”, “philosopher” and “captain”. These models reflect the societal biases found in commonly used data sources such as Google News that are used in many business, government and research contexts without awareness or correction for gender biases.

This is true in many facial classification models as well. Buolamwini et al, 2018 analyzed the accuracy of commercial gender classification products across light and dark skinned males and females. Their research considered products sold by Microsoft, Face++ and IBM and found them to perform far better on males and light skinned people — the table below shows each product’s accuracy in predicting a binary classification of male or female from an image.

This disparity is alarming given these products are being used by governments and businesses today. Evidence of this bias is being discovered in many industries, from medical studies in genomics that use datasets with 81% European born participants (Popejoy and Fullerton, 2016) to criminal prosecutions informed by models that are twice as likely to wrongly predict repeat offending for black defendants compared with white defendants (Angwin et al., 2016).

In researching our specific problem of predicting age and gender from images we found that the majority of models that perform these predictions are trained on the most popular 100,000 actors and actresses sourced from Wikipedia and IMDB. Being a dataset of celebrities, training data is mostly of white people, has a higher proportion of males, and most celebrities appear much younger than someone the same age who is not a celebrity — creating bias for classifying older people.

Our focus sought to quantify these biases and experiment with approaches to correct them.

MODEL SELECTION

Our choice to use a Convolutional Neural Net (CNN) as our primary model was also informed by our initial research. Age prediction is inherently a regression problem, so using a CNN which is typically trained using a classification loss function was not an obvious choice. However, we discovered that CNNs consistently perform well on image recognition tasks and have become industry standards for these kinds of tasks.

Over the last ten years, the ImageNet Large Scale Visual Recognition Competition has acted as a proving ground for the power and accuracy of CNNs. Their excellent performance in these annual competitions has led to the large-scale acceptance of CNNs as the “go-to” model for image classification tasks. The ImageNet Competition is a 1,000-category classification competition with over 1 million images in the training set. Academics and professionals alike have competed, improving CNNs along the way. Between 2010 and 2017, classification error in the competition was reduced from 28.2% to 2.3%.

Ultimately, we employed the VGG-Face architecture, a 16-layer CNN with 13 convolution layers (some with down-sampling) 2 fully-connected layers and a softmax output. A visualization of a model like ours can be found here. Although the classification loss function was not ideally suited for age classification, we chose to work with this particular models because of its excellent benchmark performance, its extensive documentation, and our ability to implement transfer learning for the initial weights. Our initial model (before retraining) relied entirely on transfer learning and the original source can be found here.

DATA SELECTION

The images on which the model was originally trained are sourced from IMDb and Wikipedia. The data was gathered using a list of the 100,000 most popular actors and actresses from IMDb and comprises their facial images with timestamps, dates of birth and gender. Of the 100,000 actors and actresses, 20,284 had usable data with an average of 26 images per celebrity and 523,000 total images in the data set.

Age labels are based on the date of birth recorded on IMDb and Wikipedia and timestamps of images. This assumes the date of birth is accurate on these websites. Images without a timestamp were removed where the person’s age at the time of the photo can not be ascertained. Some images are from stills of movies and the timestamp of the image is based on the time of production, however movies with extended production time could result in some inaccuracies in the timestamp information. Therefore some inaccuracies in the age label for images may exist.

Although the Wikipedia and IMDB datasets are large and well-labeled, celebrity images are likely not representative of the general population. To provide a representative measure of how our model would perform on the general public, we also employed images from the UTKFace dataset both to retrain the model and as a test set for accurate error measurement. The UTK data has both cropped and uncropped photos; examples of each data set are shown below. UTKFace is a large-scale face dataset of over 20,000 photos with annotations of age, gender, and ethnicity.

AGE PREDICTION

Our model was pre-trained on a subset of about 450,000 photos from the IMDB-Wikipedia dataset; the remaining photos not used in training were designated as an out-of-sample test set. The below shows the average predictions versus actual ages on this test set. The actual (per annotation) ages are on the x axis and the average predicted ages (per our CNN model) are shown on the y axis. One might expect that a model trained on a large dataset, making predictions on images very similar to the training set would perform well and our observations match this expectation.

To obtain predictions, we evaluated two different interpretations of the softmax output. The final layer of the CNN is a softmax layer which outputs probabilities a given image belongs to each of the 101 classes, with each class corresponding to an age (0 to 100). These probabilities can be interpreted as a weighted average, multiplying the probability the image belongs to each class by the value of the class. Or alternatively, the prediction can simply be the age of the class which has the maximum probability. The maximum probability interpretation consistently resulted in higher errors, so we employed the weighted average interpretation to make predictions.

We chose to focus on Mean Absolute Error (MAE) our primary measure of error as it is more robust to outliers and is the apparent industry standard measure of error in age prediction problems. For comparison, this paper from 2017 cites the MAE of several commercially available APIs for predicting age and lists the MAE of Microsoft’s Face API age prediction tool as 7.62 years. Our MAE when predicting ages of people from Wikipedia images was 5.3 years.

We know, however, that this measure of error may not reflect how well the model predicts age for non-celebrity faces. Non-celebrity images typically differ from celebrity photos in image resolution, lighting, and facial features, which should impact the model’s predictive power. The MAE also may be artifically low because of information leakage from the test set in the training set. For example, if a celebrity had multiple images in the dataset, and some images ended up in both the train and test sets, the neural net may be just “recognizing” a familiar face and matching hir or her known age rather than “predicting” his or her age. As a result, our next step was to make predictions on the UTK dataset.

We began by predicting ages on the set of cropped photos from UTK. Without looking too closely at examples of the photos from the UTK and Wikipedia-IMDB datasets, we assumed that the cropped photos would be most similar to the data on which the CNN was originally trained. Although we expected the model to perform poorly in this experiment, we were still surprised at the extent to which this model failed to accurately predict ages, particularly among older people. Upon further investigation, however, we realized that the cropped images from the UTK dataset are cropped much closer than the images from the Wikipedia-IMDB set. This results in hair, ears, and other facial features often being cropped out of the images. If we wanted to match the format of the photos with which the model was originally trained, we would have better served by using the uncropped photos.

Before re-running the model, however, we additionally removed photos from the uncropped UTK dataset for which the annotated age was lower than 16. Our rationale behind removing this age group relied on two key arguments: first, both the original data and the UTK datasets were sparse for people under 16. Second, our use cases are generally less relevant for very young people.

The resulting predictions had a MAE of about 10 years. This represents significant improvement over the cropped dataset predictions, confirming our intuition that it is important that the photos in the test set closely resemble those in the original training data in terms of crop, resolution, lighting, etc.

The next clear step was to try to improve the model’s predictive power for non-celebrity faces by retraining the last two fully-connected layers and the softmax output of the neural net. We left the initial 14 layers and their weights unchanged to preserve the feature extractor piece of the neural net which would require a very large amount of data to retrain, but allowed us to fine tune the portion of the neural net which generates predictions.

Initially, we retrained on a subset of the UTK uncropped data using Adam optimization for 250 epochs. The visualization shows the first 100 epochs and achieved the lowest validation loss within about 30 epochs. The blue line represents validation loss and the orange is training loss. We saved the weights for the model with the lowest validation loss to avoid over-fitting. As a result of this retraining process, our predictions on a UTK uncropped test set improved substantially.

Retraining reduced our test MAE from 10 years to about 8.4 years. Adjusting the weights for the fully connected layers based on photos more similar to those in the test set made a clear difference. The UTK data generally had lower resolution photos, poorer lighting, and women with less makeup, so some of the original model’s poor performance could be attributed to these aesthetic differences in the images.

Aside from this improvement, the model continued to perform poorly in predicting the ages of older people. Examining the UTK dataset, we observed that like the Wikipedia dataset the model was originally trained on, the vast majority of the images were for people between 20 and 35. We speculated that the lack of data for older people could be contributing to the poor performance in that region of the data.

Proportion of Data in UTK Dataset By Gender & Age

Setting aside the option to retrain the entire neural net, there are two common strategies available to try to improve model performance: adjust the loss function to further penalize poor predictions on this group, or duplicate each of the data points the training data’s poor performing region to effectively accomplish the same. We began by duplicating all the photos of people over 60, and then tried triplicating them.

By creating multiples images of people over 60, the model is further penalized for mislabeling their ages. Most of the improvement was gained by duplicating the photos, improving MAE from 8.4 to 7.9. Small incremental improvement of 0.1 years MAE was achieved by triplicating the images.

Beyond simply refining our CNN’s weights we were interested in examining bias in the model’s predictions. We separated the predictions of our best model (retrained with triplicated images for 60+) by demographics to examine discreteness in model performance. When separating the predictions by gender, the biases in the predictions appear to be similar. Overall, the model consistently over-predicts the ages of young women to a greater extent than it does men, but perhaps less severely underestimates the ages of older women than men. The variance for both genders increases as age increases. This increase in variance and the decrease in accuracy for older people can likely be attributed to the comparatively sparse data in that region.

Another striking comparison can be made by separating the predictions by race. The uncropped UTK dataset is predominately photos of Caucasian people, so it is unsurprising that the model performs comparably for Caucasian people as it performs on the dataset as a whole. Note how similar the predictions are for Caucasians, compared to the vastly different predictions for Asians in the test data. Asians represented a minority of the photos, and predictions are less accurate and are highly volatile for older people of that race. As one might expect, and as these two examples demonstrate, predictions were generally more accurate for demographic groups of people with larger quantities of data and less accurate for groups which had less representation in the dataset.

CORRECTION MODEL

To further refine predictions made by the neural network we created a secondary model that corrected for systematic under or over estimations based on age, gender and race. Through the generation of this model, we gained a better understanding of the differing magnitude of bias associated with the demographic groups.

The features used in the correction model comprised: the predicted age from the convolutional neural network, the person’s gender (male or female) and the person’s race (white, black, Asian, Indian and other). Several models were tested and tuned so we could select the most appropriate correction model — the below table shows the best performing models of each type.

None of the models we tried could improve upon the output from the original model. We believe this is due to the simple set of features which only considered a person’s predicted age, their race and their gender. This was surprising given the consistent underprediction of people at higher ages. As a secondary solution, we tested some simple heuristics to see if these could improve our predictions. To do this we took the average difference between predicted and actual ages for different age brackets, genders and races. The results are shown in the following table:

Average Difference Between Predicted and Actual Age, Grouped by Predicted Age, Race and Gender

GENDER PREDICTION

Separate to age prediction, we also built a convolutional neural network to predict gender on the same sets of images. Gender classification was binary and only considered males or females as identified in the respective datasets. The pre-trained neural network performed very accurately on dataset of celebrities from IMDB and Wikipedia and predicted the correct gender with 99% accuracy. Using the UTK-Face dataset of mixed races and non-celebrities returned a much lower accuracy of 78% overall. The accuracy varies substantially depending on race and gender: from 98% accuracy for Indian and White Men to a low of 46% accuracy for Black Females.

The results show similar gender bias found in research where the model performs consistently lower on women’s images than on photos of men. When comparing all categories, the model performs worst for photos of black women. The magnitude of difference was surprising. Men of all races were classified correctly with 96% accuracy while the best performing female race had only 63% accuracy. This inaccuracy was further amplified by a female’s age as shown in the below graph.

Accuracy of Gender Predictions for Females Only

Similar to what we saw when predicting age, the gender predictions perform worst for oldest subjects in the test data. We also see the bias against black women resulting in consistently poor predictions across all ages.

These results reflect issues in the training dataset. The training data comprised two to three times more men than women and under-represented minority races. Most surprising is the results for males across all races given the training dataset was predominantly white — suggesting race does not influence gender prediction for men significantly. For women, race plays an important role as seen in the varied accuracies for ethnicities across all age groups. Furthermore, that data suggests that the most significant bias in gender prediction for women is age.

TESTING OUR MODEL

As a final test (and for our own enjoyment) we pointed our models towards ourselves and friends. To do this we gathered 51 photos of people’s faces which included their Facebook profile picture and photos taken in person. Our goal was to show that our re-trained models would be more accurate in predicting age and gender than the original model trained on celebrity images.

The retrained model outperformed the original model for the candid set of images taken in person while the original model performed better on carefully chosen and well lit Facebook profile pictures.

We believe the difference in performance aligns with the difference in training datasets, where people’s curated, higher quality Facebook profile pictures more closely reflect celebrity images while in person photos more closely reflect the regular people used in our retrained model. Overall the retrained model performed best and improved accuracy by 10–20%.

CONCLUSION

As a result of training on imbalanced datasets, the predictions of our CNN acutely demonstrated race, gender, and age bias in its predictions. Through re-training and re-balancing the data we were able to make incremental improvements in reducing bias, improving the initial MAE for age by 24%. Yet, these improvements were only incremental changes. Thoughtful curation of balanced datasets that better represent minorities are needed to properly eliminate these biases. This is particularly important given the pervasiveness of machine learning models in businesses and governments which are trained on these sorts of datasets. Evidence abounds of biased models impacting minorities yet change is slow due to knowledge gaps, costs in data collection and limited checks and balances when deploying these models. Transparency and awareness are important first steps, but there is much more work to be done to truly ensure machine learning is fair and accountable to everyone.

GITHUB REPOSITORY

https://github.com/rmmeade/APM_Faces_Proj/settings

RESOURCES

Special thanks to Dr. Joydeep Ghosh, Shubham Sharma & Disha Makhija at the University of Texas in Austin for their generous guidance and advice.

Age and Gender Prediction CNN from Sefik Ilkin Serengil

UTK-Face Dataset

IMDB-Wikipedia Dataset

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

Man is to Computer Programmer as Women is to Homemaker? Debiasing Word Embeddings

Genomics is Failing on Diversity from Alice B. Popejoy and Stephanie M. Fullerton

Machine Bias by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, ProPublica

Face Recognition and Convolutional Neural Networks and Subspace Learning