Reweighing the Adult Dataset to Make it “Discrimination-Free”

One Example of Bias Mitigation in Pre-Processing Stage

Haniyeh Mahmoudian, PhD
Towards Data Science

--

Courtesy of Pixabay

In the previous blog, we discussed a machine learning workflow for how to identify bias and ways to mitigate it. Over the next few weeks, I will write a series of posts exploring this workflow in greater detail. First, I’ll detail a few bias mitigation techniques in relation to the stage of modeling they operate within.

The first stage in the machine learning (ML) pipeline where we can intervene to reduce bias is called pre-processing. Pre-processing describes the set of data preparation and feature engineering steps occurring before the machine learning algorithm is applied. Sampling, massaging, reweighing and suppression are among different pre-processing bias mitigation techniques proposed from academic literature [1].

In this post, I will focus on exploring reweighing [2], a pre-processing technique that assigns weights to the data.

The advantage of this approach is, instead of modifying the labels, it assigns different weights to the examples based upon their categories of protected attribute and outcome such that bias is removed from the training dataset. The weights are based on frequency counts. However as this technique is designed to work only with classifiers that can handle row-level weights, this may limit your modeling options.

To demonstrate how this technique can be used to reduce bias, I used the Adult dataset [3]. The binary target in this dataset is whether an individual has an income higher or lower than $50k. It contains several features that are protected by the law in the US, but for simplicity in this post, I will focus on sex. As can be seen in the table below, Male is the privileged group with a 31% probability of having a positive outcome (>$50k) compared to an 11% probability of having a positive outcome for the Female group.

The disparate impact metric, as described in the equation below, is a measure of discrimination in the data. A score of 1 indicates the dataset is discrimination-free. When calculated on the unweighted Adult dataset for Male versus Female, the score is 0.36.

Using the frequency counts in the table above, the reweighing technique will assign weights as follows the below equation. For example, for the privileged group with the positive outcome (that is, Male with greater than $50k income), the weight is calculated as:

Thus the weights for each category in the training data are:

By applying these weights to the counts, the disparate impact metric would become 1 for the training data and thus now “discrimination-free.” After these weights are calculated on your training data in the pre-processing phase, they can be used as an input to classifiers such as logistic regression, SVM and XGBoosts.

To evaluate the effect of the reweighing technique, I trained two logistic regression models, one with weights and the other without weights. The results of the experiment point to the usefulness of the reweighing method in reducing discrimination, as seen in the table below:

This technique is particularly helpful for use cases where the data owners are not willing to share sensitive or protected attributes with data scientists. By providing them with a script that can assign these weights to the records, data scientists would be able to reduce bias in their modeling process without direct access to protected attributes.

In comparison to simple methods like removing the sensitive attributes from the training data, experiments show that pre-processing techniques are more effective at reducing bias. That said, on average, pre-processing techniques are not as effective as in-processing ones since they are not directly involved in the model training process and some accuracy trade-off needs to be made to lower the discrimination.

References:

[1] Kamiran F, Calders T (2009a) Classifying without discriminating. In: Proceedings of IEEE IC4 international conference on computer, Control & Communication. IEEE press

[2] Kamiran, Faisal and Calders, Toon. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1–33, 2012

[3] “Adult — UCI Machine Learning.” 1 May. 1996, http://archive.ics.uci.edu/ml/datasets/Adult.

--

--

Astronomer-turned data scientist. Researcher at DataRobot (https://www.datarobot.com/) focusing on fairness, ethics, and trust in AI.