FAIRNESS AND BIAS

Tutorial: Breaking myths about AI fairness

The case of biased automated recruitment

Grégoire Martinon

Published in

Towards Data Science

14 min readMay 6, 2021

After countless scandals about biases in AI (for example, this one, this one and this one), fairness in AI appears as one of the major challenges of the field. However, AI fairness is hard to understand and hard to implement. In this tutorial, we build and analyze a particular use case: automated recruitment, which hit the headlines in the past.

Throughout this tutorial, we are leveraging the AIF360 library by IBM Research.

Let us create a biased dataset from scratch

Say the goal is to recruit profiles on a given job (e.g. data scientist). There are basically two distinct populations: men and women.

We intentionally introduce three different biases:

Representativity bias : there are three times less women than men,
Social bias : women expect lower salaries than men (wrongly!), 6 000 $ dollars less a year on average, but have similar skills,
Historical bias : recruitment has been biased against women in the past, independently of expected salary or skill level.

(For an exhaustive list of biases in datasets, see Mehrabi et al. 2019, A Survey on Bias and Fairness in Machine Learning.)

The true historical model is a logistic regression fed with:

𝑋𝑠𝑘𝑖𝑙𝑙 : skill level, a grade between 0 and 20,
𝑋𝑠𝑎𝑙𝑎𝑟𝑦 : expected salary, an amount in thousands of dollars a year,
𝑋𝑠𝑒𝑥 : sex as a binary variable which is 0 for women and 1 for men,
𝑌 : whether the candidate was recruited at its expected salary, 0 means no and 1 means yes.

High skills levels with low expected salaries maximize recruitment. Whatever these two features, men are systematically favored. This is translated into a logistic model with coefficients 𝛽 such that:

Here is the code that allows us to leverage the AIF360 library.

from sklearn.model_selection import train_test_split
from aif360.datasets import BinaryLabelDatasetdf = create_dataset()     # this is our simulated datasetdf_train, df_test = train_test_split(df, test_size=0.2)attributes_params = dict(
    protected_attribute_names=[“sex”],
    label_names=[“recruited”]
)dt_train = BinaryLabelDataset(df=df_train, **attributes_params)
dt_test = BinaryLabelDataset(df=df_test, **attributes_params)

Simulating such a recruitment policy, we obtain the following training set:

Biased recruitment training set. Image by the author.

In this skill/salary plane, we can still guess sex via the social bias : women lie in the lower cloud, expecting lower salaries, while men lie in the upper cloud. Higher skills lead to more frequent recruitment and higher salary expectations (for a fixed skill level) lead to more rejections.

So far, so good.

However, you can see that women are disfavored, since the decision boundary for women (on the lower cloud) is shifted to the right : for a given skill level, women have less chances to be recruited than men. This is precisely the historical bias we are going to tackle in this article.

Key takeaway : before tackling a bias, you must choose which bias to consider. You cannot simply “correct biases”, and you certainly cannot correct all biases at once.

Measuring fairness

In this tutorial, we choose disparate impact (DI) as a measure of discrimination. Disparate impact is defined as follows:

Definition of disparate impact of past labels. Image by the author.

Disparate impact has several advantages :

It is simple to compute,
It is simple to interpret : no discrimination implies equality to one,
It is defined as a legal constraint in American law : any process with DI < 0.8 can be legally fined.

Disparate impact also has drawbacks :

It assumes that there is no influence between the sensitive attribute and the target, which is not always true.

In our case, we assume that there is no fundamental reason for women to be hired less than men, and hence stick to disparate impact as a fairness measure.

(For more about fairness metrics and disparate impact, see Caton and Haas 2020, Fairness in Machine Learning: A Survey.)

So what is the disparate impact of the past recruitment policy ? Here is the answer.

Disparate impact of past recruitments. Dashed lines represents what is considered legal in US law. Image by the author.

We see that disparate impact is close to 0.5, this means that women have twice as less chances to be recruited than men. Now let us review different bias mitigation strategies.

Strategy n°1 : do nothing

Well, this is not a strategy strictly speaking… This is just what people do when they are not aware of fairness. Let us train a logistic regression on 𝑋𝑠𝑘𝑖𝑙𝑙, 𝑋𝑠𝑎𝑙𝑎𝑟𝑦, 𝑋𝑠𝑒𝑥 and see what happens.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegressionpreprocessing = ColumnTransformer(
    [(“scaler”, RobustScaler(), [0, 1])],
    remainder=”passthrough"
)lr = Pipeline([
    (“preprocessing”, preprocessing),
    (“lr”, LogisticRegression(C=1e9))
])lr.fit(
    dt_train.features,
    dt_train.labels.ravel()
);
y_pred = lr.predict(dt_test.features)

First things first, what is the accuracy of the model on the testing set ?

Test accuracy of the baseline “no one recruited” model and a simple logistic regression that “does nothing” about fairness. Image by the author.

We compare our model with a really naive baseline : the “no one” model that rejects every candidate (yeah, really naive…). This baseline has a 68% accuracy, just meaning that 68% of the candidates were rejected in the past. In comparison, our first logistic model makes much better predictions, and reaches a shiny 93% accuracy, hurray !

Oh wait… What is the disparate impact of our model already ? We slightly amend the definition of disparate impact by switching true labels and the predicted labels:

Disparate impact of automated recruitment decisions. Image by the author.

We can now compare our model with the historical process on the testing set :

Uh-oh… We have a highly accurate model, but of course we just learned the same bias as the one in the historical process. We have almost the same disparate impact as the biased recruitment process itself.

What is the decision boundary of our biased classifier ?

Decision boundary of a simple logistic model trained on the historical dataset. The upper part features men’s decision boundary, the lower part features the women’s boundary. Image by the author.

In this graph, we have split the decision boundary in two parts: the one for men on top and the one for women on the bottom. Without surprise, we see that our model has perfectly reproduced the bias against women, which should have much more skills to be recruited than men.

Key takeaway : if you do nothing about fairness, your model is just as fair as your training data.

Strategy n°2 : remove sensitive attribute

This is the most popular and easiest to implement strategy: we simply prevent the model from using the sensitive attribute (𝑋𝑠𝑒𝑥 in our case) and hope for the disparate impact to improve.

lr.fit(
    dt_train.features[:, 0:2],  # this is how we remove X_sex
    dt_train.labels.ravel()
);
y_pred = lr.predict(dt_test.features[:, 0:2])

In terms of disparate impact, this is what we obtain on the testing set :

Wow ! This is kind of an improvement ! Seems that our algorithm complies with American law right now. Let us visualize our new decision boundary.

Decision boundary of a simple logistic model trained with sensitive attribute removal. Image by the author.

Oh wait… That was unexpected… We now have a unique decision boundary for both men and women. But it seems that there is a new problem right now: ask for higher salaries to get recruited ! If only I had known…

This bias on salary disfavors women of course, since they tend to expect lower salaries, hence your not-so-negligible disparate impact !

And this is how our model learned to ignore sex… Despite the numbers, there is still a fundamental discrimination induced by this decision boundary. The model reproduces the historical bias to some extent, and uses the expected salary as a proxy of sex in order to discriminate women, even if it is a logical nonsense !

Key takeaway : Removing sensitive attributes does not necessarily make your model fair.
Key takeaway : Having disparate impact close to one is a necessary but not sufficient condition for being fair.

Strategy n°3 : preprocessing by reweighing

This strategy is pretty simple, it focuses on the training set only, and tries to remove the disparate impact in it by reweighing instances. Basically, unprivileged groups with favorable outcome (recruited women) are given larger weights. This motivates the model to readjust its decision boundary in the opposite way of the existing bias.

Let us assume that there is no bias in the training set. This means that 𝑌 and 𝑋𝑠𝑒𝑥 are independent. So we expect that:

Left hand-side : expected joint probability of Y and X under independence hypothesis. Right-hand side : product of observed, marginal, probabilities of Y and X. Image by the author.

However, in the biased dataset, we observe 𝑃𝑜𝑏𝑠(𝑌=1,𝑋𝑠𝑒𝑥=0), which is different from the right-hand side of the above equation. So let us define weights 𝑊:

And that’s it ! The disparate impact on the reweighed dataset in now 1 by construction:

In our case of binary 𝑌 and binary 𝑋𝑠𝑒𝑥, there are only four weights to compute, pretty easy !

(For further details see Kamiran and Kalders 2012, Data preprocessing techniques for classification without discrimination)

So let us instantiate a Reweighing object.

from aif360.algorithms.preprocessing import Reweighingprivileged_groups_params = dict(
    privileged_groups=[{‘sex’: 1}],
    unprivileged_groups=[{‘sex’: 0}]
)RW = Reweighing(**privileged_groups_params)dt_train_reweighed = RW.fit_transform(dt_train)
weights = dt_train_reweighed.instance_weights

Let us visualize our new, reweighed training set.

Reweighed training set. Dot size reflects individual weights. Image by the author.

Let us train a new logistic regression on this balanced dataset. This is simply done by adding a sample_weight option to the fit method of our pipeline. Note that we keep removing the sensitive attribute.

lr.fit(
    dt_train.features[:, 0:2],
    dt_train.labels.ravel(),
    lr__sample_weight=weights        # this is how we use weights
);
y_pred = lr.predict(dt_test.features[:, 0:2])

Here come the disparate impacts of all previous strategies:

There are two interesting things to notice :

First, the reweighed history has a disparate impact of one, up to machine precision. This is a consequence of the theorem discussed above.
Second, even if the disparate impact on the training set is one, the disparate impact of the model trained on it is not one.

Let us investigate the case by visual inspection of the obtained decision boundary:

Decision boundary of a simple logistic model trained with sensitive attribute removal and training set reweighing. Image by the author.

Right… This looks like a more acceptable decision function than before, and indeed more women get recruited. But still… there is logical nonsense in giving larger scores to higher expected salaries, and discrimination against women pervades through social bias: at fixed skill level, women get less hired because they ask for lower salaries. This is how historical bias sneaked into our model, despite all our efforts to rebalance historical bias.

Key takeaway: Removing bias in the training set does not necessarily make your model unbiased.

Strategy n°4 : post-processing by rejecting

This strategy is quite simple : whenever the model is uncertain about the decision, implement positive discrimination. In other words, whenever the score is close to 0.5, systematically recruit women and reject men.

So let us first retrain a model on the original training set (not reweighed).

lr.fit(
    dt_train.features[:, 0:2],
    dt_train.labels.ravel()
);
y_prob = lr.predict_proba(dt_test.features[:, 0:2])[:, 1]

Now, we define the rejection margin 𝜃 so that :

This simply means that men are disfavored close to the decision boundary. The idea is that discrimination more easily happens when there is no clear distinction between candidates.

(See Kamiran et al. 2012, Decision Theory for Discrimination-Aware Classification for further details.)

So let us instantiate a RejectOptionClassification object (ROC).

We specify two parameters:

the classification threshold
the ROC margin 𝜃.

from aif360.algorithms.postprocessing import (
    RejectOptionClassification
)ROC = RejectOptionClassification(**privileged_groups_params)
ROC.classification_threshold = 0.5
ROC.ROC_margin = 0.15dt_test.scores = y_prob.reshape(-1, 1)
y_pred = ROC.predict(dt_test).labels

We obtain the following disparate impact (bar on the right) :

Haha ! Seems that we obtained again an acceptable disparate impact. We are even making men the unprivileged group… Oups !

What does it mean for the decision boundary ?

Decision boundary of a simple logistic model trained on the historical dataset with sensitive attribute removal and reject option post-processing. The upper part features men’s decision boundary, the lower part features women’s decision boundary. Image by the author.

See how we implemented positive discrimination ? Men are systematically disfavored with a lower score close to the decision boundary, while the opposite is true for women.

But damned ! There is still this inconsistency between score and expected salary: even if there is an offset between men and women, both would maximize their chances of being recruited by asking for a higher salary.

Now, let us play with the 𝜃 parameter. This parameter encodes the amount of positive discrimination. How do accuracy and disparate impact evolve with it ?

Test accuracy (blue) and test disparate impact (red) as a function of the rejection margin 𝜃. Larger rejection margins lead to smaller accuracy and larger disparate impact. Image by the author.

This is the so-called fairness-utility trade-off: improving disparate impact necessarily degrades accuracy.

But ! Recall that maximizing accuracy is not what we want in the first place: welcome evaluation bias ! Maximizing accuracy simply means that we reproduce historical bias perfectly, which is not our primary objective.

Key takeaway: There is a fairness-utility trade-off: you can’t be fair and accurate. But that’s ok ! Maximizing accuracy is at most the best way to learn a pre-existing bias. Decreasing accuracy does not mean you are going to lose money, just that you are going to make money in a different way.

Strategy n°5 : in-processing with constraints

This strategy is nothing but a logistic regression with an additional penalty encoding the prejudice made by the algorithm. The prejudice index 𝑃𝐼 is defined as the mutual information between the prediction and the sensitive attribute 𝑆 (which is 𝑋𝑠𝑒𝑥 in our case):

Prejudice index (PI) definition. Image by the author.

In the implementation, this term adds up with the standard log-loss cost function, with a coefficient 𝜂.

(See Kamishima et al. 2012, Fairness-Aware Classifier with Prejudice Remover Regularizer for further details).

Let us see what happens with 𝜂=100.

from aif360.algorithms.inprocessing import PrejudiceRemoverpr = PrejudiceRemover(eta=100.0)dt_train.features = preprocessing.fit_transform(dt_train.features)
dt_test.features = preprocessing.transform(dt_test.features)pr.fit(dt_train);
y_pred = pr.predict(dt_test).scores >= 0.5

Again, disparate impact seems acceptable. But let us keep critical mind close. What does the decision boundary look like ? Since the sex variable is used by our algorithm (hopefully in the positive way?), we again split the graph into two parts : men’s decision boundary (top) and women’s decision boundary (bottom).

Decision boundary of a simple logistic model trained with a prejudice penalty. Image by the author.

Hum… Compared to our “do nothing” strategy, the decision boundary of women has been translated to the left and a little bit slanted, so as to let more of them be recruited.

However, if you look carefully at the graph, you see that a woman with skill level at 10 expecting a salary of 40 is not recruited, whereas a man with the same skill and salary level would be definitely recruited ! This is due to the fact that prejudice (i.e. mutual information) mainly relies on the contingency table of 𝑌 and 𝑋𝑠𝑒𝑥, which is only an aggregated view of the data: welcome aggregation bias!

Moreover, imagine that women tend to ask for more and more salaries in the future (which would be great!), then the women cloud would translate slightly to the top and less and less women would get hired by the algorithm, eventually degrading your disparate impact: welcome AI models lifecycle !

So you got it: fairness is not just about another metric to optimize !

What about the fairness-utility trade-off in this case ? Let us just play with the 𝜂 penalty parameter:

Test accuracy (blue) and test disparate impact (red) as a function of the prejudice 𝜂. Larger prejudice penalty lead to smaller accuracy and larger disparate impact. Image by the author.

This is another illustration of the fairness-utility trade-off: fair models have less accuracy. But again, the very notion of performance is biased when the dataset itself is biased.

Key takeaway: several techniques of bias removal exist, none is perfect.

Strategy n°6 : your strategy

If you could draw a decision boundary by hand, how would you shape it so as to be accurate, fair and consistent ?

Hereafter is a proposition. Let us instantiate a custom FairModel class and see what it says.

fm = FairModel(lr) # Our custom fair model (what is it? suspense!)
fm.fit(
    dt_train.features[:, 0:2],
    dt_train.labels.ravel()
);
y_prob = fm.predict_proba(dt_test.features[:, 0:2])

Here is how our custom strategy compares with all the other ones in terms of accuracy and disparate impact:

Test accuracies of all modeling strategies seen in this article. Image by the author.

Test disparate impacts of all modeling strategies seen in this article. Image by the author.

Our strategy performs relatively well in terms of both accuracy and disparate impacts. But since we have already been fooled by the numbers, let us have a closer look at the decision boundary.

Decision boundary of a simple logistic model combined with a simple business rule. Image by the author.

Tadaa ! See what we did ? We simply applied a business rule on our original logistic regression :

“If salary is high, rely on historical data. Else, simply ignore salary.”

This model thus seems to be fair for both men and women, independently of representativity, social and historical biases. Furthermore, even if society changes and women tend to progressively ask higher salaries, our model would remain fair and consistent. This is what we call robustness.

How did we deal with all three biases ?

Representativity bias: our logistic model is fitted only on men, which represents the ground truth of what to expect in terms of skills and salary.
Social bias: low expected salaries have no impact on the recruitment, and hence neither has sex.
Historical bias: the boundary decision for women does not optimize accuracy at all: it is made continuous with that of men (consistency), and ignores salary.

Key takeaway: the optimal solution in terms of business logic, performance and fairness may be hybrid between machine learning and business rules.

Let’s remove biases in models !

In this tutorial, we have encountered five different biases:

Representativity bias: there are not enough women to rely on disparate impacts measures alone (not to mention statistical tests that are equally flawed).
Social bias: women and expected salary are strongly correlated, which impair naive bias mitigation methods such as sensitive attribute removal.
Historical bias: women are already discriminated in the training set.
Evaluation bias: women are discriminated in the testing set as well, so considering accuracy metric alone is meaningless.
Aggregation bias: disparate impact, mutual information and all related significance tests rely on coarse summaries of the dataset, namely contingency tables. These numbers can be artificially inflated without suppressing discrimination.

In this tutorial, the first three biases were countered with a simple business rule. The last two biases were countered with visualizations and highly interpretable models.

Key takeaways

Algorithmic biases are subtle and ubiquitous. Fairness is hard to achieve and call into question the very notion of “performance”. There is no magical metric to optimize nor magical library to solve the problem.

And there will never be.

Transparency is key. Human-in-the-loop is crucial.

Affiliation

I work at Quantmetry. Pioneer and independent since its creation in 2011, Quantmetry is the leading French pure-player artificial intelligence consultancy. Driven by the desire to offer superior data governance and state-of-the-art artificial intelligence solutions, Quantmetry’s 120 employees and researcher-consultants put their passion at the service of companies in all sectors for high business results.

Read our latest white book : https://www.quantmetry.com/lp-ia-de-confiance/

References

[1] Mehrabi et al., A Survey on Bias and Fairness in Machine Learning, 2019, arXiv
[2] Caton and Haas, Fairness in Machine Learning: A Survey, 2020, arXiv
[3] Kamiran and Calders, Data preprocessing techniques for classification without discrimination, 2012, Knowledge and Information Systems

[4] Kamiran et al., Decision Theory for Discrimination-Aware Classification, 2012, IEEE 12th International Conference on Data Mining

[5] Kamishima et al., Fairness-Aware Classifier with Prejudice Remover Regularizer, 2012, Joint European Conference on Machine Learning and Knowledge Discovery in Databases.