The world’s leading publication for data science, AI, and ML professionals.

Fairness Metrics Won’t Save You from Stereotyping

Understanding the basis for decisions is crucial in ensuring fair models

Fairness and Bias

Fairness metrics are often used to verify that machine learning models do not produce unfair outcomes across racial/ethnic groups, gender categories, or other protected classes. Here, I will demonstrate that, whatever usefulness such metrics might have in identifying other types of bias, they are unable to detect one major form of algorithmic discrimination: stereotyping. In fact, the behavior of fairness metrics is essentially identical whether a model bases its decisions on arguably reasonable factors, or stereotypes due to inclusion of a sensitive feature or a proxy for such a feature.

Here is a current example. Recently, Apple and Goldman Sachs were accused of sexism in their credit decisions for the Apple Card [1,2]. As far as I know, there are no official answers about whether these accusations were justified, or what the cause of any issue might have been; the matter is still under investigation. However, some possible explanations have been discussed in the media. The speculations I’ve heard tend to fall into 2 major categories:

  1. Women have lower income because of workplace discrimination, and credit limits are linked to income; therefore, women are assigned lower limits. (e.g. [1])
  2. The Apple Card algorithm’s training data contained information about shopping patterns, which were a proxy for gender, and led to stereotyping of female applicants (e.g. [2])

These 2 scenarios are very different in terms of fairness. Whatever concerns there may be about scenario 1, scenario 2 is much worse.

Scenario 1 involves decisions based on income, which may be an appropriate basis for credit determination, even if outcomes are unequal. Personally, I’ve experienced gender-based discrimination, but I don’t see how an Apple card would help me with that. A trendy credit card can’t replace lost income, and having money seems like a necessary condition for paying back a loan.

Scenario 2 is likely to occur when female status is correlated with loan default due to a confounding feature. Lower average salaries for women would lead to such a correlation. But, basing decisions on typical incomes by gender ensures that all women are disadvantaged due to workplace discrimination that may have been experienced by only some of us. If we assume that all women have lower incomes, we will deny credit even to women who have been spared, or who have overcome, workplace discrimination. This kind of model takes pre-existing discrimination and amplifies it so that it becomes even more widespread. (Stereotyping risk is a reason that including sensitive features such as race or sex in models is widely discouraged, or even illegal in some scenarios.)

Stereotyping is not the only possible source of unfairness in machine learning models, but it is an insidious one. Our models are designed to estimate population risks, with similar people receiving similar scores. In the context of machine learning, "similar" means comparable values of the features included in the data (weighted by predictive power). Models are designed to generalize; there is always variation in individuals with the same "risk score", but, when decisions are made, all people with the same score will receive the same treatment. It may be acceptable if people with similar incomes and debt levels are treated as indistinguishable, but a model can consider sex, race, or other characteristics to be relevant similarities, especially if these are correlated with actual outcomes.

The usual strategy for mitigating stereotyping risk for machine learning models is to examine input features and verify that (1) sensitive features are not incorporated, and (2) obvious proxies for these are not included in the model. However, less obvious proxies may well occur, especially in a data set with many features.

Here I’ll use example data to simulate this issue, and to demonstrate that fairness metrics can’t distinguish stereotyping from a decision based on an arguably reasonable factor. Therefore, detection of stereotyping requires a different approach.

Demonstration Data Set

Demonstration code can be found on GitHub [3]. I use a simplified public loans data set [4], which does not contain any gender information (although the source data presumably includes multiple genders). This data contains loan information with default ("bad loan") status and 13 predictors, which include income, features related to debt load, loan amount, and employment length, and US state of residence (which I group into regions).

I artificially impose a simplified concept of "gender" on this data set, for simplicity considering only "male" and "female" classes, separated based on income. I assume lower-income borrowers are more likely to be female, and higher-income borrowers more likely to be male. My assignment greatly exaggerates income inequality in the US, with the income of "females" about half that of males ($38k vs. $74k), but (importantly) retaining significant overlap between the male and female groups. Because the loans data set contains people with mostly relatively high incomes, "females" make up only about 26% of loans. Although my simulation unrealistic, it gives us a very clear contrast which is useful as an example and is relevant for situations where representation is unequal.

I create two data sets to illustrate different decision bases. The first is just the original data set. The second removes the income feature and replaces it with my inferred "gender" feature (female indicator). I then build an XGBoost Classification model for each data set and calculate fairness for my artificial gender groups. Using this process, I hope to illustrate 2 decision paths, one based on income and other features (Model A), and another which has the potential to base decisions on gender (Model B).

Examining global feature importances show that income is the most important feature in Model A, while gender (a less detailed feature) is of lesser importance in Model B. However, gender does still appear to be an important feature, indicating that Model B uses this factor make predictions.

The question now is, to what extent can common fairness metrics distinguish Model A (income-based decisions) from Model B (gender-based decisions)? Spoiler alert: They can’t.

Fairness Metrics

In recent years, attention has been devoted to the development of "fairness metrics", which promise to detect bias in models. Typically, these are post-hoc tests, with values calculated separately for groups of interest; differences in values are thought to indicate unfairness. There are numerous fairness metrics, and many of these cannot be mutually satisfied [4]. Here I will discuss some of the more common ones used for binary classification.

Demographic Parity

One common test of fairness is simply whether a model’s binary yes/no prediction is similar across groups.

The actual outcomes vary by "gender", as might be expected because I have preferentially sorted low-income buyers into the artificial "female" category. Both model predictions reproduce actual rates fairly well. Importantly, Model A and Model B look very similar to each other!

Calibration

A calibration test for fairness examines whether the relationship between actual and predicted risk is similar across groups. One common assessment is to create decile groups from model probability outputs. How do Models A and B fare in this test?

We see above that the male and female curves lie on top of each other, indicating that both models appear calibrated with respect to gender. Again, Models A and B show similar results.

False Positive/Negative Rate

False positive/ false negative rate tests are satisfied when these rates are similar across groups. Importantly, these tests are expected to fail for a calibrated model where actual outcomes vary by group [REF], as is the case for our models. But, can these distinguish model A from B?

False Positive Rates

False Negative Rates

As expected, both tests fail for both models. False positive rates are high for females overall, as they tend to have risk scores near the decision boundary, while false negative rates are high for males who default rarely. In general, error rates for Model B are slightly higher than Model A, as expected given that raw income is a stronger predictor than the correlated but much less detailed gender feature. However, there is no strong pattern distinguishing Model A from Model B. Both models fail both tests with similar patterns.

Model Performance

Another category of fairness metrics address whether model performance is similar across groups. Any performance metric can be used. Below I show accuracy and f1 scores:

Accuracy

F1 Score

Accuracy is higher for males, partly because the default assumption of no default is more often correct for them. In addition, males are the majority group (~74% of loans). However, the f1 score, which reflects separation in the positive class, is higher for females, probably reflecting a greater number of very high-risk loans in this group. In any case, the general behavior of these metrics is again the same for both models

Metrics, What are they Good For?

The chart below summarizes metric results for both models:

As we can see, the pass/fail pattern is identical for Models A and B, although Model A bases its decision on income, and Model B leverages gender stereotypes. One conclusion is that metrics are useless in preventing stereotyping. Certainly, fairness metrics can’t detect stereotyping.

In my opinion, fairness metrics are not helpful when used to "prove" that bias does or does not exist. However, they are very useful when considered as one piece of evidence among many that direct an investigation down certain paths. Fairness metrics help get us oriented to potential risks, and provide evidence that may support or contradict our understanding of our model. Here, the pattern of fairness metrics is very much as expected, or the best that is possible, when we have an actual outcome that varies by sex and the model reproduces the actual distribution.

Can Stereotyping Be Stopped?

Since fairness metrics don’t detect stereotyping, must we rely on existing methods of examining input features for obvious proxy effects? My model B included a feature explicitly called "female", which may be an obvious red flag. However, I could have named this feature something else, or could have created an intermediate feature correlated with both income and "gender"; in other words, this feature does not have to be an obvious proxy for the conclusions to hold.

Fortunately, we can leverage explainability techniques to isolate features that contribute to group differences. We narrow our investigations to specific features and proceed with additional tests.

Shapley Explanations for Fairness

I am following a process for explaining group differences which has been described by Scott Lundberg [6]. However, I make a modification: I use one sex (males) as the reference population, or "foil", and generate contrastive explanations for individuals of the other sex (females). Using such a foil simplifies the analysis and may be especially helpful in cases when groups are imbalanced.

Shapley explanations ("phi values") distribute model results across input features. The phi values reflect the marginal contributions of each feature to the overall model score by calculating the change in outcome for a given feature value vs. a value randomly drawn from the reference population, averaged over combinations of the other features. When probabilities are allocated by Shapley, the sum of the phi values is the total probability difference between an individual score and the mean score for the foil data. Individual phi values can then be summed to identify features that contribute to overall population differences [6].

The following plots show the results of this analysis for a sample of female loan recipients for our two models:

In both models, the technique isolates the exact features that (by design) I expect to drive sex differences! Model A predicts females default more based on the income feature, while for Model B the major determinant is female status. The pattern is very clear, even though income (and therefore female status) have fairly strong correlations with other features such as loan amount and number of credit accounts. And we can see a clear difference between the models, namely the features on which decisions are based.

It wasn’t guaranteed we would highlight these exact features in the Shapley plots. Income is a very strong overall predictor, which makes it more likely to be used in decisions. If income were a weaker predictor, but correlated with a stronger predictor, that stronger feature may have popped out in the plots. That would have been fine, however, as we need to explain how our model makes decisions, regardless of the construction of the data.

At this point, we have a lot of information relevant to assessing stereotyping risks. Instead of scrutinizing every feature as a potential proxy, we have isolated key features for additional testing. Even inspection of the features provides information. We can ask follow-up questions such as, "is the feature something with a known causal relationship to our outcome?", or "how did we calculate this feature, and what is the source of the information". Or perhaps the most important question, "is this feature important enough to our goal to justify the group differences it generates?"

Additional tests are possible to examine whether the feature or features obtained by this Shapley-based analysis are likely to be independently associated with the outcome, or a proxy for race or gender; such tests can also help uncover label or feature bias. I hope to discuss some of these in a later article.

Final Thoughts

The behavior of commonly-used fairness metrics is similar whether a model basis its decisions on an appropriate predictor or on an incidental feature (such as race or sex) that happens to be correlated with a predictor. Explainability techniques can identify features driving group differences and help distinguish unfair outcomes from justifiable decisions.

A data scientist is seldom called on to decide what is fair or appropriate. Instead, our role is to understand how our models work and assess them for potential risks. We need to be able to test models, fully understand mechanisms underlying group differences, and communicate those results to stakeholders, regulators, and possibly the public.

An analysis that focuses solely on metrics can miss important categories of bias. In the case of stereotypes, AKA proxy variables or "statistical discrimination", we take pre-existing societal biases and impose those even on people who managed to escape discrimination. Our algorithms have the capacity not only to continue the status quo, but to create additional victims. We data scientists have the responsibility to detect and measure such risks, and therefore must examine decision bases in addition to metrics.

References

[1] J. M. Germain, Apple Card Algorithm May Tilt Favorably Toward Men (2019), E-Commerce Times.

[2] W. Knight, The Apple Card Didn’t ‘See’ Gender – and That’s the Problem (2019), Wired.

[3] V. Carey. GitHub Repository, https://github.com/vla6/Stereotyping_ROCDS

[4] h2o.ai. Lending Club Data Set, https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv.

[5] G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, and K. Q. Weinberger, On Fairness and Calibration (2017), Advances in Neural Information Processing Systems, 5680–5689.

[6] Scott Lundberg, Explaining Measures of Fairness (2020), Towards Data Science.


Related Articles