The world’s leading publication for data science, AI, and ML professionals.

Measuring “Fairness” When Ages Differ

Fairness and Bias

Photo by Anna Vander Stel on Unsplash
Photo by Anna Vander Stel on Unsplash

Constructing fairness metrics typically involves dividing a population into sub-groups, and then examining differences in model performance across groups. For example, you might split your population by gender, then measure accuracy and false positive rates for females vs. males. However, when the underlying age distributions of the sub-populations differ, and your outcome varies with age, differences in fairness metrics are inevitable, even when there is no intrinsic effect of group membership. This is true even if a model "adjusts for" age.

Age is not unique – variations in any key predictor can cause fairness metric differences. However, I think age deserves special consideration. In the United States (and many other places), racial/ethnic groups have large differences in age characteristics that have the potential to impact numerous models [1]. Although gender differences in the overall US population are less extreme, I’ve found ages often vary by gender in practice – for example, in the US labor market, females skew younger than males, which may result from educational differences in some birth cohorts and family caretaking responsibilities [2]. In my experience at least, most populations used for modeling show age differences across groups of interest for fairness metrics. Age is also an important feature to consider because it affects so many aspects of people’s lives, from income, to health status, to behavior. Moreover, age is often considered a sensitive feature, which sometimes means that it is ignored, when its effects may need thoughtful consideration. Even if age is inappropriate to use as a decision basis, it can be a mistake to ignore it during model validation.

For example, let’s say we want to model 401(k) participation among a firm’s employees, perhaps to construct targeted email campaigns to increase enrollment. When we construct the model, we will probably find that it predicts higher participation for males. Some of this is predictable based on age alone; in the workforce, males are older on average, and age is related to retirement plan access and participation [3]. In addition, assuming that mean ages are near mid-career (30s-50s), we’d expect higher income for the older group (males) [4], which may also increase savings rates for men compared to women.

If we examine fairness metrics for this model, we’ll see differences across genders, whether age is or isn’t included in the model. If the model incorporates an age feature (or proxy), we’ll probably see higher false positive rates for males because they are more likely to have intermediate participation probabilities. In contrast, if age isn’t included, we might expect higher false positive rates for females because we are implicitly assuming "average" age characteristics for everyone, which means over-estimating for females. And I haven’t yet mentioned gender-related differences in features like income, which will lead to additional discrepancies. When evaluating this model, we must be careful about what differences are due to age distributions vs. gender effects vs. errors or biases in our algorithm.

In this post, I use US Census data to illustrate how fairness metrics can be affected by differences in population ages, even in the absence of any direct effect of group membership. Calculations can be followed in a Jupyter (Python 3) notebook [5]. Age by race/ethnicity data is downloaded from the 2019 US Census and used to simulate a process with random and age-based variations.

The simulated process does not depend on race/ethnicity; differences across groups are due to age only. For simplicity, I show results for 2 groups only, selecting one relatively old (White alone, not Hispanic or Latino) and one relatively young (Hispanic or Latino) group. A Logistic Regression model is constructed for the data, and fairness metrics are constructed across these racial/ethnic groups. I demonstrate large differences in fairness metrics due to age distributions only. Metric differences can be partially mitigated by age stratification.

Population Age Characteristics

Census data is available for direct download; this process is illustrated in the Jupyter notebook [5]. I use 2019 American Community Survey age by race/ethnicity data ("B01001" tables) [1]. This source contains counts by age, sex, and race/ethnicity group, for eight race/ethnicity groupings. Counts are provided in age buckets of varying sizes, for example ages 18 to 19 versus ages 65–74. I combine gender counts, and construct a finer-grained distribution using smoothing. This allows me to obtain approximate age distributions by race/ethnicity with single-year resolution. Resulting curves for my selected sub-populations are shown below.

Approximate age distributions for two racial/ethic groups, based on 2019 American Community Survey data [1]. Image by author.
Approximate age distributions for two racial/ethic groups, based on 2019 American Community Survey data [1]. Image by author.

I selected these two groups because they are relatively "young" and "old". In the following sections, I restrict the populations to age 18+. I draw 5,000 individuals at random from each of the above distributions, with the total dataset consisting of 10,000 adult individuals evenly split across the racial/ethnic groups.

In my sample, the median adult population ages are 43 for the Hispanic or Latino group, and 53 for the White alone, not Hispanic or Latino group. These median differences are large (~10 years), but the relatively "fat" tail at high ages for the White alone also contributes to differences in predictions and metric values. Populations with similar medians but differences in the extremes can show large differences, especially when age effects are nonlinear (here, I use a simple linear simulation).

Simulation and Model

I simulate a simple binary process that depends on one variable which is independent of both age and race/ethnicity and that also increases linearly with age. For each individual in my sample, I construct a probability as:

prob = -6 + x + 0.1 * age + (random noise)

In the above, age is measured in years and x has a random-normal distribution. The random noise component is random-normal but weighted at 0.1. The simulation coefficients were selected so the base rate has about 1 in 3 positive outcomes, and so that the relative effects of age and x are similar. I use the above probabilities to draw samples from a binomial distribution for each case, assigning a binary (0/1) outcome to every person.

I then fit my simulated binary outcomes to a logistic regression model, which is y = x + age. Since my simulated process matches my model so well, my fit yields the input coefficients; the model coefficient for x is 1.029823, and the coefficient for age is 0.10084. I then use the predict() function for the sklearn.linear_model.LogisticRegression object to generate outcomes for each person (this is effectively a 50% threshold). Fairness metrics, e.g. false positive rates, are constructed by comparing simulated vs. modeled outcomes.

Fairness Metric Results

I look at three common fairness metrics: false positive rates, false negative rates, and model accuracy. The table below shows results for the selected race/ethnicity groups:

Error rates differ dramatically for these two populations. False positive rates are more than twice as high for the White alone group, whereas the false negative trend is reversed, with the Hispanic or Latino group having nearly double the error rate.

Although the accuracy measure is similar for both groups in this simple example, in a more complex model where the decision threshold is tuned, we could see this metric vary. For example, thresholds are often set by maximizing the f1 metric. This often pushes the decision threshold towards lower probability values, potentially decreasing the accuracy for the group which is more likely to positive outcomes.

Metric Differences are Expected

Differences in fairness metrics for these two populations are not surprising. Researchers have shown that error rate discrepancies are (with trivial exceptions) inevitable for a calibrated model where base rates differ [6, 7].

My simple example is a model that is nearly perfect, meaning that the model probability is very similar to the "actual" probability for individual cases (for most applications, actual individual probability is unknowable). This model is, by definition, very well-calibrated. Although this example is unrealistic, it provides a visual depiction of the principles outlined in the literature. If we examine the distributions of model probabilities by race/ethnicity, we see:

Histogram of model outputs by race/ethnicity. Image by author.
Histogram of model outputs by race/ethnicity. Image by author.

Consider false positives as an example. Positive predictions can be represented by the area under the portion of the probability distribution curves that lie to the right of the dashed line, which represents the 50% decision threshold. False positives are the integral of (1-probability) times the curves. Therefore, because the White alone, not Hispanic or Latino curve has more weight to the right of the dashed line, particularly in regions not close to 100% probability, we expect more false positives.

It’s hard to imagine curves where we’d see equal false positive rates with different overall rates. Kleinberg et al. shows this can occur for trivial conditions – for example when we know each person’s outcome perfectly and have no false positives or negatives; this would be represented by bimodal peaks at 0% and 100% for both groups [6].

In a real-world model, the probabilities would not so perfectly reflect the underlying process, but we would still expect that when base rates differ, model probability distribution curves would not overlap. Curves for one population would be shifted more to the right or left of the other(s). Depending on location of the decision threshold, we’d have a greater "weight" in the false positive- or negative-generating region of the plots.

Incorporating Age into Fairness Metrics

One potential mitigation strategy for age-driven disparities in metrics is to stratify the population by age, then compare race/ethnicity outcomes within age buckets. One attempt at this solution is illustrated below for false positive rates:

In the above, we see that stratifying by age helps reduce discrepancies in metrics. However, some differences could still be considered meaningful. For example, for the age 40–69 group, the White alone false positive rate is 51% higher than the Hispanic or Latino rate. This is because the underlying White alone, not Hispanic or Latino group has more individuals near the upper limit of the age bucket, while the Hispanic or Latino group is weighted toward the lower end.

Choosing age groups to examine can be tricky, especially if data is limited. It’s desirable that the shapes of the age distribution are similar, or flat-ish, within age buckets. In practice, that can be difficult to achieve without using fine-grained age levels. In the US, the White alone, not Hispanic or Latino group has a large peak near age ~70, whereas most other populations show decreasing trends around that value. Sometimes, specific age levels are relevant for the process you are modeling; for example, there are characteristic or cutoff ages for educational savings, retirement, and health insurance eligibility. Therefore, both the business question and the age distribution shapes are important in selecting buckets.

Although stratifying by age can be an art form, effects become apparent for even simple splits, which can help you decide whether age needs further consideration. Therefore, it’s usually worthwhile to stratify fairness metrics by age, even with just two large buckets.

Including vs. Omitting Age

One misconception is that if a model "adjust for" (incorporates) age, fairness metrics will also be corrected. However, because age affects base rates, we expect fairness metric differences, including accuracy and error rate discrepancies, whether or not the model contains age. Above, I have shown discrepancies even for a near-perfect model that incorporates age.

If I construct a model that omits age, accuracy degrades, as expected given that this model is a less perfect approximation of the simulated process (see the Jupyter notebook [5]). Without age stratification, some metrics look more similar between race/ethnicity categories, mostly because error rates are worse for everyone. Again, stratifying by age reduces group metric disparities.

Adding age to this model is to some extent like a "rising tide that lifts all boats" – it reduces overall error rates. However, inclusion of age may make the model appear less fair because group differences may become more apparent.

Final Thoughts

Fairness metric differences that appear to be associated with race/ethnicity, gender, etc., may have age as a cause. Such "failures" arise even when a model incorporates age, and even in the absence of independent effects of group membership.

Stratifying metrics by age can partially correct some metric disparities. In addition, it is important to consider mechanisms by which age might influence outcomes. Is age’s effect primarily related to its correlation with another factor, for example income, education, health status, marital status, or job level? Or does it have an independent effect? Are there interactions with gender or race?

Deciding whether a model is fair or unfair requires understanding reasons for differences. Attempts to correct or adjust a model to equalize metrics could have unintended consequences if age effects are not considered. Moreover, it may or may not be reasonable to "excuse" differences in error rates across (for example) racial/ethnic groups because of age differences. Ethical review boards and stakeholders should take context into account, and ask questions related to age distributions, before making judgments based on fairness metrics alone.

References

[1] U.S. Census Bureau, B01001 Tables (2019), American Community Survey.

[2] Congressional Budget Office, Factors Affecting the Labor Force Participation of People Ages 25 to 54 (2018), Report, February 7.

[3] PEW Charitable Trusts, Retirement Plan Access and Participation Across Generations (2017).

[4] Bureau of Labor Statistics, U.S. Department of Labor, Differences in Earnings by Age and Sex in 2005 (2006), The Economics Daily.

[5]. V. Carey, GitHub Repository, https://github.com/vla6/Blog_age_fairness.

[6] J. Kleinberg, S. Mullainathan and M. Raghavan, Inherent Trade-offs in the Fair Determination of risk scores (2017), Proceedings of Innovations in Theoretical Computer Science.

[7] G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, and K. Q. Weinberger, On Fairness and Calibration (2017), Advances in Neural Information Processing Systems, 5680–5689.


Related Articles