The world’s leading publication for data science, AI, and ML professionals.

Race and Ethnicity in Data Science

Why it's important and how we should approach it

Fairness and Bias, Notes from Industry

Photo by Jon Tyson on Unsplash
Photo by Jon Tyson on Unsplash

It’s undeniable that considering race or ethnicity (abbreviated as R/E; used as a singular noun in this article, although the statements in this article will refer to race and ethnicity collectively) is important in quantitatively studying Healthcare outcomes. Ask any respectable statistician/epidemiologist/data scientist and they’ll tell you at least this much! I think while the understanding of the importance of R/E is ubiquitous, we can always strive to build a stronger fundamental vocabulary of why it’s important. I wanted to write an article that aims to summarize conclusions from (fairly mature) literature about R/E in model-building. Specifically, I wanted to briefly cover R/E in explanatory and predictive contexts (for more info on the difference, check out my previous article on the topic!).

Below is an outline of this article. While I’ve ordered the topics based on my personal progression of understanding R/E (1. what does this variable represent 2. how do we record this variable’s measurement 3. why this variable is important 4. how do we make statistical conclusions about this variable), please feel free to skip around to a topic that most interests you!

  • Some notes on the differences between Race and ethnicity
  • Encoding R/E in Models
  • Things to Know about R/E in Predictive Models
  • Interpreting R/E in Explanatory (Causal) Models
  • Summary

Differences Between Race and Ethnicity

A widely accepted distinguishment between race and ethnicity is that race refers to a collection of physical characteristics (that are perceived) while ethnicity encapsulates sociocultural components of one’s identity. Here’s a brief example delineating the two:

A girl is born in China to Chinese parents, but as an infant, she was adopted by an Italian family in Italy. Ethnically, she grows up feeling Italian: She eats Italian food, she speaks Italian, she knows Italian history and culture. She knows nothing about Chinese history and culture. But when she comes to the United States, she’s treated racially as Asian.

A ThoughtCo article describes this difference in more detail, but the point I’d like to bring your attention to is that when we think about R/E, either in our sample population, our target population, or covariates in whatever model we build, we have to know exactly what we’re dealing with. Otherwise, for example, including race in a model and then interpreting or applying that model when ethnicity is unaccounted for ** may result in biased, inaccurate, or incomplete conclusions. It is common for race and ethnicity to be collected as separate data variables, so understanding the difference is one fundamental step everyone can take to better understand their dat**a. A more detailed discussion of the meaning of R/E in a model is discussed below in "R/E in Explanatory Models".

Encoding R/E

Race and ethnicity are what are known as nominal categorical variables. This means that there is no inherent order to the categories, unlike say cold, warm, hot which have an inherent temperature-based ordering. There are many different ways to encode categorical variables (UCLA IDRE has a nice list), but two common ways are simple and __ dummy encoding.

In R, you can simple-encode this variable in the following way:

# simple encoding
# A tibble: 4 x 3
  personid desired_encoding race    
     <dbl>            <dbl> <chr>   
1        1                1 white   
2        2                2 black   
3        3                3 asian   
4        4                4 hispanic
df$race <- factor(df$race, levels = c("white", "black", "asian", "hispanic")

where you can explicitly specify the levels of race. In this encoding, any estimate you get from your regression model will reflect the comparison of the mean value of the dependent variable (e.g. blood pressure) for some race to that of the white race, as white was specified to be the reference level.

Alternatively, you can dummy-encode this variable like this:

# A tibble: 4 x 5
  personid is_white is_black is_asian is_hispanic
     <dbl>    <dbl>    <dbl>    <dbl>       <dbl>
1        1        1        0        0           0
2        2        0        1        0           0
3        3        0        0        1           0
4        4        0        0        0           1

The estimates will differ in that they now reflect different comparisons (i.e., is_blackcompares the expected dependent variable for Black to not Black). You should choose a coding system that results in the most meaningful comparison for your goal.

The one pitfall to avoid is: do not use R/E as a continuous variable! In other words, in the simple encoding system, make sure that it is understood to be a factor variable by the model. Otherwise, it’ll generate estimates of R/E assuming a linear relationship in race, which is nonsensical.

Also, it’s useful to be aware that R/E in data is complex and thus this article won’t go into depth about what to do with intersectional or more specific R/E identities.

R/E in Predictive Models

There are probably many nuanced ways to think about R/E within predictive modeling, but in this section, I’ll focus on the types of potential bias in "non-R/E-representative" datasets and the controversy surrounding R/E as a predictor variable.

Types of bias

Racial bias in predictive algorithms has received a ton of attention over the years in the news (Google "Gorilla" racial profiling incident), books (e.g., Weapons of Math Destruction), and a plethora of journal articles. In healthcare, generally, the goal of an algorithm is to produce a prediction or measurement that informs a health need (e.g., predicting disease risk). Crucial to this goal is articulating who the algorithm is meant for – predicting disease risk in American males between the ages of 40–65? British children under 18? When we don’t consider the who, we run the risk of generating bias. Ziad Obermeyer, a physician-researcher at UC Berkeley who has written several articles and most recently the Algorithms Bias Playbook, describes two main types of bias: representation bias and measurement bias.

For the first type, Sjoding et. al. provided a striking example of how algorithms that used pulse oximetry measurements to predict patient blood oxygen levels performed poorly on Black patients because the algorithm was trained on primarily white patients. In this example, R/E is important in that in the ideal scenario, you would train on a representative dataset in order to make public health decisions (tangent on why in other cases, a nonrepresentative study should not always be immediately discarded). And if the data is not available, then at the very least, strong and explicit caveats should be mentioned alongside the predictive model.

For the second type, Obermeyer describes a scenario where the variables that are predicted within a model can be biased, to begin with. For instance, consider healthcare cost as a proxy variable for future health needs (proxy reflects our best attempt at measuring something). A model finds that Black people are predicted to have lower healthcare costs than white people, so a naive follow-up would be to focus targetted policy efforts on improving health outcomes for white people. However, this result actually comes from the fact that Black people in the US have lower recorded costs in data due to discrimination and difficulty of access to healthcare.

R/E as a predictor variable?

As a separate topic from the data reflecting structural inequalities related to R/E, I want to provide a brief overview of how R/E is modeled (or not). Race and ethnicity are included in some popular prognostic clinical predictive models such as the pooled cohorts equation for cardiovascular disease and the NIH breast cancer risk assessment tool. However, you may be surprised to hear that it’s not entirely common to include R/E, and in fact, only 3% (23/854) of cardiovascular prediction models included R/E. Including R/E in predictive models is deeply controversial within the literature, and I’ll attempt to provide a brief summary of the arguments.

The main reason against its inclusion is that some claim it evokes major elements of racial profiling in race-sensitive decision-making. The main counterpoint is that racial profiling in healthcare is fundamentally different from racial profiling in, say, law enforcement or insurance. Jessica Paulus, a researcher at Tufts, argues that whereas in fields like law enforcement, racial bias in models results in unambiguous punishments and rewards, it’s less clear the effects in healthcare. There are harms and benefits to under- and over-estimating risk, so including R/E as a variable serves to improve accuracy in predictions which (on average) allows for improved clinical decision-making. Contrast that to a predictive model that doesn’t account for race (so-called "race-blind") which may reduce the accuracy for all R/E groups (and consequently the decisions that are based on the prediction results) and especially those underrepresented in the data.

To be fair, R/E is a complex variable, and while some models have found some manifestation of R/E to be a statistically significant predictor, it remains to be seen how practically valid its predictive abilities are given how nuanced and fluid R/E is (read more about it in this article).

R/E in Explanatory Models

In this section, I’ll briefly introduce the challenge of interpreting R/E in a causal context. Remember the goal of explanatory models is often to establish causal mechanisms. The goal of understanding racial disparities in health outcomes to improve targetted public health policies and interventions has been long-standing. Explanatory models are often used to understand the extent of racial disparities associated with some health outcome. One of the seminal articles around this topic that discusses how to thoughtfully interpret and include R/E as a variable in explanatory models comes from Harvard professor Tyler VanderWeele.

It is a complex topic, and as a personal note, I would recommend understanding some fundamentals of causal inference before attempting to read this article.

The problem surfaces in the following common way:

You have an outcome you’d like to explain, and you’re interested in R/E as your primary exposure. You also remember from your stats classes to adjust for confounders so you pick some variables that you feel are relevant. You build the model and come up with a regression estimate for R/E.

It is at this point that VanderWeele recommends you should stop and re-evaluate what you’ve done so far. How do you interpret that regression estimate? The challenge is that when you’ve included other variables in the model, the interpretation for R/E will change (if you really want to go down the rabbit hole, here’s a comprehensive, but relatively abstract review of what a confounder means). Basically, the following diagram illustrates some characteristics of a confounder in that it affects both the exposure (e.g. R/E) and outcome (e.g. cardiovascular risk).

Author's own image
Author’s own image

However, if the goal is to understand the underlying causal mechanisms behind how an exposure leads to an outcome, we have to question what R/E really encapsulates? Is it the biological effects of skin color (e.g. darker skin protects against UV light)? Is it the health behaviors resulting from the perception of skin color (e.g. discrimination in hospitals)? Genetic background? Familial socioeconomic status (SES)? When we want to make a causal statement of effect, we want to control for as many variables as possible (think randomized controlled trial or a scientific experiment). However, if we don’t even know what a R/E effect really means, it’ll be difficult to decide which variables to control for and how to interpret R/E. For instance, VanderWeele concludes that one possible interpretation of a coefficient for R/E in a model is along the lines of (paraphrasing here):

What would happen to an observed health difference if you set the distribution of all conceivable aspects that R/E captures (i.e. physical phenotype, genetics, SES, etc.) for one R/E group equal to another.

The bottom line is including R/E in an explanatory model is not a straightforward task and you should openly question instances where you see this occur.

Summary and Conclusions

R/E in model-building is a complex topic that’s rich with discussion. Many brilliant experts have weighed in on their thoughts, and at the very least, listening in on this conversation has personally made me a more thoughtful and knowledgeable data scientist and epidemiologist. I’ll end this article with a series of questions you can ask yourself anytime you find yourself handling R/E in your own data.

  • Can I explain the difference between race and ethnicity to someone?
  • How is R/E in my data encoded? Does the encoding match my research question?
  • Who is this predictive model valid for?
  • Are there any variables in this model that might suffer from historic or systemic racial biases?
  • How has R/E been interpreted in this explanatory model? Does it make sense?

Related Articles