Legacies of Statistics & AI

p-values: A Legacy of “Scientific Racism”

A deeper look at the untold history of p-values and its legacy

Raveena Jay
Towards Data Science
12 min readMay 26, 2022

--

Disclaimer: I’ll be putting quotes around certain words, like race, racial measurements”, etc. because not only are these terms loaded with historical discrimination, but “race” is un-scientific. A “race” of people is an un-scientific term. Also, the reader should have a bit of familiarity with p-values, hypothesis testing, and Bayes’ theorem.

Photo by Edge2Edge Media on Unsplash

It’s easy to think of science as objective. Science, after all, is about studying and recording what occurs in the natural world, right? The natural world is “beyond” the human realm, beyond our imagining and our minds, as science classes would teach.

But we should not get lost on who is doing the science: sure, a microscope will help you examine smaller objects and organisms more closely, and yes, using an X-ray telescope can display certain characteristics of a galaxy that might not be obvious in visible light — but in the end, we humans are doing the observing. And we are doing the interpreting of the data.

Before we delve into the history of p-values, and possible alternative solutions to the value (and their shortcomings as well), let’s briefly go over the definition of it.

In statistics, the definition is: The p-value of a hypothesis test is the probability of your test statistic taking the observed value or more extreme, assuming your null hypothesis is true. In mathematical notation, this would be

The vertical line means “given H0”, which means “assuming the null hypothesis is true”. The value x would be, say a “z-score”, or a “t-statistic”, or a “chi-squared statistic”— if those are familiar words from your statistics classes.

One of the most common mistakes is assuming that the p-value tells you the probability of your null hypothesis given the data-evidence. This is wrong. The mathematical notation for this would be:

These two formulas are differentbut it’s easy to confuse the two. Usually, after doing experiments, scientists want the second formula but often get stuck with the first. Later we’ll see how we could get from one to the other, and how that connects to alternative ways to measure statistical results of experiments.

Wikipedia has a nice illustration of this common error:

Image source: https://en.wikipedia.org/wiki/P-value#/media/File:P-value_in_statistical_significance_testing.svg

Statisticians have a love-hate relationship with p-values, and recently have been critiquing them as a measurement of “significance” in statistical testing, and warning scientists about common pitfalls. Don’t believe me? Read this 2015 article by the National Institute of Health about it. We’ll revisit this hate-love relationship later on, so keep it in the back of your mind.

What is Scientific Racism?

Before I discuss the history behind the p-value, I need to briefly mention background about the phrase “scientific racism” — because not only is it loaded with years of racial policies supported by governments, but it’s one of the histories intertwined with p-values and significance testing in statistics. “Scientific racism”, or more accurately psuedo-scientific racism (because racism is not scientific), was a way in which European colonial governments — and the statisticians they hired to do government surveys and data collection — justified their racist policies by using statistical measurements, often in a extremely biased and incorrect way. For example — if you’ve ever heard of the infamous “skull measurements” used by European psuedo-scientists in the colonial era to try and demonstrate a (fake) correlation between skull size and intelligence — they did that in order to put scientific backing behind claims such as “Africans, Native Americans and Asians are less intelligent than Europeans due to smaller size.” Of course, these claims are completely un-scientific and baseless — but using scientific terminology and measurements as a backing was a way to sooth their guilty conscience.

Now that I’ve given a brief context to what scientific racism is, we’ll see more specific examples later in the piece related to significance testing and p-values. But first — a bit of history and context for the p-value.

I’ll be citing mathematician Aubrey Clayton’s wonderful article, “How Eugenics Shaped Statistics” in Nautilus, about the history of statistics.

The history of the P-value

P-values didn’t just come out of the blue; they were a response to mathematicians in the 1700s and 1800s trying to understand the theoretical concept of probability & randomness, and how to tie that to calculating statistical results of physical experiments — like rolling a die, flipping a coin, or more complex physics experiments.

If you’ve ever taken a statistics class (or are a statistician, or major in college) I’m sure you’ve heard of names like Galton, Pearson and Fisher. WE’ll remind ourselves that they were human beings (sometimes it was hard for me to remember that after hearing [insert name]’s formula so many times in class!). Although these individuals propped up statistics as an infallible, objective way to draw conclusions from data, even in their own work, these statisticians “revealed how thin the myth of objectivity always was”, as Clayton writes.

Galton came from a Victorian high-class family, and he often published ideas about how human intelligence could be “refined” by selective breeding in humans — by breeding the wealthiest, “learned” men as opposed to other classes in British society. Pretty unsurprising…I know. He was also alive during the 1850s, where Europe was ever-expanding in its colonialism of Asian and Africa, and America was still deeply entrenched in racist slavery and segregation. So it’s no surprise that Galton applied his “breeding” ideas to different “races” (as racist scientists used the term) of people.

But how does this tie in with racism? Well — when colonial European “scientists” began measuring the heights, weights and appearances of “races” in the world, they leaned on European statisticians like Galton to make conclusions based on the data. He and his contemporaries believed that the measurements of people in each of their “races” would follow a bell curve — the normal distribution.

Image source: https://en.wikipedia.org/wiki/Normal_distribution#/media/File:Normal_Distribution_PDF.svg

If these “racial measurements” followed a normal distribution — well, since every normal curve has a specific mean and standard deviation — it “follows” that each “race” of people has an “average look”. If this logic already sounds creepy to you — that’s because it IS — and it’s frankly mind-blowing how some of these statisticians convinced themselves their research was wholly “scientific.”

Raveena, what about the p-value is related to this?!

Well, later in to the 19th century, scientists studying all manner of (non-human) biological creatures, noticed that there were quite a many “unusual measurements’’ as Clayton notes, and one statistician in the 1870s — Karl Pearson — yes, of beloved Pearson’s correlation — wondered how to decide if data points were normally distributed. In theory, if you could figure that out, that would tell you a boatload of information, especially on detecting outlier (“unusual”) data that scientists were pining over.

As Clayton notes, before Pearson, statisticians would simply draw a histogram of the results, and then basically play connect-the-dots (or connect-the-bars??) to see if it looked like a bell curve. But Pearson got more mathematical in his approach, and created a 3-step process:

  1. Assume the data fits some type of distribution you initially had in mind. This is the “null hypothesis” I mentioned at the beginning of the article. Also write down your “alternative” hypothesis.
  2. Compare the data to what you expect from the null-hypothesis assumed distribution of data, and calculate what’s called Pearson’s “chi-squared test statistic.”
  3. Compute the p-value of this test statistic, and determine if the result is “significant”. And assess if this provides any evidence in favor of the alternative hypothesis.

Clayton notes how Pearson didn’t actually believe that statistical significance necessarily meant “”worthy of scientific discovery” — he literally just used the word to mean, “oh, the data is indicative of a different hypothesis.”

So here, we can see that the “outliers”, or “unusual observations” are the un-likely ones marked off by the test statistic. And the p-value is just the total area (probability) of that region.

The Cardinal Sin of Justifying Racism with Statistics

But the main sin that Pearson and his statistician contemporaries committed was not applying these tests to crabs, as Clayton points out — but to human beings. Humans are way too diverse in our life experiences and cultures, but that didn’t stop Pearson from trying to apply his statistical methods to “biological measurements” made on different “races” of humans, as he thought of it. In the early 1910s Pearson did a statistical study of 4,000 pairs of siblings in the UK and “determined” that statistical measurements of eye color and other body regions had “strong correlations” — according to the p-value significance test — with their character, such as “assertiveness and introspection” (Clayton). By today’s standards of science we would immediately dismiss this claim, but just a decade later, when Jewish refugees fled to Britain escaping Eastern Europe — Pearson applied the same flawed p-value correlation analysis to Jewish children — and he concluded from his significance testing that they were of “inferior stock” and their “intelligence had almost no correlation with environmental factors and discrimination.” (Clayton) If you think you’d be laughing in his face for this judgment — you’re not alone — but don’t forget that Pearson’s correlation still exists in Python and R libraries. It’s been very difficult to separate the fact that these statisticians supported eugenics judgements using mathematics, from their very valid impressive contribution to the field.

Statisticians like Pearson, Galton, and later professors like Davenport believed in eugenics, racism, and would use statistical arguments to back up horrible theories on the “dangers of interracial mixing”, “breeding within the race to preserve intelligence”, and various other arguments (Clayton). Pearson even involved himself with “eugenics departments” and societies in universities and outside in America — organizations which were adamant about preventing race-mixing between White Americans and people of color (Asians, Africans, Native peoples, etc)

So What’s the Story Now?

Fast forwarding 100 years later — in 2019, almost a thousand statisticians and scientists called for a deep critique of significance-testing, and their call for change was echoed at the American Statistical Association. In fact, the statistician Ron Wasserstein gave an amazing talk in that same year at the United States Conference at Teaching Statistics on why the p-value, especially the low-bar for significance at p < 0.05, should be re-evaluated.

Their valid arguments point out that making a significance decision based on two hypotheses (which seem like an arbitrary number) aren’t that meaningful of a statistical task, and given enough variables in your data, “spurious correlations” — as they are called — can often occur, and just due to the multiplicative and complementary nature of probability, the chances of at least one “significant measurement” by standards of the “p < 0.01” criteria will happen — even if that measurement is not meaningful to the observer.

So if p-values and as it’s called “classical hypothesis testing” is in need of critique and review, what are some alternatives, and are they the “best” options out there?

Possible Alternatives

So if p-values and significance testing are flawed and in need of deep critical analysis, what are alternatives to this? Are they good alternatives? Do they visit similar vague and risky territory as p-value significance.Well, the answer is, as always: it’s complicated, but there are some avenues.

Prior Context to Statistical Data

In the beginning of the article, I mentioned how it’s easy to think of statistics and scientific data measurements as objective, when they really aren’t — the key point is data doesn’t speak for itself — it must be interpreted in a larger context to reach an actual conclusion.

For example: say I find a coin off of the sidewalk and do an experiment: I flip it 10 times, and it comes up with 9 heads, 1 tail. Now say person A, Jason, had never seen a coin before — maybe they’re a time-traveler from the year 2100 where no coins exist, their society has no historical memory of coins, and they only use electronic payment — just go with me here! How would Jason interpret the data? Well, the data suggest that since 9 out of 10 flips are heads, the probability of a head “should” be 9/10. At least, the evidence is consistent with the hypothesis that p = 9/10. Since Jason has never seen a coin before, it makes sense that he would believe just the evidence.

Person B, Verity — is from 2022, and definitely has seen coins and held them before and has common knowledge about them. Again, the data suggest that since 9 out of 10 flips are heads, the evidence is consistent with the hypothesis that p = 9/10. BUT, here’s the catch for Verity: From Verity’s experience (and society), randomly-found coins on sidewalks usually are 50–50 fair, so she has prior context about how coins work. She would hold the “fair coin” (p = 1/2) assumption pretty highly in their mind and conclude that the 9-heads data was a statistical fluke in the context of their hypothesis.

(By the way sorry, Jason — I get that you come from a vastly more technologically advanced time period than me and Verity, but…this time around, I was flipping a fair coin. The lack of context about coins skewed your conclusion!)

If I had flipped the coin 100 times, and I got say 90 heads — then that might be enough data for Verity to change her assumption about the coin being fair — in other words, the odds of her believing the coin are biased go up. MIT cognitive scientist Joshua Tenenbaum has a great Youtube video detailing this exact experiment, around the 37:20 mark.

Interpreting the Data to Make Conclusions

Notice earlier when I mentioned Verity — she interpreted the same data in a different context of her hypothesis. The word “context” is important. It turns out that the field of “Bayesian statistics” is quite concerned about the contexts of hypotheses — what it calls “prior context”, or “prior assumptions” — and the prior assumptions Jason and Verity made greatly skewed the conclusions they made on the same data. And turns out, Bayesians are trying to provide alternatives to the usual significance testing.

One of these alternatives is called “Bayes factors”. The Bayes factor, in a sense, helps compare whether data under the null hypothesis is more likely or not than the alternative one. It basically tells us how much the data (evidence) support the alternative over the null, which can be very useful. Mathematically, it’s written like this:

Image credit: Author. Equation 1 is the Bayes Factor, written as “K”. Eq. 2 is Bayesian inference, written as odds.

If you’ve ever seen bettors write their gambling odds as “2-to-1”, or “the odds of a win like that is 1 in 1,000”, that’s the same type of odds we are using here. “2-to-1” and “1 in 1,000” as odds would be written as 2:1, and 1:1000, respectively.

Steven Goodman, a Professor of Medicine at Stanford University, has a great article here comparing Bayes Factors and P-values and notes that Bayes factors around 200 — which means that the alternative hypothesis is 200times more likely to be true than the null — often have a “corresponding” p-value around 0.001, indicating that “classical significance” might need to have stronger p-value conditions. Statistician Brendan Kline of the University of Texas, Austin notes in his research paper that Bayes factors around 0.1 denote “strong” evidence against null hypothesis in the context of big data, and for hypothesis testing of hose types of data sets, a corresponding p-value for “significance should be less than 0.0025” (Kline). And back in 2018, this article in Springer Nature Psychiatry mentions that the strengths of Bayes factors are complementing p-values and effect sizes when it comes to hypothesis testing. However, the strength of the Bayes factor does depend on the “width of the prior distribution, which is set by the researcher — so again, if you’re looking for some “objective” measure you’re out of luck. Finally, the Bayes factor by itself does not determine which hypothesis you, the researcher, should choose — you must combine this with the prior odds ratios determined wholly by the researcher before seeing any data.

Final Thoughts

I hope by reading this, you’ve gotten a sense of the history of p-values, problematic applications of p-values and classical hypothesis testing, which revealed its limitations, and the attempts researchers are taking to remedy those challenges. I don’t want readers to think that p-values are useless after reading this article. P-values are much more common in statistical software like R, or Python’s scikit-learn & scipy.stats libraries — Bayesian techniques are very new to Python’s data libraries. But it’s important to understand why the critique on classical testing exists, and an underlying theme of statistics is that subjectivity is everywhere, through and through in the field. The data almost never speak for themselves. It has to be interpreted in the surrounding context.

--

--

I recently earned my B.A. in Mathematics and I'm interested in AI's social impact & creating human-like AI/ML systems. @raveena-jay.