Data Isn’t Usually Normal

Use other data science techniques to unlock more accurate insights

Tony Pizur
Towards Data Science

--

Photo by Photo by James Cousins on Unsplash

Despite what you learned in statistics class, most data isn’t normally distributed. Think about commonly cited examples of normally distributed phenomena like height or college entrance exam scores. If you could gather the individual heights of all people on Earth and plot them on a histogram, the left-hand tail would be heavier than the right; infants and children are shorter than adults, so the distribution is skewed. The blunt statistical fix is to remove the non-complying observations, like children, or pick a small subset of the population that follows a normal pattern. The classic normal height example is from a very specific subset: adult male prisoners in 19th-century London jails. College entrance exams aren’t neatly distributed either. They are graded using a raw score — the number of correct answers — that is then scaled to a normal distribution. We know the raw scores aren’t normally distributed because otherwise they wouldn’t need to be scaled and fitted to a normal curve. Real life isn’t so normal after all.

This article begins with a review of the uses and limitations of the normal distribution. We will then examine two case studies with non-normal data. Traditional statistical tools will be used to help identify the distributions. A data-fitting function that generates unexpected results will also be applied. Optimal solutions for both case studies will be developed. However, the most important takeaway may be that the shape of data itself reveals a prodigious amount of information, perhaps more than any statistical tests.

The Normal Distribution

The normal distribution was first described by German mathematician and physicist Carl Friedrich Gauss in the early 19th century. That is why the normal distribution is also known as a Gaussian distribution. Gauss recorded and analyzed observation errors from planetoid movements. This is where the term “observation” comes from in regression and other analyses — literally from observing through a telescope. In essence, the normal distribution was about human error in observing phenomena. In many instances, that is still its main purpose today.

The normal distribution has two parameters: the mean (µ) and standard deviation (σ). For any value, x, its probability distribution is: [1]

U.S. Department of Commerce, Public Domain Image

The curve is symmetrical around the mean and is bell-shaped. For a mean of zero and standard deviation of one, the curve for a million random normal observations looks like this:

Image by Author

There are features to the curve that make statistical analysis more convenient. The mean, median, and mode are all the same. Being a probability distribution, the area of the curve will equal one, even when the parameters are changed. Its symmetry is ideal for testing the tails for extreme values.

The law of large numbers (LLN) doesn’t directly relate to the normal distribution, but it’s important vis-à-vis the central limit theorem (CLT). The LLN states that the more samples we take from a distribution, their average will more likely be close to the mean of that distribution. This makes intuitive sense because more samples implies a closer approximation to the distribution. The LLN works on any closed distribution, even highly non-normal ones, so the law’s assumptions are about the samples themselves. Each sample must be random, identically distributed, and independent. This is sometimes called “i.i.d.” In a mathematical sense, this means each sample comes from the same mean and standard deviation. It’s important to note that the LLN governs only random errors and does not account for systematic error or sampling bias. [2] The CLT states that as we take more i.i.d. samples, their distribution around the mean will converge to the normal distribution. Curiously, the LLN doesn’t require large numbers of samples; it will hold taking as few as 15 to 30 samples, depending on the underlying distribution.

The LLN and CLT are tools to investigate the mean of a distribution and the shape of sampling errors. That’s it. Neither tool describes the shape or distribution of your data. Neither will transform your data to make it normal. Both have surprisingly strict requirements on the sampling data. Taken together, they are two of the most misused tools in statistics and data science, as we shall see in the two case studies.

Case Study: Hospital Quality

Scenario: Medicare set out to improve healthcare quality by rewarding hospitals that balanced patient safety and overall experience with lower insurance claims. The Department of Health and Human Services, which administers Medicare, created a Value-Based Purchasing (VBP) Total Performance Score (TPS) to compare how well each of about 3,000 hospitals performed each year. The score has a complex calculation that includes many heterogeneous inputs, like patient surveys, insurance claims, and readmission rates. As an incentive, the government withholds 2% of national Medicare hospital reimbursement and redistributes that amount according to each hospital’s TPS. Higher-scoring hospitals receive a bonus, while lower-scoring ones are penalized.

The VBP system sparked a research idea by a hospital executive. He proposed to research a possible link between a hospital’s quality and its leadership style. The TPS was used as a proxy for quality, and hospital quality managers were surveyed using a Likert-type instrument. The data was downloaded, and a questionnaire was emailed to 2,777 quality managers. Managers that worked for more than one hospital received only one questionnaire. Because prior TPS data was analyzed by qualified experts and thought to be “a fairly normal distribution centered around a score of 37, with a small number of exceptional hospitals scoring above 80,” [3] it was assumed the current data was similar. A total of 150 completed questionnaires were returned, and the executive ran a multiple linear regression to look for any correlation between quality and leadership style. Several outliers were eliminated so error residuals looked more normal, and different transformations on the leadership data were attempted. The errors were then deemed normally distributed by the LLN and CLT. The results were disappointing, with the leadership style explaining less than 2% of the variation in scores. The associated P-values on the independent variables were highly insignificant. What went wrong? How would you advise correcting it?

Solution: Let’s start with the dependent data. The most recent VBP data did look similar to prior years, with the histogram and density of TPS appearing like this:

Images by Author

However, was the data “a fairly normal distribution” as the experts asserted? A Normal Q-Q Plot suggested that it was not:

Image by Author

The scores deviated from the reference line on both extremes. If it wasn’t normal, then what was it? As part of the MASS library in R, descdist fits univariate distributions to different kinds of data. descdist uses different estimation methods such as maximum likelihood and moment matching to generate a skewness-kurtosis plot, also known as a Cullen and Frey graph. [4] The code to create the Cullen and Frey graph for Total Performance Scores (tps) is:

library(MASS)
library(fitdistrplus)
descdist(tps, discrete = FALSE, boot = NULL, method = "unbiased", graph = TRUE, obs.col = "darkblue", obs.pch = 16, boot.col = "orange")

The resulting graph is:

Image by Author

The blue dot observation of the TPS is on the theoretical gamma distribution line. It seems as if the experts were wrong — and with significant implications.

Recall that the researcher used multiple linear regression. The first assumption for this approach is that the dependent and independent variables are linearly related. This is intuitive because if they are not linearly related, the test is misspecified. Indeed, a dependent variable with a gamma distribution calls for a generalized linear model (GLM) using maximum-likelihood estimation rather than ordinary least squares (OLS). The link function in the GLM will adjust for the extreme observations in the right-hand tail in the gamma distribution.

Several common mistakes can lead to a misspecification. The initial question, “Does this data look normal?” is quite different from “What is the shape of this data?” Graphs can be subjectively viewed, leading a researcher to see the pattern they are looking for. [5] It can also be tempting to believe if more samples are taken, then the distribution will always look more normal. The TPSs were a census of all hospitals and thus the non-probabilistic true distribution. No more samples could be taken. Misunderstanding the meaning of the data itself might be another factor. The scores weren’t a proxy for quality but rather functioned as an insurance reimbursement redistribution scheme. Insurance payments follow a gamma distribution, which was why the scoring system mirrored a gamma distribution. Research and subject matter expertise alone could have supplanted much of the work to determine that and why the data was gamma distributed.

Interestingly, running an OLS regression with a gamma-distributed dependent variable might produce estimated coefficient results that aren’t too far off from the properly specified GLM. [6] This may partly depend on the shape of the distribution (with “shape” being one of the specified parameters). However, the researcher unfortunately changed the data in two ways. By eliminating “outliers” on the right-hand tail, they inadvertently excluded the top-performing hospitals from the analysis. In other words, the scope of the inquiry was narrowed to leadership styles at non-excellent hospitals. The researcher also attempted square root transformations on the independent data to make it behave as desired; this exacerbated the misspecification. It again showed a misunderstanding of the data’s meaning since the “square root of leadership” is nonsensical.

There were other problems with the researcher’s approach. The TPSs were ultimately influenced by hundreds of workers in each hospital. This represents the true population to be surveyed. The researcher chose a purposive non-probability sampling technique to select for quality managers only. Recall that to use the LLN and CLT, samples must be randomly selected and independently and identically distributed. The sampling frame no longer matched the true population and so was biased. This also meant the error terms were not normally distributed in the linear regression. With some statistical hand waving, it’s possible to claim the quality managers were the population under study, but the resulting inherent non-probability-caused bias cannot be measured or incorporated into the model. Furthermore, the selection of the quality managers was not i.i.d. Because some managers worked for more than one hospital, their surveys could not be tied to a single TPS, making them not independent. Each of these factors contributed to the misconceived research project.

Conclusions: Let the data guide you as you formulate a research plan. From its composition and shape, you can infer its general meaning. Resist transforming data just to meet the needs of specific statistical tests you want to run. Don’t trust experts but rather research and verify what they’ve asserted; in this way, you become the expert on your data and your study. Remember the LLN and CLT have assumptions about probability; not all data derived from 30 or more samples are related to mean and the normal distribution.

In our case study, the initial idea for a hypothesis might have been stated as: “There is a correlation between leadership styles and hospital quality.” The final study results were something closer to: “There is a misspecified linear correlation between the square root of leadership styles and a hospital insurance payment redistribution scheme (excluding top performers) as governed by probability techniques misapplied to non-probabilistic data, all subject to an unknown and unmeasurable statistical bias.” Most essentially, realize that each assumption and data manipulation has an effect on the meaning of results.

Case Study: Taxes and Social Justice

Scenario: A social scientist wanted to examine marginal tax rates to check for any correlation with various social justice measures. In theory, they wanted to find a relationship demonstrating that higher top marginal state taxes implied a stronger commitment to social justice in those states. The social justice data was to be mined later to fit the marginal tax data. Beyond this general notion, the research idea was unformed. The researcher created a vector, tax, of the highest marginal income tax rates for each state and the District of Columbia. The tax data turned out to be “problematic” because it was oddly distributed, and the social scientist wanted to know how to run statistical tests with it using the normal distribution.

Solution: Data can be “problematic” for a number of reasons. It can be inaccessible or incomplete. Data may be mismeasured. A data frame may not represent the population. However, correctly measured and sufficiently collected information about a population tells its own unique story. Difficulty analyzing that data may prove to be a “problem,” but the data itself may be quite perfect. In the case of the tax rates, take a look at the histogram and density plots:

Images by Author

Based on these graphs, the data is definitely not normal, particularly because of the heavy left tail. The Cullen and Frey graph tells a different story, though:

Image by Author

The Cullen and Frey graph plotted the data as normal. This happened because our data follows two different probability distributions. It’s difficult to see this, but the clue is the 9 zero observations in the histogram. A top marginal rate of 0% indicates a state has no income tax, while all other values imply a tax exists. This kind of either/or classification is a binomial distribution. Of the states that do have an income tax, their top marginal rates follow a different distribution. Keeping only states with taxes with this single line of code,

nozerotax <- tax[ tax != 0 ]

we can graph the nozerotax data:

Images by Author.

The Cullen and Frey graph for the remaining 42 tax jurisdictions indicates a beta distribution. This is exactly what we would expect because percentages often take the form of a beta distribution, and marginal tax rates are expressed as percentages.

Conclusions: The researcher picked a single variable, tax, that followed two different distributions. This makes the LLN and CLT inapplicable because i.i.d. assumptions were violated. By extension, any statistical test relying upon a normal distribution cannot be used. The tax variable is not suited for the researcher’s goals and desired statistical methods.

Moving Forward

Every single methodological and statistical decision a data scientist makes will impact the results of their study. Because so many of the statistical tests taught in college depend on normal distributions, it may be tempting to contort our data or overlook key assumptions to run the tests we know well. It’s a bit like the old saying, If the only tool you have is a hammer, everything looks like a nail. Rather than asking whether data is normal, start out by asking: How is it distributed? And what is that telling me? You may discover far more than you anticipated.

References and Data Sources

[1] U.S. Department of Commerce, NIST/SEMATECH e-Handbook of Statistical Methods (2012), https://www.itl.nist.gov/div898/handbook/pmc/section5/pmc51.htm

[2] J. Orloff and J. Bloom, “Central Limit Theorem and the Law of Large Numbers,” Introduction to Probability and Statistics, (2014), MIT OpenCourseWare, https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading6b.pdf

[3] E. Klein and P. Shoemaker, “Value-Based Purchasing: A Preview of Quality Scoring and Incentive Payments,” Healthcare Financial Management, (2012), https://www.ahd.com/news/HFM_FeatureStory_ValueBasedPurchasing_2012_January.pdf

[4] M. Delignette-Muller and C. Dutang, “fitdistrplus: An R Package for Fitting Distributions,” Journal of Statistical Software, (2015), https://www.jstatsoft.org/article/view/v064i04

[5] N. Totton and P. White, “The Ubiquitous Mythical Normal Distribution,” Research & Innovation, (2011), https://www.researchgate.net/publication/322387030_The_Ubiquitous_Mythical_Normal_Distribution

[6] P. Johnson, “GLM with a Gamma-distributed Dependent Variable,” University of Colorado Boulder, (2014), https://pj.freefaculty.org/guides/stat/Regression-GLM/Gamma/GammaGLM-01.pdf

Data Sources: Data for Total Performance Scores is in the public domain and available from the Center for Medicare & Medicaid Services. Income tax data is from the following state tax authorities: AL, AZ, AR, CA, CO, CT, DE, DC, GA, HI, IL, IN, IA, KS, KY, LA, ME, MD, MA, MI, MN, MS, MO, MT, NE, NJ, NM, NY, NC, ND, OH, OK, OR, PA, RI, SC, UT, VT, VA, WV, WI.

--

--