The world’s leading publication for data science, AI, and ML professionals.

The Normal Distribution

The Most Important Distribution in Data Science

Data Science

There’s a reason the Normal Distribution is called "normal". Its presence can be felt throughout Data Science and machine learning, as well as in a variety of unexpected real-world scenarios. From the distribution of heights and weights, to the volume of milk collected from cows, to SAT scores – the normal distribution is seemingly omnipresent!

Image Courtesy of energepic on Pexels
Image Courtesy of energepic on Pexels

Some History

Carl Friedrich Gauss first described the normal distribution in an essay introducing least squares and maximum likelihood released in 1809. While history has given Gauss naming rights (it’s called the Gaussian distribution after all), it was Pierre-Simon Laplace who, building from Gauss’ work, formulated the Central Limit Theorem (CLT). CLT is a key statistical concept that describes the behavior of independent, random variables tending toward a normal distribution when summed, even if the underlying distribution is not itself normal.

Check out this interactive applet to play around with this!

The term "normal" here actually refers to the sum of these independent random variables being normalized and is not meant to suggest that the Gaussian distribution is the "normal" or base distribution that other distributions are then considered "abnormal" (that was a little joke earlier).


Describing a Normal Distribution

The normal distribution has several characteristics that make it very useful –

  1. Symmetric around the mean
  2. Mean, median, mode are equal
  3. Area under the curve = 1
  4. Empirical rule: 68/95/99.7 (we’ll get back to this)

A normal distribution can be described with just two parameters, mean and standard deviation, given by the Greek mu (μ) and sigma (σ). Its probability density function is provided here:

Image by Author
Image by Author

If this PDF means nothing to you, check out my previous blog on probability mass and density functions here! This modern form utilizing sigma (σ) was popularized by Karl Pearson in 1915.

By altering the mean and standard deviation, we can change the shape and location of the distribution. Changing the mean shifts the curve along the number line, while changing the standard deviation stretches or squashes the curve.

Image by Author
Image by Author

The Standard Normal Distribution

The standard normal distribution is a special case where μ = 0 and σ = 1. This case is pictured below.

Image by Author
Image by Author

Empirical Rule

I mentioned the 68/95/99.7 rule above, but let’s go deeper. What this rule states is that 68% of observations are within ±1 stdev from the mean, 95% of observations are within ±2 stdev from the mean, and 99.7% of observations are within ±3 stdev from the mean. These values become very important during hypothesis testing.

Values outside of ±3 stdev account for less than 0.3% of observations, and, depending on the situation, could be considered outliers or signal noise. Basically, as values in question become further from the mean, it becomes less likely that the observation belongs to that distribution.

Image by Author
Image by Author

With a standard normal distribution, we can use a Standard Score or z-score to calculate a probability that a given value comes from a given distribution, or to compare values from different distributions.

Here’s a resource for the z-score table.

Any normal distribution can be transformed into a standard normal distribution with the following equation, where x is a value from the original normal distribution. This is why the standard normal distribution is sometimes called a z-distribution. Each original X (a raw value) has been converted into Z (stdev from the mean) by subtracting the mean and dividing by the standard deviation.

Image by Author
Image by Author

An Example

Let’s say we have normal distribution of adult masses with a mean of 80 kg and a standard deviation of 5 kg. If we have an adult with mass X = 85 kg, then Z = 1 (85–80 / 5). This mass is 1 stdev from the mean.

This equation can be useful when attempting to find outliers in raw data, but you should always examine your data carefully before eliminating data that doesn’t immediately conform to the distribution you want.

For example, if in our data we have some masses clustering around 60 kg, we could calculate that these observations are about 4 standard deviations from the mean of 80 kg. The probability of these values being from this distribution is quite low (P = 0.00003)! After looking into this issue further, perhaps we discover that some children’s masses were accidentally included in the dataset.


Data Standardization

This process of transforming raw values to a standard normal distribution is called data standardization, and it’s very important for Machine Learning models. It’s how we fairly compare features with difference distributions and scales without incorrectly assigning more importance to features with larger raw values. For example, if we want to see how a person’s protein intake and fruit consumption affect their health, we don’t want to treat the 50 grams of protein as an order of magnitude more important than 3 servings of fruit just because the raw value is greater. If we standardize both features, we may discover that the fruit actually has a greater impact on health than protein!


Conclusion

There are of course many more applications of the Normal Distribution that we will talk about later, for example in the distribution of residuals for linear regression models. The normal distribution shows up in just about everything you’ll do as a data scientist, so I hope this overview was useful. And, again, I’ve explored different distributions’ probability functions if you want to dive even deeper!


Connect

I’m always looking to connect and explore other projects! The code to generate the various visualizations in this article can be found here.

LinkedIn | Medium | GitHub


Related Articles