The world’s leading publication for data science, AI, and ML professionals.

What In The World Are QQ Plots?

Understanding What QQ Plots Do And How To Make One From Scratch

Photo by Alexander Andrews on Unsplash
Photo by Alexander Andrews on Unsplash

If you’ve ever used a linear regression or worked with statistical tools that require the data (or the errors) to be normally distributed, then you’ve probably run into QQ plots before. You probably don’t remember much about them besides that a straight QQ plot (45 degree line) is good.

But QQ plots are actually a really nifty and intuitive way to visualize whether something is normally distributed. Let’s find out how they work and why they’re cool.

If you need a refresher on the normal distribution, I wrote this post about it.


Normal Distributions

Things that are normally distributed are great. Knowing that something conforms to the normal distribution (and knowing its mean and standard deviation) allows us to make all kinds of useful inferences about it. For example, we can be reasonably sure where its value will fall say 95% of the time (between -1.96 and +1.96 standard deviations of the mean).

But if our variable is actually not normally distributed, then our inferences will be wrong, sometimes very wrong. And depending on the application, the consequences of our inaccurate inferences can range from being merely inconvenient to even dangerous.

That’s where QQ plots come in. They’re a quick and visual way to assess whether a variable is normal or not (we can use QQ plots to check our data against any distribution, not just the normal distribution).


QQ Plots

Let’s make up some data that we already know is normally distributed:

import numpy as np
# Generate some normally distributed random numbers
random_normals = [np.random.normal() for i in range(1000)]

We can use the QQ plot function from the statsmodels library:

import statsmodels.api as sm
from matplotlib import pyplot as plt
# Create QQ plot
sm.qqplot(np.array(random_normals), line='45')
plt.show()

This code above creates the following plot:

QQ plot of a normally distributed random variable
QQ plot of a normally distributed random variable

See how our data (the blue dots) fall pretty cleanly on the red line? That means that our data is normally distributed (which we already knew). And that’s it. If our data adheres to the red 45 degree line, it’s normal or close to it, and if it does not, then it’s not normal.

Let’s take a look at the QQ plot for something that’s not normal:

import random
# Generate some uniformly distributed random variables
random_uniform = [random.random() for i in range(1000)]
# Create QQ plot
sm.qqplot(np.array(random_uniform), line='45')
plt.show()

Which generates this plot:

QQ plot of a random variable that is not normally distributed
QQ plot of a random variable that is not normally distributed

Our data (the blue dots) is nowhere close to the red line, meaning it’s not normally distributed (it’s uniformly distributed). So now that we understand what QQ plots do, let’s figure out how they do it.


How QQ Plots Work

The "QQ" in QQ plot means quantile-quantile – that is, the QQ plot compares the quantiles of our data against the quantiles of the desired distribution (defaults to the normal distribution, but it can be other distributions too as long as we supply the proper quantiles).

Quantiles are breakpoints that divide our numerically ordered data into equally proportioned buckets. For example, you’ve probably heard of percentiles before – percentiles are quantiles that divide our data into 100 buckets (that are ordered by value), with each bucket containing 1% of observations. Quartiles are quantiles that divide our data into 4 buckets (0–25%, 25–50%, 50–75%, 75–100%). Even our old friend, the median is a quantile – it divides our data into two buckets where half our observations are lower than the median and half our higher than it.

So what does it mean to compare quantiles? Let’s step back from QQ plots for a moment and think about a simpler way to compare 2 distributions, histograms. How would we figure out whether two distributions are the same? Well a decent first pass would be to overlay the distributions one on the other and stare really hard. But what should we be staring for? One simple test would be to pick a point on the X axis and see what proportion of each distribution lies to each side of it. For example, in finance we are often concerned with downside risk (the left tail of the distribution) – or in other words, what happens to our portfolio when things go bad.

Let’s say we are concerned with really terrible events so we decide to look at outcomes that lie more than 1.65 standard deviations to the left of (in other words, below) the mean – we will call this point our threshold. If the distribution of our data were normal, then approximately 5% of our observations would lie to the left of our threshold:

Normal distribution (blue) and -1.65 SD threshold (red)
Normal distribution (blue) and -1.65 SD threshold (red)

But what if our data were not normal? We can do the same analysis as above and see how many observations lie to the left of our threshold:

Normal and Not Normal Distribution visual comparison
Normal and Not Normal Distribution visual comparison

Visually we can see that a lot more of the Not Normal distribution (the gray line – it’s a Student’s T-distribution with 1 degree of freedom) lies to the left of the threshold. So if the distribution of our portfolio is actually the gray line, but we model it with the blue line, we will be significantly understating the frequency of a terrible outcome (terrible outcomes are ones to the left of our threshold, the red line). We would be assuming that there is only a 5% chance of a terrible outcome, when in reality 17% of the area under the gray line (its cumulative density function) lies to the left of our terrible outcomes threshold.

So we would be understating the risk of a terrible outcome by a factor of 3!

That’s why it’s important to check that something is normal. And that’s where QQ plots really shine. In essence, QQ plots do what we just did with our overlaid histograms (and threshold), but it does it for every observation in our data.


QQ Plots From Scratch

If you are interested in the following code, you can also grab it from my GitHub here.

Let’s make a simplified QQ plot from scratch so we can understand how it works from the ground up. Recall that quantiles are the breakpoints (e.g. percentiles) that divide our data up into numerically ordered, equally sized buckets. To calculate quantiles, we need to first order our data from smallest to largest. Let’s generate some data and sort it:

In:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import norm
N = 10
t_dist = sorted(np.random.standard_t(1, size=10))
t_dist
Out:
[-3.078322498951745,
 -0.44257926668743197,
 -0.28442575572329976,
 -0.08391894271348821,
 0.5619567861663247,
 1.0176669779384615,
 1.3448162439504447,
 1.874164646241363,
 2.936477005326059,
 5.340289177069092]

Cool, we now have 10 numerically sorted observations from a random number generator that follows a Student’s T-distribution with 1 degree of freedom. It’s the same distribution as in our histogram above. Just eyeballing the numbers, we can see that it doesn’t look very normal. A normal distribution that hasn’t been shifted to fit empirical data has a mean of 0 and a standard deviation of 1. If we were sampling from a normal distribution, it would be very unlikely to see values that are either much higher or lower than 1. However, just from our 10 observation sample, we can see that there is a -3.08 and a 5.34 (more than 5 whopping standard deviations from the expected mean).

Calculating Actual Quantiles

Next it’s time to calculate the quantiles of our observed data. Since we have 10 observations, we want 9 quantiles (we always want N-1 quantiles, or 1 less than we have number of observations). An easy way to obtain the quantiles is to just calculate the midpoints between our observations:

t_dist_quantiles = []
quantiles_percent = []
for i, val in enumerate(t_dist[:-1]):
    t_dist_quantiles.append((val + t_dist[i+1])/2)
    quantiles_percent.append((i+1)/len(t_dist))

The list _t_distquantiles records midpoints that we calculate and the list _quantilespercent records the proportion of the data that lies below the quantile. This will become more clear, once we see the resulting dataframe:

In:
qp_array = np.array(quantiles_percent).reshape(-1,1)
tq_array = np.array(t_dist_quantiles).reshape(-1,1)
qq_df = pd.DataFrame(np.concatenate((qp_array, tq_array), axis=1),
                     columns=['percent_below', 'quantile'])
print(qq_df)
Out:
   percent_below    quantile
0            0.1   -1.760451
1            0.2   -0.363503
2            0.3   -0.184172
3            0.4    0.239019
4            0.5    0.789812
5            0.6    1.181242
6            0.7    1.609490
7            0.8    2.405321
8            0.9    4.138383

Notice how the first quantile, -1.76, is halfway between -3.07 and -0.44. And the _percentbelow column tells us that 10% of our data lies below -1.76. This makes sense because only a single observation, -3.07, is less than -1.76. So 1 out of 10 observations lies below -1.76, in other words 10%. So for each quantile, we can figure out the percent below it by dividing the number of observations less than it by the total number of observations.

Theoretical Quantiles

Once we have our actual quantiles, we need something to compare them to. The benchmark in the case of QQ plots is the theoretical quantiles of the distribution we desire. So how do we get these theoretical quantiles?

We already have all the ingredients we need actually. Recall that quantiles break our data up into fixed proportions. So if we know the proportion in each bucket, then for a given quantile, we also know the proportion of our data that lies to the left and right of it. For example, referring back to the _percentbelow column in our dataframe, _qqdf, we see that for the third quantile, -0.184172, 30% of our data lies to the left of it.

So to get our third theoretical quantile, we just need to figure out the point on the normal distribution where 30% of the area under the curve lies to its left. We can easily do this for each of our quantiles with the following line of code:

qq_df['theoretical_quantile'] = [norm.ppf(percentage) for percentage in qq_df['percent_below']]

We now have all the values we need for our QQ plot in _qqdf: actual quantiles from our data for the Y axis and theoretical quantiles (from the normal distribution) for our X axis. I’ve printed the contents of _qqdf below:

   percent_below    quantile  theoretical_quantile
0            0.1   -1.760451             -1.281552
1            0.2   -0.363503             -0.841621
2            0.3   -0.184172             -0.524401
3            0.4    0.239019             -0.253347
4            0.5    0.789812              0.000000
5            0.6    1.181242              0.253347
6            0.7    1.609490              0.524401
7            0.8    2.405321              0.841621
8            0.9    4.138383              1.281552

All that’s left to do is to draw our QQ plot:

plt.subplots(figsize=(9,7))
plt.scatter(x=qq_df['theoretical_quantile'],
            y=qq_df['quantile'], label='Actual');
plt.scatter(x=qq_df['theoretical_quantile'],
            y=qq_df['theoretical_quantile'], 
            c='red', label='Normal')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Actual Quantiles')
plt.legend()
plt.savefig('qq_plot', bpi=150);
plt.show()
Our QQ plot
Our QQ plot

What The QQ Plot Is Telling Us

It would be remiss of me to conclude this post without explaining how to read a QQ plot. We already know that if the dots of our plot fall on a 45 degree line, then our data is normally distributed (assuming we are using the normal distribution’s theoretical quantiles). But when they do not fall on the line, we can still learn a lot about the distribution of our data. Here are some general tips for reading a QQ plot:

  • The slope tells us whether the steps in our data are too big or too small (or just right). Remember, each step (where a step is going from one quantile to the next) in the data traverses a fixed and constant percentage – for example, if we have N observations, then each step traverses 1/(N-1) of the data. So we are seeing how the step sizes (a.k.a. quantiles) compare between our data and the normal distribution.
  • A steeply sloping section of the QQ plot means that in this part of our data, the observations are more spread out than we would expect them to be if they were normally distributed. One example cause of this would be an unusually large number of outliers (like in the QQ plot we drew with our code previously).
  • A flat QQ plot means that our data is more bunched together than we would expect from a normal distribution. For example, in a uniform distribution, our data is bounded between 0 and 1. And within that range, each value is equally likely. So the extremes of the range (like 0.01 and 0.99) are just as likely as something in the middle like 0.50. This is very different from a normal distribution (with mean of 0 and standard deviation of 1) where something like a -3 or a 4 would be much less likely to be observed than a 0. So the QQ plot of a uniformly distributed variable (where the observations are equally spaced and therefore more bunched up relative to a normal distribution) would have a very shallow slope.
QQ plot of a uniformly distributed random variable
QQ plot of a uniformly distributed random variable

Conclusion

I’m a big fan of QQ plots as they’re a clever, visual, and fast way to compare the distribution of our data to a desired statistical distribution. Hopefully after reading this post, you will be a fan of them too. Cheers!


More Data Science Related Posts By Me:

Understanding The Normal Distribution

Pandas Join vs. Merge

Understanding RNNs

_Understanding PCA_

Understanding Bayes’ Theorem


Related Articles