
1. Introduction
You might have already encountered one type of probability plots -Q-Q plots – while working with linear regression. One of the assumptions of the regression we should check after fitting the model is if residuals follow Normal (Gaussian) distribution. And it can be often visually verified by using a Q-Q plot such as the one presented below.

To fully understand the concepts of probability plots let’s quickly go over a few definitions from probability theory/Statistics:
- probability density function (PDF) -a function that allows us to calculate probabilities of finding a random variable in any interval which belongs to the sample space. It is important to remember that the probability of a continuous random variable taking an exact value is equal to 0.

- cumulative distribution function (CDF) – a function that provides the probability of a random variable taking value equal or less than a given value x. When we are dealing with continuous variables, the CDF is the area under the PDF in the range of minus infinity to x.

- quantile -quoting Wikipedia: "cut points dividing the range of a probability distribution into continuous intervals with equal probabilities"
The following plot presents a distribution of a random variable drawn from Standard Normal Distribution as well as the PDF and CDF.

In this article I will be using two other distributions for comparison:
- Normal distribution with mean 1 and standard deviation 2.5 – N(1, 2.5)
- Skew Normal Distribution with alpha = 5
I use the Skew Normal Distribution since by adjusting the alpha parameter (while leaving scale and location to default) I control skewness of the distribution. As the absolute value of alpha increases, the absolute value of skewness increases as well. Below we can inspect the difference in distributions by looking at histograms of random variables drawn from them.

2. Probability plots
We use probability plots to visually compare data coming from different datasets (distributions). The possible scenarios involve comparing:
- two empirical sets
- one empirical and one theoretical set
- two theoretical sets
The most common use for probability plots is the middle one, when we compare observed (empirical) data to data coming from a specified theoretical distribution like Gaussian. I use this variant for explaining the particular types of plots below, however, it can also be applied to the other two cases.
2.1 P-P plot
In short, P-P (probability–probability) plot is a visualization that plots CDFs of the two distributions (empirical and theoretical) against each other.

Some key information on P-P plots:
- Interpretation of the points on the plot: assuming we have two distributions (f and g) and a point of evaluation z (any value), the point on the plot indicates what percentage of data lies at or below z in both f and g (as per definition of the CDF).
- To compare the distributions we check if the points lie on a 45-degree line (x=y). In case they deviate, the distributions differ.
- P-P plots are well suited to compare regions of high probability density (center of distribution) because in these regions the empirical and theoretical CDFs change more rapidly than in regions of low probability density.
- P-P plots require fully specified distributions, so if we are using Gaussian as the theoretical distribution we should specify the location and scale parameters.
- Changing the location or scale parameters does not necessarily preserve the linearity in P-P plots.
- P-P plots can be used to visually evaluate the skewness of a distribution.
- The plot may result in weird patterns (e.g. following the axes of the chart) when the distributions are not overlapping. So P-P plots are most useful when comparing probability distributions that have a nearby or equal location. Below I present a P-P plot comparing random variables drawn from N(1, 2.5) and compared to N(5, 1).

2.2. Q-Q plot
Similarly to P-P plots, Q-Q (quantile-quantile) plots allow us to compare distributions by plotting their quantiles against each other.
Some key information on Q-Q plots:
- Interpretation of the points on the plot: a point on the chart corresponds to a certain quantile coming from both distributions (again in most cases empirical and theoretical).
- On a Q-Q plot, the reference line is dependent on the location and scale parameters of the theoretical distribution. The intercept and slope are equal to the location and scale parameters respectively.
- A linear pattern in the points indicates that the given family of distributions reasonably describes the empirical data distribution.
- Q-Q plot gets very good resolution at the tails of the distribution but worse in the center (where probability density is high)
- Q-Q plots do not require specifying the location and scale parameters of the theoretical distribution, because the theoretical quantiles are computed from a standard distribution within the specified family.
- The linearity of the point pattern is not affected by changing location or scale parameters.
- Q-Q plots can be used to visually evaluate the similarity of location, scale, and skewness of the two distributions.
3. Examples in Python
I use the statsmodels
library to create probability plots with the [ProbPlot](https://www.statsmodels.org/dev/generated/statsmodels.graphics.gofplots.ProbPlot.html)
class.
Data
First, I generated random observations coming from three distributions: Standard Normal, Normal and Skew Normal. You can see the exact parameters of the distributions in the snippet below.
P-P plots
When I started creating some P-P plots using statsmodels
__ I noticed an issue – as I was comparing random draws from N(1, 2.5) to Standard Normal, the plot was a perfect fit while it should not be. I tried investigating this issue and found a post on StackOverflow, which explains that the current implementation always tries to estimate the location and scale parameters of the theoretical distribution, even when provided with some values. So in the case above, we are checking if our empirical data comes from Normal distribution, not the one we specified.
That is why I wrote a function for direct comparison of empirical data to a theoretical distribution with provided parameters.
Let’s first try comparing random draw from N(1, 2.5) to N(0, 1) using both statsmodels
and pp_plot
. We see that in the case of statsmodels
__ it’s a perfect fit, as the function estimated both the location and scale parameters of the Normal Distribution. When inspecting the result of pp_plot
we see that the distributions differ significantly, which can also be observed on the histograms.

Let’s also try to interpret the shape of the P-P plot from pp_plot
. To do so I will once again show the chart, together with the histograms. The horizontal movement along the x-axis is caused by the fact that the distributions are not entirely overlapping. When the point is above the reference line, it means that the value of the CDF of the theoretical distribution is higher than that of the empirical one.

The next case is comparing random draw from Skew Normal to Standard Normal. We see that the plot from statsmodels
implies that it is not a perfect match, as it has troubles finding location and scale parameters of a Normal Distribution which account for the skewness in provided data. The plot also shows that the value of the CDF of Standard Normal is always higher than that of the considered Skew Normal distribution.

Note: We can also obtain a perfect fit using statsmodels
. To do so we need to specify the theoretical distribution in ProbPlot as skewnorm
and pass an additional parameter distargs=(5,)
__ to indicate the value of alpha.
Q-Q plots
Application and interpretation
Let’s begin by comparing Skew Normal distribution to Standard Normal (with ProbPlot's
default settings).

The first thing that can be observed is the fact that points form a curve rather than a straight line, which usually is an indication of skewness in the sample data. Another way of interpreting the plot is by looking at the tails of the distribution. In this case, the considered Skew Normal distribution has a lighter left tail (less mass, points on the left side of Q-Q plot above the line) and heavier right tail (more mass, points on the right side of Q-Q plot above the line) than one could expect under Standard Normal distribution. We need to remember that the skewed distribution is shifted (as can be observed on the histograms), so these results are in line with our expectations.
I also wanted to quickly go over two other variations of the same exercise. In the first one, I specify the theoretical distribution as Skew Normal and pass alpha=5
in distargs
. This results in the following plot, on which we see a linear (though shifted as compared to the standardized reference line) pattern. However, the line pattern is basically a 45-degree line, indicating a good fit (standardized reference line turns out not to be a good choice in this case).

The second approach is comparing two empirical samples – one drawn from Skew Normal (alpha=5
), the second one from Standard Normal. I set fit=False
__ in order to turn off the automatic fitting of location, scale, and distargs
.
The results seem to be in line with the initial approach (which is a good sign 🙂 ).

Example using stock returns
I would also like to show a practical example of using Q-Q plot for evaluating whether returns generated by Microsoft stock prices follow Normal Distribution (please refer to this article for more details). The conclusion is that there is definitely more mass in the tails (indicating more negative and positive returns) than as assumed under Normality.

Further implementation details
In the qqplot
method of ProbPlot
we can specify what kind of reference line we would like to draw. The options (aside from None
for no line) are:
- s – standardized line (expected order statistics are scaled by the standard deviation of the given sample and have the mean added to them)
- q – line fit through the quartiles
- r – regression line
- 45 – y=x line (as the one used in P-P plots)
Below I show a comparison of the three methods, which – as we can see – are very similar.

When working with Q-Q plot we can also use another feature of statsmodels
that adopts non-exceedance probabilities in place of theoretical quantiles (probplot
method instead of qqplot
).
You can read more about this methodology here.

4. Summing Up
In this article, I have tried to explain key concepts of probability plots on the examples of the P-P and Q-Q plots. You can find the notebook with the code used for generating the plots mentioned in the article on my GitHub. In case you have questions or suggestions, please let me know in the comments or reach out on Twitter.
Liked the article? Become a Medium member to continue learning by reading without limits. If you use this link to become a member, you will support me at no extra cost to you. Thanks in advance and see you around!
You might also be interested in one of the following:
Prediction Strength – a simple, yet relatively unknown way to evaluate clustering