The world’s leading publication for data science, AI, and ML professionals.

Explaining probability plots

What they are, how to implement them in Python and how to interpret the results

Source
Source

1. Introduction

You might have already encountered one type of probability plots -Q-Q plots – while working with linear regression. One of the assumptions of the regression we should check after fitting the model is if residuals follow Normal (Gaussian) distribution. And it can be often visually verified by using a Q-Q plot such as the one presented below.

Example of Q-Q plot
Example of Q-Q plot

To fully understand the concepts of probability plots let’s quickly go over a few definitions from probability theory/Statistics:

  • probability density function (PDF) -a function that allows us to calculate probabilities of finding a random variable in any interval which belongs to the sample space. It is important to remember that the probability of a continuous random variable taking an exact value is equal to 0.
PDF of Gaussian Distribution
PDF of Gaussian Distribution
  • cumulative distribution function (CDF) – a function that provides the probability of a random variable taking value equal or less than a given value x. When we are dealing with continuous variables, the CDF is the area under the PDF in the range of minus infinity to x.
General formula for CDF, X - random variable, x - point of evaluation
General formula for CDF, X – random variable, x – point of evaluation
  • quantile -quoting Wikipedia: "cut points dividing the range of a probability distribution into continuous intervals with equal probabilities"

The following plot presents a distribution of a random variable drawn from Standard Normal Distribution as well as the PDF and CDF.

In this article I will be using two other distributions for comparison:

I use the Skew Normal Distribution since by adjusting the alpha parameter (while leaving scale and location to default) I control skewness of the distribution. As the absolute value of alpha increases, the absolute value of skewness increases as well. Below we can inspect the difference in distributions by looking at histograms of random variables drawn from them.

2. Probability plots

We use probability plots to visually compare data coming from different datasets (distributions). The possible scenarios involve comparing:

  • two empirical sets
  • one empirical and one theoretical set
  • two theoretical sets

The most common use for probability plots is the middle one, when we compare observed (empirical) data to data coming from a specified theoretical distribution like Gaussian. I use this variant for explaining the particular types of plots below, however, it can also be applied to the other two cases.

2.1 P-P plot

In short, P-P (probability–probability) plot is a visualization that plots CDFs of the two distributions (empirical and theoretical) against each other.

Example of a P-P plot comparing random numbers drawn from N(0, 1) to Standard Normal - perfect match
Example of a P-P plot comparing random numbers drawn from N(0, 1) to Standard Normal – perfect match

Some key information on P-P plots:

  • Interpretation of the points on the plot: assuming we have two distributions (f and g) and a point of evaluation z (any value), the point on the plot indicates what percentage of data lies at or below z in both f and g (as per definition of the CDF).
  • To compare the distributions we check if the points lie on a 45-degree line (x=y). In case they deviate, the distributions differ.
  • P-P plots are well suited to compare regions of high probability density (center of distribution) because in these regions the empirical and theoretical CDFs change more rapidly than in regions of low probability density.
  • P-P plots require fully specified distributions, so if we are using Gaussian as the theoretical distribution we should specify the location and scale parameters.
  • Changing the location or scale parameters does not necessarily preserve the linearity in P-P plots.
  • P-P plots can be used to visually evaluate the skewness of a distribution.
  • The plot may result in weird patterns (e.g. following the axes of the chart) when the distributions are not overlapping. So P-P plots are most useful when comparing probability distributions that have a nearby or equal location. Below I present a P-P plot comparing random variables drawn from N(1, 2.5) and compared to N(5, 1).
Random Variables drawn from N(1, 2.5) vs. N(5, 1)
Random Variables drawn from N(1, 2.5) vs. N(5, 1)

2.2. Q-Q plot

Similarly to P-P plots, Q-Q (quantile-quantile) plots allow us to compare distributions by plotting their quantiles against each other.

Some key information on Q-Q plots:

  • Interpretation of the points on the plot: a point on the chart corresponds to a certain quantile coming from both distributions (again in most cases empirical and theoretical).
  • On a Q-Q plot, the reference line is dependent on the location and scale parameters of the theoretical distribution. The intercept and slope are equal to the location and scale parameters respectively.
  • A linear pattern in the points indicates that the given family of distributions reasonably describes the empirical data distribution.
  • Q-Q plot gets very good resolution at the tails of the distribution but worse in the center (where probability density is high)
  • Q-Q plots do not require specifying the location and scale parameters of the theoretical distribution, because the theoretical quantiles are computed from a standard distribution within the specified family.
  • The linearity of the point pattern is not affected by changing location or scale parameters.
  • Q-Q plots can be used to visually evaluate the similarity of location, scale, and skewness of the two distributions.

3. Examples in Python

I use the statsmodels library to create probability plots with the [ProbPlot](https://www.statsmodels.org/dev/generated/statsmodels.graphics.gofplots.ProbPlot.html) class.

Data

First, I generated random observations coming from three distributions: Standard Normal, Normal and Skew Normal. You can see the exact parameters of the distributions in the snippet below.

P-P plots

When I started creating some P-P plots using statsmodels __ I noticed an issue – as I was comparing random draws from N(1, 2.5) to Standard Normal, the plot was a perfect fit while it should not be. I tried investigating this issue and found a post on StackOverflow, which explains that the current implementation always tries to estimate the location and scale parameters of the theoretical distribution, even when provided with some values. So in the case above, we are checking if our empirical data comes from Normal distribution, not the one we specified.

That is why I wrote a function for direct comparison of empirical data to a theoretical distribution with provided parameters.

Let’s first try comparing random draw from N(1, 2.5) to N(0, 1) using both statsmodels and pp_plot. We see that in the case of statsmodels __ it’s a perfect fit, as the function estimated both the location and scale parameters of the Normal Distribution. When inspecting the result of pp_plot we see that the distributions differ significantly, which can also be observed on the histograms.

P-P plots of N(1, 2.5) vs. Standard Normal
P-P plots of N(1, 2.5) vs. Standard Normal

Let’s also try to interpret the shape of the P-P plot from pp_plot. To do so I will once again show the chart, together with the histograms. The horizontal movement along the x-axis is caused by the fact that the distributions are not entirely overlapping. When the point is above the reference line, it means that the value of the CDF of the theoretical distribution is higher than that of the empirical one.

The next case is comparing random draw from Skew Normal to Standard Normal. We see that the plot from statsmodels implies that it is not a perfect match, as it has troubles finding location and scale parameters of a Normal Distribution which account for the skewness in provided data. The plot also shows that the value of the CDF of Standard Normal is always higher than that of the considered Skew Normal distribution.

P-P plots of Skew Normal (alpha=5) vs Standard Normal
P-P plots of Skew Normal (alpha=5) vs Standard Normal

Note: We can also obtain a perfect fit using statsmodels. To do so we need to specify the theoretical distribution in ProbPlot as skewnorm and pass an additional parameter distargs=(5,) __ to indicate the value of alpha.

Q-Q plots

Application and interpretation

Let’s begin by comparing Skew Normal distribution to Standard Normal (with ProbPlot's default settings).

Q-Q plots of Skew Normal (alpha=5) vs Standard Normal
Q-Q plots of Skew Normal (alpha=5) vs Standard Normal

The first thing that can be observed is the fact that points form a curve rather than a straight line, which usually is an indication of skewness in the sample data. Another way of interpreting the plot is by looking at the tails of the distribution. In this case, the considered Skew Normal distribution has a lighter left tail (less mass, points on the left side of Q-Q plot above the line) and heavier right tail (more mass, points on the right side of Q-Q plot above the line) than one could expect under Standard Normal distribution. We need to remember that the skewed distribution is shifted (as can be observed on the histograms), so these results are in line with our expectations.

I also wanted to quickly go over two other variations of the same exercise. In the first one, I specify the theoretical distribution as Skew Normal and pass alpha=5 in distargs. This results in the following plot, on which we see a linear (though shifted as compared to the standardized reference line) pattern. However, the line pattern is basically a 45-degree line, indicating a good fit (standardized reference line turns out not to be a good choice in this case).

Q-Q plots of Skew Normal (alpha=5) vs Skew Normal (alpha=5)
Q-Q plots of Skew Normal (alpha=5) vs Skew Normal (alpha=5)

The second approach is comparing two empirical samples – one drawn from Skew Normal (alpha=5), the second one from Standard Normal. I set fit=False __ in order to turn off the automatic fitting of location, scale, and distargs.

The results seem to be in line with the initial approach (which is a good sign 🙂 ).

Example using stock returns

I would also like to show a practical example of using Q-Q plot for evaluating whether returns generated by Microsoft stock prices follow Normal Distribution (please refer to this article for more details). The conclusion is that there is definitely more mass in the tails (indicating more negative and positive returns) than as assumed under Normality.

Returns on MSFT vs Norma Distribution
Returns on MSFT vs Norma Distribution

Further implementation details

In the qqplot method of ProbPlot we can specify what kind of reference line we would like to draw. The options (aside from None for no line) are:

  • s – standardized line (expected order statistics are scaled by the standard deviation of the given sample and have the mean added to them)
  • q – line fit through the quartiles
  • r – regression line
  • 45 – y=x line (as the one used in P-P plots)

Below I show a comparison of the three methods, which – as we can see – are very similar.

When working with Q-Q plot we can also use another feature of statsmodels that adopts non-exceedance probabilities in place of theoretical quantiles (probplot method instead of qqplot).

You can read more about this methodology here.

4. Summing Up

In this article, I have tried to explain key concepts of probability plots on the examples of the P-P and Q-Q plots. You can find the notebook with the code used for generating the plots mentioned in the article on my GitHub. In case you have questions or suggestions, please let me know in the comments or reach out on Twitter.

Liked the article? Become a Medium member to continue learning by reading without limits. If you use this link to become a member, you will support me at no extra cost to you. Thanks in advance and see you around!

You might also be interested in one of the following:

Phik (𝜙k) – get familiar with the latest correlation coefficientTh_at is also consistent between categorical, ordinal, and interval variables!to_wardsdatascience.com

Prediction Strength – a simple, yet relatively unknown way to evaluate clustering

The new kid on the statistics-in-Python block: pingouin

References


Related Articles