Significance of Q-Q Plots

What secrets about the data will it tell us?

Sundaresh Chandran
Towards Data Science

--

Photo by Matt Duncan on Unsplash

Understanding the distribution of a variable(s) is one of the first and foremost tasks done while exploring a dataset. One way to test the distribution of continuous variables graphically is via a Q-Q plot. Personally, these plots come in handy in the case of parametric tests as they insist on the assumption of normality even though they can be used for any underlying distribution.

What is a Q-Q plot?

Quantile-Quantile plot or Q-Q plot is a scatter plot created by plotting 2 different quantiles against each other. The first quantile is that of the variable you are testing the hypothesis for and the second one is the actual distribution you are testing it against. For example, if you are testing if the distribution of age of employees in your team is normally distributed, you are comparing the quantiles of your team members’ age vs quantile from a normally distributed curve. If two quantiles are sampled from the same distribution, they should roughly fall in a straight line.
Since this is a visual tool for comparison, results can also be quite subjective nonetheless useful in the understanding underlying distribution of a variable(s)

How is it generated?

Below are the steps to generate a Q-Q plot for team members age to test for normality

  1. Take your variable of interest (team member age in this scenario) and sort it from smallest to largest value. Let’s say you have 19 team members in this scenario.
  2. Take a normal curve and divide it into 20 equal segments (n+1; where n=#data points)
  3. Compute z score for each of these points
  4. Plot the z-score obtained against the sorted variables. Usually, the z-scores are in the x-axis (also called theoretical quantiles since we are using this as a base for comparison) and the variable quantiles are in the y-axis (also called ordered values)
  5. Observe if data points align closely in a straight 45-degree line
  6. If it does, the age is normally distributed. If it is not, you might want to check it against other possible distributions

How can I plot this in Python?

Below is a sample code to check for normality on a random normal distribution

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline
np.random.seed(100) #for reproducibility# Generate 200 random normal data points with mean=0, standard_deviation=0.1
#Documentation:https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html
random_normal_datapoints = pd.Series(np.random.normal(0, 0.1, 200))
# Lets plot the data points along with its KDE to see how it looks
fig, ax = plt.subplots()
random_normal_datapoints.plot.kde(ax=ax, legend=False, title='A random normal distrubution with mean 0 and SD 1')
random_normal_datapoints.plot.hist(density=True, ax=ax)
ax.set_ylabel('Frequency')
# Plot the Q-Q plot to graphically check for the hypothesis
#https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html
res = stats.probplot(random_normal_datapoints, plot=plt)
plt.show()
# Plot the Q-Q plot to graphically check for the hypothesis
# Documentation : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html
res = stats.probplot(random_normal_datapoints, plot=plt)
plt.show()

As you can observe, the data points lie approximately in a straight line. Thus, we can say the data point is normally distributed (even though we know well it was sampled from a random normal distribution)

If you were to plot a random uniform distribution against normal distribution, you would get the below plot

As you can see, most of the points, do not lie in a straight line. Showing that the underlying distribution is not normal

More examples on how to check for other distributions and few other examples can be found in the Github link below:

https://github.com/SundareshPrasanna/QQPlot-Medium

What else can we do with it?

Q-Q plot can also be used to test distribution amongst 2 different datasets. For example, if dataset 1, the age variable has 200 records and dataset 2, the age variable has 20 records, it is possible to compare the distributions of these datasets to see if they are indeed the same. This can be particularly helpful in machine learning, where we split data into train-validation-test to see if the distribution is indeed the same. It is also used in the post-deployment scenarios to identify covariate shift/dataset shift/concept shift visually.

Summary

In summary, A Q-Q plot helps you compare the sample distribution of the variable at hand against any other possible distributions graphically.

--

--

Data Scientist @Royal Dutch Shell | Deep Learning | NLP | TensorFlow 2.0 | Python | Astrophysics ❤