The world’s leading publication for data science, AI, and ML professionals.

Methods for Normality Test with Application in Python

When we take the name of normality test, we find ourselves puzzled how to find the best test to know if a variable is normally distributed…

Exploration of different methods of normality tests like graphical methods and statistical methods in Python

Image by Author created by wordart.com
Image by Author created by wordart.com

When we take the name of normality test, we find ourselves puzzled how to find the best test to know if a variable is normally distributed or not. As many algorithms works on the assumption that the data is normally distributed, to explore this, the UCI Abalone dataset is used that has nine variables in which eight are numeric.

Graphical Methods:

Visually the distribution can be understood by using histogram plots. Histogram Plot is provided below to see the distribution of the variables. The python is used to explore the topic. Below we can see, the distribution of all the variables is skewed and hence are not normally distributed.

Image by Author
Image by Author

A Q-Q plot is a scatter plot with two sets of quantiles against each other, if both sets of quantiles are coming from the same distribution, the points will make a line that’s roughly straight. In the Q-Q plot, only variable "Height" is looking straight whereas it is showing some outliers. As "Rings" variable is a count variable, therefore it’s Q-Q plot is looking differently like steps. Other variables’, Q-Q plot are not straight hence showing deviation from the normal distribution.

Image by Author
Image by Author

In the boxplot, there is graphically depiction of groups of the numeric data through their quartiles. Boxplot is not only used in detecting outliers, but also to see and understand the distribution. In this too, except "Height" variable others are looking skewed as distance of both whiskers are not equal and hence not normally distributed.

Image by Author
Image by Author

Statistical Tests

As there is a good chance that one can easily get baffled from the visual representation to understand the distribution, but there are statistical tests that can be used in finding the normality which are very clear to understand.

D’Agostino’s K-squared test: This tests whether a sample differs from a normal distribution.

H0= The sample comes from a normal distribution.

HA=The sample is not coming from normal distribution.

It is based on D’Agostino and Pearson’s [1], [2] test that combines skew and kurtosis to produce an omnibus test of normality. In Python, scipy.stats.normaltest is used to test this. It gives the statistic which is s^2 + k^2, where s is the z-score returned by skew test and k is the z-score returned by kurtosis test and p-value, i.e., 2-sided chi squared probability for the hypothesis test. After using alpha value of 0.05, below results were found. The null hypothesis is rejected for all the variables suggesting that all the variables are not normally distributed. As in graphical analysis, "Height" variable was looking normal, here "Height" variable too is showing p-value 0 suggesting the variable is not normally distributed.

Image by Author
Image by Author

Jarque-Bera test[3]: This tests whether the sample has the skewness and kurtosis matching with a normal distribution, i.e., skewness=0 and kurtosis =3. The null hypothesis is same as D’Agostino’s K-squared test. The test statistic is always nonnegative, and if it is far from zero then it shows the data do not have a normal distribution. This test only works for more than 2000 data samples. In Python, scipy.stats.jarque_bera is used for the test. Below we can see that all variables’ test statistics is far from zero and p-values are 0 suggesting that all variables are not normally distributed.

Image by Author
Image by Author

Anderson-Darling test (AD test)[4–5]: This tests if sample is coming from a particular distribution. The null hypothesis is that the sample is drawn from a population following a particular distribution. For the Anderson-Darling test, the critical values depend on the distribution it is being tested. The distribution it takes are normal, exponential, logistic, or Gumbel (Extreme Value Type I) distributions. If the test statistic is larger than the critical value then for the corresponding significance level, the null hypothesis (i.e., the data come from the chosen distribution) can be rejected. Precisely, the hypotheses for the AD-test are: H0: The data comes from a particular distribution. HA: The data does not come from a particular distribution.

In our case, we are taking the normal distribution. Below we can see that statistic is higher than the critical value of their corresponding significant value. After considering 5% significance level, it is concluded that the variables are not from the normally distributed population. In Python, scipy.stats.anderson is used for the test.

Image by Author
Image by Author

Kolmogorov-Smirnov test [6]: This is a non-parametric test i.e., it has no assumption about the distribution of the data. Kolmogorov-Smirnov test is used to understand how well the distribution of sample data conforms to some theoretical distribution. In this, we compare between some theoretical cumulative distribution function, (Ft(x)), and a samples’ cumulative distribution function , (Fs(x)) where the sample is a random sample with unknown cumulative distribution function Fs(x). Precisely,

H0: Fs(x) is equal to Ft(x) for all x from -inf. to inf.

HA: Fs(x) is not equal to Ft(x) for at least one x

In Python, scipy.stats.kstest is used for the test. In kstest function, the parameter "alternative" is used for the alternative hypothesis with default value of "two-sided" and for "args", ‘norm’ option is given to compare. Again, we can conclude, all the variables are not normally distributed as p-value is 0 in all variables.

Image by Author
Image by Author

Lilliefors test[7]: This is also a normality test that is based on the Kolmogorov–Smirnov test. This is specifically used to test the null hypothesis that the sample comes from a normally distributed population, when the null hypothesis does not specify which normal distribution, i.e., it does not specify the expected value and variance of the distribution. In Python, although scipy library doesn’t provide option for Lilliefors test but statsmodels library has the option, i.e., statsmodels.stats.diagnostic.lilliefors(x, dist=’norm’, pvalmethod=’table’) to implement it.

Shapiro-Wilk test: This test is most popular to test the normality. It has below hypothesis:

H0= The sample comes from a normal distribution.

HA=The sample is not coming from a normal distribution.

In Python, scipy.stats.shapiro(x) is used. Below we can see again all variables are not normally distributed as the null hypothesis is rejected. Taking alpha as 0.05, the calculated p-values from Shapiro-Wilk test are less than alpha.

Image by Author
Image by Author

We have all these tests to test the normality. We can say that graphical representation though easy to understand, but statistical tests can be used to precisely reach the conclusion, and numerically explains well if one have an iota of confusion in understanding the graphs.

The notebook for this project can be found here.

References:

1.D’Agostino, R. B. (1971), "An omnibus test of normality for moderate and large sample size", Biometrika, 58, 341–348

2.D’Agostino, R. and Pearson, E. S. (1973), "Tests for departure from normality", Biometrika, 60, 613–622.

  1. Jarque, C. and Bera, A. (1980) "Efficient tests for normality, homoscedasticity and serial independence of regression residuals", Econometric Letters, 6, 255–259.
  2. Anderson, T. W.; Darling, D. A. (1952). "Asymptotic theory of certain "goodness-of-fit" criteria based on stochastic processes", Annals of Mathematical Statistics, 23, 193–212.
  3. Anderson, T.W.; Darling, D.A. (1954). "A Test of Goodness-of-Fit", Journal of the American Statistical Association, 49, 765–769.
  4. Kolmogorov–Smirnov Test (2008). In: The Concise Encyclopedia of Statistics. Springer, New York, NY.
  5. Lilliefors, Hubert W. (1967–06–01). "On the Kolmogorov-Smirnov Test For Normality with Mean and Variance Unknown". Journal of the American Statistical Association. 62 (318): 399–402.

Related Articles