The world’s leading publication for data science, AI, and ML professionals.

Skewness and Kurtosis with Outliers

Statistics in R Series

Photo by Aaron Burden on Unsplash
Photo by Aaron Burden on Unsplash

Introduction

Real world data, oftentimes, contains extreme values that can lead to skewed distribution. Skewed data is not suitable for many statistical analysis. The presence of one single outlier can drastically change the overall statistics of a distribution. Therefore, these extreme values need to be handled carefully. If there is no justification for these outliers, the general guideline is to get rid of those. In this article, we will go through the effect of outliers on skewness as well Kurtosis.

The following figure shows an example of normally distributed data which is good for several statistical analysis.

Image by Author
Image by Author

Skewness

In statistics, Skewness is a measure of an asymmetric distribution. Basically, it describes how far the bell curve has been distorted from its symmetrical form. Skewness can be classified into two types:

  1. Distributions exhibiting positive skewness have a tail on the right side of the distribution that is longer or more widely spaced than the tail on the left side. The distribution’s mean is greater than its median.
  2. An indicator of negative skewness is a distribution whose tail is longer or more spread out on the left side than on the right. As a result, the mean is less than the median.

It is important to identify and analyze skewed distributions in statistical analysis as they can have a significant impact.

Image by Author
Image by Author
Image by Author
Image by Author

Kurtosis

Kurtosis describes the shape of the tails of a distribution in relation to its peak. As well as measuring the degree of flatness of a distribution, it is also measured the amount of data concentrated around its mean. Kurtosis can be categorized into three types:

  1. Distributions of the mesokurtic type have kurtosis equal to zero, meaning they are normal distributions with bell-shaped curves.
  2. It is characterized by a higher peak and heavier tail than a normal distribution as it has kurtosis greater than zero. Compared to a normal distribution, it indicates that the data are more concentrated around the mean.
  3. In a platykurtic distribution, the peak is flatter and the tails are lighter than in a normal distribution because the kurtosis is less than zero. As a result, there is less concentration around the mean than in a normal distribution.
Image by Author
Image by Author
Image by Author
Image by Author

Outlier

Now let deal with real world data which is most of the time skewed and often includes outliers. We have discussed about outlier detections procedures before. One common strategy is the IQR (Interquartile range) method which is industry standard. The article below was written using python.

Practical implementation of outlier detection in python

IQR method for determining the upper and lower limit is below.

  • Find the first quartile, Q1.
  • Find the third quartile, Q3.
  • Calculate the IQR. IQR= Q3-Q1.
  • Define the normal data range with lower limit as Q1–1.5IQR and upper limit as Q3+1.5IQR.
  • Any data point outside this range is considered as outlier and should be removed for further analysis.

Dataset

Our data source for this case study will be the UCI Machine Learning Repository’s Adult Data Set. About 30000 people should be identified from the dataset based on their race, education, occupation, gender, salary, how many hours they work per week, and how much money they make.

Let’s check the distribution of age.

Image by Author
Image by Author

This seems to be a little skewed on right and there some possible outliers on the far right as well. Let’s work on determining skewness and kurtosis in R.

Implementation in R

skewness and kurtosis formula
skewness and kurtosis formula

The formula for skewness and kurtosis measurement is given above. Here, µ = sample mean and σ = sample standard deviation. In R, we can either define a function or use moments library to calculate skewness as shown in the code. The value skewness determined using the moments library is 0.2213737 and using formula, the skewness is 0.2211937 which seems pretty close. Since the value is greater than 0, the distribution is positively skewed.

Now the real deal. We want to remove outliers and see if the distribution is still skewed. Let’s trim the top 2.5% and bottom 2.5% of the data to exclude extreme data points. In this case, quantile skewness is defined as:

The calculated quantile skewness from the age data is 0.09677419 which is still positive but less in magnitude. If we exclude the top 1% and bottom 1%, the skewness will be 0.1304348 which is also positive but less in magnitude compared to the original data. 105 trim will end up with a skewness value of 0.0212766. Trimming 25% makes it negatively skewed but this is not practical to trim 25% of the data on both side. Therefore it is obvious that the more data we trim, the less skewed the data will be.

1% trim -> skewness 0.1304348

2.5% trim → skewness 0.09677419

10% trim → skewness 0.0212766

Next, we can do the same study for kurtosis. The original kurtosis value using moments library and using formula are 2.298557 and 2.296066 respectively. The kurtosis of a normal distribution is 3. If kurtosis is greater than 3, the distribution is considered leptokurtic and if kurtosis is smaller than 3, the distribution is considered platykurtic

We can perform quantile-based kurtosis analysis using the formula below.

Here, Q₁ and Q₃ are the first and the third quartile value of the distribution. Qₚ is quantile at p. When p=0.025 (basically trimming 2.5% from top and 2.5% from bottom), quantile kurtosis is the ratio between 95% inner quartile interval to interquartile range. The calculated quantile-based kurtosis for age is 2.214286 which is also smaller than 3. Therefore, removing outliers didn’t have much impact on kurtosis value.

Instead of decisively trimming a fixed portion of the data, we can use IQR method to remove outliers.

This procedure gives us skewness and kurtosis of 0.1967011 and 2.203808 respectively.

Conclusion

We have covered skewness and kurtosis fundamentals and implemented in R. Dealing with real world data with outliers require some sort of cleaning to remove those extreme values. Quantile-based skewness and kurtosis measurements are discussed as well industry standard outlier detection method (IQR) is implemented in R. Readers need to follow their own judgement about choosing outlier removal method.

Acknowledgment for Dataset

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Thanks for reading.

Buy me a coffee.

Join Medium with my referral link – Md Sohel Mahmood

Get an email whenever Md Sohel Mahmood publishes.


Related Articles