The world’s leading publication for data science, AI, and ML professionals.

Outlier Detection (Part 1)

IQR, Standard Deviation, Z-score and Modified Z-score

Image by Author
Image by Author

Introduction

It is risky to include outliers in data driven models. The existence of one single misleading value has the potential to change the conclusion implied by the model. Is is therefore, important to detect and then decide whether to remove it or not from the dataset. Sometimes the data point may be extremely high or low but that does not mean it is an outlier that we want to get rid of. It may be simply an extreme data point. This is up to the user’s decision whether it should be included or not in the model. However, sometimes the data point can be simply a typo or some artifact of the data collection or processing.

Outlier detection methods

In this article, I will go through several statistical Outlier Detection methods including Inter-Quartile Range (IQR), standard deviation, Z-score and modified Z-score. I will use python to implement these methods and will share the data as well as notebook in the end. I will use the Boston AirBnB data with CC0: Public Domain license.

Inter Quartile Range

Inter Quartile Range (IQR) is one of the most extensively used procedure for outlier detection and removal. According to this procedure, we need to follow the following steps:

  • Find the first quartile, Q1.
  • Find the third quartile, Q3.
  • Calculate the IQR. IQR = Q3-Q1.
  • Define the normal data range with lower limit as Q1–1.5IQR and upper limit as Q3+1.5IQR.
  • Any data point outside this range is considered as outlier and should be removed for further analysis.

In boxplot, this IQR method is implemented to detect any extreme data point where the maximum point (the end of high whisker) is Q3+1.5IQR and the minimum point (the start of low whisker) is Q1–1.5IQR.

Implementation in python

The AirBnB data has a price column that we will consider for the implementation. After determining Q1, Q3 and IQR, the outlier points are removed from the data frame as shown below.

One outlier is detected after IQR method is applied [Image by Author]
One outlier is detected after IQR method is applied [Image by Author]
Outlier value shown with other boxplot details [Image by Author]
Outlier value shown with other boxplot details [Image by Author]

The value of the outlier is 417 which is above the maximum limit shown in the boxplot (whishi = 402). We do not have ant outlier below whislo (10).

Note for normal distribution

If we have a normal dataset, the quartile can be determined from mean and median.

[Image by Author]
[Image by Author]
[Image by Author]
[Image by Author]

Therefore, for normally distributed data, implementing IQR method with multiplier value of 1.5 is same as proceeding with standard deviation method with multiplier of 2.7 as shown above. However, oftentimes we are interested to limit the specs at 3 standard deviation. For that purpose, the IQR multiplier should be ~1.7 instead of 1.5 (for which we only set the limits to 2.7 times standard deviation).

Standard Deviation

Standard deviation method is similar to IQR procedure as discussed above. Depending on the set limit either at 2 times stdev or 3 times stdev, we can detect and remove Outliers from the dataset.

Upper limit = mean + 3 * stdev

Lower limit = mean – 3 * stdev

More outliers are found when mean +/- 3 times stdev are set as limits [Image by Author]
More outliers are found when mean +/- 3 times stdev are set as limits [Image by Author]

The total number of outliers here is 57. When 2 times stdev is used, the number of fliers drop to 20. This is because 2 times stdev implies a stricter limit set and majority of the probable extreme points are already removed by the procedure. Only few are remaining after the removal and those are still considered outlier when a new boxplot is generated with the updated data.

Z-score

Z-score is just another form of Standard Deviation procedure. Z-score is used to convert the data into another dataset with mean = 0.

[Image by Author]
[Image by Author]

Here, X-bar is the mean value and s is standard deviation. Once the data is converted, the center becomes 0 and the z-score corresponding to each data point represents the distance from the center in terms of standard deviation. For example, a z-score of 2.5 indicates that the data point is 2.5 standard deviation away from the mean. Usually z-score =3 is considered as a cut-off value to set the limit. Therefore, any z-score greater than +3 or less than -3 is considered as outlier which is pretty much similar to standard deviation method.

We found that the number of outliers is 21 before implementing this method and obtained 20 after removing those 21 outliers.

Outliers after implementing Z-score method with cut-off value set at 3 [Image by Author]
Outliers after implementing Z-score method with cut-off value set at 3 [Image by Author]

In fact, these 20 outliers are the same data point that we obtained from 3 times stdev method. Therefore, the user may proceed with either one.

Modified Z-score

Z-score is susceptible to extreme data points. If there is one extreme value, the z-score corresponding to that point will also be extreme. It has the potential to significantly move the mean away from its actual value. The modified z-score is somewhat more robust than the standard z-score since it is calculated from the median absolute deviation (MAD). The formula for modified Z-score is [1]

[Image by Author]
[Image by Author]

Since it relies on the median value, it is less susceptible to any outlier. We can use the cut-off value of modified Z-score at 3.5 [1]. Once this procedure is implemented, we obtained the same result as Z-score since the dataset probably does not have any influential outlier.

Outliers after implementing modified Z-score method with cut-off value set at 3.5 [Image by Author]
Outliers after implementing modified Z-score method with cut-off value set at 3.5 [Image by Author]

Conclusion

In this article, we have shown four statistical procedures of detecting outliers in the dataset. For normal dataset, IQR multiplier of 1.7 is similar to stdev multiplier of 3 which is also similar to setting cut-off Z-score at 3 as detection limit. Statisticians use modified Z-score to minimize the influence of outliers on Z-score. This modified Z-score indicates the relative strength of the outlier and how much it deviated from the Z-score it was supposed to have. All these procedures are standard procedures to determine outliers statistically.

Github Page for code

My website: Learning From data

Youtube for this article

Code Walkthrough

Join Medium with my referral link – Md Sohel Mahmood

Reference

  1. Robust data analysis for factorial experimental designs: Improved methods and software, Sarmad, Majid, http://etheses.dur.ac.uk/2432/1/2432_443.pdf

Related Articles