
Outliers, one of the buzzwords in the manufacturing industry, has driven engineers and scientists to develop newer algorithms as well as robust techniques for continuous quality improvement. If the data include even if one outlier, it has the potential to dramatically skew the calculated parameters. Therefore, it is of utmost importance to analyze the data without those deviant points. It is also important to understand which of the data points are considered as outliers. Extreme data points do not always necessarily mean those are outliers.
In this article, I will discuss the algorithm and the python implementation for three different outlier detection techniques. Those are Interquartile (IQR) method, Hampel method and Dbscan clustering method.
Inter quartile range (IQR) method
Each dataset can be divided into quartiles. The first quartile point indicates that 25% of the data points are below that value whereas second quartile is considered as median point of the dataset. The inter quartile method finds the outliers on numerical datasets by following the procedure below
- Find the first quartile, Q1.
- Find the third quartile, Q3.
- Calculate the IQR. IQR= Q3-Q1.
- Define the normal data range with lower limit as Q1–1.5IQR and upper limit as Q3+1.5IQR.
- Any data point outside this range is considered as outlier and should be removed for further analysis.
The concept of quartiles and IQR can best be visualized from the boxplot. It has the minimum and maximum point defined as Q1–1.5IQR and Q3+1.5IQR respectively. Any point outside this range is outlier.

IQR in python
I will take a dataset with Airbnb data from Kaggle. The dataset contains listings of thousands of Airbnb rentals with price, rating, type and so on. I will focus on the numerical price value of the rentals and create a function that can be applicable to any numerical data frame column. Let’s begin.
- First import the libraries.
- Read the file.
- Remove special sign like ‘$’ from the price column.
- See the initial distribution in boxplots.
Image by author
This boxplot shows a number of outliers in several segment of rental types.
- Create function to implement IQR method.
- Revisit the boxplot after outlier removal. The indices of the bad data points are determined and those are removed from the initial dataset.
Image by author
As seen in the boxplot, the majority of the outliers are removed. One can also perform this IQR method in individual rental type and that will remove all the deviant points and result in a cleaner boxplot.
- Check number of outliers removed. The total number of outliers determined by this process is 124.

Hampel method
This method applies Hampel‘s filter to the data to detect outlier. The process of finding the outlier is below.
- Find the median of the dataset.
- Calculate the absolute deviation of each data point from the median.
- Calculate the median of the deviations.
- Check the absolute deviation against the value of 4.5*median of the deviations.
- Whichever data point is greater or equal to that critical value, is considered as outlier.
Hampel method in python
I used the same dataset’s price column to find the Outliers.
-
Define the function for Hampel method that can work on a dataframe’s numerical column and return the indices of good data points.
-
Similar boxplots are generated after the outliers are removed.
Image by author - Check number of outliers removed. The total number of outliers determined by this process is 95.

- By looking at the range of y-axis in the boxplot and comparing it with the IQR method, it is obvious that the data points removed by Hampel method is a subset of IQR method.
Youtube video for IQR implementation
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. I would like to apply this clustering algorithm to find out outlier in the same dataset. This algorithm performs better when there are data points having cluster of similar density. This method tends to group the data points together which are closely located, considering those as neighbors. Python’s sklearn.cluster has the class implementation of DBSCAN which takes two important arguments. The first and the most important one is the eps value which is the maximum distance between the data points that can be considered as neighbors. There should be an optimum value need to be chosen for eps. This publication [1] provided the procedure to find the optimum value where eps values are plotted against data points. At some point, the eps value shows the highest change in the slope and that’s the most optimum value. The second important argument is the min_samples which is the minimum number of data points that should be inside a group to be considered as a cluster. Higher the min_samples given as input, less the number of clusters and vice versa [2].
Our Airbnb price data has some high-end rentals that could be considered as outliers but the fundamental difference between DBSCAN and IQR or Hampel is those high-end rentals can also form a cluster given that the minimum number of data points are there. Let’s see the code for DBSCAN.
DBSCAN in python
-
First import the library and define the function for DBSCAN that will perform DBSCAM on the data and return the cluster labels. A cluster label of -1 is considered as outlier.
-
Start with default eps value of 0.5 and min_samples value of 5.
-
Get the indices of the outliers.
- Plot the data after outliers are removed. The total number of outliers found here is 384.


Conclusion
IQR or Hampel method are very successful for extreme outliers with a single pattern whereas DBSCAN is a better choice if we have data of different patterns. Let’s say if we have a linear data as well as a circular data, DBSCAN will be able to differentiate the samples into different groups. In our case, some extreme high-end rentals are grouped together and form a cluster. This cluster then is isolated from some other data points which have smaller rent value (considered as outlier in this method but good data points in IQR of Hampel method). Again, one needs to figure out what is the requirement and apply the best method. As mentioned earlier, some extreme data points are not always outliers. Consider the following scatterplot with the linear fit. It does not seem to have any outlier.

Now let’s have the same scatterplot with an extreme data point.

The point is outside the main distribution but lies on the fitting line very well. It may not be an outlier but an extreme data reading. This kind of outliers can be included to make a better training model for machine learning. If there is enough number of data points outside the main distribution even if those are not on the fitting line, they will form a cluster and that is where DBSCAN is very successful.
Join Medium with my referral link – Md Sohel Mahmood
Get an email whenever Md Sohel Mahmood publishes.
Reference
[1] Nadia Rahmah and Imas Sukaesih Sitanggang, "Determination of Optimal Epsilon (Eps) Value on DBSCAN Algorithm to Clustering Data on Peatland Hotspots in Sumatra", 2016 IOP Conf. Ser.: Earth Environ. Sci. 31 012012