This article discusses few commonly used methods to detect outliers while preprocessing the data to develop machine learning models.

What are outliers?
Outliers are the values that look different from the other values in the data. Below is a plot highlighting the outliers in ‘red’ and outliers can be seen in both the extremes of data.

Reasons for outliers in data
- Errors during data entry or a faulty measuring device (a faulty sensor may result in extreme readings).
- Natural occurrence (salaries of junior level employees vs C-level employees)
Problems caused by outliers
- Outliers in the data may causes problems during model fitting (esp. linear models).
- Outliers may inflate the error metrics which give higher weights to large errors (example, mean squared error, RMSE).
Methods to identify outliers in the data
In this article, we’ll use wine dataset of Scikit-Learn. Before proceeding further, we’ll load and prepare the data.

1. Box plots
Box plots are a visual method to identify outliers. Box plots is one of the many ways to visualize data distribution. Box plot plots the q1 (25th percentile), q2 (50th percentile or median) and q3 (75th percentile) of the data along with (q1–1.5*(q3-q1)) and (q3+1.5*(q3-q1)). Outliers, if any, are plotted as points above and below the plot.

In the above plot, outliers are shown as points below and above the box plot. ‘alcohol’, ‘total_phenols’, ‘od280/od315_of_diluted_wines’, and ‘proline’ have no outliers.
2. IQR method
IQR method is used by box plot to highlight outliers. IQR stands for interquartile range, which is the difference between q3 (75th percentile) and q1 (25th percentile). The IQR method computes lower bound and upper bound to identify outliers.
Lower Bound = q1–1.5*IQR
Upper Bound = q3+1.5*IQR
Any value below the lower bound and above the upper bound are considered to be outliers. Below is the implementation of IQR method in Python.

In the above plot, as seen in the box plot, ‘alcohol’, ‘total_phenols’, ‘od280/od315_of_diluted_wines’, and ‘proline’ have no outliers.
3. Z-score method
Z-score method is another method for detecting outliers. This method is generally used when a variable’ distribution looks close to Gaussian. Z-score is the number of standard deviations a value of a variable is away from the variable’ mean.
Z-Score = (X-mean) / Standard deviation
when the values of a variable are converted to Z-scores, then the distribution of the variable is called standard normal distribution with mean=0 and standard deviation=1. The Z-score method requires a cut-off specified by the user, to identify outliers. The widely used lower end cut-off is -3 and the upper end cut-off is +3. The reason behind using these cut-offs is, 99.7% of the values lie between -3 and +3 in a standard normal distribution. Lets look at the implementation of Z-Score method in Python.

4. ‘Distance from the mean’ method (Multivariate method)
Unlike the previous methods, this method considers multiple variables in a data set to detect outliers. This method calculates the Euclidean distance of the data points from their mean and converts the distances into absolute z-scores. Any z-score greater than the pre-specified cut-off is considered to be an outlier. We’ll consider two variables (‘malic_acid’ and ‘magnesium’) of the wine dataset for implementing this method in Python using a cut-off of 3.

These are few commonly used outlier detection methods in Machine Learning. Presence of outliers may cause problems during model fitting (esp. linear models) and may also result in inflated error metrics which give higher weights to large errors. Hence, it is necessary to treat outliers before building a machine learning model.
Know more about my work at https://ksvmuralidhar.in/