The world’s leading publication for data science, AI, and ML professionals.

Outlier detection methods in Machine Learning

This article discusses few commonly used methods to detect outliers while preprocessing the data for developing machine learning models.

This article discusses few commonly used methods to detect outliers while preprocessing the data to develop machine learning models.

Image by Clay Banks on Unsplash
Image by Clay Banks on Unsplash

What are outliers?

Outliers are the values that look different from the other values in the data. Below is a plot highlighting the outliers in ‘red’ and outliers can be seen in both the extremes of data.

Image by author
Image by author

Reasons for outliers in data

  1. Errors during data entry or a faulty measuring device (a faulty sensor may result in extreme readings).
  2. Natural occurrence (salaries of junior level employees vs C-level employees)

Problems caused by outliers

  1. Outliers in the data may causes problems during model fitting (esp. linear models).
  2. Outliers may inflate the error metrics which give higher weights to large errors (example, mean squared error, RMSE).

Methods to identify outliers in the data

In this article, we’ll use wine dataset of Scikit-Learn. Before proceeding further, we’ll load and prepare the data.

Image by author
Image by author

1. Box plots

Box plots are a visual method to identify outliers. Box plots is one of the many ways to visualize data distribution. Box plot plots the q1 (25th percentile), q2 (50th percentile or median) and q3 (75th percentile) of the data along with (q1–1.5*(q3-q1)) and (q3+1.5*(q3-q1)). Outliers, if any, are plotted as points above and below the plot.

Image by author
Image by author

In the above plot, outliers are shown as points below and above the box plot. ‘alcohol’, ‘total_phenols’, ‘od280/od315_of_diluted_wines’, and ‘proline’ have no outliers.

2. IQR method

IQR method is used by box plot to highlight outliers. IQR stands for interquartile range, which is the difference between q3 (75th percentile) and q1 (25th percentile). The IQR method computes lower bound and upper bound to identify outliers.

Lower Bound = q1–1.5*IQR

Upper Bound = q3+1.5*IQR

Any value below the lower bound and above the upper bound are considered to be outliers. Below is the implementation of IQR method in Python.

Image by author
Image by author

In the above plot, as seen in the box plot, ‘alcohol’, ‘total_phenols’, ‘od280/od315_of_diluted_wines’, and ‘proline’ have no outliers.

3. Z-score method

Z-score method is another method for detecting outliers. This method is generally used when a variable’ distribution looks close to Gaussian. Z-score is the number of standard deviations a value of a variable is away from the variable’ mean.

Z-Score = (X-mean) / Standard deviation

when the values of a variable are converted to Z-scores, then the distribution of the variable is called standard normal distribution with mean=0 and standard deviation=1. The Z-score method requires a cut-off specified by the user, to identify outliers. The widely used lower end cut-off is -3 and the upper end cut-off is +3. The reason behind using these cut-offs is, 99.7% of the values lie between -3 and +3 in a standard normal distribution. Lets look at the implementation of Z-Score method in Python.

Image by author
Image by author

4. ‘Distance from the mean’ method (Multivariate method)

Unlike the previous methods, this method considers multiple variables in a data set to detect outliers. This method calculates the Euclidean distance of the data points from their mean and converts the distances into absolute z-scores. Any z-score greater than the pre-specified cut-off is considered to be an outlier. We’ll consider two variables (‘malic_acid’ and ‘magnesium’) of the wine dataset for implementing this method in Python using a cut-off of 3.

Image by author
Image by author

These are few commonly used outlier detection methods in Machine Learning. Presence of outliers may cause problems during model fitting (esp. linear models) and may also result in inflated error metrics which give higher weights to large errors. Hence, it is necessary to treat outliers before building a machine learning model.

Know more about my work at https://ksvmuralidhar.in/


Related Articles