Member-only story

5 Outlier Detection Techniques that every “Data Enthusiast” Must Know

Outlier Detection Methods (Visuals and Code)

Prakhar Mishra
Towards Data Science
8 min readJun 12, 2021

Modified Image from Source

Outliers are those observations that differ strongly(different properties) from the other data points in the sample of a population. In this blog, we will go through 5 Outlier Detection techniques that every “Data Enthusiast” must know. But before that let’s take a look and understand the source of outliers.

What are the possible sources of outliers in a dataset?

There are multiple reasons why there can be outliers in the dataset, like Human errors(Wrong data entry), Measurement errors(System/Tool error), Data manipulation error(Faulty data preprocessing error), Sampling errors(creating samples from heterogeneous sources), etc. Importantly, detecting and treating these Outliers is important for learning a robust and generalizable machine learning system.

The Z-score(also called the standard score) is an important concept in statistics that indicates how far away a certain point is from the mean. By applying Z-transformation we shift the distribution and make it 0 mean with unit standard deviation. For example — A Z-score of 2 would mean the data point is 2 standard deviation away from the mean.

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Responses (3)

What are your thoughts?

Very informative 👍

--

I appreciate your efforts for sharing the knowledge brother.
Got a small query, what works better whether the manual technique like IQR or the autoencoders?

--

Doesn't a conclusion using the z-score method depend on the size of the data set? I'd expect a point to have z ≥ 3 for data sets having around 1000 points.

--