Outlier Detection Using Random Forest Regressors: Leveraging Algorithm Strengths to Your Advantage

Using a model’s robustness to outliers to detect them

Michael Zakhary
Towards Data Science

Photo by Will Myers on Unsplash

Problem Statement

The problem of outlier detection can be tricky, especially if the ground truth or the description of what is an outlier is ambiguous or based upon multiple factors. Mathematically speaking, an outlier can be defined as data points more than three standard deviations away from a mean. However, in most real-life problems, not all data points away from a mean are of the same significance, sometimes we require a bit more nuance when flagging outliers.

Image by Rohanukhade

Let's take a quick example:

We have a dataset of water consumption per household. By analyzing the water consumption as a whole and isolating points 3 standard deviations from the mean, we can quickly get the outliers that use the most water.

This however fails to take into account the reason behind the increase in consumption, i.e. there could be multiple reasons why the water consumption is high, some reasons are of more interest…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

No responses yet

What are your thoughts?