The world’s leading publication for data science, AI, and ML professionals.

Outliers: Keep Or Drop?

A guide to dealing with extreme values

Photo by Crispin Jones on Unsplash
Photo by Crispin Jones on Unsplash

Although outliers are common in datasets, not many know how to deal with the presence of such observations.

It is usually practical to remove them since they can skew the results of data analyses and hamper model performance. That being said, removing outliers indiscriminately from a dataset can also have a detrimental effect on a project.

Whether or not extreme values in a dataset should be removed depends on the circumstances.

Here, we discuss the importance of examining outliers prior to removing them and explore ways to identify them with statistical techniques.

Are Outliers Bad?

Outliers often are portrayed as inaccurate and inappropriate information that needs to be removed to obtain cleaned data.

However, in reality, keeping these data points can have its merits.

Outliers can be legitimate anomalies that are vital for capturing information on the subject of interest. While their inclusion may influence results significantly, they can be accounted for with certain statistical or data modeling techniques.

Outliers can also stem from human error. Errors in areas like data collection and data entry can lead to inappropriate values being included in the procured data (e.g., assigning negative values to a feature that should only contain positive values). These types of data points need to be removed in order to build robust models or conduct accurate analyses.

Ultimately, outliers are data regardless of where they come from. Thus, the decision to remove data should always be backed with sufficient evidence.

To justify the removal of outliers, these data points first need to be identified. Upon identifying them, users can then exercise judgment and drop any values based on their criteria.

Identifying Outliers

There are a plethora of approaches for identifying outliers in a collection of data. Let’s cover some of the more popular ones and execute them with Python.

To demonstrate these approaches, we’ll use a toy dataset from the Scikit-learn module. It doesn’t contain any extreme values, so one will be added manually.

  1. Z-score

Data points can be identified as outliers based on their z score. A z score is a measure of how many standard deviations a data point is from the mean.

Z Score (Created By Author)
Z Score (Created By Author)

Records with z-scores that exceed a certain threshold (typically 3) are deemed as outliers.

We can compute the z-scores of all data points and identify the ones with a magnitude greater than 3.

Code Output (Created By Author)
Code Output (Created By Author)

2. Interquartile Range (IQR)

Another solution is to use the IQR method to find outliers. This method entails using the 1st quartile, 3rd quartile, and IQR to define the lower bound and upper bound for the data points.

Lower and Upper Bounds (Created By Author)
Lower and Upper Bounds (Created By Author)

Observations that lie are outside of these bounds can be classified as outliers.

Here, we find the outlier in the dataset using the lower bound and upper bound as references.

Code Output (Created By Author)
Code Output (Created By Author)

Instead of manually deriving the lower and upper bounds for the data points, users can also find outliers with the box plot. This visualization tool defines the outliers with the same lower and upper bounds.

Code Output (Created By Author)
Code Output (Created By Author)

The dots in box plots represent the outliers. Although the box plot offers a simpler way to detect outliers, it doesn’t allow users to examine each of these data points individually.

Based on the findings of outlier detection, users can decide if there is sufficient evidence to support the exclusion of these data points and keep or drop them accordingly.

Additional Factors to Consider

There are other factors that must be considered when deciding to keep or drop outliers.

1. Domain Knowledge

Statistical tools like the z-score and IQR can be used to define thresholds that separate outliers from the other data points. However, users can also leverage their domain knowledge of the subject when defining the criteria for outliers.

2. Modeling Techniques

In cases where outliers are legitimate anomalies, there may not be sufficient justification to drop them. Instead of removing these data points, one can utilize techniques that better account for them.

For instance, linear models like linear regression and logistic regression tend to be easily influenced by outliers. For such cases, one can utilize non-linear models (e.g., tree-based models) that are more resistant to extreme values.

3. Amount of Data

Unfortunately, in the real world, data can be scarce. Even the most meticulously planned out project can be hindered by the lack of data. Removing data in such a case would be unfavorable and may render any subsequent analyses invalid.

The removal of data points in smaller datasets should always be justified.

Key Takeaways

Photo by Prateek Katyal on Unsplash
Photo by Prateek Katyal on Unsplash

In general, it is always a good idea to familiarize oneself with the data before making any important decisions with regard to preprocessing or modeling.

While removing extreme values can be conducive to the success of a project, the decision to remove any data needs to be backed with sufficient evidence.

Neglecting outliers or dropping them indiscriminately are equally bad habits that can hamper the results of a project.

To account for these types of data points, it is best to be thorough by first identifying the outliers and then ascertaining their nature before deciding how to deal with them.

I wish you the best of luck in your Data Science endeavors!


Related Articles