How to Handle Outliers, Anomalies, and Skews

TDS Editors
Towards Data Science
3 min readSep 15, 2022

--

Data science is about finding patterns and extracting meaningful insights from their analysis. As any practitioner knows, however, data loves throwing us the occasional curveball: a weird spike, an unexpected dip, or (gasp!) an oddly shaped cluster.

This week, we turn our attention to those jarring moments when things (and our graphs) turn out to be less smooth than we’d hoped. Our selection of highlights cover different approaches for tackling irregularity and coming to terms with the unpredictable.

  • Finding outliers, the right way. As Hennie de Harder observes, “In most projects the data will contain multiple dimensions, and this makes it hard to spot the outliers by eye.” Rather than rely on our fallible powers of observation, Hennie shows how to leverage Cook’s distance, DBSCAN, and Isolation Forest for identifying data points that require extra scrutiny.
  • How to minimize the potential dangers of skewed data. Bias has been a charged buzzword among data and ML professionals in the past few years. Adam Brownell invites us to think about bias as “a skew that produces a type of harm,” and walks us through three strategies to measure it effectively in the context of natural language processing models.
Photo by Jennifer Boyle on Unsplash
  • Adversarial training to the rescue? Anomaly detection is particularly hard in computer vision, where small data volumes and an often-limited variety of images make model training a challenge. Eugenia Anello’s helpful explainer walks us through a novel approach, GANomaly, which leverages the power of generative adversarial networks to address the shortcomings of previous methods.
  • Keeping linear regressions outlier-proof. For a hands-on demonstration of robust linear algorithms and how you can use them to handle outliers lurking within your data, you should check out Eryk Lewinson’s recent tutorial. It covers Huber regression, Random sample consensus (RANSAC) regression, and Theil-Sen regression, and benchmarks their performance on the same dataset.

For top-notch articles on other topics, we’ve collected some of our recent favorites below. Have no fear: there are no outliers here, only consistently enlightening discussions.

We love sharing great data science work with you, and your support — including your Medium membership —makes it possible. Thank you!

Until the next Variable,

TDS Editors

--

--

Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly/write-for-tds