How to Handle Outliers, Anomalies, and Skews
Data science is about finding patterns and extracting meaningful insights from their analysis. As any practitioner knows, however, data loves throwing us the occasional curveball: a weird spike, an unexpected dip, or (gasp!) an oddly shaped cluster.
This week, we turn our attention to those jarring moments when things (and our graphs) turn out to be less smooth than we’d hoped. Our selection of highlights cover different approaches for tackling irregularity and coming to terms with the unpredictable.
- Finding outliers, the right way. As Hennie de Harder observes, “In most projects the data will contain multiple dimensions, and this makes it hard to spot the outliers by eye.” Rather than rely on our fallible powers of observation, Hennie shows how to leverage Cook’s distance, DBSCAN, and Isolation Forest for identifying data points that require extra scrutiny.
- How to minimize the potential dangers of skewed data. Bias has been a charged buzzword among data and ML professionals in the past few years. Adam Brownell invites us to think about bias as “a skew that produces a type of harm,” and walks us through three strategies to measure it effectively in the context of natural language processing models.
- Adversarial training to the rescue? Anomaly detection is particularly hard in computer vision, where small data volumes and an often-limited variety of images make model training a challenge. Eugenia Anello’s helpful explainer walks us through a novel approach, GANomaly, which leverages the power of generative adversarial networks to address the shortcomings of previous methods.
- Keeping linear regressions outlier-proof. For a hands-on demonstration of robust linear algorithms and how you can use them to handle outliers lurking within your data, you should check out Eryk Lewinson’s recent tutorial. It covers Huber regression, Random sample consensus (RANSAC) regression, and Theil-Sen regression, and benchmarks their performance on the same dataset.
For top-notch articles on other topics, we’ve collected some of our recent favorites below. Have no fear: there are no outliers here, only consistently enlightening discussions.
- Ari Joury, PhD’s new article is already making a splash, arguing that before spending inordinate amounts of time on finding the right algorithm, you should ensure you know what problem you’re actually solving.
- Picking the fastest-moving checkout line at the supermarket is an age-old conundrum, but LeAnne Chan is here to help with a game theory-informed analysis.
- If you’d like to use your dataset to train a deep learning model on Hugging Face, you’re in luck: Dr. Varshita Sher’s latest tutorial shows how you can port your data over in a smooth, painless process.
- The booming field of synthetic-data generation hasn’t paid much attention to tabular data; Javier Marin’s new deep dive covers a new, open source project that aims to rectify this problem.
- Time to find your headphones: we were thrilled to share a new episode of the TDS Podcast, featuring Jeremie Harris and analyst Ryan Fedasiuk; their conversation revolved around the potential for the U.S. and China to collaborate on AI safety.
We love sharing great data science work with you, and your support — including your Medium membership —makes it possible. Thank you!
Until the next Variable,
TDS Editors