Anomaly Detection with PyOD

This Python library (PyOD) uses algorithms to make outlier detection more accessible and comprehensive

Samantha Nasti
Towards Data Science

--

Photo by Rupert Britton via Unsplash

Last week, my Flatiron School — Data Science Full-Time Immersive Program cohort peers and I entered the Scientific Computing and Quantitative Methods phase of our bootcamp journey, and we embraced the challenge of absorbing an abundance of statistics knowledge in a short period of time. Though certainly not our easiest week, the beauty of the utility of statistics and machine learning became more and more apparent through each demonstration, calculation, and visualization.

One thing in particular stood out to me, which was the importance of a simple three-variable calculation: the Z-score. For those unfamiliar, the Z-score is calculated as follows:

z = (x-μ)/σ

The “z” demonstrates how far a data point is away from a population mean, measured in units of standard deviations. In this calculation, x is an experimental data point, μ is the mean, and σ is the standard deviation. Pretty simple, right? To the untrained eye, it’s just a number.

But I was surprised to learn how many different applications it served in statistics testing. It turns out, the z-score is incredibly useful in data analysis and probability calculations. Furthermore, it serves as a critical component of various differing techniques to detect anomalies in data sets.

Detecting Anomalies in Data

There are various different techniques and approaches to detecting anomalies in data sets, mainly because there are tons of different ways to do so. We learned of a few thus far, such as:

  • Using data visualization tools like Matplotlib and Seaborn to plot data sets and assess outliers in our data spreads,
  • Using Pandas to perform seamless data cleaning and generate descriptive statistics
  • Hypothesis testing strategies such as Z and P testing, utilizing manually calculated descriptive statistics

We learned these strategies, among others, in just a matter of weeks, which made me wonder: in the largest, messiest data sets… in a world of hundreds (if not more) different probability distribution types… these can’t be the only methods to detect outliers. And if there are many other different ways to detect anomalies in data, is there a simplified way to select which technique you would like to use and apply it to your data? After all, outlier detection holds great importance in, not only Data Science, but most other mathematical or computational disciplines. Which led me to stumble upon a fascinating Python library: the Python PyOD module!

PyOD — Easier Anomaly Tests

Would you look at that! There exists a comprehensive and scalable Python toolkit for detecting outliers/anomalies (note: these two terms are commonly used interchangeably) called PyOD. Py, O, D. Can you guess what the name stands for?! (Hint: it may have something to do with “Python” and “Outlier Detection”).

The PyOD full documentation can be found here.

I was immediately impressed by the third sentence in the PyOD documentation: “PyOD includes more than 30 detection algorithms.” Welp, that answers my question from before — clearly there are more than just a few different techniques to detect outliers in different types of data sets. Many of which would be incredibly time consuming and complex without utilizing the PyOD toolkit!

The Why

Outlier detection packages exist in many other programming languages such as Java and R, and the team behind the development of PyOD recognized a lack of a dedicated toolkit for outlier detection in Python. Python single-algorithm tools, like PyNomaly, were limited to just one algo. Alternatively, multipurpose frameworks such as the widely-known scikit-learn did not cater specifically to the issue of outlier detection.

The developers of PyOD created a comprehensive library that did just that: It specifically catered to scalable outlier detection, with the option to select a desired approach based on the type(s) of data being assessed.

How It Works

PyOD works with both Python 2 and 3, and also relies on the very commonly-used libraries NumPy, SciPi and scikit-learn to perform its most basic functions. For certain additional features, such as deep learning models and running benchmarks, other dependencies (like Pandas, Matplotlib, or TensorFlow) are optional imports.

PyOD provides access to its collection of outlier detection algorithms with its series of easy-to-use unified APIs. Each API comes with full detailed instructional documentation and affiliated examples to visualize how to start implementing each different type of algorithm. Here is a very simple example I pulled from the PyOD documentation that demonstrates a PyOD implementation of outlier detection with just 5 lines of code:

Outlier detection in just 5 lines of code using PyOD

In the above example, COPOD (Copula-Based Outlier Detection) was selected as the anomaly-detecting algorithm for this calculation. As you can see, PyOD built-in functionality makes it easy to select one of their 30+ algorithms included in its library with a simple import, and apply complex probabilistic functions in seconds. Fortunately, these algorithms are accompanied by detailed step-by-step examples walking you through sample applications of their functions in the documentation.

More complex examples of PyOD algorithmic applications can be found directly in their documentation as well. For instance, they included a notebook implementing 12 of their different outlier detection models to the same data set, allowing for versatile comparative statistics that would otherwise not be easily accessible in Python. Take a look at the accompanying visualization of the results below. This clear, organized visualization is built into the PyOD functionality.

via PyOD Documentation

Changing the Game

It’s no surprise PyOD is renowned and well acknowledged in the machine learning community. It makes access to critical statistical functions easier and in a significantly more versatile capacity. The documentation is easily accessible on Github, and the examples make complex machine learning algorithmic techniques slightly less intimidating to curious students like me. I can’t wait to continue to play with the PyOD toolkit and explore the endless possibilities!

--

--