Photo by Possessed Photography on Unsplash

Thresholding Outlier Detection Scores with PyThresh

Methods to replace the necessity of a contamination level in outlier detection

Daniel Kulik
Towards Data Science
6 min readJun 9, 2022

--

Real life is often chaotic and unpredictable. It seems to like throwing the metaphorical “spanner in the works”, making data usually appear baffling and random. Most data that is recorded or extracted generally requires some form of cleaning before applying further methods like modeling. However, it is often difficult or near impossible to visually distinguish what data is true, noise, or an anomaly. Sometimes, anomalies may even be necessary, further complicating the act of distinguishing what data to use. That is where the techniques of outlier detection comes in.

Outlier detection methods can come in many shapes and forms. But the crux of their ability to detect outliers is generally through a set of statistical conditions. Some are rigorous, others simple, yet all these methods are important to a data scientist’s toolkit when on the hunt for data that stands out from the rest.

PyOD is a one of many python libraries that contains a useful collection of methods for outlier detection. It is simplistic in use and has a plethora more outlier detection methods than most other libraries. Hence, why it is such a favorite amongst most data scientists. However, a pesky yet necessary parameter exists when trying to use most methods to find outliers… the contamination level. Most methods return a outlier confidence score when applied to a dataset. This score is useful in its own right but like regression is to classification, it lacks the labels. Therefore, a contamination level is required to set the boundary separating inliers from outliers by thresholding the confidence scores.

In PyOD and many other libraries the contamination level is often set before outlier detection. However, this input parameter raises the question: how does one know how contaminated their dataset is before testing for outliers? And that is a great question! The answer to this usually means that some form of statistical test must be done prior to outlier detection to get the contamination level. Popular methods include using the z-score or the inter-quartile region. While these are a good approximate to the correct contamination level they are not infallible to their own disadvantages. A better approach would be to apply statistical tests to the outlier detection scores themselves in order to threshold inliers from outliers.

The following examples will implement the PyThresh library… a library meant for thresholding outlier detection scores, and that I am affiliated with and actively developing. PyThresh consists of a collection of statistical methods that attempt to threshold the outlier detection scores without the need of setting a contamination level prior to fitting the data. These statistical methods range from classical thresholding such as k-means clustering and probability distances, to more obscure methods that involve topology and graph theory. All in all, they attempt to take the guessing of contamination levels (educated as they may be) out of the equation when it comes to outlier detection.

Let’s take a look at how thresholding can be implemented on unsupervised outlier detection scores. For the example below the cardio open source dataset from Outlier Detection DataSets (ODDS) will be used. This dataset was made freely available by the Stony Brook University and represents real world data [1].

The cardio (Cardiotocography) dataset [2][3] consists of measurements of fetal heart rate and uterine contraction features on cardiotocograms that were classified by expert obstetricians. The measurements were classed as either being normal (inliers) or pathologic (outliers).

To start, let’s load the cardio dataset and standardize the variables that will be scored by the outlier detection method.

If we take a quick look at the data we can see that we will have 21 explanatory variables to fit, with each variable having 1831 unique entries. Of those entries, 1655 (~90%) are inliers and 176 (~10%) are outliers as classed in the response variable.

In order to correctly identify the two classes, it is important to select the correct outlier detection and outlier thresholding methods. For this dataset, the Principle Component Analysis (PCA) unsupervised outlier detection method was selected due to the amount of exploratory variables. This method reduces the dimensionality of the data by constructing a subspace that is generated by eigenvectors that represents the highest variance that can still explain the data. Outliers can become more apparent during this dimensionality reduction and the outlier scores are the sum of the projected distance of a sample on all eigenvectors. For this dataset, we will set the number of components to reduce from 21 to 5.

Synergy between the outlier and thresholding methods often provides better results. This is due to the way the scores that are generated by the outlier detection method are then handled by the thresholding method. The more similar the two methods behave, the less chance there is for an irrational threshold to occur. However, this is not always the case and selecting the correct outlier detection and thresholding methods may require robust statistical tests before and after application.

For this dataset along with the outlier detection method a good corresponding thresholding method would be the Distance Shift from Normal (DSN) threshold. The way this threshold works is that it compares the probability distribution of the outlier scores with that of a normal distribution, and measures the difference between the two. When working with statistical distances it is important to select the metric with which measurements will be made. The Bhattacharyya distance is a measure of similarity between two probability distributions and returns the amount of overlap that exists between them. Technically it is not a metric, however it will be used for thresholding the dataset. The Bhattacharyya distance is defined as,

where p and q represent the two probability distributions over the same domain X and the Bhattacharyya coefficient (BC) for continuous probability distributions can be expressed by,

The threshold for the outlier detection scores is finally set such that any score higher than 1 minus the Bhattacharyya distance is labelled as an outlier.

With a better understanding now, we can apply these methods to the dataset:

The labels returned from the thresholder consists of a binary array where inliers are represented by zeros and outliers by ones. Before revealing the final tally of how well these methods performed, let’s rather look at a side-by-side plot of the true vs the predicted labels. Again we will use PCA, but now to visualize the data in 2D by reducing the dimensionality. This step, along with others, is essential when working with real world data that has no response variable as it provides a visual representation of what was thresholded. Note that PCA is an orthogonal linear transformation and therefore outliers with a non-linear relationship may not always appear as evident outliers visually. Also reducing the data to 2 dimensions may also remove the dimensionality needed to distinguish inliers from outliers. However, even with these disadvantages PCA visualization is a powerful confirmation testing tool.

As we can see from the plot, the DSN threshold along with the PCA outlier detector scores was able to separate outliers from inliers with significant accuracy! The accuracy for this example was 99% accurate with only two outliers classed as inliers and no inliers classed as outliers. However, depending on the dataset, the applied outlier detector, and the thresholder, the prediction accuracy will vary. Even though the contamination level has been removed for a more statistical approach to threshold outlier scores, at the end of the day it is up to the data scientist to make the final call if the predicted results appear correct.

In closing, thresholding of outlier detection scores is not a new science and has many well established implementations. These statistical and mathematical methods add to a data scientists toolkit, and assist in navigating through the wonderful world of information and data.

[1] Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science.

[2] C. C. Aggarwal and S. Sathe, “Theoretical foundations and algorithms for outlier ensembles.” ACM SIGKDD Explorations Newsletter, vol. 17, no. 1, pp. 24–47, 2015.

[3] Saket Sathe and Charu C. Aggarwal. LODES: Local Density meets Spectral Outlier Detection. SIAM Conference on Data Mining, 2016.

--

--

Machine learning developer working on green projects and a MSc student in Astrophysics