The world’s leading publication for data science, AI, and ML professionals.

Kernel Density Estimation for Anomaly Detection in Python: Part 1

Combining classic approaches with deep learning for better representations

Photo by Mulyadi on Unsplash
Photo by Mulyadi on Unsplash

Fraud detection in the financial datasets, rare event detection in network traffic, visual image inspection for buildings and road monitoring, and defect detection in production lines: these are very common problems where machine learning techniques for Anomaly Detection can be helpful.

In short, anomaly detection is a field of research that aims to find abnormal observations in datasets.

For this problem settings, we assume that we have two types of data points: normal and abnormal. Mostly, it’s a highly unbalanced dataset, because abnormal data points accrue less than normal ones. In different scenarios, the word abnormal can have different meanings. For example, it can be called novelty when we got a data point that differs from every other type of abnormalities we had before, or we can call it an outlier when it’s a very rare but known cause [1].

Despite the definition and use-case, we can generalize all these types of problems into anomaly detection.

PCA, SVM, or Kernel Density Estimation are classical machine learning techniques used to find abnormal observations. Nowadays, many deep learning techniques are used for anomaly detection like GANs, Autoencoders, etc.

A very common approach for anomaly detection in images is one-class classification with a combination of self-supervised learning. It’s called one-class classification because we use only normal data to fit a model. Roughly saying, we force the model to learn good representations of normal data points. This may help to see minor differences later when we use a test set with combination of normal and abnormal data points. These techniques become state of the art in anomaly detection.

The first part of this article will cover the classic approach for detecting abnormal data using Kernel Density Estimation, next we will dive deeper into self-supervised techniques used to improve KDE. For more practical description we will dive deeper into Cutpaste [2] a recent paper from Google, which combines a novel self-supervised technique with KDE.

Kernel Density Estimation

Kernel Density Estimation (KDE) is an unsupervised learning technique that helps to estimate the PDF of a random variable in a non-parametric way. It’s related to a histogram but with a data smoothing technique.

Histogram and KDE visualizations: Image source
Histogram and KDE visualizations: Image source

As we can see in the example above, we use the same data to plot histograms. However, shifting bins in histograms can result in huge differences in visualization. To smoothen the distribution different kernels can be used. In the example above, Tophat and Gaussian Kernels were used. The output is a smooth density estimate. So, how to get this distribution?

The generalized formula for KDE is as follow:

Where K is a kernel and h is a bandwidth parameter that is responsible for smoothness. If we choose a higher number for h, we will get smoother distribution. y is a given estimate and xi is a point from the sample dataset.

As was mentioned above K is a kernel where we have multiple options like Gaussian, Tophat, Epanechnikov, etc. For anomaly detection we are going to use Gaussian Kernel Estimation, where we calculate the density using the following formula:

Following to CutPaste paper[2], we calculate anomaly scores using the formula above.

The following steps describe the process of anomaly detection using Gaussian Density Estimation:

Step 1: Fit normal data points from train split to GDE

Step 2: Calculate anomaly scores from the test dataset

Step 3: If anomaly scores are smaller than a predefined threshold, you have an abnormal data point, otherwise, you don’t have an anomaly

ROCAUC metric for anomaly detection

In the benchmark, ROCAUC is used to compare different models. MVTec dataset is the most common benchmark dataset for anomaly detection.

To calculate the accuracy of a model, the Receiver Operating Characteristic (ROC) curve is plotted and Area Under Curve (AUC) is calculated. To calculate ROC and AUC we use our test data with abnormal and normal examples, and if we get higher results on AUC, we can detect defects better.

Examples of ROC curve after training the model: Image from author
Examples of ROC curve after training the model: Image from author

To calculate ROC curve and AUC sklearn package can be used:

Self-supervised setting

We saw how to implement GDA for Anomaly Detection. However, it’s very hard to get good results on images. If we have good representations retrieved from an image we could get better results on GDE. To get that, in the second part of the article we create a self-supervised model with unique pretrained tasks which can improve the results of GDE.

References

[1] Ruff, Lukas & Kauffmann, Jacob & Vandermeulen, Robert & Montavon, Gregoire & Samek, Wojciech & Kloft, Marius & Dietterich, Thomas & Müller, Klaus-Robert. (2021). A Unifying Review of Deep and Shallow Anomaly Detection. Proceedings of the IEEE. PP. 1–40. 10.1109/JPROC.2021.3052449.

[2]Li, Chun-Liang & Sohn, Kihyuk & Yoon, Jinsung & Pfister, Tomas. (2021). CutPaste: Self-Supervised Learning for Anomaly Detection and Localization.

Codes from Github


Related Articles