Photo by Will Myers on Unsplash

Hands-On Unsupervised Outlier Detection Using Machine Learning, with Python

Here’s how to use a simple Machine Learning algorithm to detect outliers of a non labeled dataset

Towards Data Science
4 min readFeb 7, 2022

--

One of the most informative and powerful technique in Data Science is Outlier Detection.

What makes this technique so sexy is that the definition of “outlier” is really generic. In other words, we can define an arbitrary set of our dataset as an “outlier set” and try to identify it by using the other features that we have.

Of course a very simple way to detect outlier is by plotting the distribution and extract the values that are away from the median of the distribution. The problem with this approach is that it implies that your data have a gaussian-like distribution, which is not always true.

Another way is to use a classification approach, and apply a binary classification algorithm on your data (outlier/non outlier). Nonetheless, this approach requires labelled data.

So what can we use? We can use Gaussian Process Regression. This model gives you a regression technique that outputs a mean and some predictive boundaries related to a certain probability uncertainty.

Let’s dive in:

1. Gaussian Process Regression

As I said, the assumption of a Gaussian Process Regression is that your data are nothing but a random realization of a Gaussian Process, where the mean and variance of this gaussian process are a certain mean function and the variance is a certain variance function.

Image made by using this

Both this mean and variance function depends on the kernel function that is used.

The most used is just a Radial Basis Function (RBF) kernel:

Image made by me

Where the l value gives you the covariance
We then added a White Noise Kernel that considers the fact that your measurement can be noisy as well.

2. The Code

2.1 The Libraries

It’s and Hands On article, so let’s start writing the code.
This is what you need:

It is basically the GPR part with Sklearn, matplotlib, numpy and pandas

2.2 The Dataset

The dataset that I used is a Time Series that I found here.
The dataset will be essentially used for a Regression task, but the real usage of the method comes from the Boundary that we are able to predict.

Let’s import it:

And show some rows:

As our purpose is just to show how the method works, let’s use slightly smaller numbers and limit ourselves to the 80% of the dataset:

If we want to apply the GPR Method, the fact that we have “Datetime objects” is disturbing. Let’s consider the day 0 and the distance between day 0 as our time:

Let’s plot the date and our new time axis (X):

2.3 The Model

Let’s fit the GPR model using the following line (it may take a while):

And let’s get the mean and the std boundaries:

Let’s convert them into two dataframe objects:

Let’s plot the results:

We have shown before that the original time sampling is not uniform, and it is reflected in our smaller portion of data as well. This is the reason why in the lowest figure, the GPR method is less able to identify the variance and probably overestimate it (at the end of the day, if you have few data points, how can you say that a new one is an outlier?)

2.4 Outlier Detection

The threshold that we set in the two previous plots is pretty large (99.7% confidence interval). For this reason we can say that everything that is out of this boundary is probably an anomaly (or an outlier) of our process.

Let’s create a new column that states if “Value” is out of the boundary for each row.

Let’s plot the so defined outlier:

The interesting thing is that we are able to detect if the Outlier is an outlier in terms of being larger than the Upper boundary or lower than the Lower Boundary. In other words, we are able to detect if the price of Gold is larger than we are expecting or lower:

And here you go:

3. Conclusions

This idea seems simple and naive, but it can actually become pretty complex and intriguing when the kernel function is customized (given a more precise understanding of the system we are considering).

The powerful idea is that you can analyze the behavior of your system without any labels and detect the outliers by doing so.

I hope this article was fun and inspiring. If it is so and you would like to discuss about it, or you have any question, please send me a mail: piero.paialunga@hotmail.com or add me on LinkedIn.

Ciao :)

--

--

PhD in Aerospace Engineering at the University of Cincinnati. Machine Learning Engineer @ Gen Nine, Martial Artist, Coffee Drinker, from Italy.