The world’s leading publication for data science, AI, and ML professionals.

Isolation Forest – Auto Anomaly Detection with Python

Detecting Outliers Using Python's Scikit-learn Library

Photo by Pixabay: https://www.pexels.com/photo/black-tree-near-body-of-water-35796/
Photo by Pixabay: https://www.pexels.com/photo/black-tree-near-body-of-water-35796/

Isolation Forest is a popular unsupervised machine learning algorithm for detecting anomalies (outliers) within datasets. Anomaly Detection is a crucial part of any machine learning and data science workflow. Erroneous values that are not identified early on can result in inaccurate predictions from machine learning models, and therefore impact the interpretation of those results.

Within this short article, we will cover the basics of the algorithm and how it can be easily implemented with Python using Scikit-learn.

But first, we need to cover what outliers actually are.

What are outliers?

When working with real-world data it is common to come across data points that fall outside of the normal trend. These are often referred to as outliers or anomalies. Detecting anomalies can be useful for a number of reasons, including:

  • Detecting fraudulent credit card transactions
  • Identifying unusual network traffic which could indicate unauthorised access
  • Detecting anomalous spots/pixels within deep space images that have the potential to be a new star
  • Detecting anomalous "features" within medical scans

Within well log measurements and petrophysical data, outliers can occur due to washed-out boreholes, tool and sensor issues, rare geological features, and issues in the data acquisition process.

It is essential that outliers are identified and investigated early on in the Data Science/machine learning workflow as they can result in inaccurate predictions from machine learning models.

There are numerous algorithms, both supervised and unsupervised, available within Python that allow these anomalies to be detected.

In the supervised learning case, we can use data samples that we have already gone through and labelled as good or bad to train a model and use it to predict whether new data samples are anomalous or not.

However, these methods have a number of issues including small dataset sizes and low dimensionality which is required to reduce computational time, and anomalies may be rare within the labelled data, resulting in the detection of false positives.

One of the unsupervised methods is called Isolation Forest. Full details of how the algorithm works can be found in the original paper by Liu et al., (2008) and is freely available here.

Isolation Forest Methodology

Isolation Forest is a model-based outlier detection method that attempts to isolate anomalies from the rest of the data using an ensemble of decision trees. It does not rely on training a model on labelled data.

This method selects a feature and makes a random split in the data between the minimum and maximum values. This process then carries on down the decision tree until all possible splits have been made in the data or a limit on the number of splits is reached.

Any anomalies/outliers will be split off early in the process making them easy to identify and isolate from the rest of the data.

Detection of anomalies with this method assumes:

  • The presence of anomalies is small
  • Anomalous values are different to those of normal values

The image below illustrates a very simple example using a single variable – bulk density (RHOB) and a single tree.

Example of an isolation forest for well log data. Image by the author.
Example of an isolation forest for well log data. Image by the author.

As we are using an ensemble (group) of trees that are randomly fitted to the data, we can take the average of the depth within the tree at which the outliers occur, and make a final scoring on the "outlierness" of that data point.

Advantages of Isolation Forest

Isolation Forest has a number of advantages compared to traditional distance and density-based models:

  • Reduced computational times as anomalies are identified early and quick
  • Easily scalable to high dimensional and large datasets
  • Sub-samples the data to a degree which is not possible with other methods
  • Works when irrelevant features are included

Isolation Forest Python Tutorial

In the following examples, we will see how we can enhance a scatterplot with seaborn.

Data Source

For this, we will be using a subset of a larger dataset that was used as part of a Machine Learning competition run by Xeek and FORCE 2020 (Bormann et al., 2020). It is released under a NOLD 2.0 licence from the Norwegian Government, details of which can be found here: Norwegian Licence for Open Government Data (NLOD) 2.0.

The full dataset can be accessed at the following link: https://doi.org/10.5281/zenodo.4351155.

All of the examples within this article can be used with any dataset.

Importing Libraries and Data

For this tutorial, we will need to import, seaborn, pandas and IsolationForest from Scitkit-Learn.

Once these have been imported, we next need to load our well log data. In this example, the data is stored within a CSV file and contains measurements for a single well: 15/9–15.

Do not worry if you are not familiar with this type of data, the technique shown within this tutorial can equally be applied to other datasets.

This returns the following dataframe summary:

The summary above only shows the numeric data present within the file. If we want to take a look at all features within the dataframe, we can call upon df.info() , which will inform us we have 12 columns of data, and varying levels of completeness.

Overview of features within the dataframe. Image by the author.
Overview of features within the dataframe. Image by the author.

As with many machine learning algorithms, we need to deal with the missing values. As seen above we have a few columns, such as NPHI (neutron porosity) with 13,346 values, and GR (gamma ray) with 17,717 values.

The simplest way to deal with these missing values is to drop them. Even though this is a quick method, it should not be done blindly and you should attempt to understand the reason for the missing values. Removing these rows results in a reduced dataset when it comes to building machine learning models.

I would recommend checking out these articles of mine if you want to find out more about dealing with missing data:

To remove missing rows, we can call upon the following:

And if we call upon df again, we will see that we are now down to 13,290 values for every column.

Overview of features within the dataframe after null values have been removed. Image by the author.
Overview of features within the dataframe after null values have been removed. Image by the author.

Building the Isolation Forest Model with Scikit-Learn

From our dataframe, we need to select the variables we will train our Isolation Forest model with.

In this example, I am going to use just two variables (NPHI and RHOB). In reality, we would use more and we will see an example of that later on. Using two variables allows us to visualise what the algorithm has done.

First, we will create a list of our column names:

Next, we will create an instance of our Isolation Forest model. This is done, first by creating a variable called model_IF and then assigning it to IsolationForest().

We can then pass in a number of parameters for our model. The ones I have used in the code below are:

  • contamination: This is how much of the overall data we expect to be considered as an outlier. We can pass in a value between 0 and 0.5 or set it to auto.
  • random_state: This allows us to control the random selection process for splitting trees. In other words, if we were to rerun the model with the same data and parameters with a fixed value for this parameter, we should get repeatable outputs.

    Once our model has been initialised, we can train it on the data. To do this we call upon the .fit() function and pass it to our dataframe (df). When we pass the dataframe parameter, we will also select the columns we defined earlier.

    After fitting the model, we can now create some predictions. We will do this by adding two new columns to our dataframe:

  • anomaly_scores : Generated by calling upon model_IF.decision_function() and provides the anomaly score for each sample within the dataset. The lower the score, the more abnormal that sample is. Negative values indicate that the sample is an outlier.
  • anomaly : Generated by calling upon model_IF.predict() and is used to identify if a point is an outlier (-1) or an inlier (1)

    Once the anomalies have been identified, we can view our dataframe and see the result.

    Pandas dataframe showing results of anomaly detection. Values of 1 indicate data points are good. Image by the author.
    Pandas dataframe showing results of anomaly detection. Values of 1 indicate data points are good. Image by the author.

In the returned values above, we can see the original input features, the generated anomaly scores and whether that point is an anomaly or not.

Visualising Anomaly Data using matplotlib

Looking at the numeric values and trying to determine if the point has been identified as an outlier or not can be tedious.

Instead, we can use seaborn to generate a basic figure. We can use the data we used to train our model and visually split it up into outliers or inliers.

This simple function is designed to generate that plot and provide some additional metrics as text. The function takes:

  • data : Dataframe containing the values
  • outlier_method_name : The name of the method we are using. This is just for display purposes
  • xvar , yvar : The variables that we want to plot on the x and y axis respectively
  • xaxis_limits, yaxis_limits : The x and y-axis ranges

    Once our function has been defined, we can then pass in the required parameters.

    Which generates the following plot.

Seaborn Scatter Plot showing outliers and inliers as identified by the Isolation Forest model. Image by the author.
Seaborn Scatter Plot showing outliers and inliers as identified by the Isolation Forest model. Image by the author.

Right away we can tell how many values have been identified as outliers and where they are located. As we are only using two variables, we can see that we have essentially formed a separation between the points at the edge of the data and those in the centre.

Increasing Isolation Forest Contamination Value

The previous example uses a value of 0.1 (10%) for the contamination parameter, what if we increased that to 0.3 (30%)?

When we run the above code, we get the following plot. We can see that significantly more points have been selected and identified as outliers.

Seaborn Scatter Plot showing outliers and inliers as identified by the Isolation Forest model with 30% contamination. Image by the author.
Seaborn Scatter Plot showing outliers and inliers as identified by the Isolation Forest model with 30% contamination. Image by the author.

How do we know which contamination value to set?

Setting the contamination value allows us to identify what percentage of values should be identified as outliers, but choosing that value can be tricky.

There are no hard and fast rules for picking this value, and it should be based on the domain knowledge surrounding the data and its intended application(s).

For this particular dataset, which I am very familiar with, I would consider other features such as borehole caliper and delta-rho (DRHO) to help identify potentially poor data.

Using More than 2 Features for Isolation Forest

Now that we have seen the basics of using Isolation Forest with just two variables, let’s see what happens when we use a few more.

Seaborn Scatter Plot showing outliers and inliers as identified by the Isolation Forest model using multiple input features and 10% contamination. Image by the author.
Seaborn Scatter Plot showing outliers and inliers as identified by the Isolation Forest model using multiple input features and 10% contamination. Image by the author.

We now see that the points identified as outliers are much more spread out on the scatter plot, and there is no hard edge around a core group of points.

Visualising Outliers with Seaborn’s Pairplot

Instead of just looking at two of the variables, we can look at all of the variables we have used. This is done by using the seaborn pairplot.

First, we need to set the palette, which will allow us to control the colours being used in the plot.

Then, we can call upon sns.pairplot and pass in the required parameters.

Which returns:

Seaborn pairplot of all features within the dataset after Isolation Forest. Orange points indicate outliers (-1) and blue points indicate inliers (1). Image by the author.
Seaborn pairplot of all features within the dataset after Isolation Forest. Orange points indicate outliers (-1) and blue points indicate inliers (1). Image by the author.

This provides us with a much better overview of the data, and we can now see some of the outliers clearly highlighted within the other features. Especially within the PEF and GR features.

Summary

Isolation Forest is an easy-to-use and easy-to-understand unsupervised machine learning method that can isolate anomalous data points from good data. The algorithm can be scaled up to handle large and highly dimensional datasets if required.

If you are interested in seeing how this method compares to other methods, you may like the following article:

Well Log Data Outlier Detection With Machine Learning and Python


Thanks for reading. Before you go, you should definitely subscribe to my content and get my articles in your inbox. You can do that here! Alternatively, you can sign up for my newsletter to get additional content straight into your inbox for free.

Secondly, you can get the full Medium experience and support me and thousands of other writers by signing up for a membership. It only costs you $5 a month, and you have full access to all of the amazing Medium articles, as well as the chance to make money with your writing. If you sign up using my link, you will support me directly with a portion of your fee, and it won’t cost you more. If you do so, thank you so much for your support!

References

Bormann, Peter, Aursand, Peder, Dilib, Fahad, Manral, Surrender, & Dischington, Peter. (2020). FORCE 2020 Well well log and lithofacies dataset for machine learning competition [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4351156


Related Articles