Anomaly Detection : Isolation Forest with Statistical Rules

Published in

Towards Data Science

7 min readAug 24, 2019

This is a follow up article about anomaly detection with isolation forest . In the previous article we saw about anomaly detection with time series forecasting and classification. With isolation forest we had to deal with the contamination parameter which sets the percentage of points in our data to be anomalous.

While that could be a good start to see initial results but that puts you in a problem where at any point x% of point will be returned anomalous. Lets look at one possible way to avoid it.

Now lets start with direct implementation of isolation forest on a dataset which has timestamp and cpu_utilization as value:

Link to dataset (https://github.com/numenta/NAB/tree/master/data/realAWSCloudwatch)

import pandas as pd
import numpy as np
full_df=pd.read_csv('ec2_cpu_utilization_5f5533.csv')
full_df.head()

We have around 4000 data points at a 5 minute level ranging from Feb 14 to Feb 28.

print(full_df['timestamp'].min())print(full_df['timestamp'].max())
print(len(full_df['timestamp']))

Min & Max Timestamps with number of data points

Visualize the points to have an overview on time series data:

# Using graph_objects
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.plotly as py
import matplotlib.pyplot as plt
from matplotlib import pyplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)
import plotly.graph_objs as go
fig = go.Figure(data=[go.Scatter(x=full_df['timestamp'], y=full_df['value'])])
iplot(fig)

I l filter the data for a specific day just to have simple visualization. Feb 24 seems to be a good option as there is a dip and significant spike in the data.

df=full_df.loc[(full_df[‘timestamp’]>’2014–02–24 00:00:00')&(full_df[‘timestamp’]<’2014–02–24 23:59:59')]
df.head()
plot_data=go.Scatter(x=df['timestamp'], y=df['value'])
fig=go.Figure(data=[plot_data])
iplot(fig)

Now lets start fitting this to an isolation forest model with contamination parameter set as 4% based on my intuition from the visualization.

from sklearn.ensemble import IsolationForest
#to_model_column='value'
clf=IsolationForest(n_estimators=10, max_samples='auto', contamination=float(.04), \
                        max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0,behaviour='new')
clf.fit(df[['value']])
df['scores']=clf.decision_function(df[['value']])
df['anomaly']=clf.predict(df[['value']])
df.head()df.loc[df['anomaly'] == 1,'anomaly'] = 0
df.loc[df['anomaly'] == -1,'anomaly'] = 1
df.value_counts()

def plot_anomaly(df,metric_name):
    df.timestamp = pd.to_datetime(df['timestamp'].astype(str), format="%Y-%m-%d %H:%M:%S")
    dates = df.timestamp
    #identify the anomaly points and create a array of its values for plot
    bool_array = (abs(df['anomaly']) > 0)
    actuals = df["value"][-len(bool_array):]
    anomaly_points = bool_array * actuals
    anomaly_points[anomaly_points == 0] = np.nan
    #A dictionary for conditional format table based on anomaly
    color_map = {0: "'rgba(228, 222, 249, 0.65)'", 1: "red"}#Table which includes Date,Actuals,Change occured from previous point
    table = go.Table(
    domain=dict(x=[0, 1],
                y=[0, 0.3]),
    columnwidth=[1, 2],
    # columnorder=[0, 1, 2,],
    header=dict(height=20,
                values=[['<b>Date</b>'], ['<b>Actual Values </b>'],
                        ],
                font=dict(color=['rgb(45, 45, 45)'] * 5, size=14),
                fill=dict(color='#d562be')),
    cells=dict(values=[df.round(3)[k].tolist() for k in ['timestamp', 'value']],
               line=dict(color='#506784'),
               align=['center'] * 5,
               font=dict(color=['rgb(40, 40, 40)'] * 5, size=12),
               # format = [None] + [",.4f"] + [',.4f'],
               # suffix=[None] * 4,
               suffix=[None] + [''] + [''] + ['%'] + [''],
               height=27,
               fill=dict(color=[df['anomaly'].map(color_map)],#map based on anomaly level from dictionary
               )
               ))
    #Plot the actuals points
    Actuals = go.Scatter(name='Actuals',
                     x=dates,
                     y=df['value'],
                     xaxis='x1', yaxis='y1',
                     mode='line',
                     marker=dict(size=12,
                                 line=dict(width=1),
                                 color="blue"))
    #Highlight the anomaly points
    anomalies_map = go.Scatter(name="Anomaly",
                               showlegend=True,
                               x=dates,
                               y=anomaly_points,
                               mode='markers',
                               xaxis='x1',
                               yaxis='y1',
                               marker=dict(color="red",
                                           size=11,
                                           line=dict(
                                               color="red",
                                               width=2)))
    axis = dict(
        showline=True,
        zeroline=False,
        showgrid=True,
        mirror=True,
        ticklen=4,
        gridcolor='#ffffff',
        tickfont=dict(size=10))
    layout = dict(
        width=1000,
        height=865,
        autosize=False,
        title=metric_name,
        margin=dict(t=75),
        showlegend=True,
        xaxis1=dict(axis, **dict(domain=[0, 1], anchor='y1', showticklabels=True)),
        yaxis1=dict(axis, **dict(domain=[2 * 0.21 + 0.20, 1], anchor='x1', hoverformat='.2f')))
    fig = go.Figure(data=[table, anomalies_map, Actuals], layout=layout)
    iplot(fig)
    pyplot.show()plot_anomaly(df,'anomalies')

print("Percentage of anomalies in data: {:.2f}".format((len(df.loc[df['anomaly']==1])/len(df))*100))

Looks good with some minor false anomalies as we the plot shows we have captured significant spike and few other dips and spikes. Now lets see what is the role of contamination parameter here.

Inliers occur together , it takes around 2 splits to separate x(i) whereas an outlier x(0) is separated in 4 splits

Isolation forest separates each point out from other points randomly and constructs a tree based on its number of splits with each point representing a node in tree. Outliers appear closer to the root in the tree and inliers appear in higher depth. In case of isolation forest , a forest of trees are created based on n_estimators and max_sample parameters and the score is derived from it.

One can use score_samples/decision_fuction to get the normalized anomaly_score of each point here more the negative it is anomalous based on sklearn computation .

Till this step contamination factor has no influence on the scores. From here when we apply predict in order to return anomaly 1/0 contamination acts as a cutoff/percentile on scores and returns top x percentile negative scores as anomalies. (For eg: If contamination is set as 0.05/5% then points with top 5% negative scores are labeled anomalies)

df['scores'].hist()

Inliers with positive scores are clubbed at right with positive scores. Points with negative scores are anomalies. This is the distribution of scores for this specific input points. Once we finalize our model based on this and predict the future points then our cutoff on scores for contamination might vary to give 4% of points as anomalous but the distribution of scores might change. To overcome instead of an hard coded percentage of points always being labeled as anomalies irrespective of change in scores we can use some statistical algorithm like Z-Score or IQR on the scores to detect those changes in scores and classify anomalies better.

def iqr_bounds(scores,k=1.5):
    q1 = scores.quantile(0.25)
    q3 = scores.quantile(0.75)
    iqr = q3 - q1
    lower_bound=(q1 - k * iqr)
    upper_bound=(q3 + k * iqr)
    print("Lower bound:{} \nUpper bound:{}".format(lower_bound,upper_bound))
    return lower_bound,upper_bound
lower_bound,upper_bound=iqr_bounds(df['scores'],k=2)

df['anomaly']=0
df['anomaly']=(df['scores'] < lower_bound) |(df['scores'] > upper_bound)
df['anomaly']=df['anomaly'].astype(int)
plot_anomaly(df,'iqr based')

Percentage of points labeled as anomalies is around 2% here with k being set as 2 in IQR.

print("Percentage of anomalies in data: {:.2f}".format((len(df.loc[df['anomaly']==1])/len(df))*100))

Now lets validate this with another data considering the same contamination and also with same k in IQR on scores:

df=full_df.loc[(full_df['timestamp']>'2014-02-17 00:00:00')&(full_df['timestamp']<'2014-02-17 23:59:59')]
# Using graph_objects
plot_data=go.Scatter(x=df['timestamp'], y=df['value'])
fig=go.Figure(data=[plot_data])
iplot(fig)

By looking at the plot during this day the data looks different with fewer anomalies.

from sklearn.ensemble import IsolationForest
#to_model_column='value'
clf=IsolationForest(n_estimators=10, max_samples='auto', contamination=float(.04), \
                        max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0,behaviour='new')
clf.fit(df[['value']])
df['scores']=clf.decision_function(df[['value']])
df['anomaly']=clf.predict(df[['value']])
df.loc[df['anomaly'] == 1,'anomaly'] = 0
df.loc[df['anomaly'] == -1,'anomaly'] = 1
plot_anomaly(df,'anomalies')

print("Percentage of anomalies in data: {:.2f}".format((len(df.loc[df['anomaly']==1])/len(df))*100))

Yes and here we see more of false anomalies from the visualization. The percentage of anomalous points remains same here as we have used the same contamination for data of new time frame.

df['scores'].hist()

lower_bound,upper_bound=iqr_bounds(df['scores'],k=2)
df['anomaly']=0
df['anomaly']=(df['scores'] < lower_bound) |(df['scores'] > upper_bound)
df['anomaly']=df['anomaly'].astype(int)
plot_anomaly(df,'iqr second case')

print("Percentage of anomalies in data: {:.2f}".format((len(df.loc[df['anomaly']==1])/len(df))*100))

In this case the percentage of outliers is down in this method as the data distribution which reflects in the scores have changed.

In real time anomaly detection the combination of statistical rules on isolation forest works better as you train you model and deploy and predict on future stream of data whose distribution might change from time to time and the scores of new data would be different.

Also this k parameter in IQR can be tuned based on feedback on anomalies detected. If there are false positives then k should be decreased and if there are false negatives k should be decreased to find more anomalies.

Will follow it up with some details on isolation forest algorithm new addition of warm start and interpreting results from isolation forest.

Add on : Below is a multivariate anomaly 3d visualization using plotly express which I found to be cool. Play with it in all 3 dimensions to have a look at anomalies.

Anomaly Detection : Isolation Forest with Statistical Rules

Written by adithya krishnan