Anomaly Detection with Isolation Forest & Visualization

adithya krishnan
Towards Data Science
7 min readFeb 10, 2019

--

A sudden spike or dip in a metric is an anomalous behavior and both the cases needs attention. Detection of anomaly can be solved by supervised learning algorithms if we have information on anomalous behavior before modeling, but initially without feedback its difficult to identify that points. So we model this as an unsupervised problem using algorithms like Isolation Forest,One class SVM and LSTM. Here we are identifying anomalies using isolation forest.

The data here is for a use case(eg revenue, traffic etc ) is at a day level with 12 metrics. We have to identify first if there is an anomaly at a use case level. Then for better actionability, we drill down to individual metrics and identify anomalies in them.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
warnings.filterwarnings('ignore')
import os
print(os.listdir("../input"))
df=pd.read_csv("../input/metric_data.csv")
df.head()

Now do a pivot on the dataframe to create a dataframe with all metrics at a date level. Level the multi-index pivot dataframe and treat na with 0.

metrics_df=pd.pivot_table(df,values='actuals',index='load_date',columns='metric_name')
metrics_df.reset_index(inplace=True)
metrics_df.fillna(0,inplace=True)
metrics_df

Isolation forest tries to separate each point in the data.In case of 2D it randomly creates a line and tries to single out a point. Here an anomalous point could be separated in a few steps while normal points which are closer could take significantly more steps to be segregated.

I am not going deep into each parameter. Contamination is an important parameter here and I have arrived at its value based on trial and error on validating its results with outliers in 2D plot. It stands for percentage of outlier points in the data.

I am using sklearn’s Isolation Forest here as it is a small dataset with few months of data, while recently h2o’s isolation forest is also available which is more scalable on high volume datasets would be worth exploring.

More details of the algorithm can be found here : https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf

More details on H2O Isolation forest : https://github.com/h2oai/h2o-tutorials/tree/master/tutorials/isolation-forest

metrics_df.columns
#specify the 12 metrics column names to be modelled
to_model_columns=metrics_df.columns[1:13]
from sklearn.ensemble import IsolationForest
clf=IsolationForest(n_estimators=100, max_samples='auto', contamination=float(.12), \
max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0)
clf.fit(metrics_df[to_model_columns])
pred = clf.predict(metrics_df[to_model_columns])
metrics_df['anomaly']=pred
outliers=metrics_df.loc[metrics_df['anomaly']==-1]
outlier_index=list(outliers.index)
#print(outlier_index)
#Find the number of anomalies and normal points here points classified -1 are anomalous
print(metrics_df['anomaly'].value_counts())
Number of outliers are 15 indicated by -1

Now here we have 12 metrics on which we have classified anomalies based on isolation forest.We will try to visualize the results and check if the classification makes sense.

Normalize and fit the metrics to a PCA to reduce the number of dimensions and then plot them in 3D highlighting the anomalies.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from mpl_toolkits.mplot3d import Axes3D
pca = PCA(n_components=3) # Reduce to k=3 dimensions
scaler = StandardScaler()
#normalize the metrics
X = scaler.fit_transform(metrics_df[to_model_columns])
X_reduce = pca.fit_transform(X)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_zlabel("x_composite_3")
# Plot the compressed data points
ax.scatter(X_reduce[:, 0], X_reduce[:, 1], zs=X_reduce[:, 2], s=4, lw=1, label="inliers",c="green")
# Plot x's for the ground truth outliers
ax.scatter(X_reduce[outlier_index,0],X_reduce[outlier_index,1], X_reduce[outlier_index,2],
lw=2, s=60, marker="x", c="red", label="outliers")
ax.legend()
plt.show()
3D plot of outliers highlighted

Now as we see the 3D point the anomaly points are mostly wide from the cluster of normal points,but a 2D point will help us to even judge better.
Lets try plotting the same fed to a PCA reduced to 2 dimensions.

from sklearn.decomposition import PCA
pca = PCA(2)
pca.fit(metrics_df[to_model_columns])
res=pd.DataFrame(pca.transform(metrics_df[to_model_columns]))Z = np.array(res)plt.title("IsolationForest")
plt.contourf( Z, cmap=plt.cm.Blues_r)
b1 = plt.scatter(res[0], res[1], c='green',
s=20,label="normal points")
b1 =plt.scatter(res.iloc[outlier_index,0],res.iloc[outlier_index,1], c='green',s=20, edgecolor="red",label="predicted outliers")
plt.legend(loc="upper right")
plt.show()

So a 2D plot gives us a clear picture that the algorithm classifies anomalies points in the use case rightly.

Anomalies are highlighted as red edges and normal points are indicated with green points in the plot.

Here the contamination parameter plays a great factor.
Our idea here is to capture all the anomalous point in the system.
So its better to identify few points which might be normal as anomalous(false positives) ,but not to miss out catching an anomaly(true negative).(So i have specified 12% as contamination which varies based on use case)

Now we have figured the anomalous behavior at a use case level.But to be actionable on the anomaly its important to identify and provide information on which metrics are anomalous in it individually.

The anomalies identified by the algorithm should make sense when viewed visually(sudden dip/peaks) by the business user to act upon it. So creating a good visualization is equally important in this process.

This function creates actuals plot on a time series with anomaly points highlighted on it. Also a table which provides actual data, the change and conditional formatting based on anomalies.

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.plotly as py
import matplotlib.pyplot as plt
from matplotlib import pyplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)
def plot_anomaly(df,metric_name):
df.load_date = pd.to_datetime(df['load_date'].astype(str), format="%Y%m%d")
dates = df.load_date
#identify the anomaly points and create a array of its values for plot
bool_array = (abs(df['anomaly']) > 0)
actuals = df["actuals"][-len(bool_array):]
anomaly_points = bool_array * actuals
anomaly_points[anomaly_points == 0] = np.nan
#A dictionary for conditional format table based on anomaly
color_map = {0: "'rgba(228, 222, 249, 0.65)'", 1: "yellow", 2: "red"}

#Table which includes Date,Actuals,Change occured from previous point
table = go.Table(
domain=dict(x=[0, 1],
y=[0, 0.3]),
columnwidth=[1, 2],
# columnorder=[0, 1, 2,],
header=dict(height=20,
values=[['<b>Date</b>'], ['<b>Actual Values </b>'], ['<b>% Change </b>'],
],
font=dict(color=['rgb(45, 45, 45)'] * 5, size=14),
fill=dict(color='#d562be')),
cells=dict(values=[df.round(3)[k].tolist() for k in ['load_date', 'actuals', 'percentage_change']],
line=dict(color='#506784'),
align=['center'] * 5,
font=dict(color=['rgb(40, 40, 40)'] * 5, size=12),
# format = [None] + [",.4f"] + [',.4f'],
# suffix=[None] * 4,
suffix=[None] + [''] + [''] + ['%'] + [''],
height=27,
fill=dict(color=[test_df['anomaly_class'].map(color_map)],#map based on anomaly level from dictionary
)
))
#Plot the actuals points
Actuals = go.Scatter(name='Actuals',
x=dates,
y=df['actuals'],
xaxis='x1', yaxis='y1',
mode='line',
marker=dict(size=12,
line=dict(width=1),
color="blue"))
#Highlight the anomaly points
anomalies_map = go.Scatter(name="Anomaly",
showlegend=True,
x=dates,
y=anomaly_points,
mode='markers',
xaxis='x1',
yaxis='y1',
marker=dict(color="red",
size=11,
line=dict(
color="red",
width=2)))
axis = dict(
showline=True,
zeroline=False,
showgrid=True,
mirror=True,
ticklen=4,
gridcolor='#ffffff',
tickfont=dict(size=10))
layout = dict(
width=1000,
height=865,
autosize=False,
title=metric_name,
margin=dict(t=75),
showlegend=True,
xaxis1=dict(axis, **dict(domain=[0, 1], anchor='y1', showticklabels=True)),
yaxis1=dict(axis, **dict(domain=[2 * 0.21 + 0.20, 1], anchor='x1', hoverformat='.2f')))
fig = go.Figure(data=[table, anomalies_map, Actuals], layout=layout)iplot(fig)
pyplot.show()

A helper function to find percentage change,classify anomaly based on severity.

The predict function classifies the data as anomalies based on the results from decision function on crossing a threshold.
Say if the business needs to find the next level of anomalies which might have an impact, this could be used to identify those points.

The top 12 quantiles are identified anomalies(high severity), based on decision function here we identify the 12–24 quantile points and classify them as low severity anomalies.

def classify_anomalies(df,metric_name):
df['metric_name']=metric_name
df = df.sort_values(by='load_date', ascending=False)
#Shift actuals by one timestamp to find the percentage chage between current and previous data point
df['shift'] = df['actuals'].shift(-1)
df['percentage_change'] = ((df['actuals'] - df['shift']) / df['actuals']) * 100
#Categorise anomalies as 0-no anomaly, 1- low anomaly , 2 - high anomaly
df['anomaly'].loc[df['anomaly'] == 1] = 0
df['anomaly'].loc[df['anomaly'] == -1] = 2
df['anomaly_class'] = df['anomaly']
max_anomaly_score = df['score'].loc[df['anomaly_class'] == 2].max()
medium_percentile = df['score'].quantile(0.24)
df['anomaly_class'].loc[(df['score'] > max_anomaly_score) & (df['score'] <= medium_percentile)] = 1
return df

Identify anomalies for individual metrics and plot the results.

X axis — date
Y axis — Actual values and anomaly points.

Actual values of metrics are indicated in the blue line and anomaly points are highlighted as red points.

In the table, the background red indicates high anomalies and yellow indicates low anomalies.

import warnings  
warnings.filterwarnings('ignore')
for i in range(1,len(metrics_df.columns)-1):
clf.fit(metrics_df.iloc[:,i:i+1])
pred = clf.predict(metrics_df.iloc[:,i:i+1])
test_df=pd.DataFrame()
test_df['load_date']=metrics_df['load_date']
#Find decision function to find the score and classify anomalies
test_df['score']=clf.decision_function(metrics_df.iloc[:,i:i+1])
test_df['actuals']=metrics_df.iloc[:,i:i+1]
test_df['anomaly']=pred
#Get the indexes of outliers in order to compare the metrics with use case anomalies if required
outliers=test_df.loc[test_df['anomaly']==-1]
outlier_index=list(outliers.index)
test_df=classify_anomalies(test_df,metrics_df.columns[i])
plot_anomaly(test_df,metrics_df.columns[i])

Yes, from the plots we are able to capture the sudden spikes, dips in the metrics and project them.

Also, the conditional formatted table gives us insights on cases like data, not present(value is zero) captured as high anomaly which could be a potential result of broken pipeline in data processing which needs fixing along with highlighting high and low-level anomalies.

How to use this?

If the current timestamp is anomalous for a use case drill down to metrics figure out the set of metrics which have high anomalies in the timestamp to perform RCA on it.

Also, feedback from the business user can be updated back in the data which would help in turning this to a supervised/semi-supervised learning problem and compare their results.

An enhancement here would be to combine anomalous behavior which occurs continuously. For eg., big sale days which would result in a spike in metrics for a few days could be shown as a single behavior.

--

--

Bachelors Computer Science PSG Tech,Senior Software Engineer Analytics Insights Myntra, Loves to crunch insights and tell stories from data with visualization.