Anomaly Detection in Python — Part 2; Multivariate Unsupervised Methods and Code

A Guide on how to Perform Anomaly detection for Business Analysis or a Machine Learning Pipeline on multivariate data along with relevant Python code.

Published in

Towards Data Science

14 min readMay 22, 2021

In my previous article(https://medium.com/analytics-vidhya/anomaly-detection-in-python-part-1-basics-code-and-standard-algorithms-37d022cdbcff) we discussed the basics of Anomaly detection, the types of problems and types of methods used. We discussed the EDA, Univariate and the Multivariate methods of performing Anomaly Detection along with one example of each. We discussed why Multivariate Outlier detection is a difficult problem and requires specialized techniques. We also discussed Mahalanobis Distance Method with FastMCD for detecting Multivariate Outliers.

In this article, we will discuss 2 other widely used methods to perform Multivariate Unsupervised Anomaly Detection. We will discuss:

Isolation Forests
OC-SVM(One-Class SVM)

Some General thoughts on Anomaly Detection

Anomaly detection is a tool to identify unusual or interesting occurrences in data. However, it is important to analyze the detected anomalies from a domain/business perspective before removing them. Each method has its own definition of anomalies. Multiple methods may very often not agree on which points are anomalous. It is our responsibility to validate the results from a domain/business perspective.

Sometimes, detected Anomalies represent “under-sampled” regimes in data. In this case, instead of removing them, we should aim to collect more data in that regime.

Brief Recap of Multivariate Anomalies

Multivariate Anomalies occur when the values of various features, taken together seem anomalous even though the individual features do not take unusual values. For example, imagine we have 2 features:
1. odo: this is the reading of the odometer of a car in mph. It typically lies between 0–50.
2. rpm: this is the rpm(rotations per minute) of the car’s wheels. It typically lies between 0–650.

Now, imagine odo reads 0 mph. We know this is possible — and that the car is not moving. Let’s say, on another occasion, the rpm reads 600. We know that the car is moving. Taken separately, we know that the above readings are not anomalous — because they represent perfectly normal modes of operation of the car. However, let us imagine the odo reads 0 mph and rpm reads 600 at the same time. This is not possible — they are in conflict. rpm suggests that the car is moving and odo suggests that the car is stationary. This is an example of a multivariate outlier.

Detecting a Multivariate Outlier involves examining the values of all the features simultaneously and verifying if the combination of values taken by the features is unusual. As we can understand this becomes intractable to do manually when we have large number of features(say hundreds). Alternately, we can use specialized algorithms that can identify them for us. Let us take a look at them.

Data used for Anomaly Detection

We want the Multivariate Methods to detect the 2 major outliers — Image by Author

Isolation Forests

Isolation forests are known to work well for high dimensional data. An Isolation forest is a ensemble of “Isolation Trees”. It is discussed in detail in the following paper: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest.
An Isolation tree is a binary tree that stores data by dividing it into boxes(called nodes). To understand why Isolation Forests are anomaly detectors, it is important to understand how Isolation Trees are built. Here are the steps to compute an isolation tree:

Select a feature at random from data. Let us call the random feature f.
Select a random value from the feature f. We will use this random value as a threshold. Let us call it t.
Data points where f < t are stored in Node 1 and the data points where f ≥ t go in Node 2.
Repeat Steps 1–3 for Node 1 and Node 2.
Terminate either when the tree is fully grown or a termination criterion is met.

For Simplicity, let us start with how the Isolation tree works with univariate data. We will explore Multivariate examples later. The following figure shows its mechanism for 1 Dimensional Data:

An Isolation tree built for data having NO Outliers. A Random Threshold is chosen to split the data into child nodes. This process is then repeated for each child node. — Image by Author

It is important to remember that the feature to split on and the threshold are chosen at random as shown in the above figure. Since the above example was univariate, we only choose the threshold at random.

Now, assume the univariate data above has an anomaly. In that case, the anomalous point will be far away from the other data points. Isolation forests are able to isolate out anomalies very early on in the splitting process because the Random Threshold used for splitting has a large probability of lying in the empty space between the outlier and the data if the empty space is large enough. As a result, anomalies have shorter path lengths. After all, the split point(the threshold)is chosen at random. So, the larger the empty space, the more likely it is for a randomly chosen split point to lie in that empty region.

Let us take a look at how an Isolation tree would look in the presence of an Anomaly.

The Isolation tree “Isolates” out the anomaly in the first split. Due to the large space between the anomaly and the rest of the data, it is very likely that a random split will lie in this region. — Image by Author

As we can see, due to the large space between the anomaly and the rest of the data, it is very likely that a random split will lie in this empty region.

Let us now see how this would look if we had multivariate data. As we will see, Isolation trees work very similar to what we saw above — they isolate the anomalies before isolating the other points.

The Anomaly gets isolated at split 2. — Image by Author

Please note that the trees can grow either:

Till there is exactly one data point in each leaf node. Or
Till termination criterion regarding the minimum number of data points in a leaf node is reached.

I have only shown the first few splits here for illustration. As we can see, the Isolation tree divides the data into “boxes”. It has the property that it isolates the region containing anomalies earlier than the boxes containing normal data points.

We can extend the idea of an Isolation tree to an isolation forest which is an ensemble of multiple Isolation trees. Here is briefly how Isolation forests work:

Construct an Isolation Tree either from the entire feature set or a randomly chosen subset of the feature set.
Construct n such Isolation trees.
Calculate an Anomaly score for each data point. The Anomaly score is a non-linear function of the Average path length over all Isolation trees. The path length is equivalent to the number of splits made by the Isolation tree to isolate a point. The shorter the Average path length, the larger are the chances of the point being an anomaly(as we saw earlier in the diagram).

Isolation forests work well even for data having hundreds of dimensions.

The Isolation forest in skearn has 4 important inputs:

n_estimators: Number of Isolation trees trained.
max_samples: Number of data points used to train each tree.
contamination: Fraction of anomalous data points. For example, if we suspect 5% of the data to be anomalous, we set contamination to 0.05
max_features: Number of features to be used to train each tree(This is in contrast to Random Forests where we decide on a random subset of features for each split).

It has 2 Important methods:

decision_function(X): Returns a score — such that examples having more negative scores are more anomalous.
predict(X): Returns -1 for Anomalous points and +1 for normal points. The number of points output as anomalous depends on the contamination value set while fitting the model.

Let us train an Isolation Forest on the above data(we set contamination to 0.01):

# Create Artificial Data with Multivariate Outliers
d1 = np.random.multivariate_normal(mean = np.array([-.5, 0]),
                               cov = np.array([[1, 0], [0, 1]]), size = 100)d2 = np.random.multivariate_normal(mean = np.array([15, 10]),
                               cov = np.array([[1, 0.3], [.3, 1]]), size = 100)outliers = np.array([[0, 10],[0, 9.5]])
d = pd.DataFrame(np.concatenate([d1, d2, outliers], axis = 0), columns = ['Var 1', 'Var 2'])################### Train Isolation Forest #################
model  =  ensemble.IsolationForest(n_estimators=50, max_samples=500, contamination=.01, max_features=2, 
                         bootstrap=False, n_jobs=1, random_state=1, verbose=0, warm_start=False).fit(d)# Get Anomaly Scores and Predictions
anomaly_score = model.decision_function(d)
predictions = model.predict(d)######### Visualize Anomaly scores and Anomaly Status ########plt.figure(figsize = (10, 6), dpi = 150)
s = plt.scatter(d['Var 1'], d['Var 2'], c = anomaly_score, cmap = 'coolwarm')
plt.colorbar(s, label = 'More Negative = More Anomalous')
plt.xlabel('Var 1', fontsize = 16)
plt.ylabel('Var 2', fontsize = 16)
plt.grid()# To Plot Predictions
plt.figure(figsize = (10, 6), dpi = 150)
s = plt.scatter(d['Var 1'], d['Var 2'], c = predictions, cmap = 'coolwarm')
plt.colorbar(s, label = 'More Negative = More Anomalous')
plt.xlabel('Var 1', fontsize = 16)
plt.ylabel('Var 2', fontsize = 16)
plt.grid()
plt.title('Contamination = 0.01', weight = 'bold')

LEFT: Output from the Decision Function — more negative values, imply stronger anomalies. RIGHT: The top “contamination” fraction of anomalies. — Image by Author

As we can see, the 2 points are detected to be strong outliers.

Deciding whether a point is an anomaly can be thus done using 2 methods:

Use the Predict function: If the model predicts -1, label the point as anomaly. For this method, it is important to set the contamination carefully.
Analyze the Decision Function Output distribution, and based on visual Inspection set a threshold below which anomalous points will fall. We can also apply a univariate anomaly detection algorithm on the decision function output; This is a very common method where we convert a multivariate problem into a univariate one — by calculating the anomaly scores and then use a univariate algorithm on the scores. This method is not dependent on our choice of contamination factor.

However, let us see what happens if we set different values of contamination and using Method 1.

Effect of Contamination on Isolation Forest Output when choosing Method 1 for prediction. — Image by Author

With increasing Contamination values, the model labels more data as anomalous.

What if we do not know the right contamination?

In this case, we analyze the decision_function() output. There are 2 ways of doing this:

Apply a Univariate Anomaly Detection algorithm on the Isolation Forest Decision Function Output(like the tukey’s method — which we discussed in the previous article). This is a standard method — where we calculate an ‘Anomaly Score’(here, the decision function output) using a Multivariate algorithm; Then, to select which of these anomaly scores correspond to outliers, we apply a Univariate Anomaly detection algorithm on the scores.
Plot a Histogram and manually select a threshold.

Let us see the results of applying Tukey’s method on the Decision Function output given by our Isolation Forest:

Tukey’s Method applied to decision function output of Isolation Forest selects 6 outliers. We can see 2 extreme outliers — which are indeed the anomalies we were trying to detect. — Image by Author

We see 2 clear outliers which are the 2 extreme points to the left. These are actually also the 2 major outliers we wanted to detect. However, we see 4 additional points being labelled as outliers. To select the appropriate anomaly, domain/business analysis needs to be done.

Let us discuss some advantages of Isolation forests:

Scalability — Isolation Forests can use Parallel Processing during training and prediction— as all the isolation trees are trained in parallel- independently of each other.
Interpretability — Individual trees in an Isolation forest can be visualized to give the exact rules as to what made the data point an outlier. These rules can have large business/domain importance. However, this can become hard for large data.
Flexibility — They can capture very complex outliers and do not require the data to belong to a specific distribution. If we remember, the Mahalanobis Distance method with FastMCD discussed in the previous article assumed the clean data to belong to a multivariate normal distribution. Isolation forests make no such assumptions.

From Experience, I have noticed that the Decision function values of severe outliers and minor outliers can often be close. As we saw here, we had 2 clear outliers. However, their decision function output was close to the decision function output of some other points. This is in contrast to Mahalanobis distance method where for the same problem, the distinction was very clear. Overall, the outliers detected by Isolation Forest were reasonable — and the method of Using Univariate method on the Decision Function also yielded reasonable results.

One Class — Support Vector Machine(OC-SVM)

The OC-SVM is a multivariate method that belongs to the family of One-Class classification methods. It is discussed in detail in the following paper: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-99-87.pdf

We decide a fraction of data say ν(Pronounced Nu) that we suspect to be the upper bound on the number of anomalies present in data. The OC-SVM then tries to find a boundary that encloses regions of high data density excluding at most a fraction ν of data points. It can be seen that boundaries which are linear in the problem variable space are too simple for most problems. Since SVM is a linear classifier by nature, we need to resort to kernel methods to build a flexible model with non-linear boundaries.

Instead of finding a decision boundary to separate 2 classes, the OC-SVM tries to find a hyperplane separating the Data from the Origin. The idea is to map the data into the kernel feature space and to separate it from the origin with maximum margin using a linear classifier in the kernel feature space. This corresponds to using a non-linear boundary in our original problem space.

Let us briefly discuss the use of kernels in OC-SVM. The SVM is a Linear model. This allows it to make very simple decision rules. One way to increase the capacity of the SVM is to create polynomial features from data — and then to use the Linear SVM on the transformed feature space. The SVM still finds a linear boundary in the Polynomial feature space — but when mapped back to our original problem variable space, the decision boundary(which was linear in the polynomial feature space) looks non-linear. But we had to explicitly calculate the polynomial features — which can take large memory if we had a large number of features to begin with. Kernels allow us to fit simple models(like the linear SVM) in very high dimensional feature spaces without explicitly calculating the high dimensional features. One of the most widely used kernels is the RBF Kernel. The important thing to remember is that SVM always fits a linear model in the kernel feature space — even though the decision boundary looks non-linear in the original problem variable space.

Here are some general points about the OC-SVM:

In general, it is advised to use a non-linear kernel when using OC-SVM. The RBF Kernel is widely used.
Hyperparameters to tune for OC-SVM with RBF Kernel are: Gamma, ν
Prediction can be done using predict() and decision_function() methods. We will discuss more about them later.

Gamma is a parameter specific to the RBF Kernel and it controls the effect of neighboring points on the decision boundary. Large values of Gamma allow neighboring points to have larger influence on the decision boundary and smaller values of Gamma allow both neighboring and distant points to have an effect on the decision boundary.

Let us discuss the effect of using different values of Gamma. Larger values of Gamma cause models with large variance — which can come at the cost of Generalization. Let us vary Gamma and see the impact on the model.

def plot_anomaly2(data, predicted, ax):
    data2 = data.copy()
    data2['Predicted'] = predicted
    
    normal = data2.loc[data2['Predicted'] == 1, :]
    anomalies = data2.loc[data2['Predicted'] == -1, :]
    
    # Make Scatterplot
    column1 = data.columns[0]
    column2 = data.columns[1]
    
    
    anomalies.plot.scatter(column1, column2, color = 'tomato', fontsize = 14,  sharex = False, ax=ax)
    normal.plot.scatter(column1, column2, color = 'grey', fontsize = 14,  sharex = False, ax = ax)#plt.grid(linestyle = '--')
    
    plt.xlabel(column1, fontsize = 14, weight = 'bold')
    plt.ylabel(column2, fontsize = 14, weight = 'bold')
    return ax# Create Fake data to classify 
x_fake  =  pd.DataFrame(np.random.uniform(-5, 19, (35000, 2)), columns = ['Var 1', 'Var 2'])# Visualize effect of changing Gamma
gammas = [.00005, .005, .01, .025, .05, .1,.3, .6, .9, 2, 5, 10]
fig, axes = plt.subplots(2, 6, figsize = (25, 6), tight_layout = True)
for i, ax in zip(range(len(gammas)), axes.flatten()):
    gamma = gammas[i]
    model = svm.OneClassSVM(kernel='rbf', degree=5, gamma=gamma, coef0=0.0, tol=0.001, nu=0.01, 
                        shrinking=True, cache_size=200, verbose=False, max_iter=- 1).fit(all_data)model_predictions = model.predict(x_fake)
    #x_fake['Predictions'] = model_predictionsax = plot_anomaly2(x_fake, model_predictions,ax)
    ax.scatter(all_data.iloc[:, 0], all_data.iloc[:, 1], color = 'k', s = 10)
    ax.set_title('Gamma: {}'.format(np.around(gamma,6)), weight = 'bold', fontsize = 14)

The Blue region in the following images refer to regions that the OC-SVM predicts as “Normal”. The red points are detected as anomalies.

Anomaly Status of data points as a function of Gamma — Image by Author

We Observe the following:

When gamma is extremely low or high, we see that the OC-SVM Misses at-least one of the major anomalies.
For Mid-Gamma Values in the range of .005 to .1, the OC-SVM identifies both major anomalies.

One way to select Gamma is to let sklearn choose Gamma. We do by selecting gamma = ‘scale’. Here is what happens when we set gamma = ‘scale’:

Gamma = ‘scale’ detects the 2 outliers. nu is set to 0.05. — Image by Author

As discussed earlier, in OC-SVMs the data is separated from the origin in the kernel space using a linear decision boundary. A fraction(upto ν) of data are allowed to fall on the wrong side of the linear decision boundary. Basically, we want all the inliers to be one side of the decision boundary and all the outliers to be on the other side of the decision boundary. The decision_function method of the OC-SVM outputs the signed distance of a point from the decision boundary. This distance takes negative values for outliers and positive values for normal points(inliers).

Let us apply tukey’s method on the decision_function output as we did earlier.

2 points Identified as anomalies for k = 1.5 in Tukey’s method. Here, tukey’s method was applied to OCSVM Decision Function Output. — Image by Author

Here, luckily tukey’s method identified the 2 major anomalies that we had in our data. However, in the general case, it may identify additional or lesser anomalies. Anomalies identified by Tukey’s method depend on our value of ‘k’(discussed in the previous article) which can be tuned. it is sometimes useful to treat k as a hyperparameter in the ML pipeline — which can be finalized through domain analysis or Optimization.

Conclusion

Till now we have discussed unsupervised methods of performing Anomaly detection. We discussed Isolation Forests and OC-SVM methods which are used to perform Multivariate Anomaly detection. One of the advantages of this methods is that they do not require the data to belong to a particular distribution.

OC-SVM is a method which can be used for Unsupervised and Semi-Supervised Anomaly detection. In the next articles we will discuss Semi-Supervised and Supervised methods of performing Anomaly detection. They include using PCA, Auto-Encoders, OC-SVM and imbalanced Classification methods for performing Anomaly Detection.

Please feel free to let me know if you have any feedback and check out my previous introductory article on Anomaly detection where we discuss the different types of Anomaly detection problems and methods(https://medium.com/analytics-vidhya/anomaly-detection-in-python-part-1-basics-code-and-standard-algorithms-37d022cdbcff).