Ensemble Learning for Anomaly Detection
A dive into the isolation forest model to detect anomalies in time-series data
Anomaly detection is a must-have capability for any organization. By detecting anomalies and outliers, we not only identify data that seems suspicious (or possibly wrong), but can also establish what ‘normal’ data looks like. Anomaly detection can prove to be a vital capability for a strong data governance system by identifying data errors. And for analysis, outliers can be a point of interest in certain cases such as fraud detection and predictive maintenance.
However, as data grows, anomaly detection can prove more and more difficult. High-dimensional data comes with noise and makes it difficult to use for analysis and insights. Large datasets are also likely to have errors and/or special cases. Thankfully, ensemble learning brings speed and efficiency to help us wrangle high-dimensional data and detect anomalies.
What is ensemble learning?
Ensemble learning is a machine learning technique that combines the predictions from multiple individual models to obtain a better predictive performance than any single model. Each model is considered a “weak learner” and is trained on a small subset of the data to make a prediction. Then it goes to a vote. Each weak learner is surveyed and the majority vote wins for the final prediction.
Ensemble models (trained on high-quality data) are robust, accurate, efficient, and are good at avoiding overfitting. They have many use cases such as classification, optimization, and in our case, anomaly detection.
The Isolation Forest Model
The isolation forest model is an ensemble of trees that isolates observations that are few and far between. It is very similar to the popular ‘Random Forest’ model, but instead of a forest of decision trees, the isolation forest produces a forest of ‘isolation trees’.
So how does it work? Let’s look at one isolation tree.
Consider the data above. We can see that one data point is farther away from the rest of the data (our suspected anomaly). Each isolation tree randomly chooses a ‘split value’ to begin to isolate observations. In this case, the suspected outlier is immediately isolated. This would be the case for most of the isolation trees due to its distance from the rest of the data.
Next, it chooses another split. This time, the suspected ‘normal’ data begins to get cut up. This process repeats until each observation is isolated. Ultimately, the model ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
Now that each observation is isolated, we need to ask: How many splits did it take for each observation to be isolated? In other words, how long is the partition path for each data point? Let’s say the results are the following:
Now that we know how many splits it took to isolate each observation, we calculate the mean number of splits. In our example, on average, it takes 2.6 splits to isolate an observation. Observations that have a noticeably shorter partition path, or took noticeably less splits to be isolated, are highly likely to be anomalies or outliers. The degree to which they differ from the mean number of splits is a parameter in the model. Finally, the isolation tree determines the observation G is an anomaly.
The last step of the isolation forest model is for each isolation tree to ‘vote’ on which observations are anomalies. If a majority of them think that observation G is an anomaly, then the model determines that it is.
Detecting Anomalies in Time Series Data
Lets see a simple example using the isolation forest model to detect anomalies in time-series data. Below, we have imported a sales data set that contains the day of an order, information about the product, geographical information about the customer, and the amount of the sale. To keep this example simple, lets just look at one feature (sales) over time.
See data here: https://www.kaggle.com/datasets/rohitsahoo/sales-forecasting (GPL 2.0)
#packages for data manipulation
import pandas as pd
from datetime import datetime
#packages for modeling
from sklearn.ensemble import IsolationForest
#packages for data visualization
import matplotlib.pyplot as plt
#import sales data
sales = pd.read_excel("Data/Sales Data.xlsx")
#subset to date and sales
revenue = sales[['Order Date', 'Sales']]
revenue.head()
As you can see above, we have the total sale amount for every order on a particular day. Since we have a sufficient amount of data (4 years worth), let’s try to detect months where the total sales is either noticeably higher or lower than the expected total sales.
First, we need to conduct some preprocessing, and sum the sales for every month. Then, visualize monthly sales.
#format the order date to datetime month and year
revenue['Order Date'] = pd.to_datetime(revenue['Order Date'],format='%Y-%m').dt.to_period('M')
#sum sales by month and year
revenue = revenue.groupby(revenue['Order Date']).sum()
#set date as index
revenue.index = revenue.index.strftime('%m-%Y')
#set the fig size
plt.figure(figsize=(8, 5))
#create the line chart
plt.plot(revenue['Order Date'],
revenue['Sales'])
#add labels and a title
plt.xlabel('Moth')
plt.ylabel('Total Sales')
plt.title('Monthly Sales')
#rotate x-axis labels by 45 degrees for better visibility
plt.xticks(rotation = 90)
#display the chart
plt.show()
Using the line chart above, we can see that while sales fluctuates from month-to-month, total sales trends upward over time. Ideally, our model will identify months where total sales fluctuates more that expected and is highly influential to our overall trend.
Now we need to initialize and fit our model. The model below uses the default parameters. I have highlighted these parameters as they are the most important to the model’s performance.
- n_estimators: The number of base estimators in the ensemble.
- max_samples: The number of samples to draw from X to train each base estimator (if “auto”, then
max_samples = min(256, n_samples)).
- contamination: The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples.
- max_features: The number of features to draw from X to train each base estimator.
#set isolation forest model and fit to the sales
model = IsolationForest(n_estimators = 100, max_samples = 'auto', contamination = float(0.1), max_features = 1.0)
model.fit(revenue[['Sales']])
Next, lets use the model to display the anomalies and their anomaly score. The anomaly score is the mean measure of normality of an observation among the base estimators. The lower the score, the more abnormal the observation. Negative scores represent outliers, positive scores represent inliers.
#add anomaly scores and prediction
revenue['scores'] = model.decision_function(revenue[['Sales']])
revenue['anomaly'] = model.predict(revenue[['Sales']])
Lastly, lets bring up the same line chart from before, but highlighting the anomalies with plt.scatter.
The model appears to do well. Since the data fluctuates so much month-to-month, a worry could be that inliers would get marked as anomalies, but this is not the case due to the bootstrap sampling of the model. The anomalies appear to be the larger fluctuations where sales deviated from the trend a ‘significant’ amount.
However, knowing the data is important here as some of the anomalies should come with a caveat. Let’s look at the first (February 2015) and last (November 2018) anomaly detected. At first, we see that they both are large fluctuations from the mean.
However, the first anomaly (February 2015) is only our second month of recording sales and the business may have just started operating. Sales are definitely low, and we see a large spike the next month. But is it fair to mark the second month of business an anomaly because sales were low? Or is this the norm for a new business?
For our last anomaly (November 2018), we see a huge spike in sales that appears to deviate from the overall trend. However, we have run out of data. As data continues to be recorded, it may not have been an anomaly, but perhaps an identifier of a steeper upwards trend.
Conclusion
In conclusion, anomaly detection is a must-have capability for both strong data governance and rigorous analysis. While detecting outliers and anomalies in large data can be difficult, ensemble learning methods can help as they are robust and efficient with large, tabular data.
The isolation forest model detects these anomalies by for using a forest of ‘weak learners’ to isolate observations that are few and far between.
I hope you have enjoyed my article! Please feel free to comment, ask questions, or request other topics.