Detecting real-time and unsupervised anomalies in streaming data: a starting point

Published in

Towards Data Science

7 min readJan 10, 2020

Fig.1: Smart devices generate real-time data which may suffer from anomalies, leading us to wrong data-driven decisions if we do not detect and properly manage them. Image taken from pixabay.com.

Sensors enable the Internet of Things (IoT) by collecting the data for smarter decisions in all kinds of systems. Data are usually produced in a real-time fashion, and then we may find ourselves forced to make a real-time processing (stream data mining [1]). The behaviour of the system is not always constant and unalterable, but may exhibit an unusual and significantly different from previous normal behaviour (anomaly [2]). Anomaly detection is valuable, yet it can turn into a difficult task to be executed reliably in practice.

This article does not claim to be an exhaustive list of methods and solutions, but yes to be an easy entry point for those practitioners who tackle this problem for first time and need easy, understandable, and scalable solutions. Anomalies may be present in real-world applications such as fraud prevention, finance, energy demand or consumption, e-commerce, cybersecurity, medical diagnosis, social media, predictive maintenance, or fault detection among others. In this article we analyse some algorithms to deal with anomalies. They are Welford’s algorithm, a quartiles-based solution, a z-score metric-based solution, and a machine learning-based solution called Half-Space Trees (HST). The first 3 solutions are based on statistic indicators/metrics, while the last one comes from the machine learning field.

Streaming: the rules of the game

In contrast with batch learning where data are assumed to be at rest (historical data is available), and where models do not continuously integrate new information into already constructed models, stream learning imposes constrained restrictions for the real-time processing:

Instances (data) in the stream arrive online (frequently one instance at a time) and can be read at most once, which constitutes the strongest constraint for processing data streams, and the system has to decide whether the current instance should be discarded or archived.
Only selected past instances can be accessed by storing them in memory, which is typically small relative to the size of the data streams. When designing stream learning algorithms, we have to take several algorithmic and statistical considerations into account. For example, we have to face the fact that, as we cannot store all the inputs, we cannot unwind a decision made on past data.
The processing time of each instance must be small and constant.
The processing algorithm must produce a model that is equivalent to the one that would be produced by a batch processing algorithm.

This set of constraints is the reason why most of existing anomaly detection algorithms for bacth processing are not applicable to streaming applications.

In most practical cases, data consists of a sequential and univariate dataset, where supervised information about the anomalies (how many are, where they are) is not available.

The Welford’s algorithm

Welford’s method is a usable single-pass method for computing the running variance or the running standard deviation. It can be derived by looking at the differences between the sums of squared differences for N and N-1 instances. The data do not need to be stored for a second pass [3]. In order to used the Welford’s method for anomaly detection problems, I suggest to incorporate the following simple modification.

We create an upper limit (UL) and a lower limit (LL). When the online mean consumption (orange solid line) overcomes any of these limits (or X times the online standard deviation), then we classify this reading (instance) as anomaly. The limits (black dashed lines) can be calculated as:
UP=online mean consumption+X*online standard deviation
LL=online mean consumption-X*online standard deviation

The higher X is, the more false negatives (FNs) we will assume, and the lower X is the more false positives (FPs) we will obtain. So the choice of X is not trivial and there is a trade-off between this parameter X and FPs and FNs. Then, you need to define X=1, 2, 3, 4,… etc. depending on this decision. As you see, both online mean consumption and the limits are updated online every time a new instance arrives.

Code: A simple implementation of the original method can be found here.

Notes: readings=0 should be considered an anomaly or not. You have also noticed that the LL may be under 0 depending on the readings and X. This fact can be easily adjusted if it is a problem for your graphs. Please note that, if you obtain many consecutive anomalies, it is possible that you need to consider them as unique anomaly, or even as a “anomalous” period. Finally, I would like to mention the possibility of using a sliding window of size w, and calculate all the metrics over it.

The quartiles-based solution

A boxplot is a popular way of representing the distribution of a dataset based on a set of number summaries: the minimum, the first quartile (Q1/25th Percentile), the median, the third quartile (Q3/75th Percentile), and the maximum. This representation can tell you about your outliers and their values.

Fig. 5: Elements of a boxplot. Image modified and taken from WikiMedia Commons.

The interquartile range (IQR) goes from Q1 to Q3, and we can calculate the maximum and the minimum as:

maximum=Q3+1.5*IQR
minimum=Q3-1.5*IQR

(For more detail on boxplots and quartiles, I recommend you to check the following article: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)

Then, we can consider as outliers all those points that go above maximum or below minimum. And we can calculate in an online manner these number summaries.

Code: as you see, this solution can be easily implemented. For lazy people, there are many available implementations of this method in known repositories or other websites.

Notes: here we could also used a sliding window as previously mentioned.

The z-score metric-based solution

The standard score or z-score (z) gives an idea of how far from the mean a data instance is, i.e. how many standard deviations above or below the population mean a raw score is.

The z-score for an instance x can be calculated as: z = (x — μ) / σ

The standard score can be also calculated in an online manner. But this time, we have used a sliding window, then being known as running or moving z-score. Given this window size w, the moving z-score will be the number of standard deviations that each instance is away from the mean, where the mean and standard deviation are computed this time only over the previous w instances.

Code: as you see, this solution can be easily implemented. For lazy people, there are many available implementations of this method in well-known source code repositories.

Half-Space Trees

Half-Space Trees (HST) [4] is a a fast one-class anomaly detector for evolving data streams. It requires only normal data for training and works well when anomalies are spread out in time. It does not work well if anomalies are packed together in windows of time. This technique is an ensemble of random HSTs, where each tree structure is constructed without any data, making the technique highly efficient because it does not require model restructuring when adapting to evolving data streams.

This technique is incrementally trained, and uses an sliding window w. Other relevant parameters are the number of trees in the ensemble (nt), and the threshold for declaring anomalies (th). Any instance prediction probability above this threshold will be declared as an anomaly: the lower the score, the more likely it is that the current instance is an anomaly. To know more about all parameters please check the Code section below, or this paper.

Code: this solution can be found in Creme or in scikit-multiflow, both frameworks implemented in Python.

Other approaches

It also deserves special attention some other known approaches, among many others:

Local Outlier Factor (LOF, see here more details),
Univariate parametric approaches such as Grubb test (mean may be a mak for outliers) or Tietjen y Moore test (which may suffer from swamping),
Do not forget to check [5], where an extensive comparison between many detectors has been carried out.

Conclusions

As you have already realised, all these techniques require the tuning of one or more parameters that influence their performance in terms of FPs and FNs (see F1-score performance metric).
As it is an unsupervised scenario, it is very important to have a wide knowledge of the domain, in order to evalute the performance of the technique, and to properly find the equilibrium between FPs and FNs. In any case, their evaualtion will be subjective, that is the reason why the knowledge of the domain is crucial.
Each technique finds anomalies depending on different criteria, so they do not necessarily match their classification (anomaly yes/no). But we can find similarities, for example between z-score metric-based solution and Welford’s solution.
Finally, you should consider the possibility of putting together some of these techniques (ensemble) due to the fact that they could be complementary and detect different types of anomalies.

References

[1] Bifet, A., Holmes, G., Kirkby, R., & Pfahringer, B. (2010). Moa: Massive online analysis. Journal of Machine Learning Research, 11(May), 1601–1604.

[2] Chandola, V., Mithal, V., & Kumar, V. (2008, December). Comparative evaluation of anomaly detection techniques for sequence data. In 2008 Eighth IEEE international conference on data mining (pp. 743–748). IEEE.

[3] Knuth, D. E. (2014). Art of computer programming, volume 2: Seminumerical algorithms. Addison-Wesley Professional.

[4] S.C.Tan, K.M.Ting, and T.F.Liu, “Fast anomaly detection for streaming data,” in IJCAI Proceedings — International Joint Conference on Artificial Intelligence, 2011, vol. 22, no. 1, pp. 1511–1516.

[5] Ahmad, S., Lavin, A., Purdy, S., & Agha, Z. (2017). Unsupervised real-time anomaly detection for streaming data. Neurocomputing, 262, 134–147.