Making Sense of Big Data

The PCA Trick with Time-Series

PCA can be used to reject cyclic time-series behavior, and this works for anomaly detection.

John Mark Agosta
Towards Data Science
7 min readDec 21, 2020

--

Top: Noisy periodic signals, Bottom: Reduced anomaly signal. (All images by author).

Detecting an anomaly typically means thresholding a signal, to alarm when the signal is out-of-range. For something like an assembly line, where tolerances are precise, the difference between normal and abnormal is clear. But network traffic has a noisome characteristic that makes this hard. It varies with large daily cycles as customers’ activity peaks and wanes. One could try to tenderly tease out this daily variation by modelling it. However this trick using Principal Component Analysis (PCA) avoids that hard work.

The periodic components embedded in a set of concurrent time-series can be isolated by Principal Component Analysis (PCA), to uncover any abnormal activity hidden in them.¹ This is putting the same math commonly used to reduce feature sets to a different purpose. PCA and similar dimension reduction methods may be part of your every-day data science toolkit, but I bet you never thought of using PCA for this.

Here’s the problem it solves.

Say we’ve got a cluster of networked machines running a distributed web service. Customer traffic flows through various machines, depending on their function, and each records a set of performance variables, such as memory and cpu usage. Because of the common external origin of the traffic — -to wit, the customer load — -these variables share periodic fluctuations on top of whatever innate source of variation one may seek to detect. That variation could be due to software or hardware faults among the machines — -the stuff that you’re monitoring for.

Any stationary time-series can be expressed as sums of sines and cosine functions, in what is called a Fourier expansion. These periodic functions are a natural way to analyze stationary time-series in a fundamental way that will become clear. My first foray to solve the problem was to extract the Fourier expansion explicitly, much as a tool like Facebook’s Prophet forecaster does. With the extracted components in hand, I then attempted to weave together a set responsible for the external traffic, as a predictor of normal traffic, deviations from which could be classified as abnormal. But, all this effort turns out to be unnecessary.

The linear algebra of PCA.

Recall from linear algebra that one may construct a basis for any vector space, meaning a set of independent vectors that span the space, of which any other vector in the space is a unique linear combination. All bases for the space have the same size: This size defines the dimension of the space. PCA discovers a basis with two desirable properties. First, among the many possible arbitrary bases, the eigenvalues of the subspace of the first d PCA components minimize the reconstruction error of the original signals. More precisely, for any d less than the full dimension, this subspace d gives the best d-dimensional approximation of the feature vectors. As a bonus, since the covariance matrix whose eigenvalues are recovered is symmetric, the PCA eigenvectors are orthogonal. This property is well known, previously as the Karhunen-Loeve expansion, originally published in the 1940s.

Secondly, note the Fourier expansion also forms an orthonormal basis of eigenvectors invariant to translation. This means that the components don’t change as the time-index is shifted. This is important because, besides removing the concern about how the starting point of the time-series affects the results (it doesn’t) it means the Fourier components are the eigenvectors of stationary time-series. In fact, any basis of a stationary time-series can arguably be expressed as a combination of Fourier components.

A bit of math shows how the translation invariance arises. Any exponential function has this desired property

where x can be a complex number, and h is a translation. In words, this states the exponential function is the same before and after displacement by h. Of course, for our purposes, this property carries over to the familiar pair of trigonometric functions because

Thus trigonometric functions are eigenvectors in translation, meaning any periodic signal has the same eigen-decomposition if shifted in time. Consequently — -and one might say, magically, if there is a common periodic component to the set of time-series variables, PCA will find it and the Fourier components will appear in the PCA results.

How this works.

In our case, we are interested in the PCA maximum variation subspace as a way to identify the components of the periodic signal. In short, instead of throwing away the least variation basis vectors among the d as we do for dimension reduction, we will ignore the largest (in the sense of variance) basis vectors until we are left with those that are normally “quiet”, where the anomalies appear.

Our original signals are considered as vectors, one for each time-series, with length equal to n the number of time-steps. This n is not to be confused with the dimension of the set of signal features — -think of the signals matrix as the table of p features of n samples.

Here’s the R code to convert thesignals time-series matrix into the pca_timeseries components. prcomp() is part of the built-in mva R package.

Let’s have a look.

We can see how PCA can reject periodic noise with a simulation. We create a combination of few sine and cosine features, with a pinch of noise, and run it through a PCA decomposition. The result are another set of orthogonal time-series that pick up the periodicities of the input set in the first two components, leaving the residual in the rest. Here are five noisy features with different phases, and their five PCA components:

PCA isolates the common Fourier component among timeseries.

Now we corrupt the features with a step function that is largely invisible in the original features. Looking at the basis vectors returned by PCA, we see the periodic parts expressed in the first two components . Once PCA has captured these, what’s left is a residual where the step function is obvious.

An anomaly introduced in the cyclic features is evident in the PCA components.

This works because there’s a common cause to the periodic features. They are driven by the external customer cycles. Most prominent is a daily cycle, but weekly and hourly cycles can appear also — any period that is a fraction of the time-series length.

Back to the application.

I took data of several performance measures from one cluster, for which we know an significant anomaly occurred over the course of a day within an observation period of two weeks. See the illustration at the top of the article with the title. We can see in the top graph the daily cycles in the different features, in different degrees before and after the anomaly. The point is that the PCA component — shown in black in the bottom graph — preserves the anomaly while rejecting the various levels of daily cycles.

Once the PCA model is trained on “normal” traffic we can take the last PCA component that we’ve kept and threshold it. In comparison to thresholding the individual measures directly the result is a significant reduction is spurious detections, so that we can tighten the threshold to increase sensitivity.

Runs on a real dataset.

Let’s look at an example where this method works. I ran the anomaly detector on suspect traces for 400 machine clusters, a few of which I knew had obvious anomalies. To cross-validate, the detector was trained on the normal, but cyclic traffic before the known anomalous event, then the detection rate was measured on the day when the anomaly occurred, and separately on the week after, once operations returned to normal. I found that the PCA-based anomaly detector clearly picked up the anomaly in known cases, but more importantly was almost silent during traffic in normal times. Here’s an example clearly identified as anomalous:

Example of Anomalous Memory (left) and Cpu (right) features

Here’s a case where the data has strong cyclic, noisy features, but no anomaly is detected:

Memory (left) and Cpu (right) features not classified as anomalous

More interestingly, on examination of previously unseen cases, many are anomalous but were missed entirely by existing detectors. This can be put in statistical terms. The lack of spurious detection when no anomaly occurs means the false positive rate is small, in this case actually below measurable levels for the one-week testing trace I ran. The other impressive news is that the detection rate for anomalies, the true positive rate, was higher than current detection methods.

tl;dr

  1. The PCA-Trick is practical for revealing anomalous common effects while rejecting strongly cyclic daily customer traffic.
    2. Applying “train and test” for anomaly detection analysis on a known event demonstrated substantially improved hourly anomaly detection rates at zero measurable false positive rates.
    3. Similar analysis of 400 clusters for the same episode discovered numerous anomalous clusters not caught by current tools.

¹ Lakhina, A., Crovella, M., & Diot, C. (2004). Diagnosing network-wide traffic anomalies. ACM SIGCOMM computer communication review, 34(4), 219–230.

--

--

"Data Science" is a broadly encompassing term, and I focus on modelling, specifically the initial statistical formulation stages.