Multivariate Process Control by Principal Component Analysis Using T² and Q errors

Using and interpreting Hotelling’s T² and Squared Prediction Error Q in anomaly detection systems

Davide Massidda
Towards Data Science

--

Image from geralt on Pixabay

A substantial part of my job as a data scientist consists in building anomaly detection systems for process control in manufacturing, and the Principal Component Analysis (from now PCA) has an essential role in my toolbox.

The scientific literature suggests two measures to trace anomalies through the PCA: Hotelling’s T² and Squared Prediction Error, aka Q error. Although their calculation is not complicated, the base software packages often ignore them, and their meaning remains quite enigmatic.

There are hundreds of good tutorials about PCA on the web. However, despite this large amount of information, some questions about the usage of PCA in production environments still need to be adequately addressed. In this tutorial, I’ll avoid repeating what hundreds of people said before me, going straight to the point.

This tutorial

After a quick introduction to PCA — just enough to emphasize some points — in this tutorial, we’ll deep dive into the usage of T² and Q errors, disassembling their formulas to understand what they can tell us. Here you will not find the whole story about PCA and its usage in anomaly detection, but I will focus on the meaning of T² and Q measures: from where they rise and how to interpret them.

I will use basic Python code, directly leveraging the SVD algorithm implemented in numpy. I will propose only the necessary code to obtain the described results, cutting the one to generate the graphs (you can find it here).

Read on if you have at least a shallow knowledge of PCA as a data reduction technique. If you don’t know what a principal component is, I suggest you move to a general dissertation about PCA, then next, return here.

Table of contents

Process control data
Centering and scaling data
Process control by PCA
An encoding-decoding system
Distillation
Anomaly detection by PCA
Squared Prediction Error Q
Hotelling’s T²
Anomaly detection in practice
From data to PCs… and back
Q contributions
T² contributions
Concluding remarks

Process control data

For this tutorial, I generated some ad-hoc data, simulating a straightforward process control in an industrial plant. In this repository, you can find the data and a notebook, including the complete code used to run the analyses.

Let’s start reading the data and briefly exploring it.

import pandas as pd

plant = pd.read_csv("plant_generated.csv")
plant.head()
   group  time        var1       var2        var3       var4        var5
0 train 1 83.346705 28.115372 455.898265 12.808663 227.974152
1 train 2 90.594521 33.497319 462.503195 14.079053 228.173486
2 train 3 101.275664 30.396332 492.407791 16.832834 250.212025
3 train 4 90.898109 29.143537 472.162499 16.505277 234.079354
4 train 5 84.898605 29.459506 467.872180 12.801665 238.440786

The table includes a column “group” splitting observations in a train set (100 records) and a test set (30 records). Five variables are recorded.

variables = plant.columns[2:].tolist()
print(variables)
['var1', 'var2', 'var3', 'var4', 'var5']

Each variable is monitored over time.

plant.groupby(["group"]).agg({"time": [len, min, max]})
       time
len min max
group
test 30 1 30
train 100 1 100

The correlation matrix shows good relations between the whole set of variables. No “isolated communities” of variables or redundant measures appear.

plant.loc[plant.group=="train", variables].corr()
          var1      var2     var3       var4      var5
var1 1.000000 0.536242 0.773872 0.534615 0.727272
var2 0.536242 1.000000 0.535442 0.451099 0.632337
var3 0.773872 0.535442 1.000000 0.613662 0.709600
var4 0.534615 0.451099 0.613662 1.000000 0.571399
var5 0.727272 0.632337 0.709600 0.571399 1.000000

Centering and scaling data

Since data have different scales, we need to center and scale values before running the PCA, obtaining a training set where each variable has a mean of 0 and a standard deviation of 1.

To rescale future data, I train a “scaler” object storing the center and scale parameters, so I transform the training data. I call “X” the raw observed data and “Z” the scaled data.

import numpy as np
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = np.array(plant.loc[plant.group=="train", variables])
Z_train = scaler.fit_transform(X_train)

Finally, let’s see the time series of records for each autoscaled variable.

Figure 1. Times series of autoscaled variables — Image from the author, license CC0.

Process control by PCA

A way to view PCA is as an encoding-decoding system trained on a data set representing a process in control. During the training, the system learns the rules for relating the variables to monitor. Subsequently, laying on these rules, the system can evaluate new data, establishing whether a process is in control.

PCA takes the observed variables as an ensemble and redistributes their variability, building new orthogonal variables: the Principal Components (PCs). Starting from k variables, PCA allows obtaining k PCs. The algorithm estimates the coefficients (loadings) to multiply with observed variables (Z) to get PCs.

Figure 2. Relations between the data space and the components space in PCA — Image from the author, license CC0.

In practice, PCA finds the rules (i.e., the loadings) to project data from the space of observed data to the space of principal components, obtaining the PC scores. This projection can be reverted by multiplying the PC scores by the transposed loadings matrix (see Figure 2).

I use the Singular Value Decomposition algorithm in the following code block to estimate loadings for our plant data.

u, s, vh = np.linalg.svd(Z_train)
loadings = np.transpose(vh)
print(loadings)
array([[ 0.46831293, -0.07447163,  0.50970523,  0.1634115 , -0.69902377],
[ 0.40387459, 0.76147727, -0.41265167, 0.29133908, -0.04333374],
[ 0.47556921, -0.21952275, 0.30115426, 0.43988217, 0.66441965],
[ 0.40644337, -0.58699292, -0.68019685, -0.01720318, -0.16516475],
[ 0.47561121, 0.14783576, 0.12867607, -0.83344223, 0.20187888]])

The loadings array reports variables on the rows and PCs on the columns. This is the usual orientation of the matrix, but software packages can differ in their disposition.

We can obtain the PC scores by multiplying the observed data matrix by the loadings matrix.

train_scores = Z_train @ loadings

An encoding-decoding system

A key concept of PCA is that each PC is fed by all the variables (although with varying degrees depending on the loading value). Then, when we inspect the scores of a single PC, we inspect all the variables at once (yes, it is!).

However, the overall data variability isn’t equally distributed between PCs because PCs are ranked from the one accounting for the greatest variability (PC_1) to the one accounting for the lesser variability (PC_k). Realistically, the first PCs account for the “signal” present in data, while the latest for the noise. So, we can drop the PCs not conveying a significant amount of variability, separating the signal from the noise.

The figure below shows the time series PC scores on our data. It’s evident the decrement in variability when increasing the rank of the component (in particular, jumping from the first to the second component).

Figure 3. Time series of principal component scores — Image from the author, license CC0.

Distillation

In the image below, I exploded Figure 2. I represented all the retained PCs in the gray node named distilled data, while with the other gray node, I represented all the dropped PCs. I used the term “distilled” because each component is an amalgamated essence of many variables.

Figure 4. PCA as an encoding-decoding system — Image from the author, license CC0.

We can calculate the amount of variance assimilated by each component by squaring the singular values from SVD and dividing them by the degrees of freedom.

n = Z_train.shape[0]
variances = s**2 / (n - 1)

Let’s look at the results.

pd.DataFrame({
"Component": [*range(1, len(s)+1)],
"Variance": variances,
"Proportion": variances/np.sum(variances),
"Cumulative": np.cumsum(variances/np.sum(variances))
})
   Component  Variance  Proportion  Cumulative
0 1 3.485770 0.690182 0.690182
1 2 0.573965 0.113645 0.803828
2 3 0.498244 0.098652 0.902480
3 4 0.276403 0.054728 0.957208
4 5 0.216122 0.042792 1.000000

The first component accounts for 69% of the total variance of data. Adding up the second component, we can explain the 80%. Two components can be enough to resume the whole data if we assume the remaining 20% is noise.

Using only two of five components is like performing a data compression with losing information. However, since the lost part theoretically is noise, we are gaining an advantage by dropping the last components.

Let’s isolate the loadings and variances on our system.

n_comp = 2
pca_loadings = loadings[:,:n_comp]
pca_variances = variances[:n_comp]

When new data arrive at a trained PCA, they are projected into the component space (encoding). Afterward, they are rebuilt to revert the process, using just the firsts PCs (decoding).

Anomaly detection by PCA

Now, we deep dive into detecting anomalies by Q and T² error measures. First of all, let’s understand what they represent. We will see the formulas directly applied to the data later.

Squared Prediction Error Q

Since we dropped the noise in the encoding step, the rebuilt data cannot be exactly equal to the observed data. This difference generates the error Q.

When the process control system returns an anomaly of type Q, something has broken the correlation structure: one or more variables are no longer variating in harmony with the others (where the concept of “in harmony” is defined by the correlation matrix).

We can distinguish two extreme scenarios.

  • A variable takes an unexpected value (not necessarily out of range) and is no longer “predictable” from the others. If a certain type of correlation is expected from two variables, and suddenly the direction of one changes, their expected relation isn’t observed more, generating an alarm.
  • The variable has a normal behavior, but all the others deviate from expectations. If previous information says that when the other variables deviate, our variable should deviate too, and it does not, the process has an anomaly.

The second scenario can appear paradoxical. For this reason, we need to consider another type of error: Hotelling’s T².

Hotelling’s T²

Unlike Q, T² is more linked to the metric of the observed data: values out of the expected range generate extremes T².

The T² statistic is strictly related to the Mahalanobis distance. We can conceive T² as a multivariate version of an anomaly detection system based on thresholds imposed on observed data but with some relevant differences.

  1. Observed data are denoised from the encoding system.
  2. Anomalies are searched at the PC level, and only later they will be decoded at the observed data level.

The second point is interesting because PCs synthesize the entire group of variables. Using T², we do not search anomalies variable-by-variable in a univariate perspective, but we analyze data all at once.

During the training process, we stored the variances of the PCs. Now, we can use this information to compare the observed variance of PCs against the expected one, calculating an “anomaly score” for each PC. Afterward, we can decode this anomaly score to return to the data space, obtaining the contribution of each variable to the variation of the PCs.

Anomaly detection in practice

Previously we built our PCA system. Now, let’s start to use it, projecting new data in the PC space. We start centering and scaling data, next visualizing the time series of variables.

X_test = np.array(plant.loc[plant.group=="test", variables])
Z_test = scaler.transform(X_test)

In the image below, the time series of each variable for the new data is represented.

Figure 5. Time series of centered and scaled test data — Image from the author, license CC0.

Insane data, isn’t it? As you can see, I simulated a going crazy process. Four variables strongly drift from the expected mean to high values, and a fifth variable, after an initial drift, gradually returns to normal values.

From data to PCs… and back

We can encode data calculating the PC scores:

test_scores = Z_train @ pca_loadings

Thereafter, we can decode the PCs calculating the expected data:

test_expect = test_scores @ np.transpose(pca_loadings)

Now, let’s visualize the time series of rebuilt data:

Figure 6. Times series of expected data built using a subset of PCs — Image from the author, license CC0.

Well, here we finally reveal a fundamental issue of multivariate process control. If you look closely at the chart above, you’ll notice that the rebuilt data show two main defects compared to the observed data (Figure 5).

  1. The observed data of the first four variables peak and settle on a plateau. Differently, their expected values, after peaking, gradually decrease.
  2. The observed data of the fifth variable climb to the peak and quickly decrease. Differently, its expected values show a slight decrease after peaking. Essentially, its trend is similar to one of the other variables.

The reconstruction of the first four variables has been influenced by what happens in the fifth, while the reconstruction of the fifth variable has been influenced by what happens in the first four. Then, each rebuilt variable is a mix of the entire dataset.

This inaccuracy is happening because we are decoding by dropping the last PCs. According to our PCA, the “madness” of new data is noise, so it’s absorbed from the last PCs, which absorb the “singularities” of new data absent in training data.

In production environments, Q and T² errors can be used to monitor these anomalies and understand their origin. For both statistics, we can calculate the following values:

  • the contribution of each variable to each measure of error;
  • the squared cumulated contributions for each measure of error, generating the Q and T² properly said.

You can find details with good math explanations here. In this post, I’ll focus on the contributions since the main interest in data monitoring is locating the source of the anomaly.

Q contributions

The contribution of each variable to the Q error is the distance between observed scaled data and their expected (rebuilt) values.

contrib_Q = Z_test - test_expect

In the image below, you can visualize the time series of Q contributions of new data.

Figure 7. Times series of Q contributions — Image from the author, license CC0.

Roughly speaking, Q contributions are near zero for all the variables since half of the monitoring period. In the second part of the period, Q contributions drift from zero for all the variables, especially for the fifth.

This shouldn’t surprise us. As we have seen, up to the middle of the period, the variables are anomalous but anomalous in a synchronous way (coherently with correlations observed in the training set). In the second part of the period, the fifth variable returns towards the mean, going to “normal” values (near the ones of the training set) but breaking the correlation structure.

T² contributions

The formula to obtain T² contributions is similar to the formula to decode PCs to the data level but with an important difference: the scaling factor.

PCs are all mean-centered, so the expected value for all the PCs is zero. However, since each PC absorbs a different amount of data variance (see Figure 3), each PC has its scale. The same score may represent a small variation on one PC but a large variation on another!

For this reason, before decoding, each PC is scaled by dividing it by its standard deviation. In this way, PCs are put on the same scale before decoding.

pc_scaling = np.diag(1/np.sqrt(pca_variances))
contrib_T2 = test_scores @ pc_scaling @ np.transpose(pca_loadings)

The image below shows the time series of T² contributions.

Figure 8. Time series of T² contributions — Image from the author, license CC0.

Concluding remarks

The process control using PCA considers the monitored variables as a system that should move in unison, expecting its elements to vary consistently (Q statistic) and within certain ranges (T² statistic).

The main difference between Q and T² anomalies is where these originate: the first at the data level and the second at the PC level.

Q warns us that the encoding-decoding system isn’t working as expected on new data because of the observed relations between the variables. So, the Q contributions allow us to identify the unpredictable variables given the observed data.

T² warns us that the encoding system is getting PC scores too distant from the center of the whole data. The T² contributions decode the errors, “back-propagating” them to the data level, allowing us to identify the variables with anomalous observations.

The core idea is that each data record consists of a signal and a noise. The PCA removes the noise and evaluates the signal, alarming if the signal of a variable is too distant from as expected (T²) and if it’s overwhelmed by the noise (Q). One does not necessarily imply the other, although they can occur together.

--

--

Data scientist at kode-solutions.net with a background in cognitive psychology. Rock enthusiast and addicted to comic books with a serious history of abuse.