Outlier Detection Using Principal Component Analysis and Hotelling's T2 and SPE/DmodX Methods

Principal Component Analysis (PCA) is a widely used technique for dimensionality reduction while preserving relevant information. Due to its sensitivity, it can also be used to detect outliers in multivariate datasets. Outlier detection can provide early warning signals for abnormal conditions, allowing experts to identify and address issues before they escalate. However, detecting outliers in multivariate datasets can be challenging due to the high dimensionality, and the lack of labels. PCA offers several advantages for outlier detection. I will describe the concepts of outlier detection using PCA. With a hands-on example, I will demonstrate how to create an unsupervised model for the detection of outliers for continuous and separately categorical data sets.

Outlier Detection.

Outliers can be modeled in either a univariate or multivariate approach (Figure 1). In the univariate approach, outliers are detected using one variable at a time for which data distribution analysis is a great manner. Read more details about univariate outlier detection in the following blog post [1]:

Outlier Detection Using Distribution Fitting in Univariate Datasets

The multivariate approach uses multiple features and can therefore detect outliers with (non-)linear relationships or skewed distributions. The scikit-learn library has multiple solutions for multivariate outlier detection, such as the one-class classifier, isolation forest, and local outlier factor [2]. In this blog, I will focus on multivariate outlier detection using Principal Component Analysis [3] which has its own advantages such as explainability; the outliers can be visualized as we rely on the dimensionality reduction of PCA itself.

Figure 1. Overview of univariate versus multivariate analysis for the detection of outliers. _Outlier detection for multivariate data sets will be described in this blog (_image by the author).

Anomalies vs. Novelties

Anomalies and novelties are deviant observations from standard/expected behavior. Also referred to as outliers. There are some differences though: anomalies are deviations that have been seen before, typically used for detecting fraud, intrusion, or malfunction. Novelties are deviations that have not been seen before or used to identify new patterns or events. In such cases, it is important to use domain knowledge. Both anomalies and novelties can be challenging to detect as the definition of what is normal or expected can be subjective and vary based on the application.

Principal Component Analysis for Outlier Detection.

Principal Component Analysis (PCA) is a linear transformation that reduces the dimensionality and searches for the direction in the data with the largest variance. Due to the nature of the method, it is sensitive to variables with different value ranges and, thus also outliers. An advantage is that it allows visualization of the data in a two or three-dimensional scatter plot, making it easier to visually confirm the detected outliers. Furthermore, it provides good interpretability of the response variables. Another great advantage of PCA is that it can be combined with other methods, such as different distance metrics, to improve the accuracy of the outlier detection. Here I will use the PCA library which contains two methods for the detection of outliers: Hotelling’s T2 and SPE/DmodX. For more details, read the blog post about Principal Component Analysis and pca library [3].

What are PCA loadings and how to effectively use Biplots?

If you find this article about outlier detection helpful, follow me to stay up-to-date with my latest content! Support this content using my referral link which will give you unlimited learning and reading with the Medium membership.

Outlier Detection for Continuous Random Variables.

Let’s start with an example to demonstrate the working of outlier detection using Hotelling’s T2 and SPE/DmodX for continuous random variables. I will use the wine dataset from sklearn that contains 178 samples, with 13 features and 3 wine classes [4].

# Intallation of the pca library
pip install pca

# Load other libraries
from sklearn.datasets import load_wine
import pandas as pd

# Load dataset
data = load_wine()

# Make dataframe
df = pd.DataFrame(index=data.target, data=data.data, columns=data.feature_names)

print(df)
#     alcohol  malic_acid   ash  ...   hue  ..._wines  proline
# 0     14.23        1.71  2.43  ...  1.04  3.92   1065.0
# 0     13.20        1.78  2.14  ...  1.05  3.40   1050.0
# 0     13.16        2.36  2.67  ...  1.03  3.17   1185.0
# 0     14.37        1.95  2.50  ...  0.86  3.45   1480.0
# 0     13.24        2.59  2.87  ...  1.04  2.93    735.0
# ..      ...         ...   ...  ...   ...  ...
# 2     13.71        5.65  2.45  ...  0.64  1.74    740.0
# 2     13.40        3.91  2.48  ...  0.70  1.56    750.0
# 2     13.27        4.28  2.26  ...  0.59  1.56    835.0
# 2     13.17        2.59  2.37  ...  0.60  1.62    840.0
# 2     14.13        4.10  2.74  ...  0.61  1.60    560.0
# 
# [178 rows x 13 columns]

We can see in the data frame that the value range per feature differs heavily and a normalization step is therefore important. The normalization step is a build-in functionality in the pca library that can be set by normalize=True. During the initialization, we can specify the outlier detection methods separately, ht2 for Hotelling’s T2 and spe for the SPE/DmodX method.

# Import library
from pca import pca

# Initialize pca to also detected outliers.
model = pca(normalize=True, detect_outliers=['ht2', 'spe'], n_std=2  )

# Fit and transform
results = model.fit_transform(df)

After running the fit function, the pca library will score sample-wise whether a sample is an outlier. For each sample, multiple statistics are collected as shown in the code section below. The first four columns in the data frame (y_proba, p_raw, y_score, and y_bool), are outliers detected using Hotelling’s T2 method. The latter two columns (y_bool_spe, and y_score_spe) are based on the SPE/DmodX method.

# Print outliers
print(results['outliers'])

#     y_proba     p_raw    y_score  y_bool  y_bool_spe  y_score_spe
#0   0.982875  0.376726  21.351215   False       False     3.617239
#0   0.982875  0.624371  17.438087   False       False     2.234477
#0   0.982875  0.589438  17.969195   False       False     2.719789
#0   0.982875  0.134454  27.028857   False       False     4.659735
#0   0.982875  0.883264  12.861094   False       False     1.332104
#..       ...       ...        ...     ...         ...          ...
#2   0.982875  0.147396  26.583414   False       False     4.033903
#2   0.982875  0.771408  15.087004   False       False     3.139750
#2   0.982875  0.244157  23.959708   False       False     3.846217
#2   0.982875  0.333600  22.128104   False       False     3.312952
#2   0.982875  0.138437  26.888278   False       False     4.238283

[178 rows x 6 columns]

Hotelling’s T2 computes the chi-square tests and P-values across the top n_components which allows the ranking of outliers from strong to weak using y_proba. Note that the search space for outliers is across the dimensions PC1 to PC5 as it is expected that the highest variance (and thus the outliers) will be seen in the first few components. Note, the depth is optional in case the variance is poorly captured in the first five components. Let’s plot the outliers and mark them for the wine datasets (Figure 2).

# Plot Hotellings T2
model.biplot(SPE=False, HT2=True, density=True, title='Outliers marked using Hotellings T2 method.')

# Make a plot in 3 dimensions
model.biplot3d(SPE=False, HT2=True, density=True, arrowdict={'scale_factor': 2.5, 'fontsize': 20}, title='Outliers marked using Hotellings T2 method.')

# Get the outliers using SPE/DmodX method.
df.loc[results['outliers']['y_bool'], :]

Figure 2. Left panel: PC1 vs PC2 and the projected samples with 9 detected outliers using Hotelling's T2 method. Right panel: Three-dimensional plot with the outliers. (image by the author) — Figure 2. Left panel: PC1 vs PC2 and the projected samples with 9 detected outliers using Hotelling’s T2 method. Right panel: Three-dimensional plot with the outliers. (image by the author)

The SPE/DmodX method is a measure of the distance between the actual observation and its projection, using the Principal Components. The distance to the center is expressed by Hotelling’s T2 values and, therefore, the ellipse in Figure 3 represents the boundary beyond which a sample is outlying with respect to Hotelling’s T2. A sample is flagged as an outlier based on the mean and covariance of the first two PCs (Figure 3). In other words, when it is outside the ellipse.

# Plot SPE/DmodX method
model.biplot(SPE=True, HT2=True, title='Outliers marked using SPE/dmodX method and Hotelling T2.')

# Make a plot in 3 dimensions
model.biplot3d(SPE=True, HT2=True, title='Outliers marked using SPE/dmodX method and Hotelling T2.')

# Get the outliers using SPE/DmodX method.
df.loc[results['outliers']['y_bool_spe'], :]

Figure 3A. Outliers detected using the SPE/DmodX method are depicted with diamonds. Outliers detected using the Hotelling T2 method are depicted with crosses. (image by the author)

Figure 3B. Outliers are detected using the SPE/DmodX method and visualized in a 3D plot.

Using the results of both methods, we can now also compute the overlap. In this use case, there are 5 outliers that overlap (see code section below).

# Grab overlapping outliers
I_overlap = np.logical_and(results['outliers']['y_bool'], results['outliers']['y_bool_spe'])

# Print overlapping outliers
df.loc[I_overlap, :]

Outlier Detection for Categorical Variables.

For the detection of outliers in categorical variables, we first need to discretize the categorical variables and make the distances comparable to each other. With the discretized data set (one-hot), we can proceed using the PCA approach and apply Hotelling’s T2 and SPE/DmodX methods. I will use the Student Performance data set [5] for demonstration purposes, which contains 649 samples and 33 variables. We will import the data set as shown in the code section below. More details about the column description can be found here. I will not remove any columns but if there was an identifier column or variables with floating type, I would have removed it or categorized it into discrete bins.

# Import library
from pca import pca

# Initialize
model = pca()

# Load Student Performance data set
df = model.import_example(data='student')

print(df)
#     school sex  age address famsize Pstatus  ...  Walc  health absences
# 0       GP   F   18       U     GT3       A  ...     1       3        4
# 1       GP   F   17       U     GT3       T  ...     1       3        2
# 2       GP   F   15       U     LE3       T  ...     3       3        6
# 3       GP   F   15       U     GT3       T  ...     1       5        0  
# 4       GP   F   16       U     GT3       T  ...     2       5        0  
# ..     ...  ..  ...     ...     ...     ...  ...   ...     ...      ...  
# 644     MS   F   19       R     GT3       T  ...     2       5        4  
# 645     MS   F   18       U     LE3       T  ...     1       1        4  
# 646     MS   F   18       U     GT3       T  ...     1       5        6  
# 647     MS   M   17       U     LE3       T  ...     4       2        6  
# 648     MS   M   18       R     LE3       T  ...     4       5        4  

# [649 rows x 33 columns]

The variables need to be one-hot encoded to make sure the distances between the variables become comparable to each other. This results in 177 columns for 649 samples (see code section below).

# Install onehot encoder
pip install df2onehot

# Initialize
from df2onehot import df2onehot

# One hot encoding
df_hot = df2onehot(df)['onehot']

print(df_hot)
#      school_GP  school_MS  sex_F  sex_M  ...  
# 0         True      False   True  False  ...  
# 1         True      False   True  False  ...  
# 2         True      False   True  False  ...  
# 3         True      False   True  False  ...  
# 4         True      False   True  False  ...  
# ..         ...        ...    ...    ...  ...  
# 644      False       True   True  False  ...  
# 645      False       True   True  False  ...  
# 646      False       True   True  False  ...  
# 647      False       True  False   True  ...  
# 648      False       True  False   True  ...  

# [649 rows x 177 columns]

We can now use the processed one-hot data frame as input for pca and detect outliers. During initialization, we can set normalize=True to normalize the data and we need to specify the outlier detection methods.

# Initialize PCA to also detected outliers.
model = pca(normalize=True,
            detect_outliers=['ht2', 'spe'],
            alpha=0.05,
            n_std=3,
            multipletests='fdr_bh')

# Fit and transform
results = model.fit_transform(df_hot)

# [649 rows x 177 columns]
# [pca] >Processing dataframe..
# [pca] >Normalizing input data per feature (zero mean and unit variance)..
# [pca] >The PCA reduction is performed to capture [95.0%] explained variance using the [177] columns of the input data.
# [pca] >Fit using PCA.
# [pca] >Compute loadings and PCs.
# [pca] >Compute explained variance.
# [pca] >Number of components is [116] that covers the [95.00%] explained variance.
# [pca] >The PCA reduction is performed on the [177] columns of the input dataframe.
# [pca] >Fit using PCA.
# [pca] >Compute loadings and PCs.
# [pca] >Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[116]
# [pca] >Multiple test correction applied for Hotelling T2 test: [fdr_bh]
# [pca] >Outlier detection using SPE/DmodX with n_std=[3]
# [pca] >Plot PC1 vs PC2 with loadings.

# Overlapping outliers between both methods
overlapping_outliers = np.logical_and(results['outliers']['y_bool'],
                                      results['outliers']['y_bool_spe'])

# Show overlapping outliers
df.loc[overlapping_outliers]

#     school sex  age address famsize Pstatus  ...  Walc  health absences 
# 279     GP   M   22       U     GT3       T  ...     5       1       12  
# 284     GP   M   18       U     GT3       T  ...     5       5        4 
# 523     MS   M   18       U     LE3       T  ...     5       5        2 
# 605     MS   F   19       U     GT3       T  ...     3       2        0 
# 610     MS   F   19       R     GT3       A  ...     4       1        0 

# [5 rows x 33 columns]

The Hotelling T2 test detected 85 outliers and the SPE/DmodX method detected 6 outliers (Figure 4, see legend). The number of outliers that overlap between both methods is 5. We can make a plot with the biplot functionality and color the samples in any category for further investigation (such as the sex label). The outliers are marked with x or * . This is now a good start for a deeper inspection; in our case, we can see in Figure 4 that the 5 outliers are drifting away from all other samples. We can rank the outliers, look at the loadings, and deeper investigate these students (see previous code section). To rank the outliers, we can use the y_proba __ (lower is better) for the Hotelling T2 method, and y_score_spe, for the SPE/DmodX method. The latter is the Euclidian distance of the sample to the center (thus larger is better).

# Make biplot
model.biplot(SPE=True,
             HT2=True,
             n_feat=10,
             legend=True,
             labels=df['sex'],
             title='Student Performance',
             figsize=(20, 12),
             color_arrow='k',
             arrowdict={'fontsize':16, 'c':'k'},
             cmap='bwr_r',
             gradient='#FFFFFF',
             edgecolor='#FFFFFF',
             density=True,
             )

Figure 4. Outliers detected using the SPE/DmodX method are depicted with diamonds. Outliers detected using the Hotelling T2 method are depicted with crosses. (image by the author)

Final words.

I demonstrated how to use PCA for multivariate outlier detection for both continuous and categorical variables. With the pca library, we can use Hotelling’s T2 and/or the SPE/DmodX method to determine candidate outliers. The interpretation of the contribution of each variable to the principal components can be retrieved using the loadings and visualized with the biplot in the low-dimensional PC space. Such visual insights can help to provide intuition about the detection outliers and whether they require follow-up analysis. In general, the detection of outliers can be challenging because determining what is considered normal can be subjective and vary depending on the specific application.

Be Safe. Stay Frosty.

Cheers E.