.

Principal Component Analysis (PCA) is a widely used technique for dimensionality reduction while preserving relevant information. Due to its sensitivity, it can also be used to detect outliers in multivariate datasets. Outlier detection can provide early warning signals for abnormal conditions, allowing experts to identify and address issues before they escalate. However, detecting outliers in multivariate datasets can be challenging due to the high dimensionality, and the lack of labels. PCA offers several advantages for outlier detection. I will describe the concepts of outlier detection using PCA. With a hands-on example, I will demonstrate how to create an unsupervised model for the detection of outliers for continuous and separately categorical data sets.
Outlier Detection.
Outliers can be modeled in either a univariate or multivariate approach (Figure 1). In the univariate approach, outliers are detected using one variable at a time for which data distribution analysis is a great manner. Read more details about univariate outlier detection in the following blog post [1]:
Outlier Detection Using Distribution Fitting in Univariate Datasets
The multivariate approach uses multiple features and can therefore detect outliers with (non-)linear relationships or skewed distributions. The scikit-learn library has multiple solutions for multivariate outlier detection, such as the one-class classifier, isolation forest, and local outlier factor [2]. In this blog, I will focus on multivariate outlier detection using Principal Component Analysis [3] which has its own advantages such as explainability; the outliers can be visualized as we rely on the dimensionality reduction of PCA itself.

Anomalies vs. Novelties
Anomalies and novelties are deviant observations from standard/expected behavior. Also referred to as outliers. There are some differences though: anomalies are deviations that have been seen before, typically used for detecting fraud, intrusion, or malfunction. Novelties are deviations that have not been seen before or used to identify new patterns or events. In such cases, it is important to use domain knowledge. Both anomalies and novelties can be challenging to detect as the definition of what is normal or expected can be subjective and vary based on the application.
Principal Component Analysis for Outlier Detection.
Principal Component Analysis (PCA) is a linear transformation that reduces the dimensionality and searches for the direction in the data with the largest variance. Due to the nature of the method, it is sensitive to variables with different value ranges and, thus also outliers. An advantage is that it allows visualization of the data in a two or three-dimensional scatter plot, making it easier to visually confirm the detected outliers. Furthermore, it provides good interpretability of the response variables. Another great advantage of PCA is that it can be combined with other methods, such as different distance metrics, to improve the accuracy of the outlier detection. Here I will use the PCA library which contains two methods for the detection of outliers: Hotelling’s T2 and SPE/DmodX. For more details, read the blog post about Principal Component Analysis and pca
library [3].
If you find this article about outlier detection helpful, follow me to stay up-to-date with my latest content! Support this content using my referral link which will give you unlimited learning and reading with the Medium membership.
Outlier Detection for Continuous Random Variables.
Let’s start with an example to demonstrate the working of outlier detection using Hotelling’s T2 and SPE/DmodX for continuous random variables. I will use the wine dataset from sklearn that contains 178 samples, with 13 features and 3 wine classes [4].
# Intallation of the pca library
pip install pca
# Load other libraries
from sklearn.datasets import load_wine
import pandas as pd
# Load dataset
data = load_wine()
# Make dataframe
df = pd.DataFrame(index=data.target, data=data.data, columns=data.feature_names)
print(df)
# alcohol malic_acid ash ... hue ..._wines proline
# 0 14.23 1.71 2.43 ... 1.04 3.92 1065.0
# 0 13.20 1.78 2.14 ... 1.05 3.40 1050.0
# 0 13.16 2.36 2.67 ... 1.03 3.17 1185.0
# 0 14.37 1.95 2.50 ... 0.86 3.45 1480.0
# 0 13.24 2.59 2.87 ... 1.04 2.93 735.0
# .. ... ... ... ... ... ...
# 2 13.71 5.65 2.45 ... 0.64 1.74 740.0
# 2 13.40 3.91 2.48 ... 0.70 1.56 750.0
# 2 13.27 4.28 2.26 ... 0.59 1.56 835.0
# 2 13.17 2.59 2.37 ... 0.60 1.62 840.0
# 2 14.13 4.10 2.74 ... 0.61 1.60 560.0
#
# [178 rows x 13 columns]
We can see in the data frame that the value range per feature differs heavily and a normalization step is therefore important. The normalization step is a build-in functionality in the pca library that can be set by normalize=True.
During the initialization, we can specify the outlier detection methods separately, ht2
for Hotelling’s T2 and spe
for the SPE/DmodX method.
# Import library
from pca import pca
# Initialize pca to also detected outliers.
model = pca(normalize=True, detect_outliers=['ht2', 'spe'], n_std=2 )
# Fit and transform
results = model.fit_transform(df)
After running the fit function, the pca library will score sample-wise whether a sample is an outlier. For each sample, multiple statistics are collected as shown in the code section below. The first four columns in the data frame (y_proba
, p_raw
, y_score
, and y_bool
), are outliers detected using Hotelling’s T2 method. The latter two columns (y_bool_spe
, and y_score_spe
) are based on the SPE/DmodX method.
# Print outliers
print(results['outliers'])
# y_proba p_raw y_score y_bool y_bool_spe y_score_spe
#0 0.982875 0.376726 21.351215 False False 3.617239
#0 0.982875 0.624371 17.438087 False False 2.234477
#0 0.982875 0.589438 17.969195 False False 2.719789
#0 0.982875 0.134454 27.028857 False False 4.659735
#0 0.982875 0.883264 12.861094 False False 1.332104
#.. ... ... ... ... ... ...
#2 0.982875 0.147396 26.583414 False False 4.033903
#2 0.982875 0.771408 15.087004 False False 3.139750
#2 0.982875 0.244157 23.959708 False False 3.846217
#2 0.982875 0.333600 22.128104 False False 3.312952
#2 0.982875 0.138437 26.888278 False False 4.238283
[178 rows x 6 columns]
Hotelling’s T2 computes the chi-square tests and P-values across the top n_components
which allows the ranking of outliers from strong to weak using y_proba
. Note that the search space for outliers is across the dimensions PC1 to PC5 as it is expected that the highest variance (and thus the outliers) will be seen in the first few components. Note, the depth is optional in case the variance is poorly captured in the first five components. Let’s plot the outliers and mark them for the wine datasets (Figure 2).
# Plot Hotellings T2
model.biplot(SPE=False, HT2=True, density=True, title='Outliers marked using Hotellings T2 method.')
# Make a plot in 3 dimensions
model.biplot3d(SPE=False, HT2=True, density=True, arrowdict={'scale_factor': 2.5, 'fontsize': 20}, title='Outliers marked using Hotellings T2 method.')
# Get the outliers using SPE/DmodX method.
df.loc[results['outliers']['y_bool'], :]

The SPE/DmodX method is a measure of the distance between the actual observation and its projection, using the Principal Components. The distance to the center is expressed by Hotelling’s T2 values and, therefore, the ellipse in Figure 3 represents the boundary beyond which a sample is outlying with respect to Hotelling’s T2. A sample is flagged as an outlier based on the mean and covariance of the first two PCs (Figure 3). In other words, when it is outside the ellipse.
# Plot SPE/DmodX method
model.biplot(SPE=True, HT2=True, title='Outliers marked using SPE/dmodX method and Hotelling T2.')
# Make a plot in 3 dimensions
model.biplot3d(SPE=True, HT2=True, title='Outliers marked using SPE/dmodX method and Hotelling T2.')
# Get the outliers using SPE/DmodX method.
df.loc[results['outliers']['y_bool_spe'], :]


Using the results of both methods, we can now also compute the overlap. In this use case, there are 5 outliers that overlap (see code section below).
# Grab overlapping outliers
I_overlap = np.logical_and(results['outliers']['y_bool'], results['outliers']['y_bool_spe'])
# Print overlapping outliers
df.loc[I_overlap, :]
Outlier Detection for Categorical Variables.
For the detection of outliers in categorical variables, we first need to discretize the categorical variables and make the distances comparable to each other. With the discretized data set (one-hot), we can proceed using the PCA approach and apply Hotelling’s T2 and SPE/DmodX methods. I will use the Student Performance data set [5] for demonstration purposes, which contains 649 samples and 33 variables. We will import the data set as shown in the code section below. More details about the column description can be found here. I will not remove any columns but if there was an identifier column or variables with floating type, I would have removed it or categorized it into discrete bins.
# Import library
from pca import pca
# Initialize
model = pca()
# Load Student Performance data set
df = model.import_example(data='student')
print(df)
# school sex age address famsize Pstatus ... Walc health absences
# 0 GP F 18 U GT3 A ... 1 3 4
# 1 GP F 17 U GT3 T ... 1 3 2
# 2 GP F 15 U LE3 T ... 3 3 6
# 3 GP F 15 U GT3 T ... 1 5 0
# 4 GP F 16 U GT3 T ... 2 5 0
# .. ... .. ... ... ... ... ... ... ... ...
# 644 MS F 19 R GT3 T ... 2 5 4
# 645 MS F 18 U LE3 T ... 1 1 4
# 646 MS F 18 U GT3 T ... 1 5 6
# 647 MS M 17 U LE3 T ... 4 2 6
# 648 MS M 18 R LE3 T ... 4 5 4
# [649 rows x 33 columns]
The variables need to be one-hot encoded to make sure the distances between the variables become comparable to each other. This results in 177 columns for 649 samples (see code section below).
# Install onehot encoder
pip install df2onehot
# Initialize
from df2onehot import df2onehot
# One hot encoding
df_hot = df2onehot(df)['onehot']
print(df_hot)
# school_GP school_MS sex_F sex_M ...
# 0 True False True False ...
# 1 True False True False ...
# 2 True False True False ...
# 3 True False True False ...
# 4 True False True False ...
# .. ... ... ... ... ...
# 644 False True True False ...
# 645 False True True False ...
# 646 False True True False ...
# 647 False True False True ...
# 648 False True False True ...
# [649 rows x 177 columns]
We can now use the processed one-hot data frame as input for pca and detect outliers. During initialization, we can set normalize=True
to normalize the data and we need to specify the outlier detection methods.
# Initialize PCA to also detected outliers.
model = pca(normalize=True,
detect_outliers=['ht2', 'spe'],
alpha=0.05,
n_std=3,
multipletests='fdr_bh')
# Fit and transform
results = model.fit_transform(df_hot)
# [649 rows x 177 columns]
# [pca] >Processing dataframe..
# [pca] >Normalizing input data per feature (zero mean and unit variance)..
# [pca] >The PCA reduction is performed to capture [95.0%] explained variance using the [177] columns of the input data.
# [pca] >Fit using PCA.
# [pca] >Compute loadings and PCs.
# [pca] >Compute explained variance.
# [pca] >Number of components is [116] that covers the [95.00%] explained variance.
# [pca] >The PCA reduction is performed on the [177] columns of the input dataframe.
# [pca] >Fit using PCA.
# [pca] >Compute loadings and PCs.
# [pca] >Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[116]
# [pca] >Multiple test correction applied for Hotelling T2 test: [fdr_bh]
# [pca] >Outlier detection using SPE/DmodX with n_std=[3]
# [pca] >Plot PC1 vs PC2 with loadings.
# Overlapping outliers between both methods
overlapping_outliers = np.logical_and(results['outliers']['y_bool'],
results['outliers']['y_bool_spe'])
# Show overlapping outliers
df.loc[overlapping_outliers]
# school sex age address famsize Pstatus ... Walc health absences
# 279 GP M 22 U GT3 T ... 5 1 12
# 284 GP M 18 U GT3 T ... 5 5 4
# 523 MS M 18 U LE3 T ... 5 5 2
# 605 MS F 19 U GT3 T ... 3 2 0
# 610 MS F 19 R GT3 A ... 4 1 0
# [5 rows x 33 columns]
The Hotelling T2 test detected 85 outliers and the SPE/DmodX method detected 6 outliers (Figure 4, see legend). The number of outliers that overlap between both methods is 5. We can make a plot with the biplot
functionality and color the samples in any category for further investigation (such as the sex
label). The outliers are marked with x
or *
. This is now a good start for a deeper inspection; in our case, we can see in Figure 4 that the 5 outliers are drifting away from all other samples. We can rank the outliers, look at the loadings, and deeper investigate these students (see previous code section). To rank the outliers, we can use the y_proba
__ (lower is better) for the Hotelling T2 method, and y_score_spe
, for the SPE/DmodX method. The latter is the Euclidian distance of the sample to the center (thus larger is better).
# Make biplot
model.biplot(SPE=True,
HT2=True,
n_feat=10,
legend=True,
labels=df['sex'],
title='Student Performance',
figsize=(20, 12),
color_arrow='k',
arrowdict={'fontsize':16, 'c':'k'},
cmap='bwr_r',
gradient='#FFFFFF',
edgecolor='#FFFFFF',
density=True,
)

Final words.
I demonstrated how to use PCA for multivariate outlier detection for both continuous and categorical variables. With the pca library, we can use Hotelling’s T2 and/or the SPE/DmodX method to determine candidate outliers. The interpretation of the contribution of each variable to the principal components can be retrieved using the loadings and visualized with the biplot in the low-dimensional PC space. Such visual insights can help to provide intuition about the detection outliers and whether they require follow-up analysis. In general, the detection of outliers can be challenging because determining what is considered normal can be subjective and vary depending on the specific application.
Be Safe. Stay Frosty.
Cheers E.
If you find this article about outlier detection helpful, follow me to stay up-to-date with my latest content! Support this content using my referral link which will give you unlimited learning and reading with the Medium membership.
Software
Let’s connect!
References
- E. Taskesen, What are PCA loadings and Biplots?, Medium, Towards Data Science, April 2022
- Scikit-Learn, _Outlier Detection_.
- E. Taskesen, How to Find the Best Theoretical Distribution for Your Data, Febr. 2023 Medium.
- Wine Data set, https://archive-beta.ics.uci.edu/dataset/109/wine
- P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance, ISBN 978–9077381–39–7