Principal Component Analysis (PCA): A Physically Intuitive Mathematical Introduction

Lydia Nemec

Follow

Published in

Towards Data Science

9 min readJan 16, 2022

--

The principal component analysis (PCA) involves rotating a cloud of data points in Euclidean space such that the variance is maximal along the first axis, the so-called first principal component. The principal axis theorem ensures that the data can be rotated in such away. In mathematical terms, the PCA involves finding an orthogonal linear coordinate transformation or, more generally, a new basis.

The mathematics behind the PCA is found again in the description of rotations of rigid bodies. This physical interpretation is instructive in understanding the PCA.

This blog article is available as a julia notebook on github.

Define parameters to generate sample data

First, we will generate a cloud of N randomly distributed data points in Euclidean space ℝⁿ with coordinates { x⃗⁽¹⁾, x⃗⁽²⁾,… , x⃗⁽ᴺ⁾ } = X. We will demonstrate the concept based on 3-dimensional data, where n=3 and X⊂ ℝ³ with basis vectors e⃗₁, e⃗₂ and e⃗₃ centered around the origin (0, 0, 0). For simplicity, we will set all coordinates along the basis vector e⃗₃ to zero. It will allow us to visualise the data in the e⃗₁ and e⃗₂ plane.

In our example, we created a dataset with N=820 number of points. The variance in the data is set to R1=4.0 and R2=8.5 and the data is rotated in space with angle=35.0.

Figure 1. a) Histogram of the random points distributed along e⃗₂. b) Cloud of randomly distributed data points in Euclidean space ℝ³ with coordinates *{ x⃗⁽¹⁾, x⃗⁽²⁾,… , x⃗⁽ᴺ⁾ } = X* with the basis vectors (axis in the plot b)) *e⃗₁* and *e⃗₂*. The grey lines in plot b) indicate the principal axis u⃗₁ and u⃗₂. c) Histogram of the random points distributed along *e⃗₁*. All values along e⃗₃ are equal zero.

Moment of Inertia

The moment of inertia J of a rigid body, also called rotational inertia, determines the torque required for a desired angular acceleration around an axis of rotation. It depends on the mass distribution of the body and the selected axis of rotation. A body with a larger moment of inertia J requires more torque to change the body’s rate of rotation. For the same rigid body, different axes of rotation will have different moments of inertia associated with them. In other words, it depends on the body’s mass distribution and the axis chosen, with larger moments requiring more torque to change the body’s rotation rate.

All moments of inertia of a rigid body can be summarised by a matrix. In general, it can be determined with respect to any point in space. For simplicity, we will calculate the moment of inertia with respect to the center of mass.

The principal axes of the body, also known as figure axes, and the principal moments of inertia can be found by rotating the cloud of point masses. In mathematical terms, the PCA involves finding an orthogonal linear coordinate transformation or, more generally, a new basis.

The figure axis corresponding to the largest principal moment of inertia is the area vector of the plane with the maximal spread of mass points.

Figure 2. The six cylinders have the same mass but different moments of inertia J. As they roll down the slope, cylinders with lower moments of inertia accelerate more quickly. (Image taken from wikipedia )

Visual Comparison of the Principal Axis in PCA and the Moment of Inertia

In the following, we will interpret the set of random data points from above in two ways. First, we interpret the data points X as statistically distributed data with covariance matrix C. Second, X represents a set of point masses representing a rigid body with the Moment of Inertia matrix J.

Figure 3. a) PCA: The first principal component u⃗₁ is along the axis where the variance is maximally indicated by a red arrow b) Moment of Inertia: The figure axis u⃗₂ corresponding to the (actually second) largest principal moment of inertia indicated by a green arrow.

With the visual support of Figure 1 and 3, we expect that the principal axes of the PCA and Moment of Inertia are the same. However, the value of the largest principal component and principal moment of inertia will differ for most sets of data points.

Note: In physics, the moment of inertia is defined for a 3-dimensional rigid body. For simplicity, we projected the data into the plane spanned by e⃗₁ and e⃗₂. In our example, the plane with the maximal spread of mass points is then spaned by e⃗₁ and e⃗₂. The principal axis corresponding to the largest moment is pointing out of the plane along e⃗₃ and is orthogonal to by e⃗₁ and e⃗₂.

The line along u⃗₁ is equivalent to the direction where the variance is maximal. In the following, we will support our visual understanding by exploring the PCA and moment of inertia mathematically.

Definition of the Moment of Inertia of a Rigid Body

For a rigid object of N point masses mᵢ in ℝⁿ, the moment of inertia J is given by

Its components are defined by Eq. (1) as

where δⱼ,ⱼ’ is the Kronecker delta and M = ∑ᵢᴺ mᵢ is the total mass.

Note: Here, we normalise the moment of inertia by the total mass. In physics, the moment of inertia would not commonly be normalised like this.

Looking closer, we see that J is symmetric with Jⱼ,ⱼ’ = Jⱼ’,ⱼ. The spectral theorem tells us that J has real eigenvalues λ and is diagonalisable by an orthogonal matrix (orthogonally diagonalizable).

Definition of the Covariance Matrix

The covariance matrix C for a cloud of points in Euclidean space centered around the mean is given by

Its components are defined by Eq. (2) as

Solving the Eigenvalue Problem

The principal axes of the PCA and moment of inertia can be determined by rotating the data points in space. More precisely, the principal components and axes are calculated by solving an eigenvalue problem.

A real symmetric matrix (like C and J ) has the eigendecomposition into the product of a rotation matrix R and a diagonal matrix Λ

given by J = R Λ Rᵀ . The columns of the rotation matrix R define the directions of the principal axes, and the constants λ₁, …, λₙ are the diagonal elements of the matrix Λ and
are called the principal moments.

The structure of the matrices J and C is the same except for the sign of the off-diagonal elements. We will see in the following that the eigenvectors will be the same for C and J. In addition, we will see how the eigenvalues Λ of C relate to the eigenvalues of J.

Showing the Equality of the Eigenvectors of C and J

Let’s rewrite the moment of inertia matrix J defined by Eq. (1) in terms of the covariance matrix C in Eq. (2).

Equation 3

where I is the identity matrix.

To obtain the eigenvectors and eigenvalues of C, we solve the eigenvalue problem by decomposing C into the product of a rotation matrix R and a diagonal matrix Λₒᵥ

C = R Λₒᵥ Rᵀ

where R is composed of the eigenvectors v⃗ʲ. In the case of a 3-dimensional space R = [ v⃗¹ v⃗² v⃗³ ]

and

The jᵗʰ eigenvector v⃗ʲ and jᵗʰ eigenvalue λʲₒᵥ of the covariance matrix C are given by

In the following, we will drop the index $j$ and rewrite the formula from above as Cλₒᵥ = λₒᵥv⃗. Multiplying equation Eq. (3) by the eigenvector v⃗ from the right side, we find Eq. (4).

Eq. (4) implies that C and J have the same eigenvectors v⃗.

Calculating the Eigenvalues of the Moment of Inertia Matrix

While the eigenvectors of C and J are the same, the eigenvalues are not the same. To relate them we will need to note that

where we have used that the trace of a matrix is invariant under cyclic permutations, so

Using Eq. (4) and Eq. (5), we can write the eigenvalues ΛJ of the moment of inertia J (Eq. (1)) as

We see that the kᵗʰ eigenvalue λᵏJ can be expressed in terms of the eigenvalues λₒᵥ.

Physically we can gain an intuitive understanding for the relation between these two sets of eigenvalues by considering that the covariance eigenvalue along e.g. axis v⃗¹ are determined by the component of the data points in (1) direction, while the moment of inertia around for a rotation around the axis (1) is determined by the euclidean distance of the data points from that axis.

Calculating Eigenvalues and Eigenvectors Using the Above Dataset X

Next, we calculate covariance matrix elements, eigenvalues, and eigenvectors. The covariance matrix C of the dataset X is

For the cavariance matrix C, we find the eigenvalues

and the eigenvectors

Using Eq. (6), we can calculate the eigenvalues λJ of J

Calculating the Eigenvalues and Eigenvectors of J

Now, we calculate the matrix elements, eigenvalues, and eigenvectors of the moment of inertia. The moment of inertia matrix J of the dataset X is

For the moment of inertia matrix J, we find the eigenvalues

and the eigenvectors

In both interpretations of the cloud of data points — first as a cloud of points centered around the mean and second as a rigid body rotating around the center of mass — we obtain the same eigenvectors.

Figure 4. Cloud of randomly distributed data points X. Overlayed are the scaled eigenvectors (shown as red and green arrows).

Next, we use the eigenvectors and eigenvalues to rotate the data and represent it in the new basis

We rotate the dataset X

*Figure 5. Cloud of randomly distributed data points* X represented in the basis spanned by the vectors [u⃗₁ u⃗₂ u⃗₃] *. The grey arrows indicate the old basis vectors* e⃗₁ and e⃗₂ *in this new basis.*

Final Thoughts

In machine learning and data science, PCA is used for two reasons.
First, the accuracy and numeric stability of some Machine Learning algorithms are sensitive to correlated input data. In particular, Machine Learning algorithms that perform an inversion of the covariance matrix may experience the singularity problem — Gaussian Mixture Models come to mind. A different example is the application of random forest algorithms to detect interactions between different features, where large correlations can mask these interactions. Performing a PCA first allows us to tease out the effect of correlations, which can improve feature importance analysis.

Second, PCA is used to reduce the dimensionality of the dataset, e.g. for data compression. In our example, we used a 3-dimensional dataset X, but the e⃗₃ component did not carry any information (by construction). We could have used a PCA to justify dropping the third dimension since a PCA would have shown that the variance is minimal in the e⃗₃ direction. Using the projection of a higher dimensional dataset onto a lower-dimensional space like this is a powerful tool to handle high-dimensional data and deal with the [curse of dimensionality].

Both applications of PCA have been extensively discussed. See recommendations for further reading below.