Dimensionality Reduction For Dummies — Part 3: Connect The Dots

Published in

Towards Data Science

7 min readNov 1, 2018

This is where everything clicks into place.

Back in Part 1 & Part 2, we saw that our ultimate goal is to find these vectors defining the direction of the greatest variance. Given the insights we have learned so far, this process can only make sense now.

I assume you know the basics of linear transformations, eigenvectors and eigenvalues, since there’s no point of repeating what has been thoroughly established. But if you need some insight, watch this great tutorial.

Let’s dive in.

What Do You Want?

You can’t solve a problem you haven’t defined clearly, that’s why we should consider what do we exactly mean by finding the directions of greatest variance. It’s only natural to begin asking ourselves what variance is.

As we saw previously, the variance along the x-axis is the variance of the feature represented by that axis, i.e. the variance of the x-coordinates of the dataset. To generalize this notion to any line in space, a simple observation is in order:

The x-coordinates of the data points are merely their projections on the x-axis.

Using this fact, we can easily find the variance along any line by projecting our points on the unit vector representing the line, and then taking the variance of these projections.

If A is our (n x d) mean-centered data matrix, where n is the number of examples (points) and d is the number of dimensions (features), and v is the unit vector of our line, then, using Floor #1 from the Part 2, projecting the points becomes a snap:

where p is an (n x 1) column vector, in which each element is the length of projection for each point.

Now that we have the lengths in a column vector, we can use Floor #5 to find their variance:

Suddenly, the covariance matrix of our data, C, popped up! This extremely curious appearance will serve us greatly. Given our definition of variance along any line, we can reformulate our problem by saying that we want to find the vector v such that σ² is maximum. It’s evident from the above definition that this vector depends solely on the value of C, which compels us to a careful study of this curious matrix.

Curious Covariance

In the last part we learned that an (n x d) data matrix has a (d x d) covariance matrix, i.e. a 2D dataset has a 2x2 covariance matrix. As you should know, a matrix can be viewed as a linear transformation that maps vectors to other vectors by shearing or rotation. This is no different for our covariance matrix.

The best method to learn about a linear transformation matrix is to look at its eigenvectors — those vectors that aren’t rotated or sheared by the matrix but only scaled by an amount equal to their eigenvalues. If v is an eigenvector and λ is its eigenvalue, then:

However, the eigenvectors for the covariance matrix have special properties. The covariance matrix, being symmetric, has orthonormal eigenvectors — vectors of unit length and orthogonal to each other (proof). This gives rise to the following identities:

The elegance of eigenvectors comes from a simple fact: any vector in space can be expressed as a linear combination of the eigenvectors of a matrix. The reason behind this elegance is that we can now calculate the effect of the linear transformation, represented by matrix C, on any vector, u, in terms of the eigenvectors and eigenvalues. This turns the complicated matrix multiplication procedure, C.u, into simple vector scaling. To see how this is done, notice that a (2 x 2) covariance matrix has two orthonormal eigenvectors corresponding to each dimension. Any vector u can then be expressed in terms of the two eigenvectors v1 and v2 as:

Where k1 and k2 are scalars.

So to find the effect of the covariance matrix on any vector u:

We can now write the quantity we want to maximize, (u^T)Cu, as:

Where we used the identities of orthonormality in the last step.

The Big Surprise

Finding u that maximizes σ²=(u^T)Cu is a matter of finding k1 and k2 — the coefficients of the eigenvectors that define u. But since we are only concerned with unit vectors that define the line of greatest variance, k1²+k2²=1. Can you now guess how to maximize σ²?

The key is to notice that if λ1>λ2, then the only way to maximize σ², given the constraint k1²+k2²=1, is to assign 1 to k1 and 0 to k2, so that the largest eigenvalue λ1 dominates the sum.

This leads us to the great surprise:

The direction of greatest variance is the eigenvector of the covariance matrix that has the largest absolute eigenvalue.

For if k1=1 and k2=0, then u becomes:

That is, u, the direction of greatest variance, is the eigenvector v1 itself that has the largest eigenvalue. A direct conclusion of this is that the second largest direction of variance corresponds to the eigenvector with the second largest eigenvalue, and so forth.

Another Big Surprise

What is even more interesting is that if you look at our problem statement:

and combine it with the fact that v is an eigenvector, then:

That is

The variances in the direction of the eigenvectors are their eigenvalues.

Now What?

Alright. We have finally found the vectors we all wanted. Now what?

If you recall from Part 1, we reduced a 2D dataset into a 1D line by projecting the points on the line of greatest variance. Now that we have all the quantities we need, namely, the data matrix A, the covariance matrix C, and its eigenvectors v1, v2 that define our lines, we can carry out the projection:

Generalizing to larger dimensions is only a matter of augmenting our matrices with the extra components. If we are in 3D, the covariance matrix has 3 eigenvectors, v1, v2, v3, ordered from largest to smallest λ. But to organize our problem more neatly, we use the properties of matrix multiplication. Our problem can now be stated, as:

Where E is a diagonal matrix of the variances (eigenvalues) and V is the orthonormal matrix of the eigenvectors, arranged column-wise. Take any number of principal components (eigenvectors) that have the largest eigenvalue and project our data on them to reduce the dimensionality. If we choose v1 and v2, the plane of highest variance is now defined by them, and projecting the points on the plane is equivalent to projecting onto v1 and v2 then combining the components:

This is just an extension to projection on a line (p = Av), where v is now a matrix V of column-wise vectors.

A General Procedure

The process of finding the eigenvectors of the covariance matrix is called the Eigen-decomposition of the covariance matrix, and is one of the methods for solving PCA. The name comes from the fact that if we re-organize our problem statement as:

Using the identity V^T = V^-1 for orthonormal matrices

then the matrix C has been decomposed into its eigenvectors V and eigenvalues E.

How many eigenvectors should we choose to project our data on? A rule of thumb is to choose the first n eigenvectors such that the sum of their variances (eigenvalues) is larger than 95% of the total.

We can now summarize the procedure of reducing dimensionality using PCA:

Mean-center the data by subtracting each feature from its mean.
Find the covariance matrix of the centered data C =(A^T)A.
Apply eigen-decomposition to C.
Order the eigenvectors w.r.t the eigenvalues in descending order.
Choose the first n eigenvectors that explain 95% of the total variance.
Create a matrix V, where each column is one of the n chosen vectors.
Project the data onto the subspace defined by V:

What’s Next

Another very interesting approach to solve PCA is Singular Value Decomposition (SVD). It is similar to eigen-decomposition but is more general and widely used in practice. But this will suffice for now so I’ll leave that to the next part.