Humans are visual creatures. We need to see in order to believe. When you have a dataset with more than three dimensions, it becomes impossible to see what’s going on with our eyes. But who said that these extra dimensions are really necessary? Isn’t there a way to somehow reduce it to one, two, or three humanly dimensions? It turns out there is.
Principal Component Analysis (PCA) is one such technique. It’s simple and elegant. Unfortunately, simple doesn’t mean easy to see through and really understand what’s going on. If you have read about it before, you may have encountered a fully mathematical and abstract treatment with no intuition of its significance. Or it may have been explained ‘layman-style’, with no mathematical grounds. Or maybe a little bit of both, with a conceptual gap in the middle that doesn’t connect intuition with rigor. I’ll try to avoid this here.
The main objective of Dimensionality Reduction is this: find a low-dimensional representation of the data that retains as much information as possible. That’s a pretty bold statement. Let’s see what that means.
Getting Rid Of The Unnecessary
Suppose we have the following dataset for house values, or prices, in thousand dollars, of a certain district:

This data set has 4 features (4 dimensions) and is impossible to visualize graphically as a whole. However, if you carefully study the relationship of the features to each other and to themselves, you’ll notice that not all features are equally important.
For example, can you characterize each house by its floor count? Does the number of floors help to distinguish one house from another? It doesn’t seem so, since they are almost equal, that is, they have low variance, namely, _ σ² = 0.2,_ and thus aren’t very helpful. What about households? It doesn’t vary much, but its variance is certainly more than floors (σ² = 28), and thus is more helpful. Now, for the last two features, area (σ² = 43) and value (σ² = 127), they vary much more, and hence they are more representative than the other two of our data.
But there’s something we can do to fully exploit each feature to the maximum without sacrificing much accuracy. So far, we have studied each feature individually. What about their relation to each other?If you look carefully at the first two features, value and area, you’ll notice that the value is roughly double the area. This is incredibly useful, for we can now deduce one feature from another, and need only one instead of two. This property is called covariance. The higher the covariance, the more correlated the two features are, which implies redundancy in the data, since there’s more information than needed because we can deduce one feature from another.
From the preceding discussion, it should be evident that:
- It’s a good thing to have features with high variance, since they will be more informative and more important.
- It’s a bad thing to have highly correlated features, or high covariance, since they can be deduced from one another with little loss in information, and thus keeping them together is redundant.
And that’s just what PCA does. It tries to find another representation of the data (another set of features), such that the features in this representation have the highest possible variance and lowest possible covariance. But this shouldn’t make sense now, until we see it with our eyes…
Show Me The Magic
In an interview with PCA, I asked her the following question: "What does your perfect dataset look like?" She showed me this:

And it makes sense – the data can essentially be reduced to a single line. Observe: the variance along the _x_1 direction is very high compared to the _x_2 direction, which means we can safely drop the _x_2 features without much damage (by saying the "variance along _x_1 direction", I mean the variance of the first feature, since we chose to represent it with the _x_1-axis, same for _x_2). Moreover, _x_1 doesn’t seem to depend on _x_2 at all, it just goes on increasing regardless of the value of _x_2, which implies relatively low covariance. Now as to why this is actually perfect for PCA, it is simply because all she needs to do is this:


It’s crucial to understand what happened. Because _x_1 is much more important than _x_2 (according to the two criteria stated earlier), we have decided to keep only _x_1, by projecting the data points on its axis, which is equivalent to keeping only the _x_1-coordinate of the points, which is also equivalent to dropping the "unimportant" feature _x_2. And now we have a 1D dataset instead of 2D!
This is essentially Dimensionality Reduction: finding the best low dimensional representation of the data. Of course, there’ll be some error because we neglected the second feature, which is represented by the dashed lines above. But this error is kept to a minimum since the data lies almost on a line, and it’s the best we can actually do with the given information (more on this later).
The Plot Thickens
"So what about your not-so-perfect dataset? How does it look like?" I asked PCA. Without hesitation, she said: "I really don’t believe there’s such a thing, you know. Everyone is perfect, you only need to change your perspective. Here, take a look at this."

"Not as easy as before, don’t you think? Nope. Tilt your head", she said.

"It’s actually the same ‘perfect’ dataset I showed you previously, just rotated 45 degrees. All we have to do is rotate our own axes to align with the dataset, and proceed as before: projecting on the _new x_1 axis, and omitting the _new x_2 axis"
It’s crucial to pause and ponder about this unexpected rotation of events. Like what engineers usually do, PCA reduces the not-so-perfect problem of a dataset that isn’t "aligned", to a perfect, "aligned" problem that is easy to solve.
How does this happen? Essentially, PCA tries to find another set of axes such that the variance along that axis is as large as possible. When I said "variance along _x_1 axis", I meant the variance of the feature _x_1. But after rotating our axes, the axes lost their meaning – they no longer represent _x_1 or _x_2. Rather, they represent a linear combination of both. To see how, notice that the new _x_1 and _x_2 axes above can be obtained by performing a rotation transformation on the old axes, obtaining:
These two new directions, _z_1 and _z_2, are the Principal Components of the dataset. The 1st principal component, _z_1, is the one with the highest variance, and hence the one that is most important, carries the most information, and in a sense that which the data hinges on. The variance in the direction of the first principal component is now interpreted as the variance of the new, made-up feature _z_1. As for the 2nd principal component, _z_2, it’s just the one with the second most variance that is perpendicular to the 1st.
As its name implies, Principal Component Analysis is all about finding these principal components, so that we may utilize the first few of them that carry the most variance to represent our data, just like what we did when we projected the prefect dataset into a line.
It’s easy to see how this can generalize to more than two dimensions. Instead of projecting into lines, we project into planes, and our perfect dataset in 3D is now lying approximately on a plane (or maybe, even better, a line):

Magical Lines And Where To Find Them
Now comes the real deal: How to find these principal components?
I’ll leave that matter to the next part, where we shall develop that master formula:

And along the way, we will discover deep insights on how we can turn our qualitative Intuition developed here into an elegant mathematical structure that generalizes to all dimensions.