Dimensionality Reduction For Dummies — Part 1: Intuition

Humans are visual creatures. We need to see in order to believe. When you have a dataset with more than three dimensions, it becomes impossible to see what’s going on with our eyes. But who said that these extra dimensions are really necessary? Isn’t there a way to somehow reduce it to one, two, or three humanly dimensions? It turns out there is.

Principal Component Analysis (PCA) is one such technique. It’s simple and elegant. Unfortunately, simple doesn’t mean easy to see through and really understand what’s going on. If you have read about it before, you may have encountered a fully mathematical and abstract treatment with no intuition of its significance. Or it may have been explained ‘layman-style’, with no mathematical grounds. Or maybe a little bit of both, with a conceptual gap in the middle that doesn’t connect intuition with rigor. I’ll try to avoid this here.

The main objective of dimensionality reduction is this: find a low-dimensional representation of the data that retains as much information as possible. That’s a pretty bold statement. Let’s see what does that mean.

Getting Rid Of The Unnecessary

Suppose we have the following dataset for house values, or prices, in thousand dollars, of a certain district:

Dataset with 4 features per item

This data set has 4 features (4 dimensions) and is impossible to visualize graphically as a whole. However, if you carefully study the relationship of the features to each other and to themselves, you’ll notice that not all features are equally important.

For example, can you characterize each house by its floor count? Does the number of floors help to distinguish one house from another? It doesn’t seem so, since they are almost equal, that is, they have low variance, namely, σ² = 0.2, and thus aren’t very helpful. What about households? It doesn’t vary much, but its variance is certainly more than floors (σ² = 28), and thus is more helpful. Now, for the last two features, area (σ² = 43) and value (σ² = 127), they vary much more, and hence they are more representative than the other two of our data.

But there’s something we can do to fully exploit each feature to the maximum without sacrificing much accuracy. So far, we have studied each feature individually. What about their relation to each other?If you look carefully at the first two features, value and area, you’ll notice that the value is roughly double the area. This is incredibly useful, for we can now deduce one feature from another, and need only one instead of two. This property is called covariance. The higher the covariance, the more correlated the two features are, which implies redundancy in the data, since there’s more information than needed because we can deduce one feature from another.

From the preceding discussion, it should be evident that:

  • It’s a good thing to have features with high variance, since they will be more informative and more important.
  • It’s a bad thing to have highly correlated features, or high covariance, since they can be deduced from one another with little loss in information, and thus keeping them together is redundant.

And that’s just what PCA does. It tries to find another representation of the data (another set of features), such that the features in this representation have the highest possible variance and lowest possible covariance. But this shouldn’t make sense now, until we see it with our eyes…

Show Me The Magic

In an interview with PCA, I asked her the following question: “What does your perfect dataset look like?” She showed me this:

And it makes sense — the data can essentially be reduced to a single line.Observe: the variance along the x1 direction is very high compared to the x2 direction, which means we can safely drop the x2 features without much damage (by saying the “variance along x1 direction”, I mean the variance of the first feature, since we chose to represent it with the x1-axis, same for x2). Moreover, x1 doesn’t seem to depend on x2 at all, it just goes on increasing regardless of the value of x2, which implies relatively low covariance. Now as to why this is actually perfect for PCA, it is simply because all she needs to do is this:

Projecting the dataset on the x1 axis.
After projection, the data only has one dimension.

It’s crucial to understand what happened. Because x1 is much more important than x2 (according to the two criteria stated earlier), we have decided to keep only x1, by projecting the data points on its axis, which is equivalent to keeping only the x1-coordinate of the points, which is also equivalent to dropping the “unimportant” feature x2. And now we have a 1D dataset instead of 2D!

This is essentially Dimensionality Reduction: finding the best low dimensional representation of the data. Of course, there’ll be some error due to neglecting the second feature, which is represented by the dashed lines above. But we know this error is kept to a minimum since the data lies almost on a line, and it’s the best we can actually do with the given information (more on this later).

The Plot Thickens

“So what about your not-so-perfect dataset? How does it look like?” I asked PCA. Without hesitation, she said: “I really don’t believe there’s such a thing, you know. Everyone is perfect, you only need to change your perspective. Here, take a look at this.”

“Not so easy as before, don’t you think? Nope. Tilt your head”, she said.

“It’s actually the same ‘perfect’ dataset I showed you previously, just rotated 45 degrees. All we have to do is rotate our own axes to align with the dataset, and proceed as before: projecting on the new x1 axis, and omitting the new x2 axis”

It’s crucial to pause and ponder about this unexpected rotation of events. Like what engineers usually do, PCA reduces the not-so-perfect problem of a dataset that isn’t “aligned”, to a perfect, “aligned” problem that is easy to solve.

How does this happen? Essentially, PCA tries to find another set of axes such that the variance along that axis is as large as possible. When I said “variance along x1 axis”, I meant the variance of the feature x1. But after rotating our axes, the axes lost their meaning — they no longer represent x1 or x2. Rather, they represent a linear combination of both. To see how, notice that the new x1 axis in the above figure is a line that satisfies the equation : x2=x1 or x1-x2 = 0 (with respect to the old feature axes). And x1-x2=0 is nothing more than a linear combination of the features, each weighed by 1 and -1. This rotated axis now represents a new feature, call it z1, that is:

Likewise, the rotated x2-axis now represents a new feature, call it z2:

These two new directions, z1 and z2, are the Principal Components of the dataset. The 1st principal component, z1, is the one with the highest variance, and hence the one that is most important, carries the most information, and in a sense that which the data hinges on. The variance in the direction of the first principal component is now interpreted as the variance of the new, made-up feature z1. As for the 2nd principal component, z2, it’s just the one with the second most variance that is perpendicular to the 1st.

As its name implies, Principal Component Analysis is all about finding these principal components, so that we may utilize the first few of them that carry the most variance to represent our data, just like what we did when we projected the prefect dataset into a line.

It’s easy to see how this can generalize to more than two dimensions. Instead of projecting into lines, we project into planes, and our perfect dataset in 3D is now lying approximately on a plane (or maybe, even better, a line):

Magical Lines And Where To Find Them

Now comes the real deal: How to find these principal components?

I’ll leave that matter to the next part, where we shall develop that master formula:

And along the way, we will discover deep insights on how we can turn our qualitative intuition developed here into an elegant mathematical structure that generalizes to all dimensions.