# Dimensionality Reduction For Dummies — Part 1: Intuition

Humans are visual creatures. We need to see in order to believe. When you have a dataset with more than three dimensions, it becomes impossible to see what’s going on with our eyes. But who said that these extra dimensions are *really* necessary? Isn’t there a way to somehow reduce it to one, two, or three humanly dimensions? It turns out there is.

Principal Component Analysis (PCA) is one such technique. It’s simple and elegant. Unfortunately, simple doesn’t mean easy to see through and really understand what’s going on. If you have read about it before, you may have encountered a fully mathematical and abstract treatment with no intuition of its significance. Or it may have been explained ‘layman-style’, with no mathematical grounds. Or maybe a little bit of both, with a conceptual gap in the middle that doesn’t connect intuition with rigor. I’ll try to avoid this here.

The main objective of dimensionality reduction is this: **find a low-dimensional representation of the data that retains as much information as possible.** That’s a pretty bold statement. Let’s see what does that mean.

### Getting Rid Of The Unnecessary

Suppose we have the following dataset for house values, or prices, in thousand dollars, of a certain district:

This data set has 4 features (4 dimensions) and is impossible to visualize graphically as a whole. However, if you carefully study the relationship of the features to each other and to themselves, you’ll notice that not all features are equally important.

For example, can you characterize each house by its floor count? Does the number of floors **help to distinguish** one house from another? It doesn’t seem so, since they are almost equal, that is, they have **low variance**, namely,* *σ² = 0.2,* *and* *thus aren’t very helpful. What about households? It doesn’t vary much, but its variance is certainly more than floors (σ² = 28), and thus is **more **helpful. Now, for the last two features, area (σ² = 43) and value (σ² = 127), they vary much more, and hence they are more representative than the other two of our data.

But there’s something we can do to fully exploit each feature to the maximum without sacrificing much accuracy. So far, we have studied each feature individually. What about their **relation **to each other?If you look carefully at the first two features, value and area, you’ll notice that the value is **roughly double** the area. This is incredibly useful, for we can now deduce one feature from another, and need only one instead of two. This property is called **covariance**. The higher the covariance, the more correlated the two features are, which implies redundancy in the data, since there’s more information than needed because we can deduce one feature from another.

From the preceding discussion, it should be evident that:

- It’s a
*good thing*to have features with*high variance*, since they will be more informative and more important. - It’s a
*bad thing*to have highly correlated features, or*high covariance*, since they can be deduced from one another with little loss in information, and thus keeping them together is redundant.

And that’s just what PCA does. It tries to find another representation of the data (another set of features), such that the features in this representation have the highest possible variance and lowest possible covariance. But this shouldn’t make sense now, until we see it with our eyes…

### Show Me The Magic

In an interview with PCA, I asked her the following question: “What does your perfect dataset look like?” She showed me this:

And it makes sense — the data can essentially be **reduced to a single line.**Observe: the variance along the *x*1 *direction *is very high compared to the *x*2 direction, which means we can safely drop the *x*2 features without much damage (by saying the “variance along *x*1 direction”, I mean the variance of the first feature, since we chose to represent it with the *x*1-axis, same for *x*2). Moreover, *x*1 doesn’t seem to depend on *x*2 at all, it just goes on increasing regardless of the value of *x*2, which implies relatively *low covariance*. Now as to why this is actually perfect for PCA, it is simply because all she needs to do is this:

It’s crucial to understand what happened. Because *x*1 is much more important than *x*2 (according to the two criteria stated earlier), we have decided to keep only *x*1, by **projecting **the data points on its axis, which is equivalent to keeping only the *x*1-coordinate of the points, which is also equivalent to dropping the “unimportant” feature *x*2. And now we have a 1D dataset instead of 2D!

This is essentially **Dimensionality Reduction: **finding the best low dimensional representation of the data. Of course, there’ll be some error due to neglecting the second feature, which is represented by the dashed lines above. But we know this error is kept to a minimum since the data lies almost on a line, and it’s the best we can actually do with the given information (more on this later).

### The Plot Thickens

“So what about your not-so-perfect dataset? How does it look like?” I asked PCA. Without hesitation, she said: “I really don’t believe there’s such a thing, you know. Everyone is perfect, you only need to change your perspective. Here, take a look at this.”

“Not so easy as before, don’t you think? Nope. Tilt your head”, she said.

“It’s actually the same ‘perfect’ dataset I showed you previously, just rotated 45 degrees. All we have to do is rotate our own axes to align with the dataset, and proceed as before: projecting on the *new x*1 axis, and omitting the *new x*2 axis”

It’s crucial to pause and ponder about this unexpected rotation of events. Like what engineers usually do, PCA reduces the not-so-perfect problem of a dataset that isn’t “aligned”, to a perfect, “aligned” problem that is easy to solve.

How does this happen? Essentially, PCA tries to find another set of axes such that the variance along that axis is as large as possible. When I said “variance along *x*1 axis”, I meant the variance of the feature *x*1. But after rotating our axes, the axes lost their meaning — they no longer represent *x*1 or *x*2. Rather, they represent a **linear combination** of both. To see how, notice that the new *x*1 axis in the above figure is a line that satisfies the equation : *x*2=*x*1 or* x*1-*x*2 = 0 (with respect to the old feature axes). And *x*1-*x*2=0 is nothing more than a linear combination of the features, each weighed by 1 and -1. This rotated axis now represents a **new feature**, call it *z*1, that is:

Likewise, the rotated *x*2-axis now represents a new feature, call it *z*2:

These two new directions, *z*1 and *z*2, are the **Principal Components** of the dataset. The *1st principal component*, *z*1, is the one with the highest variance, and hence the one that is most important, carries the most information, and in a sense that which the data *hinges *on. The variance in the direction of the first principal component is now interpreted as the variance of the new, made-up feature *z*1. As for the *2nd principal component*, *z*2, it’s just the one with the second most variance that is perpendicular to the 1st.

As its name implies, Principal Component Analysis is all about finding these principal components, so that we may utilize the first few of them that carry the most variance to represent our data, just like what we did when we projected the prefect dataset into a line.

It’s easy to see how this can generalize to more than two dimensions. Instead of projecting into lines, we project into planes, and our perfect dataset in 3D is now lying *approximately* on a plane (or maybe, even better, a line):

### Magical Lines And Where To Find Them

Now comes the real deal: How to find these principal components?

I’ll leave that matter to the next part, where we shall develop that master formula:

And along the way, we will discover deep insights on how we can turn our qualitative intuition developed here into an elegant mathematical structure that generalizes to all dimensions.