How exactly does PCA work?

Simplest guide to PCA, ever.

Chayan Kathuria
Towards Data Science

--

Principal Component Analysis is the process of compressing a lot of data into something that captures the essence of the data.

The Intuition

PCA (Principal Component Analysis) is a technique which finds the major patterns in data for dimensionality reduction. Reading this line for the first time may trigger a few questions:

What are these patterns?

How to find these patterns?

What is dimensionality reduction?

And what are dimensions anyway?

And why to reduce them?

Let’s go around each of them one by one.

Assume we have a dataset with, say, 300 columns. So our dataset has 300 dimensions. Neat. But do we really need so many dimensions? We could. But most of the times, we don’t. So we need to find a fast and easy way to not just remove a certain number of features, but to capture the essence of the data of 300 dimensions in a much lesser number of transformed dimensions.

Variance

Each of the 300 features would be having a certain amount of variance — that is a change in the values throughout. If a feature describes number of floors in a particular building for 200 days, its variance will be 0. As there is no change in its value throughout. Features with 0 variance are of no use as they provide no insights. So, variance is indeed our friend! And this is the pattern I mentioned earlier.

More the variance, more is the importance of that feature. As it contains more ‘information’. A variable with 0 variance contains 0 information. Do not confuse variance with correlation! Variance is not with respect to the target variable of your data. It just states how the value of particular feature varies throughout the data.

Principal Components

Now that we know variance, we need to find a new set of transformed feature set which can explain the variance in a much better manner. The original 300 features are used to make linear combinations of the features in order to push all the variance in a few transformed features. These transformed features are called the Principal Components.

The Principal Components have now got nothing to do with the original features. We will get 300 principal components from 300 features. Now here comes the beauty of PCA — The newly formed transformed feature set or the Principal Components will have the maximum variance explained in the first PC. The second PC will have the second highest variance and so on.

For example, if the first PC explains 68% of the total variance in data, the 2nd feature explains 15% of the total variance and the next 4 features comprise of 14% variance in total. So you have 97% of the variance explained by just 6 Principal Components! Now, let’s say the next 100 features in total explain another 1% of the total variance. It makes less sense now to include 100 more dimensions just to get a percent of variance more. By taking just top 6 Principal Components , we have reduced the dimensionality from 300 to a mere 6!

Eigenvectors & Eigenvalues

Let’s now consider a simpler example with only 2 features which will be easier to visualize. Below is the figure if we plot feature 1 with feature 2.

What PCA (with SVD) does is, it finds the best fit line for these data points which minimizes the distance between the data points and their projections on the best fit line. Now, consider the average of the data points of feature 1 and feature 2. It will be somewhere around A. Hence, likewise PCA can also maximize the distance of the projected points on the best fit line from the point A.

Shifting the line so that the point A coincides with the origin, which will make it easier to visualize.

The distance d1 is the distance of the point 1 with respect to the origin. Similarly, d2,d3,d4,d5,d6 will be the respective distances of the projected points from the origin. The best fit line will have the maximum Sum of Squares of Distances. Suppose that the slope of the line is 0.25. That means the line consists of 4 parts of feature 1 and 1 part of feature 2. This would look something like :

where B=4 & C=1. Hence we can easily find A by the Pythagoras Theorem, which comes out to be 4.12. PCA scales these values so that the vector A is unit length long. Hence A=1, B=4/4.12 = 0.97 & C=1/4.12 = 0.242. This unit vector A is the eigenvector! The Sum of Squared Distances d1,d2,d3,d4,d5,d6 is the eigenvalue. Quite straightforward! This is the linear combination of feature 1 and feature 2 I mentioned earlier. This tells us that for PC1, feature 1 is almost 4 times as important than feature 2 or in other words, it contains almost 4 times more the spread(variation) in data than feature 2.

Now, the Principal Component 2 will be the vector orthogonal to PC1 as the principal components have 0 correlation among them. That will be something like the red line:

From similar understanding, PC2 will have -0.242 parts of feature 1 and 0.97 parts of feature 2. This tells us that for PC2, feature 2 is almost 4 times as important than feature 1. The eigenvector and the eigenvalue can be calculated similarly for PC2. So we finally have found our Principal Components!

Explained Variance

We calculated the Sum of Squared Distances for both the principal components. If we divide those values by n-1 (where n is the sample size), we will get the variance for the respective principal component.

Let us suppose that the variance for PC1 comes out to be 15 and that for PC2 comes out to be 3. Hence the total variation around both the principal components is 18. So PC1 accounts for 15/18 which is equal to 0.83 or 83% of the total variance in the data. And PC2 accounts for 3/18 which is equal to 0.17 or 17% of the total variance in the data. This is Explained Variance Ratio. This tells how much variance in the data is explained by a particular principal component. Principal components are ranked in order of their explained variance ratio. We can select top m components if the total explained variance ratio reaches a sufficient value.

Principal Component Analysis reduces the dimensionality to overcome overfitting. Your model might not need all the features to give a good performance. It might give a great training score but a very low test score. In other words, it might overfit. PCA is not a feature selection or a feature elimination technique. It is more of a feature extraction technique. You might also group it under the feature engineering umbrella.

That is all for this article. Refer to below excellent sources for more insights:

--

--