Principal Component Analysis

(Updated Sep.2021) Step by step intuition, mathematical principles and python code snippets behind one of the most important algorithms in unsupervised learning

Published in

Towards Data Science

15 min readOct 8, 2020

This animation shows how the covariance matrix of the projected points (A…E) get diagonalized when the rotating direction reaches two special directions (being the solution to the eigen decomposition equation below). To verify it, slow down the video or try to pause it when the direction overlaps the orange dotted line (first principal component) or the second pink dotted line (second Principal componemt). You will also notice that the diagonal elements of the matrix reaches their maximum values (max variance). (Video created by the author using Geogebra6 sw)

This formula is called eigen-decomposition equation. Initial works started in the end of 1800 but only in recent times has become of practical usage in big data analysis thanks to advances in (personal) computing power.

Hi, everybody, my name is Andrea Grianti in Milan, Italy. I wrote this to share my thoughts after many books and papers read on the subject. This is not a textbook but a starting point for further understanding of the topic. As the math/algebra behind is rather tough I divided it into four parts:

PCA Intuition
Math/Algebra (easy)
Math/Algebra (difficult)
Python snippets

1. PCA Intuition

If you have a large data sets with thousands/millions of observations (rows) and hundreds of different variables (columns) one of the first objectives is verifying if it’s possible simplifying and reducing the dataset in order to facilitate the analysis work on a much smaller subset of the original data.

Direct elimination of variables is the obvious way but it has clearly impact on the information content of your data set. Too much or a wrong eliminations and your data set becomes useless, too less and the data set remains large and difficult to analyse.

The term ‘information’ is quite a generic subject and it’s difficult to define it. It depends on the data set. A data set could contain information for me and nothing to others and vice versa.

We could try to define the quantity of information in a data set using concepts like “Information Content” which is a concept bound to the probability that a specific value, among all those possible in the context of a variable of a data set, happens.

Following this concept, as much variety of possible outcomes a variable x has, lower is the probability to predict its value and therefore higher the information content is.

This is our first assumption to keep in mind: higher variance => higher information content. Only the context and our data can tell if this assumption holds.

Before focusing on the high variance data concept we must deal with a second assumption: correlation between variables is a form of data redundancy. If there’s a clear defined relation between two variables like for example angles in degrees and angles in radians, or temperature in centigrades or farenheit, then one of the two variables are useless and can be eliminated directly from the data set. When instead the dependencies are less clear direct elimination is not recomended and one can try to manage the lack of evidence for correlation through a third assumption.

The third assumption is: we assume that whatever is the correlation between the variables, this is linear.

So to recap we are assuming that:

variance is related to information content and should be maximized
redundant variables and high correlations between variables are a form of noise that should be minimized
correlations between variables are linear

Why these two assumptions are important ? Because to reduce the dimensionality of a data set we should evaluate the contribution of each variable to the overall variance(=information) of the dataset in order to choose those with the greatest contribution and discard those with lowest contribution.

The battlefield for this operation is the Covariance Matrix or its brother the Correlation Matrix. Because the Correlation Matrix is strictly bound to the Covariance Matrix by the same relation who binds simple Covariance and Correlation between two variables, I will work with the Covariance Matrix as the choice is not relevant in relation to the understanding of the math principles behind.

In any case for those who prefer a simple refresh of the covariance and correlation concepts in order to understand the relationship here it is:

We know that for two variables covariance formula is (in case our dataset is a sample of data taken out of a population):

Covariance for not centered around the mean sample data (Source: Author)

Covariance represents a dispersion measure that include the concept of linear “synchronicity” between two variables in relation to their respective means.

That means measuring, for each point, how the difference between <the x coordinate of a point> and <the mean of all the x points>, is “in synch” with the difference between <the y coordinate for the same point> and <the mean of all the y points> and then averaging that number.

Points drawn on a cartesian axis system. Note the average point in red. Note also that when you redefine the axis setting the (0,0) point upon the average the dispersion is not impacted. This actually simplifies the reasoning to understand the relationship between covariance and correlation. (Source: Author)

It’s interesting to note in sample picture above that there are the same number of points on the left and on the right of the average (for dimension x) and there are the same number of points above and below the average (for dimension y).

This simple example gives you the idea that the covariance is important for the sign. When the result gives you a positive sign or negative sign it gives you an idea of the quadrants where the direction of the synch is. Note also that having the same sign it doesn’t tell you anything about the slope of the direction, but just the quadrants where the direction is.

If the points instead of going from low-left to up-right, were going from up-left to low-right (reflected along the horizontal line of the means) the covariance would have been the same value but with a negative sign before it.

How Covariance is bound to Correlation

If you leave apart for the moment the constant in the above covariance formula, the rest of the equation is the sum of a product of a difference which should remind you (at least for the sum of a product) of the dot product concept in algebra.

To make it like a dot product between two vectors we can center the data points (you subtract from each variable values the mean of that variable) and we have both the means = 0 while the variance remains unchanged (because shifting all points do not change the distance between the points and dispersion remains the same).

The covariance formula in that case for centered data simplifies to:

Covariance formula for **centered** around the mean data (Source: Author)

which contains the definition of the dot product we know is:

but we know from geometry that the dot product can also be written as:

Dot Product : Geometry version (Source: Author)

Considering that when centered dataalso the variance formula simplifies to:

Variance formula for centered variables X and Y.

Standard Deviations of centered data in relation to the length of vectors X and Y

For centered data the covariance formula expressed in terms of std deviations and cosine of the angle between the vectors X and Y which is the correlation coefficient. (Source:Author)

The relationship between covariance and Correlation (rho) in case of 2 centered variables. Same concept can be transposed in term of matrices when variables are n(Source: Author)

Relationship between simple covariance and correlation for two centered variables: sigmas being their standard deviations and rho being the correlation coefficient. At an extreme where correlation is zero, covariance is zero. At the other when correlation is maximum rho=1 or rho=-1, covariance is the product of the standard deviations of each variable.

From Covariance to Covariance Matrix

Considering that the covariance formula above was just for 2 variables we can see the covariance matrix as the “big picture” of all the covariances between all of the variables in our data set.

Covariance Matrix (also called in literature Variance/Covariance Matrix) (Source:Author)

When we calculate the Variance/Covariance Matrix of our original dataset (we call the original dataset X as a collection of many column vectors X1, X2…Xn) we see the variances along the diagonal but we also see the joint covariances in the off diagonal elements which are a (difficult to interpret) measure of the magnitude of the ‘synchronicity’ between one variable and the others.

Because of course we cannot modify our original data X in order to eliminate the correlations between variables (unless we remove completely a variable which is always possible but risky because we could unwillingly remove important information), we can try to find a way to ‘transform’ X into a different data set Y having special (special=eigen in german language…) link with X but built in a way that the new variables of Y(Y1,Y2…Yn) would have a different Covariance matrix Cy where variances of these variables would be “isolated” from the interefences (correlations) of the other variables (=> covariances = 0).

This is where we would like to land. With a new covariance matrix Cy which is diagonalized. Means that we are looking to solve a problem to find a new dataset Y, bound to the original dataset X (we’ll see how the binding works), where all the new variables of Y are uncorrelated between them. In this way the covariances in Cy matrix would be zero and the variances isolated in the digonal. This would clearly allow us to define the total variance of the dataset as the sum of diagonal elements (Trace) (Source:Author).

This operation is the magic of the eigen decomposition equation. By solving that equation we will find a way to transform the original X data set and by consequence the covariance matrix Cx into a new dataset Y and by consequence a diagonalized covariance matrix Cy using a special transformation matrix we call B. B will be the way we use to go back and forth from X to Y and viceversa.

Assuming we are able to solve that eigen decomposition equation… so what ? If using original data X, I was forced to work with Cx with the problem of being unable to isolate single variables contributions to overall variance based on the presence of correlations, now, solving that equation, I could work with a new Y data set whose covariance matrix is Cy and it’s diagonal so the variance contribution of every variable of Y is clearly defined and not impacted from joint intereferences/correlations. With that we can later decide, based on the values of their variances, where to cut the Y data set to keep a subset of the most important Y variables based on the weight of their contribution with respect to the sum of the values of the variances in Cy.

What’s the price to pay when we work with Y instead of X? Because variables of the new Y data set (we will see this) are a linear combination of the original X data , their meaining is undefined. So to give a contextual meaning to the y variables it’s required a certain acumen to make them ‘talking’ from a practical stand point. Of course the matter is a bit complex and there are many details we skip here about how to name the new found variables, but to frame the problem and solution this is it.

In any case to go from X to Y (and from Cx to Cy) we need to understand the logic of the transformation (through a still to be defined B matrix) we can do on our original data. So we need to talk (study) about projections and linear transformations as they are strictly connected.

2. Math/Algebra behind PCA (easier)

Projection concept: in short what we draw in charts depends on the system of coordinates we use to represent the data. Think to perspective in art: it’s a projection of real life measures transformed according to specific rules.

In 3 dimensions for example we use the cartesian system which is represented by a matrix (E) of orthonormal vectors (orthogonal/perpendicular vectors of length=1).

When we have for example a data set X (measures are in 3 dimensions: column vectors x1,x2,x3), to calculate the variance of variable x1 we apply the usual formula for variance. But what we actually do is to project the coordinates of each data point of X on the 3 directions in E through a dot product.

So in general we have X･E=X. This projection operation is “transparent” with cartesian system and we don’t even realize to do a projection as it is “automatic” in our minds.

But if we decide to move away from Cartesian system we can build a similar set of orthonormal vectors that define a new basis B made of b1, b2, b3 (instead of E made of e1, e2, e3) like this:

And if we dot multiply (for example) a generic row vector of our data set X containing the coordinates of a single data point (single observation if you prefer):

generic row of data, (1,3) means 1 row, 3 columns

with B (=3 rows,3 columns) we obtain a new row vector (=1 row, 3 columns):

The corresponding new row in Y using a generic row of X trasformed with B

where each element of this vector has new “measures”/”coordinates” given by:

the three equations here are the explosion of the yi equation in the previous figure. So dimensionally the shape is still (1 row ,3 columns) but each element of the row is a linear combination of xi and B together.

In algebra terms and in general in our case we change basis simply with:

the first is the matrix form. In our case Y is made of 3 columns vectors (y1,y2,y3 with m rows) and B is made of 3 column vectors (b1,b2,b3 with 3 rows). So Y is (m,n) = X (m,n) . B (n,n). The 3 equations, when the solution to B will be found, will represent the so called “Principal Components”.

where X is the original data matrix (m rows by n columns) in basis E , B is a new orthonormal basis (n by n),Y (m by n) is the resulting new data matrix with the measures transposed in the new basis B.

Note that y1,y2,y3 are new variables, whose meaning is NOT related to x1,x2,x3 but they are new and a new meaning must be semantically defined as they are linear combinations of all the X variables weighted by the corresponding components of the b1,b2,b3 vectors of B respectively. That is:

y1 is made of every variable of X weighted by vector b1=[b11,b12,b13]
y2 is made of every variable of X weighted by vector b2=[b21,b22,b23]
y3 is made of every variable of X weighted by vector b3=[b31,b32,b33].

So if we find a solution for B we could work with the y’s new variables. But what B should we use to reach of our objective to find Y with a Cy diagonal matrix?

B can be built progressively if we understand that it exists a unique direction that maximizes the variance of the X projected along that direction. We can project a dataset X along the x axis of the cartesian system, but in that case probably the variance is not maximized. Therefore we need to find which is that particular direction. Once we find both the direction and the value of max variance we know we have found the first eigen vector and the first eigen value. In general we can say that we have found the first principal component PC1: y1=X.b1 whose strength is the eigenvalue 1 (lambda 1).

To evaluate y1 in terms of how much variance explains, at this stage you can divide lambda1 by the sum of the diagonal elements of the covariance matrix Cx because the sum of the total variance does not change from Cx to Cy.

Then with the same principle we can find the second direction b2 (second eigenvector) as the one that maximize the variance (second eigenvalue) between all the possible projections of X along a second direction of unitary length and orthogonal to b1. When found this is the second principal component: PC2: y2=X.b2

Then the third direction b3 maximize the variance of X along a third direction defined again by a unit vector that must also be orthogonal to both b2 and b1. When found this is the third principal component: PC3: y3=X.b3

At the end of these iterations we would have built a special B=[b1*,b2*,b3*] (eigenvector matrix), a new dataset Y =[y1=X.b1, y2=X.b2, y3=X.b3] made of 3 ‘principal components’ ranked by strength of variance, a special vector of eigenvalues lambdas=(lambda1, lambda2, lambda3) where each lambda is the variance of each y1,y2,y3.

Of course the iteration can go on up to n dimensions. At the end we have all the elements to reduce Y according to the corresponding weight of every variance over the total variance sum(lambda1,lambda2,lambda3). Hopefully with few variables of Y we can “explain” a large percentage of the total variance in a way that we can reduce the number of variables but not the amount of information it contains..

3. Math/Algebra behind PCA (difficult)

The above was the verbose part to explain a manual iterative procedure to solve the eigen decomposition equation. Now we take a look at the whole mathematical procedure from the start to the solution of the eigen decomposition equation.

We call X the original data set (m rows x n columns) where each variable in columns have been already “centered” around their respective means, Cx is the covariance matrix of X, b1 is an unknown vector of n elements and first columns of the to-be B transformation matrix, y1 is a projection of X along a still unknown vector b1: y1=X.b1

Variance of the first new variable of Y we call y1:

To maximize Var(y1) and find the corresponding first direction we use the Laplace equation with the Lagrange multiplier. The symbol is the gradient, f is the function to maximize which is in our case the variance of y1 that is X.b1, g is the function where we set the constraint that the length of the b vector must be 1:

Partial derivatives of f over b1 (the unknown vector of n components) is:

Fig.2 If you want to find out why you can try with a simple b1 vector of unknowns (b11,b12) a given simmetric Cx like ([4,2][2,1]) and again b1 of unknown terms (b11,b12) . You will have a second degree polinomial and the two partial derivatives (for b11 and b12) make up a matrix which is 2 times the Cx multiplied by b1

Partial derivatives of g over b1 is:

The eigen decomposition equation becomes:

To find some b1 vector <>0 we must find the determinant of term in parenthesis and set it equal to zero. Only in this case the system of n equations in n unknown components of b1 has non trivial solutions:

The solution to this system of n equations in n unknowns will likely give (I keep it simple because there could be exceptions) n different lambdas. These lambdas represent the n different variances of the projected points of X along still to be determined directions.
Now we should assign the max of lambdas (let’s call it lambda1) found and find b1 by solving the system given by:

After we have found b1, we take lambda2 and we solve again the system in the equation above to find b2 and so on until for all the lambdas we have all the b vectors.
The calculations as you can imagine are very long for just 3 variables, that’s why we need Python and numerical algorithms to make the dirty work.
In the end we will (exceptions apart) have: the eigenvector B matrix as a sequence of b1, b2, b3 … bn column vectors; the eigenvalue vector L(ambda) representing the variances of every y of Y (being the Principal Components) or the magnitude of the eigenvectors.

4. Python snippets:

There are many examples and libraries around but here I want to use Python just to show with just numpy code snippets that you can quickly try your sample small dataset and understand what’s going on by looking at the results. I skipped the print statements as you can work on console and check yourself the content of the variables. Even charting is omitted but you can do that as you like with matplotlib or similar. Here the point is not building an app, but showing the understanding of the logic.

import numpy as np
import pandas as pdpd.options.display.max_columns = 200
pd.options.display.width=200
pd.options.display.max_rows=200def fcenter(X): #function to center data
    data_mean=np.mean(X,axis=0) #calc mean by column 
    X=X-data_mean #centered data
    return Xdef fcov(X): #function to find the covariance matrix
    covx=(X.T.dot(X))/X.shape[0] #calc cov matrix of X
    #alternative to: covx=np.cov(X,rowvar=False,ddof=0)
    return covxdef feigen_decomp(Cx): #this is the core to solve the eigen decomposition equation
    eigval,eigvec=np.linalg.eig(Cx) #solve eigen equation
    return eigval,eigvecX=np.array([[1,1,6],[4,2,9],[2,-2,3],[-3,3,1],[-5,1,7]]) #some data
Xc=fcenter(X) #centered data
#----------------------
#this shows that variance does not change when centering data
vX=np.var(X,axis=0)   #variance of columns of X
vXc=np.var(Xc,axis=0) #variance of columns of Xc
#----------------------
Cx=fcov(Xc)           #Cx=covariance of Xc
L,B=feigen_decomp(Cx) #L=eigenvalues vector, B=eigenvectors matrix
Lw=L/L.sum()          #weight in % of every PC (not cumulative)
Y=Xc.dot(B)           #Y=Principal Components matrix (in columns = y  
                       scores) 
Cy=fcov(Y)            #shows that diagonal of covariance matrix of Y
                       coincides con L
#----------------------
#Loadings analysis
Loadings=np.sqrt(L)*B   #see comments
Loadings_sq=Loadings**2 #see comments
Loadings_sq.sum(axis=0) #see comments
Loadings_sq.sum(axis=1) #see comments

One note on Loadings: Loadings are useful when you want to understand the results. Recall that each new variable of Y is a linear combination of all the X variables. Loadings matrix represents vertically how much of the variance of each PC is explained by each variable x of X: in fact the sum of each column is equal to L, and horizontally how much of the variance of each x is explained by each PC: infact the sum by row is equal to variance of x. An example of using the loadings is given in another story I wrote about FIFA 21 players. Check that out if you want. It’s more fun than this in the end:-).

Loadings are important for trying to define a name of the PCs in relation to the names of the most relevant x that contribute to the values (scores)of the Principal Components.

… just Follow me

Hi everybody my name is Andrea Grianti, I spent my professionial life in IT and Data Warehousing but I became later more and more passionate with data science and analytics topics.

Please consider following me in order for me to reach the threshold of number of followers so that Medium platform consider me in their partner program.